[BUG][Spark] Reliability issues when upserting data to ADLS using Delta in Synapse Analytics #3286
Open
2 of 8 tasks
Labels
bug
Something isn't working
Bug
Which Delta project/connector is this regarding?
Describe the problem
When upserting data from an Azure Synapse Analytics Spark session to ADLS Delta tables, we are seeing significant reliability issues in our job. In particular, when doing the Delta upsert, we are seeing executors will fail by running OOM. Once enough executors fail this causes the job to fail, bringing down the pipeline.
Our data size is not particularly large.
According to Azure support, our nodes are reaching maximum memory and CPU usage, not spilling to disk, then shutting down.
We are not sure if this is due to an issue with Delta upsert, Synapse analytics runtime, or something else. However, we have been able to isolate the issue to the writing step to ADLS by writing to a Synapse data catalog table, then reading that in to materialize the data and break the lineage.
Steps to reproduce
Create table in Synapse data catalog with around 40 GB of data in ~70 columns. Must contain columns named "year", "month", "file_date" and 2 PK columns "pk_1" and "pk_2" - "pk_1" is not unique, and is clustered on a few values, whereas the combination of "pk_1" and "pk_2" is unique.
Use the below code to overwrite into Delta table:
Load a similarly sized amount of data with same structure into another data catalog table.
Use the below code to upsert the data into the same Delta table:
where the function referenced is:
Observed results
The execution runs fairly quickly, but seems to have executors failing. This can be inferred because tasks fail in groups that are the same size as the # of vCores allocated to each node, and the tasks fail with the message:
I have also just recently seen a failure, which may be a Delta issue:
I have not deleted this folder location nor truncated the Delta log.
Expected results
I would expect Spark to spill to memory or disk if required, rather than fully exhausting the resourced of nodes. There is currently no spillage.
I would also not expect Spark to generally need to spill in this case because the amount of data is so small relative to the memory available to the nodes, but Spark does keep multiple copies of data so spillage may be needed.
I would also expect Delta to be able to find the files in the Delta temporary folder, which doesn't seem to be properly happening. This could be happening due to the executor failures, but they don't seem to be happening side-by-side.
Further details
Our incoming dataset is around 40 GB before being loaded into Spark. This data is loaded into ADLS as Parquet without proper partitioning currently, which could effect how it is read in. We read it using last modified timestamps, but cannot apply Hive partitioning to it. Copying the data and applying proper partitions didn't seem to resolve failures.
We are upserting into about 72 GB of data. Our target data is partitioned into 23 partitions, but all are used in the upsert.
We have run into failures with various node sizes and amounts. We are currently using 6 XXL Azure nodes (64 vCores, 400 GB memory each) and still seeing these failures. The node size or amount doesn't seem to significantly effect the failures.
We have tried various optimization techniques including salting and repartitioning. We have not tried z-ordering, but can try that if it might address the root cause.
We have also tried doing a MERGE delete, then append, and we've seen node failure issues with this approach as well.
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
The text was updated successfully, but these errors were encountered: