[SPARK-48807][SQL] Binary Support for CSV datasource #47212

yaooqinn · 2024-07-04T10:59:29Z

What changes were proposed in this pull request?

SPARK-42237 disabled binary output for CSV because the binary values use java.lang.Object.toString for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now.

Why are the changes needed?

improve csv with spark sql types

Does this PR introduce any user-facing change?

Yes, but it's from failures to success with binary csv tables

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

dongjoon-hyun · 2024-07-09T03:21:13Z

cc @weiyuyilia and @HyukjinKwon from

[SPARK-42237][SQL] Change binary to unsupported dataType in CSV format #39802

HyukjinKwon · 2024-07-09T10:55:38Z

only thing from me is that we won't be able to read/write roundtrip. Can we do this with the newer binary string format?

yaooqinn · 2024-07-09T11:33:01Z

For IO roundtrip, the UFT8 output style can play it directly. Other styles can play with/ functions, or we can add an extra read option to help

HyukjinKwon · 2024-07-10T02:25:54Z

If we specify the schema as binary, can we read it back as binary?

HyukjinKwon · 2024-07-10T02:26:21Z

I remember we do similar things in thriftserver (cc @wangyum ) so I am fine with this but just want to make sure we can read it back

yaooqinn · 2024-07-10T02:29:57Z

If we specify the schema as binary, can we read it back as binary?

https://github.com/apache/spark/pull/47212/files#diff-9ccc240a0142ac3674f47953eb70be3424c3f8bc19e6c7431d4575adfe9bd3fbR3185-R3190

Yes, I have added the above tests to verify read-as-raw-string and read-w/-binary-schema

HyukjinKwon · 2024-07-10T04:10:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+            .option("ds_option", "value")
+            .format(dataSourceFormat)
+            .save(path.getCanonicalPath)
+          val expectedStr = ToStringBase.getBinaryFormatter("Spark SQL".getBytes())


Can we change the value as non UTF8 output instead?

This helper method gets a binary formatter based on BINARY_OUTPUT_STYLE and converts the raw bytes here to both UTF8 and non-UTF8 outputs

yaooqinn · 2024-07-10T07:27:54Z

Thanks you @HyukjinKwon @dongjoon-hyun

Merged to master

### What changes were proposed in this pull request? SPARK-42237 disabled binary output for CSV because the binary values use `java.lang.Object.toString` for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now. ### Why are the changes needed? improve csv with spark sql types ### Does this PR introduce _any_ user-facing change? Yes, but it's from failures to success with binary csv tables ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47212 from yaooqinn/SPARK-48807. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

[SPARK-48807][SQL] Binary Support for CSV datasource

ec71957

github-actions bot added the SQL label Jul 4, 2024

yaooqinn added 2 commits July 5, 2024 01:00

test

8eee9c3

fix tests

2b3fd3a

HyukjinKwon approved these changes Jul 10, 2024

View reviewed changes

HyukjinKwon reviewed Jul 10, 2024

View reviewed changes

yaooqinn closed this in b13fc16 Jul 10, 2024

yaooqinn deleted the SPARK-48807 branch July 10, 2024 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48807][SQL] Binary Support for CSV datasource #47212

[SPARK-48807][SQL] Binary Support for CSV datasource #47212

yaooqinn commented Jul 4, 2024

dongjoon-hyun commented Jul 9, 2024 •

edited

Loading

HyukjinKwon commented Jul 9, 2024

yaooqinn commented Jul 9, 2024

HyukjinKwon commented Jul 10, 2024

HyukjinKwon commented Jul 10, 2024

yaooqinn commented Jul 10, 2024 •

edited

Loading

HyukjinKwon Jul 10, 2024

yaooqinn Jul 10, 2024

yaooqinn commented Jul 10, 2024

[SPARK-48807][SQL] Binary Support for CSV datasource #47212

[SPARK-48807][SQL] Binary Support for CSV datasource #47212

Conversation

yaooqinn commented Jul 4, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun commented Jul 9, 2024 • edited Loading

HyukjinKwon commented Jul 9, 2024

yaooqinn commented Jul 9, 2024

HyukjinKwon commented Jul 10, 2024

HyukjinKwon commented Jul 10, 2024

yaooqinn commented Jul 10, 2024 • edited Loading

HyukjinKwon Jul 10, 2024

Choose a reason for hiding this comment

yaooqinn Jul 10, 2024

Choose a reason for hiding this comment

yaooqinn commented Jul 10, 2024

dongjoon-hyun commented Jul 9, 2024 •

edited

Loading

yaooqinn commented Jul 10, 2024 •

edited

Loading