Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48807][SQL] Binary Support for CSV datasource #47212

Closed
wants to merge 3 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jul 4, 2024

What changes were proposed in this pull request?

SPARK-42237 disabled binary output for CSV because the binary values use java.lang.Object.toString for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now.

Why are the changes needed?

improve csv with spark sql types

Does this PR introduce any user-facing change?

Yes, but it's from failures to success with binary csv tables

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jul 4, 2024
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 9, 2024

cc @weiyuyilia and @HyukjinKwon from

@HyukjinKwon
Copy link
Member

only thing from me is that we won't be able to read/write roundtrip. Can we do this with the newer binary string format?

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 9, 2024

For IO roundtrip, the UFT8 output style can play it directly. Other styles can play with/ functions, or we can add an extra read option to help

@HyukjinKwon
Copy link
Member

If we specify the schema as binary, can we read it back as binary?

@HyukjinKwon
Copy link
Member

I remember we do similar things in thriftserver (cc @wangyum ) so I am fine with this but just want to make sure we can read it back

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 10, 2024

If we specify the schema as binary, can we read it back as binary?

https://github.com/apache/spark/pull/47212/files#diff-9ccc240a0142ac3674f47953eb70be3424c3f8bc19e6c7431d4575adfe9bd3fbR3185-R3190

Yes, I have added the above tests to verify read-as-raw-string and read-w/-binary-schema

.option("ds_option", "value")
.format(dataSourceFormat)
.save(path.getCanonicalPath)
val expectedStr = ToStringBase.getBinaryFormatter("Spark SQL".getBytes())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the value as non UTF8 output instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper method gets a binary formatter based on BINARY_OUTPUT_STYLE and converts the raw bytes here to both UTF8 and non-UTF8 outputs

@yaooqinn yaooqinn closed this in b13fc16 Jul 10, 2024
@yaooqinn yaooqinn deleted the SPARK-48807 branch July 10, 2024 07:27
@yaooqinn
Copy link
Member Author

Thanks you @HyukjinKwon @dongjoon-hyun

Merged to master

biruktesf-db pushed a commit to biruktesf-db/spark that referenced this pull request Jul 11, 2024
### What changes were proposed in this pull request?

SPARK-42237 disabled binary output for CSV because the binary values use `java.lang.Object.toString` for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now.

### Why are the changes needed?

improve csv with spark sql types

### Does this PR introduce _any_ user-facing change?

Yes, but it's from failures to success with binary csv tables

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47212 from yaooqinn/SPARK-48807.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
### What changes were proposed in this pull request?

SPARK-42237 disabled binary output for CSV because the binary values use `java.lang.Object.toString` for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now.

### Why are the changes needed?

improve csv with spark sql types

### Does this PR introduce _any_ user-facing change?

Yes, but it's from failures to success with binary csv tables

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47212 from yaooqinn/SPARK-48807.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants