-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BytesIO support to scan_csv
#4950
Comments
Out of curiosity, is this Zstandard compressed TSV file public data or is it from an internal pipeline? Never saw any Zstandard compressed TSV files in the wild myself unfortunately. |
The file I'm trying to work with is a variant information file from plink2 ( Some public data are available here. |
+1 to this feature request; in the meantime, you can use |
Why go via pandas? |
This would also help with #7514 |
See #9266 why |
You can use |
Looking through the code, it seems like the limitation eventually devolves to
There are likely other code paths that interact with
Specifically for the |
My first thought for implementation based on above is replacing all enum FileSource {
Path(PathBuf),
Reader(Box<dyn MmapBytesReader>),
} |
But... that's sort of limiting. The more general solution is making the CsvStreamer not assuming the CSV fits in memory. In particular, if csv CoreReader could bee modified to take a GetReaderBytes trait implementor as an input instead of ReaderBytes, which lets you get ReaderBytes at some position and ReaderBytes can implement GetReaderBytes for itself, and much of the code would stay the same. trait GetReaderBytes {
fn read(&self, offset: usize, length: usize) -> ReaderBytes
} |
I think I'm understanding correctly that this is a recommendation to use a rust library. Any advice for the less-cool among us who are still working pretty exclusively from a python environment? |
It is a command line tool, but part of the rust |
awesome, thanks for the correction, yeah cli tools work for my use case. |
+1 to this. Implementing this proposal would allow for scanning memory-mapped files that need additional processing, e.g. parsing custom headers. |
Problem Description
Firstly, thank you for making this fantastic library 😀
I have the following use case:
Where
path
is the path of a Zstandard compressed TSV file. I'm working with bioinformatics data, and bioinformaticians love to generate massive CSVs/TSVs and then compress them.I would like to use
scan_csv
to read the decompressed BytesIO stream instead and take advantage of all the cool lazy evaluation features to reduce memory usage. Alternatively, it would be great ifscan_csv
supported Zstandard compressed file paths directly.Thanks for your time!
The text was updated successfully, but these errors were encountered: