Add BytesIO support to `scan_csv` #4950

nebfield · 2022-09-23T12:30:09Z

Problem Description

Firstly, thank you for making this fantastic library 😀

I have the following use case:

import zstandard
import polars as pl

with open(path, 'rb') as f:
    dctx = zstandard.ZstdDecompressor()
        with dctx.stream_reader(f) as reader:
            df = pl.read_csv(reader, sep='\t')

Where path is the path of a Zstandard compressed TSV file. I'm working with bioinformatics data, and bioinformaticians love to generate massive CSVs/TSVs and then compress them.

I would like to use scan_csv to read the decompressed BytesIO stream instead and take advantage of all the cool lazy evaluation features to reduce memory usage. Alternatively, it would be great if scan_csv supported Zstandard compressed file paths directly.

Thanks for your time!

The text was updated successfully, but these errors were encountered:

ghuls · 2022-09-23T13:04:36Z

Out of curiosity, is this Zstandard compressed TSV file public data or is it from an internal pipeline? Never saw any Zstandard compressed TSV files in the wild myself unfortunately.

nebfield · 2022-09-23T13:55:38Z

The file I'm trying to work with is a variant information file from plink2 (.pvar.zst or .bim.zst). plink is a popular toolkit used in genome wide association studies and related areas.

Some public data are available here.

.pvar spec

indigoviolet · 2022-12-13T07:20:58Z

+1 to this feature request; in the meantime, you can use pandas.read_csv or pyarrow.csv.open_csv to get some of this behavior.

ritchie46 · 2022-12-13T11:06:09Z

Why go via pandas? pl.read_csv accepts a BytesIO?

AroneyS · 2023-05-10T05:23:35Z

This would also help with #7514

corneliusroemer · 2023-06-06T18:59:49Z

I managed to get read_csv to work with with xopen("data.csv.zst", "rb") as f, see #7287

@ghuls at Nextstrain we've started to zst compress everything, so yeah, there are now metadata.tsv.zst's in the wild :)

corneliusroemer · 2023-06-06T19:30:11Z

@ritchie46: Why go via pandas? pl.read_csv accepts a BytesIO?

See #9266 why pl.read_csv doesn't work in this case, unless I'm doing something wrong, read_csv reads the whole file into memory. I haven't found a way to emulate pandas usecol behaviour for zstd compressed csvs with polars.

ghuls · 2023-06-16T11:16:15Z

You can use parquet-fromcsv in the meantime to convert compressed CSV/TSV files to parquet and use pl.scan_parquet on them:
#9283 (comment)

itamarst · 2023-08-17T20:29:17Z

Looking through the code, it seems like the limitation eventually devolves to LazyScanBuilder::scan_csv(), which accepts a Into<std::path::PathBuf>. It then opens that and converts it into a File. After a bit of back and forth this gets turned into a ReaderBytes. And then later it makes a LogicalPlan::Scan which takes a PathBuf.

LogicalPlan::Scan eventually becomes a ALogicalPlan::Scan I think, and there's some code in physical_plan/planner/lp.rs which creates a CvsExec from that, which in turn creates a CsvReader... specifically for CSVs. An ALogicalPlan::Scan can also be parquet etc., so option will either be splitting up the work across multiple PRs and there's intermediate stage where #10413 is not yet done (but closer) or doing it in one big change, which seems worse.

There are likely other code paths that interact with (A)LogicalPlan::Scan.

CsvReader works off of anything with MmapBytesReader trait (as does ParquetReader).

Specifically for the BytesIO case, to support MmapBytesReader one would getbuffer(), which prevents resizing at least. One could then use the buffer API to extract a pointer to the underlying data, and then multiple threads can read from that without holding the GIL, so long as additionally a reference is kept to the BytesIO object (or Python buffer object) so it isn't GCed.

itamarst · 2023-08-18T14:53:59Z

My first thought for implementation based on above is replacing all PathBuf in the various classes (LogicalPlan::Scan etc) with:

enum FileSource {
    Path(PathBuf),
    Reader(Box<dyn MmapBytesReader>),
}

itamarst · 2023-08-18T19:50:32Z

But... that's sort of limiting. The more general solution is making the CsvStreamer not assuming the CSV fits in memory. In particular, if csv CoreReader could bee modified to take a GetReaderBytes trait implementor as an input instead of ReaderBytes, which lets you get ReaderBytes at some position and ReaderBytes can implement GetReaderBytes for itself, and much of the code would stay the same.

trait GetReaderBytes {
    fn read(&self, offset: usize, length: usize) -> ReaderBytes
}

MaxPowerWasTaken · 2024-02-29T06:06:19Z

You can use parquet-fromcsv in the meantime to convert compressed CSV/TSV files to parquet and use pl.scan_parquet on them: #9283 (comment)

I think I'm understanding correctly that this is a recommendation to use a rust library. Any advice for the less-cool among us who are still working pretty exclusively from a python environment?

ghuls · 2024-03-19T17:00:37Z

You can use parquet-fromcsv in the meantime to convert compressed CSV/TSV files to parquet and use pl.scan_parquet on them: #9283 (comment)

I think I'm understanding correctly that this is a recommendation to use a rust library. Any advice for the less-cool among us who are still working pretty exclusively from a python environment?

It is a command line tool, but part of the rust arrow library.

MaxPowerWasTaken · 2024-03-19T18:58:46Z

You can use parquet-fromcsv in the meantime to convert compressed CSV/TSV files to parquet and use pl.scan_parquet on them: #9283 (comment)

I think I'm understanding correctly that this is a recommendation to use a rust library. Any advice for the less-cool among us who are still working pretty exclusively from a python environment?

It is a command line tool, but part of the rust arrow library.

awesome, thanks for the correction, yeah cli tools work for my use case.

AlexanderNenninger · 2024-07-04T12:19:26Z

+1 to this. Implementing this proposal would allow for scanning memory-mapped files that need additional processing, e.g. parsing custom headers.

nebfield added the feature label Sep 23, 2022

corneliusroemer mentioned this issue Jun 7, 2023

Support reading of zstd compressed csv files #9283

Open

cmdlineluser mentioned this issue Jun 25, 2023

Read from generated csv file in memory without writing to hard disk using scan_csv() #9511

Closed

stinodego added enhancement New feature or an improvement of an existing feature and removed feature labels Jul 14, 2023

s-b90 mentioned this issue Aug 11, 2023

scan_parquet from io.BytesIO() #10413

Open

Wainberg mentioned this issue Nov 22, 2023

Support BytesIO, StringIO etc. in scan_csv() #12617

Closed

Wainberg mentioned this issue Dec 31, 2023

Unify read and scan functions #13040

Open

10 tasks

stinodego added the A-io-csv Area: reading/writing CSV files label Feb 17, 2024

CanglongCl mentioned this issue Apr 2, 2024

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Closed

ritchie46 closed this as completed in #18532 Sep 9, 2024

c-peters added the accepted Ready for implementation label Sep 9, 2024

c-peters assigned coastalwhite Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BytesIO support to `scan_csv` #4950

Add BytesIO support to `scan_csv` #4950

nebfield commented Sep 23, 2022 •

edited

Loading

ghuls commented Sep 23, 2022

nebfield commented Sep 23, 2022 •

edited

Loading

indigoviolet commented Dec 13, 2022

ritchie46 commented Dec 13, 2022

AroneyS commented May 10, 2023

corneliusroemer commented Jun 6, 2023

corneliusroemer commented Jun 6, 2023

ghuls commented Jun 16, 2023

itamarst commented Aug 17, 2023 •

edited

Loading

itamarst commented Aug 18, 2023

itamarst commented Aug 18, 2023

MaxPowerWasTaken commented Feb 29, 2024

ghuls commented Mar 19, 2024

MaxPowerWasTaken commented Mar 19, 2024

AlexanderNenninger commented Jul 4, 2024

Add BytesIO support to scan_csv #4950

Add BytesIO support to scan_csv #4950

Comments

nebfield commented Sep 23, 2022 • edited Loading

Problem Description

ghuls commented Sep 23, 2022

nebfield commented Sep 23, 2022 • edited Loading

indigoviolet commented Dec 13, 2022

ritchie46 commented Dec 13, 2022

AroneyS commented May 10, 2023

corneliusroemer commented Jun 6, 2023

corneliusroemer commented Jun 6, 2023

ghuls commented Jun 16, 2023

itamarst commented Aug 17, 2023 • edited Loading

itamarst commented Aug 18, 2023

itamarst commented Aug 18, 2023

MaxPowerWasTaken commented Feb 29, 2024

ghuls commented Mar 19, 2024

MaxPowerWasTaken commented Mar 19, 2024

AlexanderNenninger commented Jul 4, 2024

Add BytesIO support to `scan_csv` #4950

Add BytesIO support to `scan_csv` #4950

nebfield commented Sep 23, 2022 •

edited

Loading

nebfield commented Sep 23, 2022 •

edited

Loading

itamarst commented Aug 17, 2023 •

edited

Loading