`pl.read_csv_batched()` fails if `dtypes` is provided and not all columns are used #9654

julibeg · 2023-07-01T10:21:30Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

There appears to be a bug in the code handling setting dtypes in pl.read_csv_batched(). When I use columns to select a subset of columns and set their dtypes with dtypes, it errors out because it tries to set the dtype of an unselected column. Running pl.read_csv() on the same file with the same arguments works just fine.

Reproducible example

import polars as pl

# write df with three columns for MRE
pl.DataFrame(
    {
        "c1": ["a", "b", "c"],
        "c2": [1, 2, 3],
        "c3": [10, 20, 30],
    }
).write_csv("xmpl.csv")

# we only want the last two columns and set them to u64
cols = ["c2", "c3"]
dtypes = {"c2": pl.UInt64, "c3": pl.UInt64}

# `pl.read_csv()` works
print(
    pl.read_csv(
        "xmpl.csv",
        columns=cols,
        dtypes=dtypes,
    )
)

# but `pl.read_csv_batched()` fails
reader = pl.read_csv_batched(
    "xmpl.csv",
    columns=cols,
    dtypes=dtypes,
)
print(reader.next_batches(1))

Expected behavior

Expected behaviour is parsing the CSV just like pl.read_csv(). Instead I get

Traceback (most recent call last):
  File "...", line 29, in <module>
    print(reader.next_batches(1))
  File "/.../polars/io/csv/batched_reader.py", line 132, in next_batches
    batches = self._reader.next_batches(n)
exceptions.ComputeError: Could not parse `a` as dtype `u64` at column 'c2' (column number 1).
The current offset in the file is 9 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `a` to the `null_values` list.

Installed versions

--------Version info---------
Polars:      0.18.4
Index type:  UInt32
Platform:    Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:      3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0]

----Optional dependencies----
numpy:       1.23.4
pandas:      1.5.1
pyarrow:     11.0.0
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  3.6.2
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>

The text was updated successfully, but these errors were encountered:

JanaganSaravanan · 2023-07-03T05:29:04Z

Hi!, @julibeg That is an interesting issue. Here is what i learned:

pl.read_csv_batched function reads large file in small chunks or batches. Due to this incremental processing, some functionalities, such as precise column selection and dtype specification, may not be fully supported or have limitations compared to the pl.read_csv() function.
Solutions mentioned in the error message are not working or optimum solution.

Suggestion:
For this issue, you could include dtypes of all columns or remove the dtypes argument entirely from the pl.read_csv_batched function (letting pl.read_csv_batched figure out all the dtypes) .

Love to discuss more solution!.

julibeg · 2023-07-03T07:49:44Z

Hi @JanaganSaravanan!

Thanks for your comment!

For this issue, you could include dtypes of all columns

Good point! This is the solution which I resorted to for now. Still, it would be preferable if the problem were solved in polars (or alternatively at least the documentation should mention that all dtypes need to be provided).

julibeg added bug Something isn't working python Related to Python Polars labels Jul 1, 2023

stinodego added needs triage Awaiting prioritization by a maintainer A-io Area: reading and writing data labels Jan 13, 2024

CanglongCl mentioned this issue Apr 2, 2024

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pl.read_csv_batched()` fails if `dtypes` is provided and not all columns are used #9654

`pl.read_csv_batched()` fails if `dtypes` is provided and not all columns are used #9654

julibeg commented Jul 1, 2023

JanaganSaravanan commented Jul 3, 2023

julibeg commented Jul 3, 2023

pl.read_csv_batched() fails if dtypes is provided and not all columns are used #9654

pl.read_csv_batched() fails if dtypes is provided and not all columns are used #9654

Comments

julibeg commented Jul 1, 2023

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

JanaganSaravanan commented Jul 3, 2023

julibeg commented Jul 3, 2023

`pl.read_csv_batched()` fails if `dtypes` is provided and not all columns are used #9654

`pl.read_csv_batched()` fails if `dtypes` is provided and not all columns are used #9654