Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.read_csv_batched() fails if dtypes is provided and not all columns are used #9654

Open
2 tasks done
julibeg opened this issue Jul 1, 2023 · 2 comments
Open
2 tasks done
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@julibeg
Copy link

julibeg commented Jul 1, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

There appears to be a bug in the code handling setting dtypes in pl.read_csv_batched(). When I use columns to select a subset of columns and set their dtypes with dtypes, it errors out because it tries to set the dtype of an unselected column. Running pl.read_csv() on the same file with the same arguments works just fine.

Reproducible example

import polars as pl

# write df with three columns for MRE
pl.DataFrame(
    {
        "c1": ["a", "b", "c"],
        "c2": [1, 2, 3],
        "c3": [10, 20, 30],
    }
).write_csv("xmpl.csv")

# we only want the last two columns and set them to u64
cols = ["c2", "c3"]
dtypes = {"c2": pl.UInt64, "c3": pl.UInt64}

# `pl.read_csv()` works
print(
    pl.read_csv(
        "xmpl.csv",
        columns=cols,
        dtypes=dtypes,
    )
)

# but `pl.read_csv_batched()` fails
reader = pl.read_csv_batched(
    "xmpl.csv",
    columns=cols,
    dtypes=dtypes,
)
print(reader.next_batches(1))

Expected behavior

Expected behaviour is parsing the CSV just like pl.read_csv(). Instead I get

Traceback (most recent call last):
  File "...", line 29, in <module>
    print(reader.next_batches(1))
  File "/.../polars/io/csv/batched_reader.py", line 132, in next_batches
    batches = self._reader.next_batches(n)
exceptions.ComputeError: Could not parse `a` as dtype `u64` at column 'c2' (column number 1).
The current offset in the file is 9 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `a` to the `null_values` list.

Installed versions

--------Version info---------
Polars:      0.18.4
Index type:  UInt32
Platform:    Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:      3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0]

----Optional dependencies----
numpy:       1.23.4
pandas:      1.5.1
pyarrow:     11.0.0
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  3.6.2
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>
@julibeg julibeg added bug Something isn't working python Related to Python Polars labels Jul 1, 2023
@JanaganSaravanan
Copy link

Hi!, @julibeg That is an interesting issue. Here is what i learned:

  1. pl.read_csv_batched function reads large file in small chunks or batches. Due to this incremental processing, some functionalities, such as precise column selection and dtype specification, may not be fully supported or have limitations compared to the pl.read_csv() function.
  2. Solutions mentioned in the error message are not working or optimum solution.

Suggestion:
For this issue, you could include dtypes of all columns or remove the dtypes argument entirely from the pl.read_csv_batched function (letting pl.read_csv_batched figure out all the dtypes) .

Love to discuss more solution!.

@julibeg
Copy link
Author

julibeg commented Jul 3, 2023

Hi @JanaganSaravanan!

Thanks for your comment!

For this issue, you could include dtypes of all columns

Good point! This is the solution which I resorted to for now. Still, it would be preferable if the problem were solved in polars (or alternatively at least the documentation should mention that all dtypes need to be provided).

@stinodego stinodego added needs triage Awaiting prioritization by a maintainer A-io Area: reading and writing data labels Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants