Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv: dtypes not working and very confusing #14385

Open
2 tasks done
Julian-J-S opened this issue Feb 9, 2024 · 1 comment
Open
2 tasks done

read_csv: dtypes not working and very confusing #14385

Julian-J-S opened this issue Feb 9, 2024 · 1 comment
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@Julian-J-S
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

# OPTION 1
pl.read_csv(
    source="a\n1\n2\n3".encode(),
    schema={"x": pl.UInt16},
    dtypes={"x": pl.Int32}, # no effect
)

# OPTION 2
pl.read_csv(
    source="a\n1\n2\n3".encode(),
    dtypes={"x": pl.Int32}, # works (I think this is also problematic because there is no "x" column, creating an issue)
)

# OPTION 1
pl.read_csv(
    source="a\n1\n2\n3".encode(),
    schema={"x": pl.UInt16},
    dtypes=[pl.Int32], # works!
)

Log output

# OPTION 1
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ u16 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

# OPTION 2
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i32 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

# OPTION 3
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i32 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Issue description

the dtypes parameter behaves very buggy and unexpected

  • it does NOT work in dict format the the schema is also specified
  • it DOES work in dict format if no schema is specified but renames the column which is incorrect imo and inconsistent with DataFrame creation behaviour
  • is DOES work in list format

Expected behavior

behaviour should be consistent and what the user expects. Currently very confusing can you actually do not know the result before you run the code because there seems to be much "magic" behind the scenes.

creating a separate issue for this but imo there needs to be a slight rework of schema and dtypes on multiple levels to create a good and reasonable user experience! (I am open to look into this myself)

Installed versions

0.20.7
@Julian-J-S Julian-J-S added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 9, 2024
@stinodego stinodego added P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Jun 8, 2024
@stinodego
Copy link
Member

dtypes has been renamed to schema_overrides. See intended behavior here:
#11723 (comment)

Your option 1 should indeed have Int32 as data type.
For your option 2, schema_overrides should have no effect.
For option 3, the result is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Status: Ready
Development

Successfully merging a pull request may close this issue.

2 participants