`read_csv`: `dtypes` not working and very confusing #14385

Julian-J-S · 2024-02-09T07:58:18Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

# OPTION 1
pl.read_csv(
    source="a\n1\n2\n3".encode(),
    schema={"x": pl.UInt16},
    dtypes={"x": pl.Int32}, # no effect
)

# OPTION 2
pl.read_csv(
    source="a\n1\n2\n3".encode(),
    dtypes={"x": pl.Int32}, # works (I think this is also problematic because there is no "x" column, creating an issue)
)

# OPTION 1
pl.read_csv(
    source="a\n1\n2\n3".encode(),
    schema={"x": pl.UInt16},
    dtypes=[pl.Int32], # works!
)

Log output

# OPTION 1
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ u16 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

# OPTION 2
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i32 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

# OPTION 3
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i32 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Issue description

the dtypes parameter behaves very buggy and unexpected

it does NOT work in dict format the the schema is also specified
it DOES work in dict format if no schema is specified but renames the column which is incorrect imo and inconsistent with DataFrame creation behaviour
is DOES work in list format

Expected behavior

behaviour should be consistent and what the user expects. Currently very confusing can you actually do not know the result before you run the code because there seems to be much "magic" behind the scenes.

creating a separate issue for this but imo there needs to be a slight rework of schema and dtypes on multiple levels to create a good and reasonable user experience! (I am open to look into this myself)

Installed versions

0.20.7

The text was updated successfully, but these errors were encountered:

stinodego · 2024-06-08T09:19:26Z

dtypes has been renamed to schema_overrides. See intended behavior here:
#11723 (comment)

Your option 1 should indeed have Int32 as data type.
For your option 2, schema_overrides should have no effect.
For option 3, the result is correct.

Julian-J-S added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 9, 2024

This was referenced Mar 27, 2024

fix(python, rust): read_csv column order did not follow the columns parameter #15317

Closed

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Closed

stinodego added P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Jun 8, 2024

stinodego mentioned this issue Jun 8, 2024

read_csv() and scan_csv() usually corrupt the schema when dtypes contains columns not in the schema #15605

Closed

2 tasks

stinodego added the A-io-csv Area: reading/writing CSV files label Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_csv`: `dtypes` not working and very confusing #14385

`read_csv`: `dtypes` not working and very confusing #14385

Julian-J-S commented Feb 9, 2024

stinodego commented Jun 8, 2024

read_csv: dtypes not working and very confusing #14385

read_csv: dtypes not working and very confusing #14385

Comments

Julian-J-S commented Feb 9, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

stinodego commented Jun 8, 2024

`read_csv`: `dtypes` not working and very confusing #14385

`read_csv`: `dtypes` not working and very confusing #14385