Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading CSV files with variable number of columns not supported #1505

Open
allspatial opened this issue Oct 8, 2021 · 14 comments
Open

Reading CSV files with variable number of columns not supported #1505

allspatial opened this issue Oct 8, 2021 · 14 comments
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature good first issue Good for newcomers

Comments

@allspatial
Copy link

allspatial commented Oct 8, 2021

Are you using Python or Rust?

Python

Which feature gates did you use?

This can be ignored by Python users.

What version of polars are you using?

0.9.12

What operating system are you using polars on?

macOS

Describe your bug.

When reading a CSV file with variable number of columns, polars assumes all rows have the number of columns inferred from the first row (?) and skips parsing any subsequent columns. Providing the columns to be parsed explicitly via the columns parameter results in error:

RuntimeError: Any(NotFound("Unable to get field named "column_4". Valid fields: ["column_1", "column_2", "column_3"]"))

What are the steps to reproduce the behavior?

Dataset (test.csv):

a,b,c
a,b,c,d,e,f
g,h,i,j,k

Example 1 (no error but reads only 3 columns instead of 6)

import polars as pl

df = pl.read_csv("/tmp/test.csv", has_headers=False)

Example (results in error)

import polars as pl

df = pl.read_csv("/tmp/test.csv", has_header=False, infer_schema_length=0,
                 columns=["column_1", "column_2", "column_3", "column_4", "column_5", "column_6"])

What is the actual behavior?

Columns beyond the ones inferred from the first data row are not parsed.

What is the expected behavior?

All columns are parsed but are set to NaN/None for rows that don't have data for these columns.

@ghuls
Copy link
Collaborator

ghuls commented Oct 8, 2021

You can use xsv fixlengths to fix those kind of broken CSV files:

$ cat test.csv
a,b,c
a,b,c,d,e,f
g,h,i,j,k

$ xsv fixlengths test.csv 
a,b,c,,,
a,b,c,d,e,f
g,h,i,j,k,

@allspatial
Copy link
Author

Many thanks! That's a very useful tool I wasn't aware of.

@ritchie46 ritchie46 added the good first issue Good for newcomers label Oct 11, 2021
@ritchie46
Copy link
Member

In case when there is not a header present in the csv-file we use the first line to determine new column names (column_1, column_2 .., column_n). We probably should use the max line length of the lines we scan for dtype inference.

@ghuls
Copy link
Collaborator

ghuls commented Oct 12, 2021

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

@ritchie46
Copy link
Member

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Here it is:

I think only the else (no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.

For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.

@ghuls
Copy link
Collaborator

ghuls commented Oct 12, 2021

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Here it is:

I think only the else (no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.

For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.

Not when the user provides, new_columns.

@pradkrish
Copy link
Contributor

I am not sure I understand the issue here. I see that CsvReader has an argument max_records, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?

@ritchie46
Copy link
Member

ritchie46 commented Nov 15, 2021

I am not sure I understand the issue here. I see that CsvReader has an argument max_records, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?

I think I already fixed this issue.

Edit: not entirely certain anymore

@pradkrish
Copy link
Contributor

Okay, I will be happy to get the commit that you think might have fixed the issue.

@ghuls
Copy link
Collaborator

ghuls commented Nov 16, 2021

I normally fixed it here: ee26601

cannero added a commit to cannero/polars that referenced this issue Jan 8, 2023
Infer the number of columns of a header-less csv from the same group
of rows which are used to infer the types.
The old logic is used if `infer_schema_length` is set to 0.

Closes pola-rs#1505
cannero added a commit to cannero/polars that referenced this issue Jan 31, 2023
Infer the number of columns of a header-less csv from the same group
of rows which are used to infer the types.
The old logic is used if `infer_schema_length` is set to 0.

Closes pola-rs#1505
cannero added a commit to cannero/polars that referenced this issue Feb 22, 2023
Infer the number of columns of a header-less csv from the same group
of rows which are used to infer the types.
The old logic is used if `infer_schema_length` is set to 0.

Closes pola-rs#1505
cannero added a commit to cannero/polars that referenced this issue Feb 22, 2023
Infer the number of columns of a header-less csv from the same group
of rows which are used to infer the types.
The old logic is used if `infer_schema_length` is set to 0.

Closes pola-rs#1505
cannero added a commit to cannero/polars that referenced this issue Mar 18, 2023
Infer the number of columns of a header-less csv from the same group
of rows which are used to infer the types.
The old logic, to count only the columns in the first row is used if
`infer_schema_length` is set to 0.

Closes pola-rs#1505
cannero added a commit to cannero/polars that referenced this issue Mar 20, 2023
Infer the number of columns of a header-less csv from the same group
of rows which are used to infer the types.
The old logic, to count only the columns in the first row is used if
`infer_schema_length` is set to 0.

Closes pola-rs#1505
cannero added a commit to cannero/polars that referenced this issue Apr 7, 2023
Infer the number of columns of a header-less csv from the same group
of rows which are used to infer the types.
The old logic, to count only the columns in the first row is used if
`infer_schema_length` is set to 0.

Closes pola-rs#1505
cannero added a commit to cannero/polars that referenced this issue Apr 7, 2023
Infer the number of columns of a header-less csv from the same group
of rows which are used to infer the types.
The old logic, to count only the columns in the first row is used if
`infer_schema_length` is set to 0.

Closes pola-rs#1505
cannero added a commit to cannero/polars that referenced this issue May 6, 2023
Infer the number of columns of a header-less csv from the same group
of rows which are used to infer the types.
The old logic, to count only the columns in the first row is used if
`infer_schema_length` is set to 0.

Closes pola-rs#1505
@stinodego stinodego added enhancement New feature or an improvement of an existing feature and removed feature labels Jul 14, 2023
@jmakov
Copy link

jmakov commented Sep 12, 2023

Any updates on this? Still doesn't work using infer_schema_length=0 or =None.

@stinodego stinodego added the accepted Ready for implementation label Jan 13, 2024
@Nagaprasadvr
Copy link

is this issue resolved ? can i take this and open a pr

@deanm0000
Copy link
Collaborator

@Nagaprasadvr I'm not a code reviewer so I can't give a absolutely definitive answer but @stinodego marked it as accepted and if it were fixed it'd be closed so I don't see why not.

One caveat is that it needs to be a rust fix not a python fix as the maintainers don't want feature divergence between rust and python.

@Nagaprasadvr
Copy link

ty , will take this issue and open a pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature good first issue Good for newcomers
Projects
Status: Ready
Development

Successfully merging a pull request may close this issue.

9 participants