Reading CSV files with variable number of columns not supported #1505

allspatial · 2021-10-08T14:08:06Z

Are you using Python or Rust?

Python

Which feature gates did you use?

This can be ignored by Python users.

What version of polars are you using?

0.9.12

What operating system are you using polars on?

macOS

Describe your bug.

When reading a CSV file with variable number of columns, polars assumes all rows have the number of columns inferred from the first row (?) and skips parsing any subsequent columns. Providing the columns to be parsed explicitly via the columns parameter results in error:

RuntimeError: Any(NotFound("Unable to get field named "column_4". Valid fields: ["column_1", "column_2", "column_3"]"))

What are the steps to reproduce the behavior?

Dataset (test.csv):

a,b,c
a,b,c,d,e,f
g,h,i,j,k

Example 1 (no error but reads only 3 columns instead of 6)

import polars as pl

df = pl.read_csv("/tmp/test.csv", has_headers=False)

Example (results in error)

import polars as pl

df = pl.read_csv("/tmp/test.csv", has_header=False, infer_schema_length=0,
                 columns=["column_1", "column_2", "column_3", "column_4", "column_5", "column_6"])

What is the actual behavior?

Columns beyond the ones inferred from the first data row are not parsed.

What is the expected behavior?

All columns are parsed but are set to NaN/None for rows that don't have data for these columns.

ghuls · 2021-10-08T14:22:14Z

You can use xsv fixlengths to fix those kind of broken CSV files:

$ cat test.csv
a,b,c
a,b,c,d,e,f
g,h,i,j,k

$ xsv fixlengths test.csv 
a,b,c,,,
a,b,c,d,e,f
g,h,i,j,k,

allspatial · 2021-10-08T19:43:31Z

Many thanks! That's a very useful tool I wasn't aware of.

ritchie46 · 2021-10-11T14:19:35Z

In case when there is not a header present in the csv-file we use the first line to determine new column names (column_1, column_2 .., column_n). We probably should use the max line length of the lines we scan for dtype inference.

ghuls · 2021-10-12T14:37:41Z

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

ritchie46 · 2021-10-12T15:01:00Z

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Here it is:

polars/polars/polars-io/src/csv_core/utils.rs

Line 141 in 3d99b45

if has_header {

I think only the else (no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.

For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.

ghuls · 2021-10-12T15:28:24Z

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Here it is:

polars/polars/polars-io/src/csv_core/utils.rs

Line 141 in 3d99b45

if has_header {

I think only the else (no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.

For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.

Not when the user provides, new_columns.

pradkrish · 2021-11-15T20:21:54Z

I am not sure I understand the issue here. I see that CsvReader has an argument max_records, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?

ritchie46 · 2021-11-15T20:33:01Z

I am not sure I understand the issue here. I see that CsvReader has an argument max_records, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?

I think I already fixed this issue.

Edit: not entirely certain anymore

pradkrish · 2021-11-15T21:27:55Z

Okay, I will be happy to get the commit that you think might have fixed the issue.

ghuls · 2021-11-16T07:40:04Z

I normally fixed it here: ee26601

Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic is used if `infer_schema_length` is set to 0. Closes pola-rs#1505

Infer the number of columns of a header-less csv from the same group of rows which are used to infer the types. The old logic, to count only the columns in the first row is used if `infer_schema_length` is set to 0. Closes pola-rs#1505

jmakov · 2023-09-12T19:58:55Z

Any updates on this? Still doesn't work using infer_schema_length=0 or =None.

Nagaprasadvr · 2024-01-25T05:16:38Z

is this issue resolved ? can i take this and open a pr

deanm0000 · 2024-01-25T23:04:51Z

@Nagaprasadvr I'm not a code reviewer so I can't give a absolutely definitive answer but @stinodego marked it as accepted and if it were fixed it'd be closed so I don't see why not.

One caveat is that it needs to be a rust fix not a python fix as the maintainers don't want feature divergence between rust and python.

Nagaprasadvr · 2024-01-26T02:25:21Z

ty , will take this issue and open a pr

ritchie46 added the good first issue Good for newcomers label Oct 11, 2021

zundertj added the feature label Dec 26, 2021

cannero mentioned this issue Jan 8, 2023

feat(rust,python): csv with variable number of columns #6124

Closed

stinodego added enhancement New feature or an improvement of an existing feature and removed feature labels Jul 14, 2023

stinodego added the accepted Ready for implementation label Jan 13, 2024

Nagaprasadvr mentioned this issue Jan 26, 2024

fix: Allow parsing csv file which doesn't has a header and has rows with variable number of columns #14012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading CSV files with variable number of columns not supported #1505

Reading CSV files with variable number of columns not supported #1505

allspatial commented Oct 8, 2021 •

edited by stinodego

Loading

ghuls commented Oct 8, 2021

allspatial commented Oct 8, 2021

ritchie46 commented Oct 11, 2021

ghuls commented Oct 12, 2021

ritchie46 commented Oct 12, 2021

ghuls commented Oct 12, 2021

pradkrish commented Nov 15, 2021

ritchie46 commented Nov 15, 2021 •

edited

Loading

pradkrish commented Nov 15, 2021

ghuls commented Nov 16, 2021

jmakov commented Sep 12, 2023

Nagaprasadvr commented Jan 25, 2024

deanm0000 commented Jan 25, 2024

Nagaprasadvr commented Jan 26, 2024

Reading CSV files with variable number of columns not supported #1505

Reading CSV files with variable number of columns not supported #1505

Comments

allspatial commented Oct 8, 2021 • edited by stinodego Loading

Are you using Python or Rust?

Which feature gates did you use?

What version of polars are you using?

What operating system are you using polars on?

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

ghuls commented Oct 8, 2021

allspatial commented Oct 8, 2021

ritchie46 commented Oct 11, 2021

ghuls commented Oct 12, 2021

ritchie46 commented Oct 12, 2021

ghuls commented Oct 12, 2021

pradkrish commented Nov 15, 2021

ritchie46 commented Nov 15, 2021 • edited Loading

pradkrish commented Nov 15, 2021

ghuls commented Nov 16, 2021

jmakov commented Sep 12, 2023

Nagaprasadvr commented Jan 25, 2024

deanm0000 commented Jan 25, 2024

Nagaprasadvr commented Jan 26, 2024

allspatial commented Oct 8, 2021 •

edited by stinodego

Loading

ritchie46 commented Nov 15, 2021 •

edited

Loading