CSV parsing: ComputeError #15854

CameronBieganek · 2024-04-23T21:25:19Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Use the following CSV file:

"serial_number","data_date","data_latitude","data_longitude","ign_status","is_power_on","is_zone_1_active","is_zone_1_door_open","unit_mode_detail","engine_hours","electrical_hours","engine_rpm","voltage","ambient_temperature","set_point_1","discharge_air_1","return_air_1","power_off_description","system_operating_mode","zone_1_control_condition"
"6001320386",2021-10-11 20:02:47.000,35.464762,-97.542528,false,False,,False,,6359,0,,13.57,,,,,Countdown,,

And the following Python script:

import polars as pl

schema = {
    "serial_number": pl.Utf8,
    "data_date": pl.Datetime,
    "data_latitude": pl.Float64,
    "data_longitude": pl.Float64,
    "ign_status": pl.Boolean,
    "is_power_on": pl.Boolean,
    "is_zone_1_active": pl.Boolean,
    "is_zone_1_door_open": pl.Boolean,
    "unit_mode_detail": pl.Utf8,
    "system_operating_mode": pl.Utf8,
    "zone_1_control_condition": pl.Utf8,
    "power_off_description": pl.Utf8,
    "engine_hours": pl.Float64,
    "electrical_hours": pl.Float64,
    "engine_rpm": pl.Float64,
    "voltage": pl.Float64,
    "ambient_temperature": pl.Float64,
    "set_point_1": pl.Float64,
    "discharge_air_1": pl.Float64,
    "return_air_1": pl.Float64
}

data = pl.read_csv("test.csv", schema=schema)

Output:

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
File ~\projects\polars_env\csv_parsing_bug.py:28
      3 import polars as pl
      5 schema = {
      6     "serial_number": pl.Utf8,
      7     "data_date": pl.Datetime,
   (...)
     25     "return_air_1": pl.Float64
     26 }
---> 28 data = pl.read_csv("test.csv", schema=schema)

File ~\projects\polars_env\venv\Lib\site-packages\polars\_utils\deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~\projects\polars_env\venv\Lib\site-packages\polars\_utils\deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~\projects\polars_env\venv\Lib\site-packages\polars\_utils\deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~\projects\polars_env\venv\Lib\site-packages\polars\io\csv\functions.py:416, in read_csv(source, has_header, columns, new_columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma)
    404         dtypes = {
    405             new_to_current.get(column_name, column_name): column_dtype
    406             for column_name, column_dtype in dtypes.items()
    407         }
    409 with prepare_file_arg(
    410     source,
    411     encoding=encoding,
   (...)
    414     storage_options=storage_options,
    415 ) as data:
--> 416     df = _read_csv_impl(
    417         data,
    418         has_header=has_header,
    419         columns=columns if columns else projection,
    420         separator=separator,
    421         comment_prefix=comment_prefix,
    422         quote_char=quote_char,
    423         skip_rows=skip_rows,
    424         dtypes=dtypes,
    425         schema=schema,
    426         null_values=null_values,
    427         missing_utf8_is_empty_string=missing_utf8_is_empty_string,
    428         ignore_errors=ignore_errors,
    429         try_parse_dates=try_parse_dates,
    430         n_threads=n_threads,
    431         infer_schema_length=infer_schema_length,
    432         batch_size=batch_size,
    433         n_rows=n_rows,
    434         encoding=encoding if encoding == "utf8-lossy" else "utf8",
    435         low_memory=low_memory,
    436         rechunk=rechunk,
    437         skip_rows_after_header=skip_rows_after_header,
    438         row_index_name=row_index_name,
    439         row_index_offset=row_index_offset,
    440         sample_size=sample_size,
    441         eol_char=eol_char,
    442         raise_if_empty=raise_if_empty,
    443         truncate_ragged_lines=truncate_ragged_lines,
    444         decimal_comma=decimal_comma,
    445     )
    447 if new_columns:
    448     return _update_columns(df, new_columns)

File ~\projects\polars_env\venv\Lib\site-packages\polars\io\csv\functions.py:559, in _read_csv_impl(source, has_header, columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma)
    555         raise ValueError(msg)
    557 projection, columns = parse_columns_arg(columns)
--> 559 pydf = PyDataFrame.read_csv(
    560     source,
    561     infer_schema_length,
    562     batch_size,
    563     has_header,
    564     ignore_errors,
    565     n_rows,
    566     skip_rows,
    567     projection,
    568     separator,
    569     rechunk,
    570     columns,
    571     encoding,
    572     n_threads,
    573     path,
    574     dtype_list,
    575     dtype_slice,
    576     low_memory,
    577     comment_prefix,
    578     quote_char,
    579     processed_null_values,
    580     missing_utf8_is_empty_string,
    581     try_parse_dates,
    582     skip_rows_after_header,
    583     parse_row_index_args(row_index_name, row_index_offset),
    584     sample_size=sample_size,
    585     eol_char=eol_char,
    586     raise_if_empty=raise_if_empty,
    587     truncate_ragged_lines=truncate_ragged_lines,
    588     decimal_comma=decimal_comma,
    589     schema=schema,
    590 )
    591 return wrap_df(pydf)

ComputeError: could not parse `Countdown` as dtype `f64` at column 'set_point_1' (column number 18)

The current offset in the file is 457 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Countdown` to the `null_values` list.

Original error: ```remaining bytes non-empty```

Installed versions

--------Version info---------
Polars:               0.20.22
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

CameronBieganek · 2024-04-23T21:40:30Z

Note that scan_csv works, like this:

data = pl.scan_csv("test.csv", schema=schema)

...where the file and the schema dictionary are the same as above. I'm guessing the error with read_csv is happening because the column order in the schema does not match the column order in the CSV? Normally I expect the order of entries in a dictionary to be immaterial, although technically as of Python 3.6 the built-in dictionary preserves insertion order.

I have a very similar issue open already. Basically this comes down to very poor error messages when the schema argument is involved. Not to mention, the docstring entry for schema could be more explicit about the requirements: e.g. order of entries in the dictionary must match the order of the columns.

cmdlineluser · 2024-04-23T22:58:41Z

That is odd.

Just a visualization of how schema is treated differently in read and scan:

import tempfile
import polars as pl

f = tempfile.NamedTemporaryFile()
f.write(b"""
A,B
1,2
""".strip())
f.seek(0)

pl.read_csv(f.name, schema={"B": pl.String, "A": pl.Int32})
# shape: (1, 2)
# ┌─────┬─────┐
# │ B   ┆ A   │
# │ --- ┆ --- │
# │ str ┆ i32 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

pl.scan_csv(f.name, schema={"B": pl.String, "A": pl.Int32}).collect()
# shape: (1, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i32 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

[Update]: - It seems #11723 contains a mention of it.

Found in the redesign issue:

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

bradfordlynch · 2024-09-15T18:07:16Z

Ran into this issue as well. It is particularly surprising because dict types are accepted for the schema argument yet they do not guarantee the ordering on their keys. I messed around with various fixes until thinking that maybe it was the order of the keys that was causing my problems. I've created a PR to improve the documentation until this is fixed. For reference here is a minimal demonstration of the issue:

from io import StringIO

import polars as pl

csv = """A,B
1,"foo"
3,"bar"
"""

buf = StringIO(csv)

# Works fine
schema_good = {"A": pl.Int64, "B": pl.String}
pl.read_csv(buf, schema=schema_good)

# Raises ComputeError
buf.seek(0)
schema_bad = {"B": pl.String, "A": pl.Int64}
pl.read_csv(buf, schema=schema_bad)

CameronBieganek added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 23, 2024

alexander-beedie added the A-io-csv Area: reading/writing CSV files label Apr 24, 2024

bradfordlynch mentioned this issue Sep 15, 2024

docs(python): Clarify documentation for schema in read_csv function #18759

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV parsing: ComputeError #15854

CSV parsing: ComputeError #15854

CameronBieganek commented Apr 23, 2024 •

edited

Loading

CameronBieganek commented Apr 23, 2024 •

edited

Loading

cmdlineluser commented Apr 23, 2024 •

edited

Loading

bradfordlynch commented Sep 15, 2024

CSV parsing: ComputeError #15854

CSV parsing: ComputeError #15854

Comments

CameronBieganek commented Apr 23, 2024 • edited Loading

Checks

Reproducible example

Installed versions

CameronBieganek commented Apr 23, 2024 • edited Loading

cmdlineluser commented Apr 23, 2024 • edited Loading

bradfordlynch commented Sep 15, 2024

CameronBieganek commented Apr 23, 2024 •

edited

Loading

CameronBieganek commented Apr 23, 2024 •

edited

Loading

cmdlineluser commented Apr 23, 2024 •

edited

Loading