Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_csv writes wrong with NaN value #18676

Closed
jackasser opened this issue Dec 7, 2017 · 23 comments · Fixed by #19091
Closed

to_csv writes wrong with NaN value #18676

jackasser opened this issue Dec 7, 2017 · 23 comments · Fixed by #19091
Labels
IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite
Milestone

Comments

@jackasser
Copy link

to_csv with Nan value at top row, unexpected "" in the csv file

import pandas as pd
df = pd.DataFrame([None,1,2])
df.to_csv("df.csv",header=None,index=None,encoding ='utf-8')

# df.csv
#""
#1.0
#2.0

# I want to make df.csv
#
#1.0
#2.0

Versions:

pandas.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
​
@jreback
Copy link
Contributor

jreback commented Dec 8, 2017

this does look like a bug.

@gfyoung ?

@jreback jreback added the IO CSV read_csv, to_csv label Dec 8, 2017
@gfyoung gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Dec 8, 2017
@gfyoung
Copy link
Member

gfyoung commented Dec 8, 2017

Agreed. I can also replicate this on 0.21.0 (@jackasser : I noticed that you were on 0.20.3). Investigation and PR are welcome!

@Licht-T
Copy link
Contributor

Licht-T commented Dec 8, 2017

I am working on this. This seems not the encoding issue.

df.to_csv("df.csv",header=None,index=None)
[pandas] cat df.csv                                                                                                       14:54:23  ☁  add-tuple-list-factorize-test ☂ ✭
""
1.0
2.0

@Licht-T
Copy link
Contributor

Licht-T commented Dec 8, 2017

Also, this returns the right result.

df = pd.DataFrame([1,None,1,2])

@Licht-T
Copy link
Contributor

Licht-T commented Dec 8, 2017

@gfyoung Seems that the bug or the default spec. in csv library.

Case 1

Input:

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=csv.excel)
w.writerow(['1'])
w.writerow([''])
fp.close()

Output:

1
 

Case 2

Input:

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=csv.excel)
w.writerow([''])
w.writerow(['1'])
fp.close()

Output:

""
1

@Licht-T
Copy link
Contributor

Licht-T commented Dec 8, 2017

But this works.

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=None)
w.writerow(['', '1'])
w.writerow(['3', '2'])
fp.close()
,1
3,2

@Licht-T
Copy link
Contributor

Licht-T commented Dec 8, 2017

The csv library in CPython says the single empty field is quoted.
https://github.com/python/cpython/blob/0b3ec192259a65971001ce8f0de85a9c1e71d9c7/Modules/_csv.c#L1244

I don't know why this is needed.

@gfyoung
Copy link
Member

gfyoung commented Dec 8, 2017

@Licht-T : Ah! That's very good to know. Okay, this means that this issue is out of the control of the pandas library since we do wrap Python csv library when doing this.

@jackasser : Looks like you may have hit upon a point of contention in Python's CSV library. I would raise this issue in their library by submitting an issue on their python.org website.

@gfyoung
Copy link
Member

gfyoung commented Dec 8, 2017

Closing because this is out of pandas control. @Licht-T : Thanks for the investigation!

@gfyoung gfyoung closed this as completed Dec 8, 2017
@gfyoung gfyoung removed the Bug label Dec 8, 2017
@gfyoung gfyoung added this to the No action milestone Dec 8, 2017
@jreback jreback reopened this Dec 8, 2017
@jreback
Copy link
Contributor

jreback commented Dec 8, 2017

@gfyoung this can be fixed on pandas side
just because the csv library doesn’t do this correctly
we shouldn’t be inconsistent

doesn’t passing dialect=None work

@gfyoung
Copy link
Member

gfyoung commented Dec 8, 2017

@jreback : to_csv uses Python's csv.writer class to write to CSV. The dialect parameter goes straight to Python's csv writer, and that AFAICT has no impact on the output in the example @Licht-T provided.

Here's the code where we initialize our writer:

https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/format.py#L1644-L1656

UnicodeWriter utilizes the csv.writer class behind the scenes. Thus, we are always using Python's csv library to get the job done, meaning our CSV writing is vulnerable to any issues in Python's CSV module for the time being.

@gfyoung
Copy link
Member

gfyoung commented Dec 8, 2017

I suppose we could hack our away around this by checking for an empty first row before writing it to CSV and replace it with a space for example, though again as I said, very hackish IMO.

@jreback
Copy link
Contributor

jreback commented Dec 8, 2017

try passing dialect=None when we pass to the csvwriter

@jreback jreback removed this from the No action milestone Dec 8, 2017
@Licht-T
Copy link
Contributor

Licht-T commented Dec 8, 2017

@jreback I already tried, but the result is same.

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=None)
w.writerow([''])
w.writerow(['1'])
fp.close()
""
1

@Licht-T
Copy link
Contributor

Licht-T commented Dec 8, 2017

The only parameter that makes some impact is quoting.

import csv
fp = open('test.csv', 'w')
w = csv.writer(fp, dialect=None, quoting=csv.QUOTE_NONE)
w.writerow([''])
w.writerow(['1'])
fp.close()
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-56-97051671206d> in <module>()
      2 fp = open('test.csv', 'w')
      3 w = csv.writer(fp, dialect=csv.excel,quoting=csv.QUOTE_NONE)
----> 4 w.writerow([''])
      5 w.writerow(['1'])
      6 fp.close()

Error: single empty field record must be quoted

@Licht-T
Copy link
Contributor

Licht-T commented Dec 8, 2017

Well..., this csv library behavior is very inconsistent and should be fixed in CPython.
Or, is there any reason to do such behavior...?
https://github.com/python/cpython/blame/0b3ec192259a65971001ce8f0de85a9c1e71d9c7/Modules/_csv.c#L1244

@jreback
Copy link
Contributor

jreback commented Dec 8, 2017

@Licht-T certainly can file a bug report there.

ok I guess no easy way to fix this here, however, maybe we should add a small note in the code about this?

@jackasser
Copy link
Author

@jreback @Licht-T @gfyoung thank you to considering about it!
I understand that this problem should solve in Python's CSV library.
But I think we should add a small note in the code.

@Licht-T
Copy link
Contributor

Licht-T commented Dec 8, 2017

@jreback Created the issue on CPython.
https://bugs.python.org/issue32255

I'll add the small note on pandas.
Where should I add the note?

@jackasser
Copy link
Author

@Licht-T
Copy link
Contributor

Licht-T commented Dec 12, 2017

@jackasser @jreback @gfyoung Actually, the double quoted blank field is the default spec. when writing single column CSV. IOW, this issue is the correct behavior, but the "Case 1" in #18676 (comment) was wrong behavior. This is now fixed in CPython and the patch is backported to CPython 3.6.
python/cpython#4769

Please note that this bug does not exist in CPython 2.7.

@jreback
Copy link
Contributor

jreback commented Dec 13, 2017

@Licht-T ok can you add a test for >= 3.6 only, and xfail it for now (as not sure which release its on, though maybe its actually out?).

@jreback jreback added this to the 0.22.0 milestone Dec 13, 2017
@Licht-T
Copy link
Contributor

Licht-T commented Dec 14, 2017

@jreback Okay! (That fix is not released yet, will be included in the next release of CPython 3.6.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants