Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExcelFile class has no attribute 'encoding'. Is it correct? #25523

Closed
Dimaskuaskus opened this issue Mar 3, 2019 · 3 comments
Closed

ExcelFile class has no attribute 'encoding'. Is it correct? #25523

Dimaskuaskus opened this issue Mar 3, 2019 · 3 comments
Labels
Duplicate Report Duplicate issue or pull request Enhancement IO Excel read_excel, to_excel

Comments

@Dimaskuaskus
Copy link

Code Sample

import os
import pandas as pd

directory = '/Data'
# more than 300 files
for d, dirs, files in os.walk(directory):
    # i can not use read_excel function because i do not know how many sheets in each xls
    # i use instance of ExcelFile
    # in fact it returns xlrd.Book instance (because engine: Acceptable values are None or ``xlrd``)
    try
        current_file = pd.ExcelFile(os.path.join(d,f))
        #do not work. My files use cp1251 charset. See more in output
    except UnicodeDecodeError:
        pass

    sheet_names_ = current_file.sheet_names()
    for sn in current_file.sheet_names():
        # this Data Frame i shoud parse and import to my Pstgresql DB
        current_file.parse(sheet_name=sn)

Problem description

I have a lot of XLS files encoded with cp1251. Each file contains 1 or more sheets. I need to import them into my Postgesql DB. But Pandas can not do it simple. I can load one sheet (list of, but i don`t have it) at a time in any encoding using pandas.read_excel(file,encoding=my_encoding). Or I can load all sheets from all files in UTF encoding using instance of pandas.ExcelFile(file). It looks like some bad recursion.
I can use xlrd to read my files with specific charset to get xlrd.Book instance. Then pass it to pandas.read_excel(). But I do not like this decision.

If pandas.ExcelFile() would have an attribute of the encoding, then it could be passed to any "engine". I saw a new version of excel package. The same problem.

(I dont have MS Excel and can not create "bad" file for test)
Sorry for my english(((

Expected Output

Output pandas/xlrd Error

*** No CODEPAGE record, no encoding_override: will use 'ascii'
*** No CODEPAGE record, no encoding_override: will use 'ascii'
Traceback (most recent call last):
File "TableLoader.py", line 13, in
c = ExelLoader()
File "TableLoader.py", line 10, in ExelLoader
my_xls_x = pandas.read_excel('Data/qq.xls')
File "/home/dz/.local/lib/python3.6/site-packages/pandas/util/_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "/home/dz/.local/lib/python3.6/site-packages/pandas/util/_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "/home/dz/.local/lib/python3.6/site-packages/pandas/io/excel.py", line 350, in read_excel
io = ExcelFile(io, engine=engine)
File "/home/dz/.local/lib/python3.6/site-packages/pandas/io/excel.py", line 653, in init
self._reader = self._enginesengine
File "/home/dz/.local/lib/python3.6/site-packages/pandas/io/excel.py", line 424, in init
self.book = xlrd.open_workbook(filepath_or_buffer)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/init.py", line 157, in open_workbook
ragged_rows=ragged_rows,
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/book.py", line 120, in open_workbook_xls
bk.get_sheets()
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/book.py", line 723, in get_sheets
self.get_sheet(sheetno)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/book.py", line 714, in get_sheet
sh.read(self)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/sheet.py", line 820, in read
strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/biffh.py", line 250, in unpack_string
return unicode(data[pos:pos+nchars], encoding)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/timemachine.py", line 31, in
unicode = lambda b, enc: b.decode(enc)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcd in position 0: ordinal not in range(128)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 32
OS: Linux
OS-release: 4.15.0-45-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: ru_RU.UTF-8
LOCALE: ru_RU.UTF-8

pandas: 0.24.1
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd WillAyd added the IO Excel read_excel, to_excel label Mar 4, 2019
@WillAyd
Copy link
Member

WillAyd commented Mar 4, 2019

This is very closely related to #23444 @gfyoung

@WillAyd WillAyd added this to the Contributions Welcome milestone Mar 4, 2019
@gfyoung
Copy link
Member

gfyoung commented Mar 4, 2019

@WillAyd : I would be inclined to call this a duplicate in fact, since the underlying Enhancement request is the same, even though the premise is different.

@WillAyd
Copy link
Member

WillAyd commented Jan 19, 2021

Closing as a duplicate

@WillAyd WillAyd closed this as completed Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

3 participants