Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add categorical support for Stata export #8767

Merged
merged 1 commit into from
Nov 13, 2014

Conversation

bashtage
Copy link
Contributor

Add support for exporting DataFrames containing categorical data.

closes #8633
xref #7621

@bashtage
Copy link
Contributor Author

Some questions about categoricals:

  • Are the underlying data types always ints?
  • Can they every be int64?

Could also probably use some feedback to simplify the Python2/3 bytes/strings code.

A few small things left to do:

  • Test that errors are correctly raised
  • Add notes
  • Add something to docs

@jreback jreback added Categorical Categorical Data Type IO Stata read_stata, to_stata labels Nov 10, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 10, 2014
@jreback
Copy link
Contributor

jreback commented Nov 10, 2014

categories usually strings, but can be ints/datetimes/floats, really anything.
codes are always ints, can be a size from 8-64 bits (a function of the number of categories)
In theory can be in64 but for a practial manner you can hold np.iinfo(np.int32).max in an int32 so will never happen.

self.off = []
for vl in self.value_labels:
category = vl[1]
if not isinstance(category, string_types):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can stringify if non-string?

@jankatins jankatins mentioned this pull request Nov 10, 2014
4 tasks
@bashtage bashtage force-pushed the stata-categorical branch 4 times, most recently from 1873f71 to 08ad0a6 Compare November 10, 2014 18:42
@bashtage
Copy link
Contributor Author

@jreback Should be pretty much ready.

In the end I decided to stringify categoricals and provide a warning to check the Stata file.

:class:`~pandas.io.stata.StataWriter`` and
:func:`~pandas.core.frame.DataFrame.to_stata` only support fixed width
strings containing up to 244 characters, a limitation imposed by the version
115 dta file format. Attempting to write *Stata* dta files with strings
longer than 244 characters raises a ``ValueError``.

.. warning::

*Stata* data files only support text labels for categroical data. Exporting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "categroical"

null_byte = b'\x00'
# len
bio.write(struct.pack(byteorder + 'i', self.len))
# labname
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a blank line before each 'block' (e.g. have a blank line, comment line, then code)

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@bashtage otherwise looks good. ping when pushed and green.

@bashtage
Copy link
Contributor Author

@jreback ready

"""Check for categorigal columns, retain categorical information for
Stata file and convert categorical data to int"""

is_cat = [True if com.is_categorical_dtype(data[col]) else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: [ com.is_categorical_dtype(data[col]) for col in data ]. The True/False are superfluous

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

minor comment. ping when green.

@bashtage
Copy link
Contributor Author

@jreback Should be ready

original = pd.concat([original[col].astype('category') for col in original], axis=1)

with tm.ensure_clean() as path:
original.to_stata(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look in detail, but shouldn't we also test that the file is written correctly? (by reading it in again and checking to original?)

@bashtage
Copy link
Contributor Author

@jorisvandenbossche Good idea - I had only been checking that the files are correct in Stata, which is probably more important but cannot be automated. This showed there is a bug in the reader code that doesn't correctly handle missing values.

In general writing and reading it back in isn't that useful since both the writer and reader can agree but still be incorrect (this happened in the past).

So @jreback hold off on this one for a while until I can fix the reader.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2014

ping when ready

Add support for exporting DataFrames containing categorical data.

closes pandas-dev#8633
xref pandas-dev#7621
@bashtage
Copy link
Contributor Author

@jreback ready

jreback added a commit that referenced this pull request Nov 13, 2014
ENH: Add categorical support for Stata export
@jreback jreback merged commit 8d1ae49 into pandas-dev:master Nov 13, 2014
@jreback
Copy link
Contributor

jreback commented Nov 13, 2014

thanks! this is excellent!

@bashtage bashtage deleted the stata-categorical branch November 13, 2014 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: categorical dataexport - graceful degradation
3 participants