ENH: Add categorical support for Stata export #8767

bashtage · 2014-11-10T00:13:09Z

Add support for exporting DataFrames containing categorical data.

closes #8633
xref #7621

bashtage · 2014-11-10T00:15:34Z

Some questions about categoricals:

Are the underlying data types always ints?
Can they every be int64?

Could also probably use some feedback to simplify the Python2/3 bytes/strings code.

A few small things left to do:

Test that errors are correctly raised
Add notes
Add something to docs

jreback · 2014-11-10T00:20:32Z

categories usually strings, but can be ints/datetimes/floats, really anything.
codes are always ints, can be a size from 8-64 bits (a function of the number of categories)
In theory can be in64 but for a practial manner you can hold np.iinfo(np.int32).max in an int32 so will never happen.

jreback · 2014-11-10T00:23:34Z

pandas/io/stata.py

+        self.off = []
+        for vl in self.value_labels:
+            category = vl[1]
+            if not isinstance(category, string_types):


you can stringify if non-string?

bashtage · 2014-11-10T18:43:49Z

@jreback Should be pretty much ready.

In the end I decided to stringify categoricals and provide a warning to check the Stata file.

jorisvandenbossche · 2014-11-10T19:17:41Z

doc/source/io.rst

  :class:`~pandas.io.stata.StataWriter`` and
  :func:`~pandas.core.frame.DataFrame.to_stata` only support fixed width
  strings containing up to 244 characters, a limitation imposed by the version
  115 dta file format. Attempting to write *Stata* dta files with strings
  longer than 244 characters raises a ``ValueError``.

+.. warning::
+
+  *Stata* data files only support text labels for categroical data.  Exporting


typo: "categroical"

jreback · 2014-11-11T00:49:10Z

pandas/io/stata.py

+        null_byte = b'\x00'
+        # len
+        bio.write(struct.pack(byteorder + 'i', self.len))
+        # labname


can you put a blank line before each 'block' (e.g. have a blank line, comment line, then code)

jreback · 2014-11-11T00:50:22Z

@bashtage otherwise looks good. ping when pushed and green.

bashtage · 2014-11-11T09:51:35Z

@jreback ready

jreback · 2014-11-11T14:16:32Z

pandas/io/stata.py

+        """Check for categorigal columns, retain categorical information for
+        Stata file and convert categorical data to int"""
+
+        is_cat = [True if com.is_categorical_dtype(data[col]) else False


nitpick: [ com.is_categorical_dtype(data[col]) for col in data ]. The True/False are superfluous

jreback · 2014-11-11T14:17:38Z

minor comment. ping when green.

bashtage · 2014-11-12T03:31:03Z

@jreback Should be ready

jorisvandenbossche · 2014-11-12T08:03:33Z

pandas/io/tests/test_stata.py

+        original = pd.concat([original[col].astype('category') for col in original], axis=1)
+
+        with tm.ensure_clean() as path:
+            original.to_stata(path)


I didn't look in detail, but shouldn't we also test that the file is written correctly? (by reading it in again and checking to original?)

bashtage · 2014-11-12T12:59:33Z

@jorisvandenbossche Good idea - I had only been checking that the files are correct in Stata, which is probably more important but cannot be automated. This showed there is a bug in the reader code that doesn't correctly handle missing values.

In general writing and reading it back in isn't that useful since both the writer and reader can agree but still be incorrect (this happened in the past).

So @jreback hold off on this one for a while until I can fix the reader.

jreback · 2014-11-12T13:00:18Z

ping when ready

Add support for exporting DataFrames containing categorical data. closes pandas-dev#8633 xref pandas-dev#7621

bashtage · 2014-11-12T17:05:23Z

@jreback ready

ENH: Add categorical support for Stata export

jreback · 2014-11-13T11:15:50Z

thanks! this is excellent!

jreback added Categorical Categorical Data Type IO Stata read_stata, to_stata labels Nov 10, 2014

jreback added this to the 0.15.2 milestone Nov 10, 2014

jreback reviewed Nov 10, 2014
View reviewed changes

jankatins mentioned this pull request Nov 10, 2014

ENH: Categorical serialized #7621

Closed

4 tasks

bashtage force-pushed the stata-categorical branch 4 times, most recently from 1873f71 to 08ad0a6 Compare November 10, 2014 18:42

jorisvandenbossche reviewed Nov 10, 2014
View reviewed changes

bashtage force-pushed the stata-categorical branch from 08ad0a6 to 6d0e8cc Compare November 10, 2014 20:04

jreback reviewed Nov 11, 2014
View reviewed changes

bashtage force-pushed the stata-categorical branch from 6d0e8cc to 47d3d7e Compare November 11, 2014 09:18

bashtage force-pushed the stata-categorical branch from 47d3d7e to ffa73b9 Compare November 11, 2014 10:09

jreback reviewed Nov 11, 2014
View reviewed changes

bashtage force-pushed the stata-categorical branch from ffa73b9 to 3a713fc Compare November 12, 2014 02:42

jorisvandenbossche reviewed Nov 12, 2014
View reviewed changes

bashtage force-pushed the stata-categorical branch from 3a713fc to 3b4787b Compare November 12, 2014 15:48

ENH: Add categorical support for Stata export

204b50e

Add support for exporting DataFrames containing categorical data. closes pandas-dev#8633 xref pandas-dev#7621

bashtage force-pushed the stata-categorical branch from 3b4787b to 204b50e Compare November 12, 2014 16:10

jreback added a commit that referenced this pull request Nov 13, 2014

Merge pull request #8767 from bashtage/stata-categorical

8d1ae49

ENH: Add categorical support for Stata export

jreback merged commit 8d1ae49 into pandas-dev:master Nov 13, 2014

bashtage deleted the stata-categorical branch November 13, 2014 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add categorical support for Stata export #8767

ENH: Add categorical support for Stata export #8767

bashtage commented Nov 10, 2014

bashtage commented Nov 10, 2014

jreback commented Nov 10, 2014

jreback Nov 10, 2014

bashtage commented Nov 10, 2014

jorisvandenbossche Nov 10, 2014

jreback Nov 11, 2014

jreback commented Nov 11, 2014

bashtage commented Nov 11, 2014

jreback Nov 11, 2014

jreback commented Nov 11, 2014

bashtage commented Nov 12, 2014

jorisvandenbossche Nov 12, 2014

bashtage commented Nov 12, 2014

jreback commented Nov 12, 2014

bashtage commented Nov 12, 2014

jreback commented Nov 13, 2014

ENH: Add categorical support for Stata export #8767

ENH: Add categorical support for Stata export #8767

Conversation

bashtage commented Nov 10, 2014

bashtage commented Nov 10, 2014

jreback commented Nov 10, 2014

jreback Nov 10, 2014

Choose a reason for hiding this comment

bashtage commented Nov 10, 2014

jorisvandenbossche Nov 10, 2014

Choose a reason for hiding this comment

jreback Nov 11, 2014

Choose a reason for hiding this comment

jreback commented Nov 11, 2014

bashtage commented Nov 11, 2014

jreback Nov 11, 2014

Choose a reason for hiding this comment

jreback commented Nov 11, 2014

bashtage commented Nov 12, 2014

jorisvandenbossche Nov 12, 2014

Choose a reason for hiding this comment

bashtage commented Nov 12, 2014

jreback commented Nov 12, 2014

bashtage commented Nov 12, 2014

jreback commented Nov 13, 2014