Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document pandas.DataFrame.to_stata data_label #13535

Closed
frehoy opened this issue Jun 30, 2016 · 6 comments
Closed

Document pandas.DataFrame.to_stata data_label #13535

frehoy opened this issue Jun 30, 2016 · 6 comments
Labels
Docs IO Stata read_stata, to_stata
Milestone

Comments

@frehoy
Copy link

frehoy commented Jun 30, 2016

I work with Pandas and Stata and found the DataFrame.to_stata() method very valuable. I would like to be able to assign variable labels in my .dta files but the data_label parameter of the DataFrame.to_stata() method is not documented so I do not know in which format to supply my variable labels.

I tried a dictionary of the form {'df column name' : 'wanted label'} but that returns

TypeError: unhashable type: 'slice'

Here is the page in the documentation I am referring to: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_stata.html#pandas.DataFrame.to_stata

If it could be updated to provide a working example of data_label I would be eternally grateful. This is my first issue on github, hope I specified it correctly, feel free to point out if I messed up the format somehow.

Thanks!

Code Sample, a copy-pastable example if possible

import pandas as pd
d = {'one' : [1., 2., 3., 4.]}
df = pd.DataFrame(d)
labdict = {'one': 'foo'}
df.to_stata('test.dta', write_index=False, data_label=labdict)

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: 1.3.7
pip: 8.1.1
setuptools: 19.6.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: 2.8

@TomAugspurger
Copy link
Contributor

The release note adding it is here

DataFrame.to_stata and StataWriter will accept keyword arguments time_stamp
and data_label which allow the time stamp and dataset label to be set when creating a
file. (:issue:6545)

So it looks like a single label for the entire dataset, not one per column. Though I could be wrong as I have only used stata once.

cc @bashtage who added this.

@frehoy
Copy link
Author

frehoy commented Jun 30, 2016

Thanks Tom, yes that seems to be the case unfortunately. I would really like to be able to add labels for individual variables though. Should I open a feature request for that?

@bashtage
Copy link
Contributor

Only dataset labels are implemented, not per variable labels. The dataset label in the dta file format is:

5.1.5  Dataset label


    The dataset label is recorded as


              <label>llccccc........c</label>
                       |------------|
                          ll bytes


    Requirements:


                ccc..c        Up to 80 UTF-8 characters.
                              UTF-8 characters each require 1 to 4 bytes.
                              No trailing \0 is written.


                ll            The byte length of the UTF-8 characters, 
                              whose length is recorded in a 2-byte unsigned 
                              integer encoded according to byteorder.


                              Because ccc..c is allowed to contain up 
                              to 80 characters, 0 <= ll <= 4*80  
                              (4*80 = 320 = 0x140).


    If no characters are recorded (there is no data label), the .dta file
    contains


                <label>0000</label>


    where 0000 represents 2 bytes of 0.

@bashtage
Copy link
Contributor

@frehoy This would need a feature request. Should also document these two inputs.

@bashtage
Copy link
Contributor

It seems that is is almost implemented here.

See
https://github.com/pydata/pandas/blob/master/pandas/io/stata.py#L2057

and

https://github.com/pydata/pandas/blob/master/pandas/io/stata.py#L2134

As you can see, this always passes the default None and so no labels are written.

Mostly needs an external interface and a tiny amount of wiring up. And testing, esp that the labels can be read into Stata.

@TomAugspurger
Copy link
Contributor

Thanks @bashtage.

@frehoy, are you interested in submitting a pull request for documenting data_label, or working on the per-variable labels? I'll make a second issue for the variable labels.

@jreback jreback added this to the 0.19.0 milestone Jul 13, 2016
bashtage added a commit to bashtage/pandas that referenced this issue Jul 15, 2016
Add support for writing variable labels
Fix documentation for to_stata
Clean up function name to improve readability

closes pandas-dev#13536
closes pandas-dev#13535
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants