pdexplorer is a Stata emulator for Python/pandas.
pdexplorer
is available on PyPI. Run pip
to install:
pip install pdexplorer
pdexplorer
can be run in three modes:
1. Stata-like Emulation
from pdexplorer import *
This import adds Stata-like commands into the Python namespace. For example,
webuse('auto')
reg('mpg price')
2. Pure Stata Emulation
from pdexplorer import do
do() # Launches a Stata emulator that can run normal Stata commands
Now you can run regular Stata commands e.g.,
webuse auto
reg mpg price
do()
also supports running the contents of do-file e.g.,
do('working.do')
Under the hoods, the Stata emulator translates pure Stata commands into their Pythonic equivalents.
For example, reg mpg price
becomes reg('mpg price')
.
3. Inline Stata
For example,
from pdexplorer import do, current
do(inline="""
webuse auto
reg mpg price
""") # Launches a Stata emulator that can run normal Stata commands
print(current.df) # access DataFrame object in Python
The rest of this documentation shows examples using Stata-like emulation, but these commands can all be run using pure Stata emulation as well.
pdexplorer
uses Python libraries under the hood. (The result of a command reflects the output of those libraries and may differ slightly from equivalent Stata output.)- There is no support for mata. Under the hood,
pdexplorer
is just the Python data stack.
Stata is great for its conciseness and readability. But Python/pandas is free and easier to integrate with other applications. For example, you can build a web server in Python, but not Stata; You can run Python in AWS SageMmaker, but not Stata.
pdexplorer
enables Stata to be easily integrated into the Python ecosystem.
In contrast to raw Python/pandas, Stata syntax achieves succinctness by:
- Using spaces and commas rather than parentheses, brackets, curly braces, and quotes (where possible)
- Specifying a set of concise commands on the "current" dataset rather than cluttering the namespace with multiple datasets
- Being verbose by default i.e., displaying output that represents the results of the command
- Having sensible defaults that cover the majority of use cases and demonstrate common usage
- Allowing for namespace abbreviations for both commands and variable names
- Employing two types of column names: Variable name are concise and used for programming. Variable labels are verbose and used for presentation.
- Packages are imported lazily e.g.,
import statsmodels
is loaded only when it's first used by a command. This ensures thatfrom pdexplorer import *
runs quickly.
webuse('auto')
li() # List the contents of the data
See https://www.stata.com/manuals/dwebuse.pdf
webuse('auto')
with by('foreign'):
summarize('mpg weight')
See https://www.stata.com/manuals/rsummarize.pdf
webuse('auto')
regress('mpg weight foreign')
ereturnlist()
In the last example, note the use of ereturnlist()
, corresponding to the Stata command ereturn list
. Additionally, a Python object may also be available as the command's return value. For example,
webuse('auto')
results = regress('mpg weight foreign')
Here, results
is a RegressionResultsWrapper object from the statsmodels package.
With few exceptions, the basic Stata language syntax (as documented here) is
[by varlist:] command [subcommand] [varlist] [=exp] [if exp] [in range] [weight] [, options]
where square brackets distinguish optional qualifiers and options from required ones. In this diagram, varlist denotes a list of variable names, command denotes a Stata command, exp denotes an algebraic expression, range denotes an observation range, weight denotes a weighting expression, and options denotes a list of options.
The by varlist:
prefix causes Stata to repeat a command for each subset of the data for which the
values of the variables in varlist are equal. When prefixed with by varlist:, the result of the command
will be the same as if you had formed separate datasets for each group of observations, saved them,
and then gave the command on each dataset separately. The data must already be sorted by varlist,
although by has a sort option.
In pdexplorer, this gets translated to
with by('varlist'):
command("[subcommand] [varlist] [=exp] [if exp] [in range] [weight] [, options]", *args, **kwargs)
where *args
, and **kwargs
represent additional arguments that are available in a pdexplorer
command but
not in the equivalent Stata command.
Sometimes, Stata commands are two words. In such cases, the pdexplorer
command is a concatenation of the two words. For example,
label data "label"
becomes
labeldata("label")
pdexplorer command |
package dependency |
---|---|
cf | ydata-profiling or sweetviz |
browse | xlwings |
regress | statsmodels |