Access the calling pandas
data frame in loc[]
, iloc[]
,
assign()
and other methods with DF
to write better chains of
data frame operations, e.g.:
df = (
df
# Select all rows with column "x" < 2
.loc[DF["x"] < 2]
.assign(
# Shift "x" by its minimum.
y = DF["x"] - DF["x"].min(),
# Clip "x" to it's central 50% window. Note how DF is used
# in the argument to `clip()`.
z = DF["x"].clip(
lower=DF["x"].quantile(0.25),
upper=DF["x"].quantile(0.75)
),
)
)
- Motivation: Make chaining Pandas operations easier and bring functionality to Pandas similar to Spark's col() function or referencing columns in R's dplyr.
- Install from PyPI with
pip install pandas-paddles
. Pandas versions 1+ (>=1,<3
) are supported. - Documentation can be found at readthedocs.
- Source code can be obtained from GitHub.
- Changelog
Instead of writing "traditional" Pandas like this:
df_in = pd.DataFrame({"x": range(5)})
df = df_in.copy()
df["y"] = df["x"] // 2
df = df.loc[df["y"] <= 1]
df
# x y
# 0 0 0
# 1 1 0
# 2 2 1
# 3 3 1
One can write:
from pandas_paddles import DF
df = (
df_in
.assign(y = DF["x"] // 2)
.loc[DF["y"] <= 1]
)
This is especially handy when re-iterating on data frame manipulations
interactively, e.g. in a notebook (just imagine you have to rename
df
to df_out
).
But you can access all methods and attributes of the data frame from the context:
df = pd.DataFrame({
"X": range(5),
"y": ["1", "a", "c", "D", "e"],
})
df.loc[DF["y"].str.isupper() | DF["y"].str.isnumeric()]
# X y
# 0 0 1
# 3 3 D
df.loc[:, DF.columns.str.isupper()]
# X
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
You can even use DF
in the arguments to methods:
df = pd.DataFrame({
"x": range(5),
"y": range(2, 7),
})
df.assign(z = DF['x'].clip(lower=2.2, upper=DF['y'].median()))
# x y z
# 0 0 2 2.2
# 1 1 3 2.2
# 2 2 4 2.2
# 3 3 5 3.0
# 4 4 6 4.0
When working with pd.Series
the S
object exists. It can be used
similar to DF
:
s = pd.Series(range(5))
s[S < 3]
# 0 0
# 1 1
# 2 2
# dtype: int64
-
- (+) active
- (-) new API to learn
-
- (-) stale(?), last change 6 years ago
- (-) new API to learn
- (-)
Symbol
/pandas_ply.X
works only withply_*
functions
-
- (+) no explicite
df
necessary - (-) new API to learn
- (+) no explicite
-
(+) simple
select
accessor(-) usage inside chains clumsy (needs explicite
df
):((df .select.A == 'a') .select.B == 'b' )
(-) hard-coded
str
,dt
accessor methods(?) composable?
Development is containerized with Docker to separte from host systems and improve reproducability. No other prerequisites are needed on the host system.
Recommendation for Windows users: install WSL 2 (tested on Ubuntu 20.04), and for containerized workflows, Docker Desktop for Windows.
The common tasks are collected in Makefile
(See make help
for a
complete list):
Run the unit tests:
make test
ormake watch
for continuously running tests on code-changes.Build the documentation:
make docs
TODO: Update the
poetry.lock
file:make lock
Add a dependency:
Start a shell in a new container.
Add dependency with
poetry add
in the running container. This will updatepoetry.lock
automatically:# 1. On the host system % make shell # 2. In the container instance: I have no name!@7d0e85b3a303:/app$ poetry add --dev --lock falcon
Build the development image
make image
(Note: This should be done automatically for the targets.)