A DataVil project.
FrameX is a light-weight, dataset fetching library for fast prototyping, tutorial creation, and experimenting.
Built on top of Polars.
To get started, install the library with:
pip install framex
import framex as fx
iris = fx.load("iris")
which returns a polars DataFrame
Therefore, you can use all the polars functions and methods on the returned DataFrame.
iris.head()
shape: (5, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f32 ┆ f32 ┆ f32 ┆ f32 ┆ str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
│ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
│ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
iris = fx.load("iris", lazy=True)
which returns a polars LazyFrame
Both these operations create local copies of the datasets by default cache=True
.
To see the list of available datasets, run:
fx.available()
{'remote': ['iris', 'mpg', 'netflix', 'starbucks', 'titanic'], 'local': ['titanic']}
which returns a dictionary of both locally and remotely available datasets.
To see only local or remote datasets, run:
fx.available("local")
fx.available("remote")
{'local': ['titanic']}
{'remote': ['iris', 'mpg', 'netflix', 'starbucks', 'titanic']}
To get information on a dataset, run:
fx.about("mpg") # basically the same as `fx.about("mpg", mode="print")`
which will print the information on the dataset as the following:
NAME : mpg
SOURCE : https://www.kaggle.com/datasets/uciml/autompg-dataset
LICENSE : CC0: Public Domain
ORIGIN : Kaggle
OG NAME : autompg-dataset
Or you can get the information as a single row polars.DataFrame by running:
row = fx.about("mpg", mode="row")
print(row)
which will print the information on the dataset ASCII art as the following:
shape: (1, 4)
┌──────┬─────────────────────────────────┬────────────────────┬────────┐
│ name ┆ source ┆ license ┆ origin │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞══════╪═════════════════════════════════╪════════════════════╪════════╡
│ mpg ┆ https://www.kaggle.com/dataset… ┆ CC0: Public Domain ┆ Kaggle │
└──────┴─────────────────────────────────┴────────────────────┴────────┘
or you can simply treat row
as a polars DataFrame in your code.
In case you need the file links.
url_pokemon = fx.get_url("pokemon")
by default, the format is " feather".
Optionally, you can specify the format of the dataset.
url_pokemon_csv = fx.get_url("pokemon", format="csv")
Get a single dataset:
fx get iris
or get multiple datasets:
fx get iris mpg titanic
which will download dataset(s) to the current directory.
to get the datasets into cache directory:
fx get iris mpg titanic --cache
or to a specific directory:
fx get iris mpg titanic --dir data
To get the name of the available datasets on the remote server.
fx list
this will list all available datasets on the remote server.
To get information on a dataset or datasets, run:
fx about mpg iris
To show a preview of a single dataset
fx show iris
To describe (or summarize) a dataset
fx describe iris
For more parameters
fx get --help