Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sampling parameters as a global config #192

Merged
merged 37 commits into from
Dec 31, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
403bdd6
update export tutorial to add explanation for standalone argument
westernguy2 Sep 18, 2020
be1ddc3
minor fixes and remove cell output in notebooks
dorisjlee Sep 18, 2020
2f1c5b2
added contributing doc
dorisjlee Sep 19, 2020
a5caa69
fix bugs and uncomment some tests
westernguy2 Sep 21, 2020
3fb197d
remove raise warning
westernguy2 Sep 23, 2020
ef82410
remove unnecessary import
westernguy2 Sep 23, 2020
21d71ea
split up rename test into two parts
westernguy2 Sep 23, 2020
2b8abe1
fix setting warning, fix data_type bugs and add relevant tests
westernguy2 Sep 25, 2020
7942161
remove ordinal data type
westernguy2 Sep 25, 2020
98f4c2e
add test for small dataframe resetting index
westernguy2 Sep 28, 2020
18cace7
add loc and iloc tests
westernguy2 Sep 28, 2020
6e9195b
fix merge conflicts
westernguy2 Sep 28, 2020
dbdfdcd
fix attribute access directly to dataframe
westernguy2 Sep 28, 2020
d63d006
add small changes to code
westernguy2 Sep 30, 2020
4faff66
Merge branch 'master' into master
westernguy2 Sep 30, 2020
083e091
Merge branch 'master' of github.com:westernguy2/lux into westernguy2-…
dorisjlee Sep 30, 2020
a998646
added test for qcut and cut
dorisjlee Sep 30, 2020
acdd9c9
add check if dtype is Interval
westernguy2 Oct 2, 2020
b8fa059
added qcut test
dorisjlee Oct 2, 2020
1838ea9
Merge branch 'master' of github.com:westernguy2/lux into westernguy2-…
dorisjlee Oct 4, 2020
a826e34
fix Record KeyError
westernguy2 Nov 29, 2020
afc4f71
add tests
westernguy2 Dec 4, 2020
a96baa4
take care of reset_index case
westernguy2 Dec 4, 2020
da4c602
small edits
westernguy2 Dec 4, 2020
a03f275
add data_model to column_group Clause
westernguy2 Dec 7, 2020
4ff25e8
small edits for row_group
westernguy2 Dec 7, 2020
cfe8772
Merge branch 'master' of github.com:westernguy2/lux into westernguy2-…
dorisjlee Dec 7, 2020
cfcc50c
fixes to row group
dorisjlee Dec 7, 2020
3f60ca9
add config for start and cap for samples
westernguy2 Dec 23, 2020
71e481d
finish sampling config and tests
westernguy2 Dec 29, 2020
86006d6
black formatting
westernguy2 Dec 29, 2020
f015c34
add documentation for sampling config
westernguy2 Dec 29, 2020
f87d63b
remove small added issues
westernguy2 Dec 29, 2020
89d6310
Merge branch 'master' of github.com:westernguy2/lux into westernguy2-…
dorisjlee Dec 29, 2020
0782095
minor changes to docs
dorisjlee Dec 29, 2020
6f09f93
implement heatmap flag and add tests
westernguy2 Dec 31, 2020
2ed47b8
black formatting and documentation edits
westernguy2 Dec 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions doc/source/guide/FAQ.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,23 @@ How do I turn off Lux?
To display only the Pandas view of the dataframe, print the dataframe by doing :code:`df.to_pandas()`.
To turn off Lux completely, remove the :code:`import lux` statement and restart your Jupyter notebook.

How do I disable sampling and have Lux visualize the full dataset?
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Lux displays a warning saying "Large dataframe detected: Lux is only visualizing a random sample". If you would like to disable sampling, you can run:

.. code-block:: python
lux.config.sampling = False
Note that if you have already loaded your data in and printed the visualizations, you would need to reinitialize the Dataframe by setting the config before loading in your data, as such:

.. code-block:: python
lux.config.sampling = False
df = pd.read_csv("...")
If you want to fine-tune the sampling parameters, you can edit :code:`lux.config.sampling_start` and :code:`lux.config.sampling_cap`. See `this page <https://lux-api.readthedocs.io/en/latest/source/reference/config.html>`_ for more details.

Troubleshooting Tips
--------------------

Expand Down
28 changes: 28 additions & 0 deletions doc/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,31 @@ If you try to set the default_display to anything other than 'lux' or 'pandas,'
:align: center
:alt: Retrieves a single attribute from Lux's Action Manager using its defined id.

Change the sampling parameters of Lux
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To speed up the visualization processing, by default, Lux performs random sampling on datasets with more than 10000 rows. For datasets over 30000 rows, Lux will randomly sample 30000 rows from the dataset.

If we want to change these parameters, we can set the `sampling_start` and `sampling_cap` via `lux.config` to change the default form of output. The `sampling_start` is by default set to 10000 and the `sampling_cap` is by default set to 30000. In the following block, we increase these sampling bounds.

.. code-block:: python
lux.config.sampling_start = 20000
lux.config.sampling_cap = 40000
If we want Lux to use the full dataset in the visualization, we can also disable sampling altogether (but note that this may result in long processing times). Below is an example if disabling the sampling:

.. code-block:: python
lux.config.sampling = False
Disable the use of heatmaps for large datasets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In addition to sampling, Lux replaces scatter plots with heatmaps for datasets with over 5000 rows to speed up the visualization process.

We can disable this feature and revert back to using a scatter plot by running the following code block (but note that this may result in long processing times).

.. code-block:: python
lux.config.heatmap = False
87 changes: 87 additions & 0 deletions lux/_config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,93 @@ def __init__(self):
self.plot_config = None
self.SQLconnection = ""
self.executor = None
self._sampling_start = 10000
self._sampling_cap = 30000
self._sampling_flag = True
self._heatmap_flag = True

@property
def sampling_cap(self):
return self._sampling_cap

@sampling_cap.setter
def sampling_cap(self, sample_number: int) -> None:
"""
Parameters
----------
sample_number : int
Cap on the number of rows to sample. Must be larger than _sampling_start
"""
if type(sample_number) == int:
assert sample_number >= self._sampling_start
self._sampling_cap = sample_number
else:
warnings.warn(
"The cap on the number samples must be an integer.",
stacklevel=2,
)

@property
def sampling_start(self):
return self._sampling_start

@sampling_start.setter
def sampling_start(self, sample_number: int) -> None:
"""
Parameters
----------
sample_number : int
Number of rows required to begin sampling. Must be smaller or equal to _sampling_cap
"""
if type(sample_number) == int:
assert sample_number <= self._sampling_cap
self._sampling_start = sample_number
else:
warnings.warn(
"The sampling starting point must be an integer.",
stacklevel=2,
)

@property
def sampling(self):
return self._sampling_flag

@sampling.setter
def sampling(self, sample_flag: bool) -> None:
"""
Parameters
----------
sample_flag : bool
Whether or not sampling will occur.
"""
if type(sample_flag) == bool:
self._sampling_flag = sample_flag
else:
warnings.warn(
"The flag for sampling must be a boolean.",
stacklevel=2,
)

@property
def heatmap(self):
return self._heatmap_flag

@heatmap.setter
def heatmap(self, heatmap_flag: bool) -> None:
"""
Parameters
----------
heatmap_flag : bool
Whether or not a heatmap will be used instead of a scatter plot.
"""
if type(heatmap_flag) == bool:
self._heatmap_flag = heatmap_flag
else:
warnings.warn(
"The flag for enabling/disabling heatmaps must be a boolean.",
stacklevel=2,
)

@property
def default_display(self):
Expand Down
12 changes: 7 additions & 5 deletions lux/executor/PandasExecutor.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,17 +40,19 @@ def __repr__(self):
@staticmethod
def execute_sampling(ldf: LuxDataFrame):
# General Sampling for entire dataframe
SAMPLE_START = 10000
SAMPLE_CAP = 30000
SAMPLE_FLAG = lux.config.sampling
SAMPLE_START = lux.config.sampling_start
SAMPLE_CAP = lux.config.sampling_cap
SAMPLE_FRAC = 0.75
if len(ldf) > SAMPLE_CAP:

if SAMPLE_FLAG and len(ldf) > SAMPLE_CAP:
if ldf._sampled is None: # memoize unfiltered sample df
ldf._sampled = ldf.sample(n=SAMPLE_CAP, random_state=1)
ldf._message.add_unique(
f"Large dataframe detected: Lux is only visualizing a random sample capped at {SAMPLE_CAP} rows.",
priority=99,
)
elif len(ldf) > SAMPLE_START:
elif SAMPLE_FLAG and len(ldf) > SAMPLE_START:
if ldf._sampled is None: # memoize unfiltered sample df
ldf._sampled = ldf.sample(frac=SAMPLE_FRAC, random_state=1)
ldf._message.add_unique(
Expand Down Expand Up @@ -99,7 +101,7 @@ def execute(vislist: VisList, ldf: LuxDataFrame):
PandasExecutor.execute_binning(vis)
elif vis.mark == "scatter":
HBIN_START = 5000
if len(ldf) > HBIN_START:
if lux.config.heatmap and len(ldf) > HBIN_START:
vis._postbin = True
ldf._message.add_unique(
f"Large scatterplots detected: Lux is automatically binning scatterplots to heatmaps.",
Expand Down
35 changes: 35 additions & 0 deletions tests/test_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,41 @@ def change_color_make_transparent_add_title(chart):
assert title_addition in exported_code_str


def test_sampling_flag_config():
df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/airbnb_nyc.csv")
df._repr_html_()
assert df.recommendation["Correlation"][0].data.shape[0] == 30000
lux.config.sampling = False
df = df.copy()
df._repr_html_()
assert df.recommendation["Correlation"][0].data.shape[0] == 48895
lux.config.sampling = True


def test_sampling_parameters_config():
df = pd.read_csv("lux/data/car.csv")
df._repr_html_()
assert df.recommendation["Correlation"][0].data.shape[0] == 392
lux.config.sampling_start = 50
lux.config.sampling_cap = 100
df = pd.read_csv("lux/data/car.csv")
df._repr_html_()
assert df.recommendation["Correlation"][0].data.shape[0] == 100
lux.config.sampling_cap = 30000
lux.config.sampling_start = 10000


def test_heatmap_flag_config():
df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/airbnb_nyc.csv")
df._repr_html_()
assert df.recommendation["Correlation"][0]._postbin
lux.config.heatmap = False
df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/airbnb_nyc.csv")
df = df.copy()
assert not df.recommendation["Correlation"][0]._postbin
lux.config.heatmap = True


# TODO: This test does not pass in pytest but is working in Jupyter notebook.
# def test_plot_setting(global_var):
# df = pytest.car_df
Expand Down