CS 598 YP Spring 2024
Last Updated: January 30th 2024
Deadline: February 27th 2024, 11:59 PM CT
In this project, you will be implementing Online Aggregation (OLA) for a few basic operations in Pandas. You will observing OLA in action with dynamically updating Plotly plots, which will display incrementally improving estimates alongside the processing of the dataframe.
To get started, you will need to clone this repository to your own Github account - please do not commit directly to this repository! You will then need to make your cloned repository private. To do so, navigate to the "Change repository visibility" setting in the "Settings" tab:
The goal of this project is to answer questions on the Predict Future Sales dataset (included in this repository as sales_train.csv
) in OLA fashion.
The starter code for sampling the dataframe and dividing it into suitable-sized slices for incremental processing has been provided to you in utils.py
.
The visualization code is also provided to you in Visualization.ipynb
.
Your task is to implement OLA for 5 different operations in ola.py
:
- Filtered mean, i.e.,
avg(x) where y = z
(5 points) - Grouped means, i.e.,
avg(x) group by y
(10 points) - Grouped sums, i.e.,
sum(x) group by y
(10 points) - Grouped counts, i.e.,
count(x) group by y
(10 points) - Filtered cardinality via HLL , i.e.,
count_distinct(x) where y = z
(Extra credit, 5 points)
You can find the skeleton code for each operation in the child classes of the base Ola
class (e.g., GroupByAvgOla
). An implementation of computing mean with OLA (i.e., avg(x)
) is provided to you as an example in the AvgOla
class.
For each operation, you will implement the logic for processing incoming dataframe slices in the process_slice
class function:
when a new slice arrives, you will perform computations on the slice to improve your estimated values, then update the Plotly plot with the improved estimates.
You are also allowed to use limited amount of space for bookkeeping (e.g., storing rolling averages) in the form of class variables during the processing of subsequent slices.
You are only required to implement the OLA logic; there is no SQL parsing involved in this assignment.
You can verify your implementations of OLA operations with the Visualization.ipynb
notebook. It contains 5 Plotly plots, one for each OLA operation.
Once you correctly implement the OLA operations, you can click 'run all' to observe the Plotly plots dynamically updating with the processing of dataframe slices.
It is recommended that you run the notebook in a Jupyter Notebook session; The Plotly plots have been observed to fail in certain platforms/IDEs such as JupyterHub or PyCharm. Running the notebook and observing the plot updates is optional for this assignment: you are not required to record the dynamic updates.
Note: the dynamic plots may not work in JupyterLab. If you see display errors, consider using Jupyter Notebook to run the notebook instead, i.e., jupyter notebook
.
You will be graded on the correctness of your implementations. You will receive the points for each operation if your implementation satisfies the following two criteria:
- The contents of the Plotly plots after each processed dataframe slice are correct
- The combined size of class variables you use for bookkeeping is smaller than a certain size during the OLA process (so no storing entire dataframes - that defeats the purpose of OLA)
You can check your assignment progress via the Github Actions workflow (described below). If the actions workflow passes, you will receive full score for this assignment.
You should not modify utils.py
, test_ola.py
or the expected_results
directory as they are used by the autograder.
Github Actions can be accessed by clicking in the location specified in the image below: it currently displays a red X because the tests are failing (as nothing has been implemented yet).
You can check which test cases have passed or failed by clicking on the details tab:
A full-score submission with a passing Github Actions workflow looks like this:
You will submit your work for Project 1 by uploading the URL of your private repository to the Project 1 - OLA assignment to Canvas. You will also need to share access to your private repository to the two course TAs:
- Billy Li (BillyZhaohengLi)
- Hanxi Fang (iq180fq200)
You can share access by navigating to Settings -> Collaborators: