Project 1

CS 598 YP Spring 2024
Last Updated: January 30th 2024
Deadline: February 27th 2024, 11:59 PM CT

Project Overview

In this project, you will be implementing Online Aggregation (OLA) for a few basic operations in Pandas. You will observing OLA in action with dynamically updating Plotly plots, which will display incrementally improving estimates alongside the processing of the dataframe.

Getting Started

To get started, you will need to clone this repository to your own Github account - please do not commit directly to this repository! You will then need to make your cloned repository private. To do so, navigate to the "Change repository visibility" setting in the "Settings" tab:

The goal of this project is to answer questions on the Predict Future Sales dataset (included in this repository as sales_train.csv) in OLA fashion. The starter code for sampling the dataframe and dividing it into suitable-sized slices for incremental processing has been provided to you in utils.py. The visualization code is also provided to you in Visualization.ipynb.

Tasks

Your task is to implement OLA for 5 different operations in ola.py:

Filtered mean, i.e., avg(x) where y = z (5 points)
Grouped means, i.e., avg(x) group by y (10 points)
Grouped sums, i.e., sum(x) group by y (10 points)
Grouped counts, i.e., count(x) group by y (10 points)
Filtered cardinality via HLL , i.e., count_distinct(x) where y = z (Extra credit, 5 points)

You can find the skeleton code for each operation in the child classes of the base Ola class (e.g., GroupByAvgOla). An implementation of computing mean with OLA (i.e., avg(x)) is provided to you as an example in the AvgOla class. For each operation, you will implement the logic for processing incoming dataframe slices in the process_slice class function: when a new slice arrives, you will perform computations on the slice to improve your estimated values, then update the Plotly plot with the improved estimates. You are also allowed to use limited amount of space for bookkeeping (e.g., storing rolling averages) in the form of class variables during the processing of subsequent slices.

You are only required to implement the OLA logic; there is no SQL parsing involved in this assignment.

Verifying Your Results

You can verify your implementations of OLA operations with the Visualization.ipynb notebook. It contains 5 Plotly plots, one for each OLA operation. Once you correctly implement the OLA operations, you can click 'run all' to observe the Plotly plots dynamically updating with the processing of dataframe slices.

It is recommended that you run the notebook in a Jupyter Notebook session; The Plotly plots have been observed to fail in certain platforms/IDEs such as JupyterHub or PyCharm. Running the notebook and observing the plot updates is optional for this assignment: you are not required to record the dynamic updates.

Note: the dynamic plots may not work in JupyterLab. If you see display errors, consider using Jupyter Notebook to run the notebook instead, i.e., jupyter notebook.

Grading

You will be graded on the correctness of your implementations. You will receive the points for each operation if your implementation satisfies the following two criteria:

The contents of the Plotly plots after each processed dataframe slice are correct
The combined size of class variables you use for bookkeeping is smaller than a certain size during the OLA process (so no storing entire dataframes - that defeats the purpose of OLA)

You can check your assignment progress via the Github Actions workflow (described below). If the actions workflow passes, you will receive full score for this assignment.

You should not modify utils.py, test_ola.py or the expected_results directory as they are used by the autograder.

Accessing Github Actions

Github Actions can be accessed by clicking in the location specified in the image below: it currently displays a red X because the tests are failing (as nothing has been implemented yet).

You can check which test cases have passed or failed by clicking on the details tab:

A full-score submission with a passing Github Actions workflow looks like this:

Submission instructions

You will submit your work for Project 1 by uploading the URL of your private repository to the Project 1 - OLA assignment to Canvas. You will also need to share access to your private repository to the two course TAs:

Billy Li (BillyZhaohengLi)
Hanxi Fang (iq180fq200)

You can share access by navigating to Settings -> Collaborators:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project 1

Project Overview

Getting Started

Tasks

Verifying Your Results

Grading

Accessing Github Actions

Submission instructions

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
expected_results		expected_results
README.md		README.md
Visualization.ipynb		Visualization.ipynb
ola.py		ola.py
requirements.txt		requirements.txt
sales_train.csv		sales_train.csv
test_ola.py		test_ola.py
utils.py		utils.py

illinoisdata/CS598-MP1-OLA

Folders and files

Latest commit

History

Repository files navigation

Project 1

Project Overview

Getting Started

Tasks

Verifying Your Results

Grading

Accessing Github Actions

Submission instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages