We created this repository as a way to help Data Scientists learning Pyspark become familiar with the tools and functionality available in the API. This repository contains 11 lessons covering core concepts in data manipulation. This repository was forked from Guipsamora's Pandas Exercises project and repurposed to solve the same exercises using the Pyspark API instead of Pandas.
Tutorials are great resources, but to learn is to do. So unless you practice you won't learn. Pyspark is no exception!
There will be three different types of files:
1. Exercise instructions
2. Solutions without code
3. Solutions with code and comments
My suggestion is that you learn a topic in a tutorial, video or documentation and then do the first exercises. Learn one more topic and do more exercises. If you are stuck, don't go directly to the solution with code files. Check the solutions only and try to get the correct answer.
Suggestions and collaborations are more than welcome.🙂 Please open an issue or make a PR indicating the exercise and your problem/solution.
As a community project, we're seeking help to converting this repo into a complete repository for mastering Pyspark.
We need assistance with the following:
Select an issue in the Issues tab corresponding to one of the tutorial directories. In your pull request, re-write the directory using Pyspark instead of pandas. So far, we've listed issues for every exercise in the repo.
We have a lot of refactoring to do outside of the lessons. If you see something that needs to be changed, please raise an issue. To contribute, please either raise an issue in the Issues
tab, or raise a pull request for an existing issue.
Our readme section could use some work. For instance, we should list ways to run Pyspark on local machines (Windows, MacOS, Linux).
Getting and knowing | Merge | Time Series |
Filtering and Sorting | Stats | Deleting |
Grouping | Visualization | Indexing |
Apply | Creating Series and DataFrames | Exporting |
Chipotle
Occupation
World Food Facts
Chipotle
Euro12
Fictional Army
Alcohol Consumption
Occupation
Regiment
Students Alcohol Consumption
US_Crime_Rates
Auto_MPG
Fictitious Names
House Market
Chipotle
Titanic Disaster
Scores
Online Retail
Tips
Apple_Stock
Getting_Financial_Data
Investor_Flow_of_Funds_US
Video tutorials of data scientists working through the above exercises: