👋 Welcome!
In this repository you can find my code to create an automated data pipeline on the cloud using Python, SQL and AWS. (This project is a part of the DataScience Bootcamp at WBS Coding School)
In this hypothetical project I am hired as a data scientist by an e-scooter-sharing company called gans and the goal is to assemble a data pipeline which is capable of collecting the data that is needed to predict e-scooter movement from external sources in an automated way.
-
I collected city information with web scraping and HTML parsing (using the Python library Beautiful Soup)
-
I collected weather OpenWeather and flight information Aerodatabox by making requests to Application Programming Interfaces (APIs) using Python's requests library.
-
I created a SQL database to store the data (MySQL Workbench)
-
Transferred the project to the cloud and scheduled to run periodically (AWS, RDS, Lambda, EventBridge).
-
In snippets_for_webscraping_and_APIcalls you can find some examples of how to make API requests using Open weather map and Spotipy (You will need to add your own credentials)
-
version_01 and version_02 are very similar except in version_01 city and airport data are loaded from the .csv files and the code is structured to run only for one city. Whereas in version_02 city and airport data are collected with webscraping (city data from wikipedia) and API calls (airport icao codes with aerodata box).
-
For both version the main scripts are the notebooks called gans_main_script.ipynb.
You can refer to this Medium article for a more detailed description of the project and the code.