In this pipeline the data from the Polish job listings website https://www.pracuj.pl/ is extracted, transformed and loaded into data warehouse. It consists of two Python scripts: the web scraping tool and the transformation tool.
It uses Beautiful Soup as a main framework for scraping the website. Since some content in the source HTML is dynamic and cannot be extracted with Beautiful Soup, the Selenium library is used as a second scraping framework. It is configured to look for job listings based on the day they were published by setting the starting date and optionally the end date which by default is set to today's date. Once data from particular day is extracted, it (the date) is written into the metafile to prevent the tool from scraping the same job listings. The job data is written into apache parquet files and stored in the AWS S3 Bucket.
Once new file appears in the S3 Bucket, the data is transformed using Pandas library: cleaned and then split into facts and dimensions. The facts in this case are salaries offered by the employers. Each new record in the dimension and fact table is assigned with its unique surrogate key. Finally, using the Amazon Redshift Python connector, the new data is loaded into AWS Redshift which serves as a data warehouse. Names of files from which data was transformed and loaded into DWH are written into the metafile.
The pipeline is managed using Argo Workflows which is a Kubernetes orchestration enginge allowing to schedule containerised applications. For this purpose Docker images of both tools (Web Scraping and Data Transforming) were created. The Kubernetes deployment file specifies DAG (Directed Acyclic Graph) with two tasks - each responsible for pulling specific Docker image and running one of the tools within a container.
The DAG is scheduled to run everyday at 4:00 a.m. UTC. It starts with the Web Scraping task and once it is finished, the Data Transformation task is executed. Kubernetes is hosted on t3.micro Amazon EKS Cluster (or was until I found out it's not part of the AWS Free Tier services).
The DWH model is using a star schema. Measurement gathered in the DWH are salaries offered by employers which is stored in the fact table, while all other data forms dimension tables. Additionally, two bridge tables were created (https://www.kimballgroup.com/2012/02/design-tip-142-building-bridges/) in order to accommodate many-to-many relationships between the fact table and location and category dimension tables. Since salaries visible in the job offers are often given in range, the salary fact table consists of minimum salary and maximum salary (if salary is not given in range then minimum = maximum).