Hello there! This is my data engineering repository where I will apply all that I've learned about data engineering. The topics are making an ETL pipeline, data modeling, error handling, code standards, logging, unit testing, and other concepts that may be related to data engineering. Also, this is where I will generally document my progress on learning and applying data engineering concepts.
Additionally, this repository will contain two main directories: Learning Projects and Projects. Bite Projects will contain my short projects that I aim to apply concepts and knowledge that I learned. These projects will most probably take me about 1-2 weeks to do. Projects on the other hand will contain multiple concepts and knowledge that I have gained from Learning Projects. These will take me about 1-2 months to finish.
Possibly, you can see how I do or set the following:
- standardizing Python scripts and SQL queries
- data modeling
- create an ETL pipeline
- apply unit tests on a pipeline
- provide feedback on things that you think that needs improvement
The information for each content shows my standards for each project that I will make, with execption to very first project.
General File Structure
General Data Model
General Pipeline
Python Scripts Standards
SQL Queries Standards
Project-
├── data/
| ├── preprocessed/
| | ├── preprocessed data A
| | L── preprocessed data B
| |
| ├── raw/
| | ├── raw data A
| | L── raw data B
| |
| L── test/
| ├── test data A
| L── test data B
|
├── documents/
| ├── data model
| ├── file structure
| ├── pipeline
| L── requirements
|
├── scripts/
| ├── etl
| ├── code profiling
| ├── style checker
| ├── unit tests
| L── main
|
L── README.md
graph TD;
data_source_A-->extracted_raw_data;
data_source_B-->extracted_raw_data;
data_source_C-->extracted_raw_data;
extracted_raw_data-->transform_A;
transform_A-->transform_B;
transform_A-->transform_C;
transform_B-->storage;
transform_C-->storage;
- Boolean variable names should start with "is_" or "has_".
- Boolean values should ONLY be "True" and "False" when stored in a database.
- Date variable names should start with "date_".
- Date values should be "YYYY-MM-DD"
- Should only do one thing.
- Must display an example output if applicable.
- Must have docstrings, short explanation if needed, try-except statement, and logging outputs.
import logging
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)
def add(number_1, number_2):
'''
Add two numbers and return them as float
----
Parameters
number_1: int/float - the first number
number_2: int/float - the second number
----
Return
result: float - the sum of the first and second number in float type
----
Example
>>> add(4, 5)
9.0
'''
try:
result = float(number_1 + number_2)
except Exception as e: # Catch all kind of errors
logging.error(f"{e} caught in execution.")
else:
logging.info(f"Added {number_1} and {number_2} = {result}")
return result
- Should follow the Modern SQL Style Guide
select t1.name
, t2.value
from table_one as t1
left join table_two as t2
on t1.id = t2.id
where t1.name like 'E%'
and t2.value > 100
order by t1.name