Data Engineering

A repository to document my knowledge

Background

Hello there! This is my data engineering repository where I will apply all that I've learned about data engineering. The topics are making an ETL pipeline, data modeling, error handling, code standards, logging, unit testing, and other concepts that may be related to data engineering. Also, this is where I will generally document my progress on learning and applying data engineering concepts.

Additionally, this repository will contain two main directories: Learning Projects and Projects. Bite Projects will contain my short projects that I aim to apply concepts and knowledge that I learned. These projects will most probably take me about 1-2 weeks to do. Projects on the other hand will contain multiple concepts and knowledge that I have gained from Learning Projects. These will take me about 1-2 months to finish.

How can this repository help you

Possibly, you can see how I do or set the following:

standardizing Python scripts and SQL queries
data modeling
create an ETL pipeline
apply unit tests on a pipeline

How can you help

provide feedback on things that you think that needs improvement

General File Structure

Project-
├── data/
|   ├── preprocessed/
|   |   ├── preprocessed data A
|   |   L── preprocessed data B
|   |
|   ├── raw/
|   |   ├── raw data A
|   |   L── raw data B
|   |
|   L── test/
|       ├── test data A
|       L── test data B
|
├── documents/
|   ├── data model
|   ├── file structure
|   ├── pipeline
|   L── requirements
|
├── scripts/
|   ├── etl
|   ├── code profiling
|   ├── style checker
|   ├── unit tests
|   L── main
|
L── README.md

General Pipeline

graph TD;
    data_source_A-->extracted_raw_data;
    data_source_B-->extracted_raw_data;
    data_source_C-->extracted_raw_data;
    extracted_raw_data-->transform_A;
    transform_A-->transform_B;
    transform_A-->transform_C;
    transform_B-->storage;
    transform_C-->storage;

Python Scripts Standards

Variable Names and Values

Boolean variable names should start with "is_" or "has_".
Boolean values should ONLY be "True" and "False" when stored in a database.
Date variable names should start with "date_".
Date values should be "YYYY-MM-DD"

Functions

Should only do one thing.
Must display an example output if applicable.
Must have docstrings, short explanation if needed, try-except statement, and logging outputs.

import logging
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)


def add(number_1, number_2):
    '''
    Add two numbers and return them as float
    ----
    Parameters
    number_1: int/float - the first number
    number_2: int/float - the second number
    ----
    Return
    result: float - the sum of the first and second number in float type
    ----
    Example
    >>> add(4, 5)
    9.0
    '''
    try:
        result = float(number_1 + number_2)
    except Exception as e:  # Catch all kind of errors
        logging.error(f"{e} caught in execution.")
    else:
        logging.info(f"Added {number_1} and {number_2} = {result}")
        return result

SQL Queries Standards

Should follow the Modern SQL Style Guide

select t1.name
     , t2.value
  from table_one as t1
  left join table_two as t2
    on t1.id = t2.id
 where t1.name like 'E%'
   and t2.value > 100
 order by t1.name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering

A repository to document my knowledge

Background

How can this repository help you

How can you help

Table of Contents

General File Structure

General Pipeline

Python Scripts Standards

Variable Names and Values

Functions

SQL Queries Standards

Data Sources to Consider for Projects

Sources or Useful Materials

Courses

Books

Documentations

Articles

Videos

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
Bite Projects		Bite Projects
Projects		Projects
README.md		README.md
my_2024_2025_roadmap.md		my_2024_2025_roadmap.md

Dixboi/Data-Engineering

Folders and files

Latest commit

History

Repository files navigation

Data Engineering

A repository to document my knowledge

Background

How can this repository help you

How can you help

Table of Contents

General File Structure

General Pipeline

Python Scripts Standards

Variable Names and Values

Functions

SQL Queries Standards

Data Sources to Consider for Projects

Sources or Useful Materials

Courses

Books

Documentations

Articles

Videos

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages