Welcome to the Netflix Data Analysis & Database Normalization project! This project explores the process of cleaning, normalizing, and analyzing Netflix data using SQL and database best practices. It also dives into database design principles to ensure optimized data storage and retrieval.
The original dataset contained two key tables:
titles
: Information on unique shows and movies.credits
: Details on the cast and crew involved in each show or movie.
We discovered discrepancies between the two tables:
- The
credits
table contained more unique show IDs than thetitles
table, leading to inconsistencies.
We created a unified view by selecting only the common records from both tables, ensuring data consistency throughout the analysis. Afterward, we applied database normalization techniques to split the data into smaller, well-organized tables.
- Less Data Duplication: Improved storage efficiency by reducing redundancy.
- Increased Data Integrity: Accurate and consistent data across all tables.
- Improved Query Performance: Faster and more efficient queries through proper indexing and structure.
- Enhanced Security: More controlled access to sensitive information.
Database design is the organization of data according to a database model. The designer determines what data must be stored and how the data elements interrelate.
After cleaning the Netflix data in Part 1, we obtained two tablesβ-β'titles' containing information about unique shows/movies and 'credits' containing information about the castings in different shows/movies. The data is now distributed in these two tables.
When we counted the unique shows in each of the tables (since both have id column which corresponds to unique shows), we found out that the number of unique shows in credits table is higher than the titles table.
-
Conceptual Data Model: High-level view of key entities and relationships.
-
Logical Data Model: Detailed relationships and entity specifications.
-
Physical Data Model: Actual implementation of the tables, ensuring optimal performance.
- We performed Exploratory Data Analysis (EDA) to uncover trends in popular genres, actor appearances, and the distribution of shows across different ratings.
- The normalized tables made it easy to run complex queries on specific data points, providing deeper insights into Netflix's vast content library.
Some of the SQL queries we explored:
- Most Frequent Actors: Identify which actors appear most often in Netflix shows.
- Genre Popularity: Analyze which genres dominate Netflixβs catalog.
- Rating Distributions: Understand how shows and movies are rated across various regions.
We used tools like Tableau and Power BI to visualize the findings. Here's an example of how the data looks post-normalization:
Normalization is crucial for:
- Ensuring data consistency across related tables.
- Eliminating redundancy, so each piece of data is stored only once.
- Making your database scalable, easier to manage, and more flexible for future changes.
- SQL (PostgreSQL, MySQL)
- Python (for additional data analysis)
- Tableau/Power BI (for visualizations)
- Clone the repository to your local environment:
git clone https://github.com/mayankyadav23/Netflix-Data-Analysis.git