The Big Chill

Project to apply Machine Learning (ML) to Netflix tv shows & movies

APPLICATION PURPOSE: Do you have a TV show or movie about to premier on Netflix!?!

The description of your tv show or movie is an important part of the information people are presented with when scrolling through the numerous options to select as they sit back to “Netflix & chill”.
User can enter tv show or movie description to predict that description’s rating & OMDB genre.

This application can help you optimize the description you supply Netflix to ensure the person picks the expected genre & movie rating of your movie or tv show.

Underlying Database provides a way to explore/analyze the Netflix shows and movies by genre and rating to find something to watch on Netflix. Data manipulated through ML to identify similar content by matching text-based features.

Table of Contents:

Application Diagram

Machine Learning

Flask

Heroku Deployment

Other Data Analysis

Extract Transform Load (ETL)

Data Sources

Team Members

Technologies Used

Application Diagram

Machine Learning

Machine Learning (ML) library used = OneVsRestClassifier (sklearn.multiclass)
TV Show or Movie Description is input to predict one of

28 OMDB genre (The most common genre is Drama, which causes our predictions to be a bit unbalanced):

Grouped to 5 ratings = G, PG, PG-13, R, NC-17:

Grouped to G: ‘TV-Y7’ , ’TV-G’ , ’TV-Y’ , ’TV-Y7-FV’

Grouped to R: 'TV-MA’ & 'R'

Common Description Words: Finding the right words to train your machine on is key.

Picking valuable words like "young" will give more desirable results rather than "the"

Library nltk.corpus used to remove ‘standard’ stopwords (top words removed = a, the, to, and)

trained genre on 99% of the data & ratings on 88% of the data; train %s high due to limited records
Saved model to refrence model through Flask using Pickle
Predictive Accuracy for genre = 43.5%
Predictive Accuracy for ratings = 35.2%

Flask

Saved machine learning (ML) model & referenced in Flask using Pickle (.pkl)
Used Pandas to merge datbase tables as needed to pull data from Postgresql database
FireFox application utilized to develop API in order to render the data to the HTML

Heroku Deployment

Use rquirements text file to load only the python libraries used in the app (ie make sure included Pickle)
Limited to 10,000 records for Heroku database, used AWS RDS for database server
Add config with user name & password in Heroku & create variable to reference in app.py Flask to keep senstive information masked
Deploy using Gethub master branch
Include 'Procfile' in repository to connect to Heroku app (web: gunicorn app:app)
Deployed site address: https://the-big-chill.herokuapp.com/

Other Data Analysis

Data Extract Transform Load (ETL)

ETL Summary

Extract - We will be using two datasets. We are using a Netflix Movies and TV Shows csv file from Kaggle and using the OMDB API (OMDBAPI.com).
Transform - We will pull from the OMDB API titles, ratings, genres, etc. and using the csv netflix file, denoting if the title is in Netflix.
Load - We will create the database in postgreSQL

Extract

From the OMDB source we cleaned the "NA" string to blank
If it wasn't a Movie, the API didn't have a boxoffice field. We had to create an if statement in the code to account for that
Each API key only allowed 1000 pulls per day and we had over 6000 pulls. We had to manage multiple API keys in a list and loop through to not exceed 1000 per key.

Transform

The API pull jupyter notebook exported a CSV. The Transform_Load.ipynb pulled the two CSV files and then merged them. We dropped the "listed_in", "genre", and "language" columns that were not a one-to-one to the title and created separate tables for those with a junction table.
The "imdbVotes" column was stripped of the commas and turned into an integer
The data was pulled into a dataframe from the CSVs with pandas. Pandas did not recognize the comma delimited strings as a list. We had to use the str.split(',').tolist() function for the "genre", "listed_in", and "language" columns. We did not do this for the "country", "director", and "cast" columns since we were not splitting them into extra tables.

Removed the NaN values before looping through the "genre", "listed_in" and "language" columns. We looped through to get the unique values into a set. A set was used since we wanted unique values only in the list for each category and we did not need them in a certain order.

Used python pandas function .explode() to break out the show_id into a row per genre/listed_in or language.
The tables were merged to get a show_id to OMDB_genre_no, listed_in, language_no & table as the junction tables.

Load

The database schema is shown below in the image. We used the main merged "Title" table and then had a OMDB_genre and netflix_genre table along with their junction tables.

The database and tables were initially set up in PostgreSQL, a relational database, since our data was consistent and so we were able to relate tables to make it easier to query.
The Transform_Load.ipynb file loads the table data into the tables into PostgreSQL
PostgreSQL was then connected to AWS RDS (for deployement beyond local host).

Data Sources

Database

Public Domain Netflix Movies and TV Shows csv file from Kaggle: https://www.kaggle.com/shivamb/netflix-shows (This dataset consists of tv shows and movies available on Netflix as of 2019)

OMDB API http://www.omdbapi.com/

Other

Netflix Research data: https://www.businessofapps.com/data/netflix-statistics/

Team Members

April Lagnevall alagnevall - Responsibility: HTML/CSS build, Genre predict ML model, js and D3 build to enable ML model in app

Haifa Najdawi HaifaNajdawi - Responsibility: Data visualizations & analysis, Machine Learning, API build, AWS RDS set up, and Heroku Deployment

Julia Headlee julieheadlee Responsibility: Flask, Ratings predict ML model, js and D3 build to enable ML model in app

Melanie Nolker mnolker - Responsibility: Database ETL & project documentation

Technologies Used

flask

flask_cors

matplotlib

models

nltk

numpy

wordcloud

os

pandas

pickle

pprint

psycopg2

re

seaborn

sklearn

sqlalchemy

tqdm

typing

bootstrap

jquery

js.d3

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
machine_learning_models		machine_learning_models
static		static
templates		templates
.gitignore		.gitignore
Final_Report_Out_Presentation.pptx		Final_Report_Out_Presentation.pptx
Procfile		Procfile
README.md		README.md
app.py		app.py
app_api.ipynb		app_api.ipynb
models.py		models.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Big Chill

Project to apply Machine Learning (ML) to Netflix tv shows & movies

Application Diagram

Machine Learning

Flask

Heroku Deployment

Other Data Analysis

Data Extract Transform Load (ETL)

ETL Summary

Extract

Transform

Load

Data Sources

Team Members

Technologies Used

About

Releases

Packages

Contributors 4

Languages

HaifaNajdawi/The_big_chill

Folders and files

Latest commit

History

Repository files navigation

The Big Chill

Project to apply Machine Learning (ML) to Netflix tv shows & movies

Application Diagram

Machine Learning

Flask

Heroku Deployment

Other Data Analysis

Data Extract Transform Load (ETL)

ETL Summary

Extract

Transform

Load

Data Sources

Team Members

Technologies Used

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages