RedCarpetUp internship application
First step(For a batch of movies):
Primary dataset Download primary data from
Feature generation
Casts: Awards_types(dataset AW): Actors(dataset A): Movies(dataset M):
First exercise(time: 12 hours):
- Load primary dataset to pandas.
- Scrape data from secondary links and load to pandas. While any method is fine - beautiful soup would be recommended.
- Use Levenshtein distance to match movie names in primary dataset with movies provided in dataset M. (Recommendations:
- Persist Levenshtein distance scores between movies in primary dataset and movies in dataset(M) and share in a CSV.
- Assume that the movies with the highest Levenshtein distance is the same and use that to merge primary dataset to dataset M.
- Using this, use data in dataset AW,A to create additional features.
- After this exercise, you should have multiple features in for each movie. Share the processed data in a csv format.
Second exercise(time: 12 hours) For modelling divide the data into atleast three samples:
- Training
- Testing
- Out of time testing - This dataset needs to have all 2016 year releases - DO NOT INCLUDE 2016 year releases in previous two datasets. Feel free to play around with distribution of training & testing datasets.
Models to be implemented:
- SVM for multiclass prediction -
- LARS Lasso -
Model comparision metrics to be generated for each of your models:
- AUC (Multiclass) -
- R2 (LARS) -
Share Jupyter notebook reading in .csv with all modelling code - try to optimize the model as much as possible in given time frame.
Brownie points:
Create function with single movie run. Something like this: def fun_name(movie_name): ......
Calling fun_name(movie_name) should predict rating of a movie.