This repository dives deep into the world of The Simpsons. We use the data sets available at the data science platform Kaggle and the #tidytuesday Github repository, which contain a variety of information on the show, including script lines, IMDb ratings, TV views, guest star appearances, and much more.
The repository contains a report illustrating exploratory and text analyses of the resources above. In particular,
-
Exploratory data analysis: this allows us to discover, among other things, the most popular characters, the locations where they usually interact, how the ratings and views have evolved across almost 30 seasons, and whether the appearance of guest stars has had any impact on the show ratings.
-
Text analysis: the availability of the script lines allows up to plunge into The Simpsons' world, and investigate the most recurrent and peculiar (as measured by tf-idf) words and bigrams, and the underlying sentiments. Through a Latent Dirichlet Allocation (LDA) analysis, we can uncover the main topics of the show, such as family, school, social life, and work relationships.
The analysis code is contained in the R Markdown report exploring-simpsons.Rmd
.
If you just what to have a look at the rendered notebook, please refer to exploring-simpsons.html
.
The HTML outlook of the report can be changed by editing the custom.css
file, which is itself a modified version of the readthedown
format.
All the data sets employed for the analyses are collected for convenience in the Data
folder.
Suggestions and feedback are welcome!
Link to notebook on Kaggle: https://www.kaggle.com/elenageminiani/exploring-the-simpsons-show.