The purpose of this analysis is to use PySpark to perform the ETL process to extract one of the datasets from Amazon reviews written by members of the paid Amazon Vine program. To accomplish this I transformed the data, connected to an AWS RDS instance, loaded the transformed data into pgAdmin, and then used PySpark to determine if there was any bias towards favorable reviews from Vine members in the dataset.
-
Software:
- Google Colaboratory (Google Colab Notebook)
- PySpark
- Amazon Web Services (AWS)
- PostgreSQL 12
- pgAdmin 4
- Google Colaboratory (Google Colab Notebook)
-
Data source:
- Amazon Review datasets
-
The total number of reviews for all Vine and non-Vine reviews
- The total number of reviews for all Vine and non-Vine reviews is 18,155 people.
- Appoximately 1% are Vine members. (136 people)
- Appoximately 99% are non-Vine members. (18,019 people)
- The total number of reviews for all Vine and non-Vine reviews is 18,155 people.
-
The number of 5-star reviews for all Vine and non-Vine reviews
- There are 74 out of 136 Vine members gave 5-star reviews.
- There are 8,482 out of 18,019 non-Vine members gave 5-star reviews.
-
The percentage 5-star reviews for all Vine and non-Vine reviews
- Appoximately 54% of Vine members gave 5-star reviews.
- Appoximately 47% of non-Vine members gave 5-star reviews.
For the results, we could come to the conclusion that there is a positivity bias for reviews in the Vine program on the furniture category. However, there are more than 50 datasets and difference categories that we could use to prove even further if there is any bias towards favorable reviews from Vine members.
- The additional analyses that we could do with this dataset to support our statement are
- use more datasets that are different categories from the Amazon Review datasets.
- analyze more summary statistics such as mean, mode, and median of the star rating.