Analysis on Amazon's vine review program using PySpark and AWS RDS with PostgreSQL
The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. This project analyzes Amazon Vine program and determines if there is a bias toward favorable reviews from Vine members.
The analysis uses PySpark to perform the ETL (extract, transform, and load) process to extract the dataset, transform the data, connect to an AWS RDS instance, load the transformed data into PostgreSQL server (pgAdmin).
After the ETL process, analysis was done to answer the following questions:
- How many Vine reviews and non-Vine reviews were there?
- How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?
- What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?
- Data Source: From Amazon Review Datasets, the Books_v1_02.tsv.gz dataset was chosen.
- Software: Google Colab Notebook, PostgreSQL 11.9, pgAdmin 4, AWS
Originally created in Google Colab for PySpark to run, these dataframes were then loaded to AWS RDS using a a connection from PySpark to PostgreSQL. See Amazon_Reviews_ETL.pynb for complete code.
Table1: customers_table
Table2: products_table
Table3: review_id_table
Table4: vine_table
Using the extracted vine_table, an analysis was performed on the Amazon Vine program to determine if there is a bias toward favorable reviews from Vine members. See Vine_Review_Analysis.pynb for complete code.
- Paid Reviews Dataframe
- Unpaid Reviews Dataframe
The analysis first filtered by total_votes count is equal to or greater than 20 and then by the number of helpful_votes divided by total_votes is equal to or greater than 50%. From the resulting dataframe, Paid and Unpaid Reviews dataframes were filtered.
We see that there are 0 fields for Paid Reviews and 403,807 Unpaid Reviews. Under Unpaid Reviews, there are 242,889 5-star reviews which amounts to a total of 60.15% of Unpaid Reviews.
What these numbers seems to suggest is that there is not strong bias toward five-star reviews from paid Amazon Vine reviewers. We can assume that Vine customers are more critical when submitting their review. We should do furthur analysis and include all of the data rather than filtering it to a percentage of helpful vs. total votes as we did for this analysis.