Skip to content

Analysis on Amazon's vine review program using PySpark and AWS RDS with PostgreSQL

Notifications You must be signed in to change notification settings

ramya-ramamur/Amazon_Vine_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Analysis on Amazon's vine review program using PySpark and AWS RDS with PostgreSQL

Amazon_Vine_Analysis

Overview

The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. This project analyzes Amazon Vine program and determines if there is a bias toward favorable reviews from Vine members.

The analysis uses PySpark to perform the ETL (extract, transform, and load) process to extract the dataset, transform the data, connect to an AWS RDS instance, load the transformed data into PostgreSQL server (pgAdmin).

After the ETL process, analysis was done to answer the following questions:

  • How many Vine reviews and non-Vine reviews were there?
  • How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?
  • What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?

Resources

Results

ETL (extract, transform, and load)

Originally created in Google Colab for PySpark to run, these dataframes were then loaded to AWS RDS using a a connection from PySpark to PostgreSQL. See Amazon_Reviews_ETL.pynb for complete code.

Table1: customers_table

Screen Shot 2022-02-14 at 9 42 35 PM

Table2: products_table

Screen Shot 2022-02-14 at 9 43 36 PM

Table3: review_id_table

Screen Shot 2022-02-14 at 9 45 27 PM

Table4: vine_table

Screen Shot 2022-02-14 at 9 46 58 PM

Analysis with vine_table

Using the extracted vine_table, an analysis was performed on the Amazon Vine program to determine if there is a bias toward favorable reviews from Vine members. See Vine_Review_Analysis.pynb for complete code.

  • Paid Reviews Dataframe

Screen Shot 2022-02-14 at 9 52 29 PM

  • Unpaid Reviews Dataframe

Screen Shot 2022-02-14 at 9 54 15 PM

Total Reviews: Paid vs Unpaid

Screen Shot 2022-02-14 at 9 55 37 PM

5-star Vine Reviews: Paid vs Unpaid

Screen Shot 2022-02-14 at 9 57 04 PM

Percentage of 5-star Vine Reviews: Paid vs Unpaid

Screen Shot 2022-02-14 at 9 58 07 PM

Summary

The analysis first filtered by total_votes count is equal to or greater than 20 and then by the number of helpful_votes divided by total_votes is equal to or greater than 50%. From the resulting dataframe, Paid and Unpaid Reviews dataframes were filtered.

We see that there are 0 fields for Paid Reviews and 403,807 Unpaid Reviews. Under Unpaid Reviews, there are 242,889 5-star reviews which amounts to a total of 60.15% of Unpaid Reviews.

What these numbers seems to suggest is that there is not strong bias toward five-star reviews from paid Amazon Vine reviewers. We can assume that Vine customers are more critical when submitting their review. We should do furthur analysis and include all of the data rather than filtering it to a percentage of helpful vs. total votes as we did for this analysis.

About

Analysis on Amazon's vine review program using PySpark and AWS RDS with PostgreSQL

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published