This is a REST-based service for a given Amazon Product
dataset. This ervice will accept a query in the form of the template provided below.
Group By Template
SELECT<COLUMNS>,FUNC(COLUMN1)
WHERE <COLUMN1> = X
FROM<TABLE>
GROUP BY <COLUMNS>
HAVING FUNC(COLUMN1) > X
The service translates the query into MapReduce jobs and also into Spark job. It should run these two jobs separately and return the following in a JSON object:
- Time taken for Hadoop MapReduce execution.
- Time taken for Spark execution.
- Input and output of map and reduce tasks in a chain as they are applied on the data.
- Spark transformations and actions in the order they are applied.
- Result of the query.
There are four tables named as
- product
- category
- similar
- reviews
id | ASIN | title | productGroup | salesrank | similarCount | categoriesCount | reviewCount | downloads | averageRating |
---|
- id : The primary Key of the product table
- ASIN : It is the Amazon Standard Identification Number
- title : Name/title of the product
- productGroup : Product group (Book, DVD, Video or Music)
- salesrank : Amazon Salesrank
- similarCount : Number of similar products
- categoriesCount : Number of categories
- reviewCount : Number of reviews on the product
- downloads : Number of downloads
- averageRating : Average rating of a product
id | productId | categoryName | categoryId |
---|
- id : Primary key of the category table
- productId : Product ASIN id from product table
- categoryName : Name of the cateogry
- categoryId : Category unique identification in amazon
id | productId | similarId |
---|
- id : Primary key of the similar table
- productId : Product ASIN id from the product table
- similarId : Similar Product ASIN id from the product table
id | date | productId | customerId | rating | votes | helpful |
---|
- id : Primary key of the review table
- date : Date of review
- productId : Product ASIN number from product table
- customerId : Customer id of the reviewer
- rating : Rating for the product from the reviewer
- votes : Number of votes to the review
- helpful : Number of people found the review helpful
- User submits query.
initiator
sends the string query to theDriver
.- The
Driver
sends the query to theParser
and receives the parsed query. Parser
creates a fileelements.json
which stores the breakdown elements of the query.Driver
sends the parsed query to theMRSession
andSparkler
to receive the query result.MRSession
initiates themapper
&reducer
.MRResult
fetches the output of the MapReduce job fromhdfs
to local and sends the JSON string toDriver
.Sparkler
executes the query and sends the result toDriver
.- JSON String from
MRResult
andSparkler
is passed to theDriver
. Driver
sends the JSON String to theinitiator
and it sends the response to client.
- initiator.py : Links the user to the Driver for the execution of the query.
- Driver : Sends the input query to the Parser and upon receiving the parsed query sends it to the MapRed and Spark modules to get the result.
- Elements/MapRed : Executes the MapReduce job the query and sends the result to the Driver.
- Elements/Spark : Executes the Spark job the query and sends the result to the Driver.
- Dependencies :
Stores the
Database schema
,Column DataType
,Query Elements
andConfig
files.
For the given query
Select Column1, Column2, count(Column3) from Table where Column2 = X group by Column1 having count(Column3) > Y
the [Parser] breaks down into elements as shown below and stores it into elements.json
.
selectColumns:
- 'Column1'
- 'Column2'
selectFunc: 'count'
selectFuncColumn: 'Column3'
fromTable: 'Table'
whereColumn: 'Column2'
whereOperator: '='
whereValue: 'X'
groupByColumn: 'Column1'
havingThreshold: 'Y'
havingOperation: '>'
The MapRed
module takes in the STDIN input from the CSV file and uses the elements.json
to get the parsed query. It goes through the input and selects the rows based on the where condition. A <key, value> pair is generated with keys being groupBy column values joined by space and value being aggregation column values. The key, values are separated by the tab delimiter. Reducer does the group by and excecutes the having condition to generate the output.
The Spark
module takes in the CSV file to create a Dataframe. It performs the transformations as per the where
condition and aggregate
function and generated the output.