Exploratory Data Analysis and Price Predictions for Avocado Dataset based on Machine Learning
- Problem Statement
- Data Loading and Description
- Data Profiling
- Understanding the Dataset
- Profiling
- Preprocessing
- Data Visualisation and Questions answered
- Q.1 Which type of Avocados are more in demand (Conventional or Organic)?
- Q.2 In which range Average price lies, what is distribution look like?
- Q.3 How Average price is distributed over the months for Conventional and Organic Types?
- Q.4 What are TOP 5 regions where Average price are very high?
- Q.5 What are TOP 5 regions where Average consumption is very high?
- Q.6 In which year and for which region was the Average price the highest?
- Q.7 How price is distributed over the date column?
- Q.8 How dataset features are correlated with each other?
- Feature Engineering for Model building
- Model selection/predictions
- P.1 Are we good with Linear Regression? Lets find out.
- P.2 Are we good with Decision Tree Regression? Lets find out.
- P.3 Are we good with Random Forest Regressor? Lets find out.
- Lets see final Actual Vs Predicted sample.
- Conclusions
- The notebooks explores the basic use of Pandas and will cover the basic commands of (EDA) for analysis purpose.
- In this study, we will try to see if we can predict the Avocado’s Average Price based on different features. The features are different (Total Bags,Date,Type,Year,Region…).
- Categorical: ‘region’,’type’
- Date: ‘Date’
- Numerical:’Total Volume’, ‘4046’, ‘4225’, ‘4770’, ‘Total Bags’, ‘Small Bags’,’Large Bags’,’XLarge Bags’,’Year’
- Target:‘AveragePrice’
The variables of the dataset are the following:
This data was downloaded and provided by INSAID, from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Represents weekly 2018 retail scan data for National retail volume (units) and price. The dataset comprises of 18249 observations of 14 columns. Below is a table showing names of all the columns and their description.
The unclear numerical variables terminology is explained in the next section:
Features | Description |
---|---|
‘Unamed: 0’ | Its just a useless index feature that will be removed later |
‘Total Volume’ | Total sales volume of avocados |
‘4046’ | Total sales volume of Small/Medium Hass Avocado |
‘4225’ | Total sales volume of Large Hass Avocado |
‘4770’ | Total sales volume of Extra Large Hass Avocado |
‘Total Bags’ | Total number of Bags sold |
‘Small Bags’ | Total number of Small Bags sold |
‘Large Bags’ | Total number of Large Bags sold |
‘XLarge Bags’ | Total number of XLarge Bags sold |