-
Notifications
You must be signed in to change notification settings - Fork 0
/
Super-Market _EDAnalysis (1).py
543 lines (369 loc) · 15.9 KB
/
Super-Market _EDAnalysis (1).py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
#!/usr/bin/env python
# coding: utf-8
# # Super Market Sales - Exploratory Ananlysis - EDA
# > About Project - By using this Data set I want to know about the 3 months of all braches' transactions have done And Explore every transaction between each individual category column and numerical column.
# > How The sales are influenced by gender-wise, City, Products, Customer Type.
# >Each Branch wise sales Behavior.
#
#
#
# ## Import Liabraries
# In[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# ## Reading Dataset
# In[2]:
df = pd.read_csv('https://raw.githubusercontent.com/sushantag9/Supermarket-Sales-Data-Analysis/master/supermarket_sales%20-%20Sheet1.csv')
# In[3]:
df.head() #exploring the number of columns of data set.
# In[4]:
df.info() #here we can get no.of rowss(entries) ,columns and dtype of each columns to further process.
# In[5]:
df.duplicated().sum() #checking about the duplicates
# In[6]:
df.isnull().sum() # checking the null values and sum it each column to know the count of null value
# ## Data Cleaning (Data Wrangling)
#
# 1. Removing the Unnecessary columns which are not required for my analysis.
#
# 2. Converting the data type, default to meaningful.
# In[7]:
#Note - i am removing the gp% because, if you check df(data frame) gp% has same observations so that...
# and also tax column is not required for my analysis.
df.drop(columns=['Tax 5%','gross margin percentage'],axis=1,inplace=True)
# In[8]:
df['Date'] = pd.to_datetime(df['Date'])
# In[9]:
df['Time'] = pd.to_datetime(df['Time']) #coverting object to data time formate
# In[10]:
df['Time'] = df.Time.dt.hour #keeping the column as a hour observations.
# ## Statistics Summary
#
# #### Statistics summary gives a high-level idea to identify whether the data has any Outliers, Data entry error,
# #### Distribution of data such as the data is normally distributed or left / right skewed.
# In[11]:
df.describe()
# ### From the statistics summary, we can infer the below findings :
#
# . we can observe here, the data set does not have much outliers and skewsness.
# above columns are all float or integer dtype. so see every column mean and standard deviation.
# looks preety normal.
#
# . Rating from users is 7 from median and 6.97 from mean. looks decent rating.
#
# . Here we can see the margin between total and cogs. with ".describe()" function we could check difference between them
# easily.
# #### Before we do EDA, lets seperate Numerical and Categorical variables for easy analysis
# In[12]:
Quantitative_data = df.select_dtypes(include=np.number).columns.tolist()
# In[13]:
Quantitative_data
# In[14]:
Qualitative_data = df.select_dtypes(include=['object']).columns
# In[15]:
Qualitative_data
# ## EDA Univariate Analysis
# In[16]:
#Note - By using the below syntax we can acheive the sknewness in boxplot and histogram plots
# here histograms for check the distribution of each numerical column.
for i in Quantitative_data:
print(i)
print(' ')
print('Skewness :',df[i].skew())
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.histplot(x=df[i],kde=True,palette='deep')
plt.subplot(1,2,2)
sns.boxplot(x=df[i],palette='deep')
plt.show()
# . Through plots and .skew() function we can get meaningfull information about our distribution behaviour.
#
# . so coming to the point, Unit price, Quantity and rating columns are having the normal distribution
# however Totals(sales), cogs, gross income are having right skew distribution with few outlier
# observations.
#
# In[17]:
#Note - With below Syntax, we can do univariate analysis between categorical variables. that is we can get Frequency
# by using "count plots..."
sns.set_style('darkgrid')
title = {'family':'times new roman','color':'blue','size':23}
fig,axes=plt.subplots(4,2,figsize=(20,30))
axes[0,0].set_title('Branch Vs Product Line',fontdict=title)
sns.countplot(ax=axes[0,0],x=df.Branch,hue=df['Product line'],palette='deep',hatch='...',order=df.Branch.value_counts().index)
axes[0,0].set_ylim(0,150)
axes[0,1].set_title('Branch Wise Frequency',fontdict=title)
sns.countplot(ax=axes[0,1],x=df.Branch,palette='deep',hatch='...',order=df.Branch.value_counts().index)
axes[0,1].set_ylim(0,400)
axes[1,0].set_title('Branch Vs Customer Type',fontdict=title)
sns.countplot(ax=axes[1,0],x=df.City,hue=df['Customer type'],palette='deep',hatch='...',order=df.City.value_counts().index)
axes[1,0].set_ylim(0,250)
axes[1,1].set_title('City Wise Frequency',fontdict=title)
sns.countplot(ax=axes[1,1],x=df.City,palette='deep',hatch='...',order=df.City.value_counts().index)
axes[1,1].set_ylim(0,400)
axes[2,0].set_title('Mode of Payment Vs City',fontdict=title)
sns.countplot(ax=axes[2,0],x=df.Payment,hue=df.City,palette='deep',hatch='...',order=df.Payment.value_counts().index)
axes[2,0].set_ylim(0,250)
axes[2,1].set_title('Mode of Payment Wise Frequency',fontdict=title)
sns.countplot(ax=axes[2,1],x=df['Payment'],palette='deep',hatch='...',order=df['Payment'].value_counts().index)
axes[2,1].set_ylim(0,400)
axes[3,0].set_title('Customer Type vs Gender',fontdict=title)
sns.countplot(ax=axes[3,0],x=df['Customer type'],hue=df.Gender,palette='deep',hatch='...',order=df['Customer type'].value_counts().index)
axes[3,0].set_ylim(0,350)
axes[3,1].set_title('Gender Vs Product Line',fontdict=title)
sns.countplot(ax=axes[3,1],x=df['Product line'],hue=df.Gender,palette='deep',hatch='...',order=df['Product line'].value_counts().index)
axes[3,1].set_ylim(0,150)
axes[3,1].tick_params(labelrotation=90)
#plt.subplots_adjust(hspace=0.5)
plt.show()
# ### From the count plot, we can have below observations -
#
# 1. (Branch vs Product Line)
#
# Through counts plots we could get the branch wise which product goining out from Mall. frequently(Higher). that is From
#
# A branch - Home and lifestyle 65 times.
#
# B branch - Fashion accessories & sports and travel both 62 times.
#
# C branch - Food and Beverages 66 times || Fashion accessories 65 times.
#
# #### _______
#
#
# 2. (Branch vs Customer Type)
#
# Customer with membership (173) > Normal customers in Yangon City
#
# as well as, Naypyitaw city recorded 169 memebers > Normal customers
#
# #### _______
#
# 3. (Mode of payment vs city)
#
# . From Mandalay city, Credit Card users are more then remain payment modes.
#
# . From Yangon city, Ewallet users are more than remain payment modes.
#
# . From Naypyitaw city, cash users are more than remain payment modes.
#
# #### _______
#
# 4. (Gender vs Product)
#
# . Females are frequently purchasing fashion and accessories products.
#
# . male are frequently purchasing health and beauty products And so on....
# ## EDA Bivariate Analysis
# #### Now, let’s move ahead with bivariate analysis. Bivariate Analysis helps to understand how variables are related to each other and the relationship between dependent and independent variables present in the dataset.
#
# #### For Numerical variables, Pair plots and Scatter plots are widely been used to do Bivariate Analysis.
#
# #### A Stacked bar chart can be used for categorical variables if the output variable is a classifier. Bar plots can be used if the output variable is continuous
# In[18]:
sns.pairplot(df,vars=['Rating','Quantity','Total','gross income','cogs','Unit price']);
# #### Pair Plot provides below insights:
#
# 1. Here we can see a strong positive correlation between totals and gross income || cogs and gross income.
# 2. And we can see there is no relation between rating and quantity( While Increasing)
# In[19]:
#Note - A bar plot can be used to show the relationship between Categorical variables and continuous variables
fig,axes=plt.subplots(4,2,figsize=(25,40))
title = {'family':'times new roman','color':'blue','size':23}
axes[0,0].set_title('PRODUCT Vs GENDER WISE RATING',fontdict=title)
sns.boxplot(ax=axes[0,0],y=df['Product line'],x=df.Rating,palette='deep',hue=df.Gender)
axes[0,0].legend(loc=1,prop={'size':15},framealpha=False)
axes[0,1].set_title('PRODUCT Vs GENDER WISE QUANTITY',fontdict=title)
sns.boxplot(ax=axes[0,1],y=df['Product line'],x=df['Quantity'],palette='deep',hue=df.Gender)
axes[0,1].legend(loc=1,prop={'size':15},framealpha=False)
axes[1,0].set_title('BRANCH Vs (COGS & SALES)',fontdict=title)
df.groupby('Branch')[['Total','cogs']].sum().plot.bar(ax=axes[1,0],hatch='...',fontsize=15)
axes[1,0].set_ylim(0,170000)
axes[2,0].set_title('CUSTOMER Vs SALES',fontdict=title)
df.groupby('Customer type')['Total'].sum().sort_values(ascending=False).plot.bar(ax=axes[2,0],hatch='...',fontsize=15)
axes[2,0].set_ylim(0,200000)
axes[2,1].set_title('GENDER Vs SALES',fontdict=title)
df.groupby('Gender')['Total'].sum().sort_values(ascending=False).plot.bar(ax=axes[2,1],hatch='...',fontsize=15)
axes[2,1].set_ylim(0,200000)
axes[1,1].set_title('PRODUCT Vs (COGS & SALES)',fontdict=title)
df.groupby('Product line')[['Total','cogs']].sum().plot.bar(ax=axes[1,1],hatch='...',fontsize=15)
axes[1,1].set_xticks([0,1,2,3,4,5],['E & A','F & A','F & B','H & B','H & L','S & T'])
axes[1,1].set_ylim(0,70000)
axes[3,0].set_title('CITY vs SALES',fontdict=title)
df.groupby('City')['Total'].sum().sort_values(ascending=False).plot.bar(ax=axes[3,0],hatch='...',fontsize=15)
axes[3,0].set_ylim(0,170000)
axes[3,1].set_title('PRODUCTS vs SALES',fontdict=title)
sns.barplot(x=df['Product line'],y=df['Total'],hue=df['Branch'],ax=axes[3,1],hatch='...')
axes[3,1].set_xticks([0,1,2,3,4,5],['H & B', 'E & A',
'H & L', 'S & T', 'F & B',
'F & A'],fontsize = 15)
axes[1,0].tick_params(labelrotation=0)
axes[0,1].tick_params(labelrotation=0)
axes[1,1].tick_params(labelrotation=0)
axes[2,0].tick_params(labelrotation=0)
axes[2,1].tick_params(labelrotation=0);
# ### Observations
#
# 1. (Product vs Gender's Rating)
#
# by using box plots-
#
# .females and males both have given highest average rating to
# food and beverages product.
#
# .indivdually females are giving highest average rating to health and beauty products.
# lowest average rating to electronic accessories.
#
# .males are giving highest average rating to food and behaverage products
# and lowest avearge rating to fashion accessories.
#
# #### _______
#
# 2. (Product vs Sales)
#
# comparing the all products,
# . Most of the higher sales from Food and Beverage products
#
# . lowest sales from Health and Beauty products.
#
# #### _______
#
# 3. (Branch vs Sales)
#
# . looks like all the branches are geeting same sale figure. however not much significant difference.
# from C branch getting higher sales compared to remain sales.
#
# #### _______
#
# 4. (City Vs Sales)
#
# . From Naypyitaw city, Getting more sales however not much significant difference between all cities.
#
# #### _______
#
# 5. (members_ship vs normal customers)
#
# . The customers who has membership, getting more sales from them than normal customers. And so on...
#
#
#
# ## EDA Multivariate Analysis
#
# #### As the name suggests, Multivariate analysis looks at more than two variables. Multivariate analysis is one of the most useful methods to determine relationships and analyze patterns for any dataset.
# In[20]:
# As the name suggests, Multivariate analysis looks at more than two variables.
# Multivariate analysis is one of the most useful methods to determine relationships and analyze patterns for any dataset.
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(),linewidths=0.5,cmap='Blues',vmax=1,vmin=-1,annot=True);
# ### From the Heat map, we can infer the following:
#
# 1. Here It is appearing strong R(correlation cofficient) value is 0.63. Between units price and Totals
#
# 2. Positive Correlation R(correlation cofficient) value is 0.71. Between units Quantity and Totals.
#
# 3. Negative Correlation R(correlation cofficient) value is -0.016. Between Quantity and Rating.
# ## Timeline Analysis
#
#
# #### with the help of DateTime dtype in the data frame. there is a possibility to do the timeline analysis and forecasting. ie., Month and day, time-wise transactions.
#
#
# In[21]:
#Adding New Columns Month name and Day name due to further month wise and day wise analysis
df['Month'] = pd.DatetimeIndex(df['Date']).month
df['Day'] = df['Date'].dt.day_name()
# In[22]:
a=sns.relplot(x=df.Time,y=df.Total,kind='line',row=df.Day,col=df.Month,
row_order=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],)
a.set_ylabels('Total Sales')
plt.show() #below Darker shadow is confidence intervel(inference stats)
#Day wise Month sales
# ### Observations
#
# . In month 1 Saturday sales were going well up to 600 total Sales.
#
# . In month 2 Friday sales were going well up to 1000+ total Sales.
#
# . In month 3 Friday sales were going well up to 600 total Sales
#
# . In on average Friday in the week is going very well with less width of the confidence interval.
#
# . another side on Wednesday is not going up to marks compare to Remain days in the week
# and with the huge confidence interval
#
# . Here we can see in all charts the total sales figure raising in between 12noon to 2pm.
# In[23]:
sns.relplot(x=df.Time,y=df.Quantity,hue=df['Gender'],col=df.Branch,kind='line',ci=None,col_order=['A','B','C'])
plt.show()
# ### Observations
#
# #### Note :- FQ : female Quantity Curve, MQ : Male Quantity Curve
#
#
#
# . At all branches, FQ-curves are going over the MQ-curve. Except for Branch A.
#
# . At branch C. It has a significant difference between A and B after 6 pm
# FQ-curve increasing gradually to the top.
#
# . At branch B. from Noon 12 to 3 pm FQ-Curve raising gradually.
#
# . At branch A. after 3 pm MQ-curve is raising continuously.
#
# . At Branch c. MQ-Curve is falling down for long time between 11 am to 3 pm.
#
# In[24]:
time_series = df[['Date','Total']]
# In[25]:
plt.figure(figsize=(20,5))
plot = sns.lineplot(x=time_series['Date'],y=time_series['Total'],ci=None);
plt.ylabel('Total Sales',fontsize=15)
plt.xlabel('Date',fontsize=15);
# ## Observation
#
# . Sales look like no trend however it has variance and seasonality. But It is just
# 3 months of Sales. We can not say much more about behavior of the Sales
# In[26]:
from PIL import Image
from urllib.request import urlretrieve
urlretrieve('https://securecdn.pymnts.com/wp-content/uploads/2017/05/grocery-tracker-51517.jpg','sm.jgp')
sm_image = Image.open('sm.jgp')
plt.figure(figsize=(15,10))
plt.grid(False)
plt.axis(False)
plt.imshow(sm_image);
# # Conclusion
#
# 1. if you see here most of the sale transactions have been coming from females.so that Electronic accessories products are not getting more sale transactions.
#
#
# 2. Here The Strange point is Even though getting sales from females but getting fewer sales from Health and Beauty products. However Males-wise the highest product sold is Health and Beauty Product.
#
#
#
# 3. Coming to the mode of payment Credit Card sales are very less compared to other modes.
#
#
# 4. At Branch A, Home and lifestyle products are highest sold products
#
#
# 5. At Branch B, Health and Beauty products are highest sold products
#
#
# 6. At Branch C, Food and beverage products are highest sold products
#
#
# 7. Food and beverage sales are going very well at Branch c, Because compare to other topmost sold product of each Branch. you can understand It has a significant trend between 6 pm and 8 pm by females
#
#
# 8. During 2019-02 month we can see in overall sales-wise it has the huge Downfall.
#
#
# #### This way, we perform EDA on the datasets to explore the data and extract all possible insights, which can help in model building and better decision making.
#
# #### However, this was only an overview of how EDA works; you can go deeper into it and attempt the stages on larger datasets.
#
# #### If the EDA process is clear and precise, our model will work better and gives higher accuracy!