-
Notifications
You must be signed in to change notification settings - Fork 0
/
PLAYSTORE_APPS_ANALYSIS.py
392 lines (201 loc) · 9.41 KB
/
PLAYSTORE_APPS_ANALYSIS.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
#!/usr/bin/env python
# coding: utf-8
# # PLAY STORE APPS ANALYSIS AND VISUALIZATION
# ### About the project:
#
# In this project, I'd be will be working on a real-world dataset of the google play store, one of the most used applications for downloading android apps. This project aims on cleaning the dataset, analyze the given dataset, and mining informational quality insights. This project also involves visualizing the data to better and easily understand trends and different categories.
#
# ### Project Description:
#
# This project will help one understand how a real-world database is analyzed using SQL, how to get maximum available insights from the dataset, pre-process the data using python for a better upcoming performance, how a structured query language helps us retrieve useful information from the database, and visualize the data with the power bi tool.
# The Project will consist of 2 modules:
#
# - Module 1: Pre-processing, Analyzing data using Python and SQL.
# - Module 2: Visualizing data using Power bi.
# ## Module 1 : Pre-processing, Analyzing data using Python and SQL
#
# The first step of this analysis would involve pre-processing the data using python libraries. After this pre-processing also known as data cleaning, the cleaned data would then be further analyzed using MySQL to garner insights based on our reserach question.
# ## Task 1 : Pre-processing the data
# In[1]:
# Firstly, we would have to import all the necessary libraries to be utilized
import pandas as pd
import numpy as np
from numpy import nan
from datetime import datetime, timedelta
# In[2]:
#There are two datasets which we would then import using pandas
apps = pd.read_csv("playstore_apps.csv", index_col = 'App')
reviews = pd.read_csv("playstore_reviews.csv", index_col = 'App')
# In[3]:
#Lets see the column attributes for apps dataset
apps.info()
# In[4]:
# Let's find the no of rows and columns in apps dataset
apps.shape
# In[5]:
# Let's view the first 5 rows of the apps dataset
apps.head()
# In[6]:
apps.duplicated().value_counts()
# ### Findings on `apps` dataset
#
# 1. Data contains 10841 rows and 12 columns
# 2. When importing the datasets the `App` column was used as the index column.
# 3. The `apps` dataset contains 492 duplicates rows and 10349 unique rows.
# 4. `Last Updated` column has a object datatype.
# 5. `Rating`,`Current Ver`, `Andriod Ver` `Type` columns contains missing values
# 6. Some columns contains irrelevant values that must be cleaned or removed.
# ### Subtask 1 : Removing Duplicate Rows
#
# We have established that 492 columns in the apps dataset contains duplicate rows. We would then attempt to remove those duplicate rows from our apps dataset.
# In[7]:
# Drop all duplicate rows that appears more than once in 'Apps' dataset while retaining its first row
apps.drop_duplicates(keep='first', inplace=True)
# In[8]:
#Check to make sure duplicates rows were dropped
apps.info()
# ### SubTask 2: Remove Irrelevant values from each column if any
#
# We would check each column to ascertain attributes that are irrelevant and must be removed from the dataset.
#
# We would start by checking the unique attributes for each column so as to source for irregularities.
# In[9]:
apps['Category'].unique()
# The `category` column loos okay except the 1.9 attribute as seen above which is not a google playstore category and thus we would attempt to remove the row entry containing the category of '1.9'
# In[10]:
#Check the row that contains the category of 1.9
apps[apps['Category'] == '1.9']
# In[11]:
# only one row (row 10472) contain the category of 1.9 and that row should be removed form our dataset since its not needed.
apps.drop('Life Made WI-Fi Touchscreen Photo Frame', inplace=True)
# In[12]:
# Check to make sure the category of 1.9 was removed.
apps['Category'].unique()
# In[13]:
# Check the distinct attributes of `Rating` column
apps['Rating'].unique()
# The Rating column contains missing values. Since the Ratings for each app is a numerical value, these missing values will be replaced with 0.
# In[14]:
#replace all nan values with 0
apps['Rating'] = apps['Rating'].fillna(0)
# In[15]:
# Check the Rating column to confirm that missing nan rows have been filled with 0
apps['Rating'].unique()
# In[16]:
# Check the distinct attributes of `Reviews` column
apps['Reviews'].unique()
# In[17]:
# Check the distinct attributes of `Size` column
apps['Size'].unique()
# In[18]:
# Check the distinct attributes of `Installs` column
apps['Installs'].unique()
# In[19]:
# Check the distinct attributes of `Price` column
apps['Price'].unique()
# In[20]:
# Check the distinct attributes of `Genres` column
apps['Genres'].unique()
# In[21]:
# Check the `Current Ver` column for null values
apps[apps['Current Ver'].isna()]
# In[22]:
# Check the count of null values in `Current Ver` column
apps['Current Ver'].isna().value_counts()
# In[23]:
# Check the count of null values in `Andriod Ver` column
apps['Android Ver'].isna().value_counts()
# In[24]:
# Check the distinct attributes of `Andriod Ver` column
apps['Android Ver'].unique()
# In[25]:
# Drop null values in `Andriod Ver` column
apps.dropna(subset=['Android Ver'], inplace=True)
# In[26]:
# Check the distinct attributes of `Content Rating` column
apps['Content Rating'].unique()
# In[27]:
# Check for unique column attributes
apps['Type'].unique()
# In[28]:
# Drop rows with missing values in `Type` column
apps.dropna(subset=['Type'], inplace=True)
# #### Check
#
# Let's confirm that the nan row has been removed from the `Type` column
# In[29]:
apps['Type'].unique()
# #### Next is to check for incorrect datatype in column attributes. The `Last Updated` column has a datatype of object and must be changed to a datatype of datetime to reflect the date attributes in it
# In[30]:
# Change datatype of `Last Updated` column to datetime
apps['Last Updated'] = pd.to_datetime(apps['Last Updated'])
# In[31]:
# Let's check that all changes made are reflected in every column.
apps.info()
# ### Summary of Data Cleaning on `Apps` Dataset
#
# 1. Removed all duplicate rows in the apps dataset.
# 2. Removed 1no. row containing '1.9' in the `Category` variable.
# 3. Replaced all missing values in `Rating` variable with 0.
# 4. Removed all rows containing missing values in `Andriod Ver` variable.
# 5. Removed all rows containing missing values in `Type` variable.
# 6. Changed the datatype of `Last Updated` variable to datetime.
# From the above assessment, it can be seen that the `app` dataset still contains missing values in the `Current Ver` columns which will be addressed in the next stage of our analysis using Microsoft Excel. All missing values in the `Current Ver` column will be replaced with NaN.
# #### Now we can export our `apps` dataset to csv file for further assessment/analysis
# In[32]:
apps.to_csv('cleaned_apps_v2.csv', encoding = 'utf=8')
# ### Data Preparation
#
# The following data preparation processes was carried out on the cleaned apps dataset after exporting the cleaned apps dataset to csv.
#
# 1. Replaced all missing values in the `Current Ver` variable with NaN (Carried out on Microsoft Excel)
# 2. Cleaned the apps columns for special characters and irrelevant app names.
#
# After data preparation, the cleaned apps dataset was found to contain 9766 rows and 13 variables
#
# This cleaned apps dataset can now be analyzed to gain insight using MySQL and visualized using Power BI.
# ## Pre-processing on the `review` dataset
# Lets programmatically view the reviews dataset to get familiar with its properties.
# In[33]:
# Check for dataet dimensions
reviews.shape
# In[34]:
# Lets see the column attributes for reviews dataset
reviews.info()
# In[35]:
# Check dataset to view the first 10 rows
reviews.head(10)
# In[36]:
# Check for missing values in the `Translated_Review` column
reviews['Translated_Review'].isna().value_counts()
# ### Summary of finding on `reviews` dataset
#
# 1. `reviews` dataset contains 64295 rows and 4 variables
# 2. The dataset contains a lot of missing values for different reviews records
# 3. The `App` column was used as the index column
#
# #### Sub-task - Drop all rows with missing values
# In[37]:
# Drop all missing values in the reviews dataaset
reviews.dropna(inplace=True)
# We could observed that the `reviews` dataset had 26868 null values in its `Translated_Review` column. Therefore all rows with null values in the `Translated_Review` column was deleted as there is no translated reviews present and as such that kind of data is of no use for our analysis
# #### Check
#
# Check to make sure all missing values in the reviews dataset was dropped.
# In[38]:
# Check for any remaining missing values in `Transalated_Reviews` column
reviews['Translated_Review'].isna().value_counts()
# In[39]:
# Check for duplicates
reviews.duplicated().value_counts()
# In[40]:
# Check dataset to confirm that all data cleaning has reflected.
reviews.info()
# _Duplicate records in the reviews dataset would not be dropped as all reviews records will be useful for further analysis_
# ### Summary of Data Cleaning on `reviews` dataset
#
# 1. All rows containing missing values were dropped from the dataset.
# 2. After dropping missing values the dataset now contains 37427 rows of data
# The reviews dataset will now be exported as a csv file for further analysis
# In[41]:
reviews.to_csv('cleaned_reviews_v2.csv', encoding = 'utf=8')