In this project i created two solutions the firstis with geopandas with python and the second with pyspark , to analyse various files containing informations about gps positions and stores positions in order to analyse the costumers behaviour .
Coding this project was possible through the usage of databricks service hosted on Aws cloud platform running through Ec2 instances as the cluster, also using Aws S3 cloud object storage to upload our training data , sample data and finally store our results ( you will also find a csv copy of the results in this repo ) .
The s3 bucket was mounted to the databricks dbfs (Databricks File System) using Aws instance profile , which will allow us to explore it's paths and files (using databricks dbutils) as if the s3 bucket was a local file system .
The python notebooks were exported on three different formats (ipynb , python source code and HTML) , the three different formats were uploaded to this github repoistory .
You can check the .HTML version of the notebook in order to see the code with the execution and results .
To run the script you can upload any of the different formats to your databricks platform and simply execute it (after changing the files paths or simply mounting your own s3 bucket) or simply running locally the oython source code version of the notebooks.
The notebook's cluster already contains the necessary dependecies that were installed through pip :
%pip install geopandas
%pip install shapely
%pip install fsspec
%pip install s3fs
%pip install rtree
( if you are going to run it locally you will probably need also to pip install pandas , fiona and gdal )
The S3 bucket contains three different data sources:
- GPS signals of users
- Berlin store polygons
- User affinity datasets
The S3 bucket had the following structure :
https://s3.console.aws.amazon.com/s3/buckets/train-data-20221903 :
|-Berlin_map_data/plz_5-stellig_berlin.kml : .kml file containg berlin map informations for geopandas
|- oregon-prod/ : contains jobs logs and meta data
|- Results_data/ : seperated in folders with names as 'year-month-day' format , each folder contains the results data of the corresponding day .
|- sample_data/ : contains sample gps data
|- train_data/ : contains gps full data , store polygones and affinity datasets
The Berlin map was downloaded from : https://www.suche-postleitzahl.org/berlin.13f
After applying spatial join between the gps signals dataframes and the stores dataframe , here are some vizualisations and insights :
In the first visualization we see the sum of unique visitors on all stores for each day , we can detect some anomalies where we hace very low number of unique visitors to all stores compared to the rest of the days , on the following dates :
- 2021-01-01 corresponds to Friday and the first day of the new year
- 2021-01-03 corresponds to Sunday
- 2021-01-10 corresponds to Sunday
- 2021-01-17 corresponds to Sunday
So on Sundays the we have the lowest number of visitors or consumers to the stores .
In the Second visualization , where we have a more detailed view , we see that on the mentioned dates ( 2021-01-01/03/10/17 ) , fast food business (like mcdonalds and Burger king) are an exception to the rule and keep having unique visitors close to their average unique visitors number unlike hypermarket and retail business ( like Aldi , Kaufland , Rewe ) .
The vizualisation or distribution of all the gps signals looks like a polygone , which means that the devices locations are not random so we can predict the various locations for consumer behaviour or the best places to open new stors and get more people to come .
At this part we observe the locations of the stores but more specifically in Berlin which allows us to interpret the variations between the numbers of unique visitors to each store through the relation of the distance of the store from the center of Berlin or the diamant shape of the Gps visualization .