Ce projet a pour but de réaliser une extraction de features, suivie d'une PCA sur des données volumineuses à l'aide de Spark dans le cloud.
-
Updated
Mar 14, 2024 - Jupyter Notebook
Ce projet a pour but de réaliser une extraction de features, suivie d'une PCA sur des données volumineuses à l'aide de Spark dans le cloud.
With online sales gaining popularity, tech companies are exploring ways to improve their sales by analyzing customer behavior and gaining insights about product trends. Furthermore, the websites make it easier for customers to find the products they require without much scavenging.
This project provides a detailed overview of creating an automated data engineering pipeline using Airflow, AWS services, Spark, Snowflake and Tableau
Data Modeling with Spark for a data lake hosted on S3
Preventing churn is key to improving revenue for Sparkify, a subscription-based company (fictitious). This project is to analyze data from Sparkify to build a model to predict user churn. First, a sample dataset (128MB) was used on a local machine to explore relevant features and develop a working model. Then similar steps were used to develop a…
Parsing the common crawl database using Scala and Spark
Elastic Data Factory
The goal of this repo is to analyze Amazon's digital product from different perspectives using AWS EMR.
Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS using a Spotinst AWS MrScaler resource
Implemented the PageRank algorithm in Hadoop MapReduce framework and Spark.
A robust data pipeline leveraging Amazon EMR and PySpark, orchestrated seamlessly with Apache Airflow for efficient batch processing
This repository aims to capture and clean data from the twitter API in order to perform a sentiment analysis on an EMR cluster.
Criação de Esteiras de Deploy com Git Actions para subir uma infraestrutura na AWS com o Terraform fazendo controle da versão. Tecnologias utilizadas: escrita no formato Delta, Lambda Function, Kinesis Streaming, S3, Athena, Glue e EMR.
Explore and replicate Amazon EMR (Elastic MapReduce) setup and utilization for big data processing and analytics tasks, featuring comprehensive demonstrations from VPC creation to Spark job execution.
Automate Amazon EMR clusters using Lambda for streamlined and scalable data processing workflows. Unlock the full potential of your data pipeline with LambdaEMR Automator.
Add a description, image, and links to the emr-cluster topic page so that developers can more easily learn about it.
To associate your repository with the emr-cluster topic, visit your repo's landing page and select "manage topics."