As part of the Big Data and AI Engineering Onsite Bootcamp, we are asked to deliver a solution for the MENA market that can be solved by Big Data and AI tools. The project has to have an impact and deliver a solution for a real-world problem using MENA datasets.
Table of Contents
This is the overview of the project's structure and files for easier navigation.
├── README.md
├── Deserts_Dashboard.pdf
├── presentation.pdf
├── Notebooks
│ ├── CapstoneProject2_Preprocessing_Notebook.ipynb
| ├── CapstoneProject2_EDA_Notebook.zip
│ └── CapstoneProject2_ML_Notebook.ipynb
└── Datasets (too big to be uploaded)
├── High_School_Public_Results_2022_EG_both_attempts.csv (original dataset)
└── capstone_project2_preprocessed_eng.csv (output of the pre-preprocessing notebook: used for the EDA, Dashboard, and Machine Learning models)
The goal of this project is to forecast if a student can enroll in one of the public institutions in Egypt based on his current major and a few extracted attributes using data from a third-year secondary school dataset that was web scraped from the standardized tests in Egypt.
- Preprocessing
- Feature Engineering
- Feature Selection
- Labeling and classifying the data
- Exploratory Data Analysis
- Data Visualization
- Machine Learning
- Oversampling
- Python, Jupyter
- Pandas
- Plotly
- Power BI
- Pig (Big Data tool)
- PySpark Machine Learning (Big Data tool)
This dataset provides Egyptian student’s public results information. Including the student’s unique desk identifier number during the exam (this is unique for all students across Egypt), their gender and school name, the administration and the city their school belongs to, and how many test attempts they had. Lastly, for each attempt, it lists all the courses they can take depending on their branch and what score they have achieved for each course. Most of the courses will be calculated in the total score except for three courses; religion, national education, and economics Statistics. The dataset consists of 45 features and 683k records, which were taken for one year only; 2022.
However, the problem has challenges because all the helpful features to our target can be found in the grades which can't be taken because it will create a data leakge in our model. In order to create a solid prediction, we need to extract more features from the existing columns, i.e. the school name.
At the beginning of our analysis, we raised some questions that we intend to answer using our EDA, dashboard visualization, and modeling. The questions are:
- How many branches? Do grades differ based on the branch?
- Has Egypt achieved the perfect normal distribution for the grading curve?
- Were there any unusual cases that happened to students during their exams?
- What exactly happens if a student fails or misses their exam? Are they given another chance? And does their score improve once they get a second chance?
- For people with disabilities, What's their gender? How many can join the university? What are their grades? Do they have more second attempts? for unusual cases?
- For Egypt and Saudi, Do we have the same schooling system? How do schools handle disabled students?
- Do we have gender equality in our schools?
Preprocessing is the essence of this project. In this README file, we will be listing the overview of each step. However, for a more detailed description, visit our Medium Blog Post
Before the Preprocessing:
After the Preprocessing:
General Preprocessing steps:
Saudi dashboards:
Egypt dashboards:
All of these models were evaluted in order to choose the best one of them.
For the model selection, gradient Boost is the best model since it has the highest accuracy, and this is the result after the optimization.
Team Leadear: Reema Alaswad (Reema's LinkedIn)
Name | |
---|---|
Raghad Aleisa | Raghad's LinkedIn |
AlJohara Alkanhal | AlJohara's LinkedIn |
Maha AlHazzani | Maha's LinkedIn |
Eman Aldosari | Eman's LinkedIn |