📚 Educational Landscape Project

As part of the Big Data and AI Engineering Onsite Bootcamp, we are asked to deliver a solution for the MENA market that can be solved by Big Data and AI tools. The project has to have an impact and deliver a solution for a real-world problem using MENA datasets.

Table of Contents

Project Overview
Business Objective
- Methods Used
- Technologies
Dataset Overview
Preprocessing Overview
Visualization
Modeling Results
Contributing Members Contact
Acknowledgments

Project Overview

This is the overview of the project's structure and files for easier navigation.

├── README.md
├── Deserts_Dashboard.pdf
├── presentation.pdf
├── Notebooks
│   ├── CapstoneProject2_Preprocessing_Notebook.ipynb
|   ├── CapstoneProject2_EDA_Notebook.zip
│   └── CapstoneProject2_ML_Notebook.ipynb
└── Datasets (too big to be uploaded)
    ├── High_School_Public_Results_2022_EG_both_attempts.csv (original dataset) 
    └── capstone_project2_preprocessed_eng.csv (output of the pre-preprocessing notebook: used for the EDA, Dashboard, and Machine Learning models)

(back to top)

Business Objective

The goal of this project is to forecast if a student can enroll in one of the public institutions in Egypt based on his current major and a few extracted attributes using data from a third-year secondary school dataset that was web scraped from the standardized tests in Egypt.

Methods Used

Preprocessing
Feature Engineering
Feature Selection
Labeling and classifying the data
Exploratory Data Analysis
Data Visualization
Machine Learning
Oversampling

Technologies

Python, Jupyter
Pandas
Plotly
Power BI
Pig (Big Data tool)
PySpark Machine Learning (Big Data tool)

(back to top)

Dataset Overview

This dataset provides Egyptian student’s public results information. Including the student’s unique desk identifier number during the exam (this is unique for all students across Egypt), their gender and school name, the administration and the city their school belongs to, and how many test attempts they had. Lastly, for each attempt, it lists all the courses they can take depending on their branch and what score they have achieved for each course. Most of the courses will be calculated in the total score except for three courses; religion, national education, and economics Statistics. The dataset consists of 45 features and 683k records, which were taken for one year only; 2022.

Dataset link

However, the problem has challenges because all the helpful features to our target can be found in the grades which can't be taken because it will create a data leakge in our model. In order to create a solid prediction, we need to extract more features from the existing columns, i.e. the school name.

At the beginning of our analysis, we raised some questions that we intend to answer using our EDA, dashboard visualization, and modeling. The questions are:

How many branches? Do grades differ based on the branch?
Has Egypt achieved the perfect normal distribution for the grading curve?
Were there any unusual cases that happened to students during their exams?
What exactly happens if a student fails or misses their exam? Are they given another chance? And does their score improve once they get a second chance?
For people with disabilities, What's their gender? How many can join the university? What are their grades? Do they have more second attempts? for unusual cases?
For Egypt and Saudi, Do we have the same schooling system? How do schools handle disabled students?
Do we have gender equality in our schools?

(back to top)

Preprocessing Overview

Preprocessing is the essence of this project. In this README file, we will be listing the overview of each step. However, for a more detailed description, visit our Medium Blog Post

Before the Preprocessing:

After the Preprocessing:

General Preprocessing steps:

Visualization

Saudi dashboards:

Egypt dashboards:

(back to top)

Modeling Results

All of these models were evaluted in order to choose the best one of them.

For the model selection, gradient Boost is the best model since it has the highest accuracy, and this is the result after the optimization.

(back to top)

Contributing Members Contact

Team Leadear: Reema Alaswad (Reema's LinkedIn)

Other Members:

Name	LinkedIn
Raghad Aleisa	Raghad's LinkedIn
AlJohara Alkanhal	AlJohara's LinkedIn
Maha AlHazzani	Maha's LinkedIn
Eman Aldosari	Eman's LinkedIn

(back to top)

Acknowledgments

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Educational Landscape Project

Project Overview