Dengue fever is a public health concern in tropical and subtropical regions. Early detection of outbreaks can help health officials to take proactive measures to contain the impact.
The goal of this project, as part of DengAI competition, is to develop a predictive model that accurately forecasts the total number of dengue fever cases in San Juan and Iquitos based on historical, climatic, and environmental data.
To predict the total_cases label for each (city, year, weekofyear), we will be working with follwing datasets which can be downloaded.
- Training features: dengue_features_train.csv
- Training labels : dengue_labels_train.csv
- Test features for prediction: dengue_features_test.csv
- The format for submission: submission_format.csv
The training and test data contains data for two cities (San Juan, Puerto Rico and Iquitos, Peru). In the Training labels, we have total cases labels of two cities for each combination of (city, year, weekofyear) for corresponding climate features given in dengue_features_train.csv. The test data contains the climate features for each city which spans 5 and 3 years respectively.
Each feature datafile contains the following columns:
city | year | weekofyear | week_start_date | ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_k | reanalysis_avg_temp_k | reanalysis_dew_point_temp_k | reanalysis_max_air_temp_k | reanalysis_min_air_temp_k | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm |
---|
- city – City abbreviations: sj for San Juan and iq for Iquitos
- week_start_date – Date given in yyyy-mm-dd format
- station_max_temp_c – Maximum temperature
- station_min_temp_c – Minimum temperature
- station_avg_temp_c – Average temperature
- station_precip_mm – Total precipitation
- station_diur_temp_rng_c – Diurnal temperature range
- precipitation_amt_mm – Total precipitation
- reanalysis_sat_precip_amt_mm – Total precipitation
- reanalysis_dew_point_temp_k – Mean dew point temperature
- reanalysis_air_temp_k – Mean air temperature
- reanalysis_relative_humidity_percent – Mean relative humidity
- reanalysis_specific_humidity_g_per_kg – Mean specific humidity
- reanalysis_precip_amt_kg_per_m2 – Total precipitation
- reanalysis_max_air_temp_k – Maximum air temperature
- reanalysis_min_air_temp_k – Minimum air temperature
- reanalysis_avg_temp_k – Average air temperature
- reanalysis_tdtr_k – Diurnal temperature range
Training labels file (dengue_labels_train.csv) contains following columns.
city | year | weekofyear | total_cases |
---|
Before you begin, ensure you have met the following requirements:
If you haven't installed Kedro or Conda, please follow the links above for installation instructions.
To set up the project, follow these steps:
First, clone the repository to your local machine:
git clone https://github.com/johnheusinger/dengue_fever.git
cd dengue_fever
Create a new Conda environment:
conda create --name your_env_name python=3.12
conda activate your_env_name
Install the required packages using the requirements.txt
file:
pip install -r requirements.txt
With your Conda environment activated and dependencies installed, you can run the project using Kedro:
kedro run
Getting familiar with the data:
Checking for seasonality
Checking for missing values
year 0
weekofyear 0
week_start_date 0
ndvi_ne 194
ndvi_nw 52
ndvi_se 22
ndvi_sw 22
precipitation_amt_mm 13
reanalysis_air_temp_k 10
reanalysis_avg_temp_k 10
reanalysis_dew_point_temp_k 10
reanalysis_max_air_temp_k 10
reanalysis_min_air_temp_k 10
reanalysis_precip_amt_kg_per_m2 10
reanalysis_relative_humidity_percent 10
reanalysis_sat_precip_amt_mm 13
reanalysis_specific_humidity_g_per_kg 10
reanalysis_tdtr_k 10
station_avg_temp_c 43
station_diur_temp_rng_c 43
station_max_temp_c 20
station_min_temp_c 14
station_precip_mm 22
total_cases 0
We used the KNNImputer to fill in the missing values in the dataset.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_filled = imputer.fit_transform(df)
The idea is to find the 5 closest neighbors to the missing value and take the average of those 5 neighbors to fill in the missing value.
Going forward it would have been interesting to try different values for n_neighbors to see if it would have improved the model.
Also we could try different imputing methods to see if it would have improved the model.
We decided to first impute the missing values and then detect and handle outliers because we assumed that outliers in the dataset still contain valuable information that could be useful for the model.
We used the z-score method to detect outliers in the dataset.
Overall we got mixed results. One team member founding that removing outliers improved the model while the other team member found that keeping the outliers improved the model.
More testing needed to establish whether or not to keep outliers
We used the MaxAbsScaler to scale the data.
Although we were not given a validation set, we decided to still use the training dataset and split into a training and validation set to be able to test models and parameters.
We tried with three models to see which one performs the best:
models = [RandomForestRegressor(),LinearRegression(),SVR()]
for model in models:
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
print(f'{model} MSE: {mean_squared_error(y_test,y_pred)}')
The MSE for each model was as follows: RandomForestRegressor() MSE: 406.71694216027873 LinearRegression() MSE: 466.33090156794424 SVR() MSE: 515.257572877534
We used GridSearchCV to find the best hyperparameters for the RandomForestRegressor model.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators':[100,200,300],
'max_depth':[3,5,7,10],
'min_samples_split':[2,5,10],
'min_samples_leaf':[1,2,4]
}
grid_search = GridSearchCV(RandomForestRegressor(),param_grid,cv=5,scoring='neg_mean_squared_error')
grid_search.fit(X_train,y_train)
grid_search.best_params_
Results of the GridSearchCV:
{'max_depth': 7, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 200}
Our Backlog of assumptions & experiment ideas:
- Cyclical enconding for the week of the year - IMPROVED
- Rolling averages (rainfall, temperature, etc.) as new features - DID NOT IMPROVE
- Keeping rather than removing outliers - MIXED RESULTS
- Split by cities & build a seperate model for each city - DID NOT IMPROVE
- Try with pyCaret - PENDING
Other factors that may have (accidentally) improved model (needs to be tested):
- Merging the train and test datasets potentially improved imputation >> model performance
- Rounding the target variable rather than directly converting outome to integer
- Best Score: 25.3413
- Rank: 1,335