This is a repository for the work KidSat: satellite imagery to map childhood poverty.
The Demographic and Health Surveys (DHS) program gathers and shares vital data on population, health, and nutrition in developing countries to inform public health policies. Their collection procedures and methods are listed here.
To access DHS data, please follow these steps:
-
Register for DHS Access:
- Visit the registration page here and apply for access to the DHS data.
-
Obtain the Data for Following Countries and Years For the following country and years, select ALL STATA and Geographic Data.
Country Year(s) Zambia 2007, 2013, 2018 Malawi 2000, 2004, 2010, 2015 Uganda 2000, 2006, 2011, 2016 Comoros 2012 Tanzania 1999, 2010, 2015, 2022 Kenya 2003, 2008, 2014, 2022 Angola 2015 Ethiopia 2000, 2005, 2011, 2016, 2019 Rwanda 2005, 2007, 2010, 2014, 2019 Lesotho 2004, 2009, 2014 Madagascar 1997, 2008, 2021 Zimbabwe 1999, 2005, 2010, 2015 Burundi 2010, 2016 Mozambique 2011 Eswatini 2006 South Africa 2016 The folders should be unzipped and store in
survey_processing/dhs_data/
(e.g.survey_processing/dhs_data/
should contain subfolders of "ET_20XX_DHS_XXX..." etc. ).
This section provides step-by-step instructions on how to use this repository to achieve its intended functionality.
Before you start, make sure you have registered a Google Earth Engine project for academic purposes. You will need your project name to query the API. The sign-up page is here.
-
Set Up Environment
Example:
pip install -r requirements.txt
-
Configuration
You need to update your Google Earth Engine project name to
imagery_scraping/config/google_config.json
. The format (for me) wasee-YOUR_GMAIL_NAME
. Note, please do not push your project name to GitHub. -
Query File (Optional)
The file
imagery_scraping/config/query.json
contains an example of how you should query imageries. You need to provide the latitude and longitude in WGS84 format. In our work, we mainly use shapefile from DHS directly. -
Running the Application You first need to go to the
imagery_scraping
directoryExample:
An example of usage is shown below:
python main.py "config/query.csv" "EarthImagery" 2021 "L8" -r 5
It will prompt you to authenticate for Google. If all goes well, it will download the images to your Google Drive under a folder called
EarthImagery
. The images will be collected from the 2021 LandSat8 dataset and will be centered around the coordinates you provided in the query file with a 5 km square window.If you have a shapefile from DHS, you can also use for example
python main.py "ETGE81FL" "Ethiopia2021Imagery" 2021 "S2" -r 5
to extract the imagery.
-
Visualization (Optional)
To see the imagery, you need to download the imagery data from Google Drive first. We provide sample data in
imagery_scraping/data
and a notebook to see the imagery you queried in true color. Note that this is only a visualization; the original data is much richer and contains more than the three RGB channels. For training, we should use the original data instead of the true-color image alone. -
Getting All Imagery
We recommend using this notebook to download all imagery and keep track of progress as GEE has a upper limit of 3000 jobs at the same time. You will need to download the imagery and save to an accessible location (we will refer to
path_to_parent_imagery_folder
in later sections), each of its subdirectory should be country code + year + source (e.g. ET2019S2 for Ethiopia 2019 Sentinel 2). The notebook should already be formatting the export using this naming convention.
Collect all DHS data to survey_processing/dhs_data
. The following command
python survey_processing/main.py survey_processing/dhs_data
would create 5 splits of the training and test data for spatial analysis and before/after 2020 split for temporal analysis.
The MOSAIKS features were extracted using IDinsight package. A notebook is provided in this repository for getting all features for MOSAIKS.
After having the splits in survey_processing/processed_data
, you can finetune DINOv2 using the following commands. For the spatial experiment with Landsat imagery, you can use the following code.
python modelling/dino/finetune_spatial.py --fold 1 --model_name dinov2_vitb14 --imagery_path {path_to_parent_imagery_folder} --batch_size 8 --imagery_source L --num_epochs 20
Finetuning sentinel imagery, the normal command is
python modelling/dino/finetune_spatial.py --fold 1 --model_name dinov2_vitb14 --imagery_path {path_to_parent_imagery_folder} --batch_size 1 --imagery_source S --num_epochs 10
Note that to get a cross-validated result, you should use fold 1 to 5.
For temporal finetuning, the command for Landsat is
python modelling/dino/finetune_temporal.py --model_name dinov2_vitb14 --imagery_path {path_to_parent_imagery_folder} --batch_size 8 --imagery_source L
and replace L
to S
for sentinel finetuning.
For evaluation, make sure the all 1-5 finetuned spatial models (or the finetuned temporal model for temporal evaluation) are in modelling/dino/model
and run
python modelling/dino/evaluate.py --use_checkpoint --imagery_path {path_to_parent_imagery_folder} --imagery_source L --mode spatial
Change the --mode
to temporal
for temporal evaluation, and change L
to S
for imagery sources.
Remove the --use_checkpoint
for evaluating on raw DINO models.
To run the finetuning process, you first need to download the checkpoints for fMoW-SatMAE non-temporal or temporal. Then run the following:
python -m modelling.satmae.satmae_finetune --pretrained_ckpt $CHECKPOINT_PATH --dhs_path ./survey_processing/processed_data/train_fold_1.csv --output_dir $OUTPUT_DIR --imagery_path $IMAGERY_PATH
Arguments:
--pretrained_ckpt
: Checkpoint of pretrained SatMAE model.--imagery_path
: Path to imagery folder--dhs_path
: Path to DHS.csv
file--output_path
: Path to export the output. A unique subdirectory will be created.--batch_size
--random_seed
--sentinel
: Landsat is used by default. Turn this on to use Sentinel imagery--temporal
: Add this flag to use the temporal mode--epochs
: Number of epochs--stopping_delta
: Delta for early stopping--stopping_patience
: Early stopping patience--loss
: Eitherl1
(default) orl2
.--lr
: Learning rate--weight_decay
: Weight decay for Adam optimizer--enable_profiling
: Enable reporting of loading/inference time.
Evaluation consists of 2 steps: exporting the model output, and perform Ridge Regression. Since exporting the model output is expensive, we split it into 2 separate modules:
To carry out the first step, edit the file modelling/satmae/satmae_finetune
and change the SATMAE_PATHS
variable accordingly. For each entry, you can put all the model checkpoints you need to evaluate or None
to use the pretrained checkpoint, along with their fold (1-5). You do not have to put the entries in any order, nor need to put all the folds, but the script caches the data from different folds in memory, which helps significantly reduce the time for loading and preprocessing the satellite images.
python -m modelling.satmae.satmae_finetune --output_dir $OUTPUT_DIR --imagery_path $IMAGERY_PATH
Arguments
--imagery_path
: Path to imagery folder--output_path
: Path to export the output. A unique subdirectory will be created.--batch_size
--sentinel
: Landsat is used by default. Turn this on to use Sentinel imagery--temporal
: Add this flag to use the temporal mode
This will export data as Numpy arrays in .npy
files in the output location, which has the shape (num_samples, 1025)
. The first 1024 columns (i.e arr[:, :1024]
) is the predicted feature vector from the model, and the last column (i.e arr[:, 1024]
) is the target. You can then adapt the script modelling/satmae/eval_dhs.py
to conduct Ridge Regression or more advanced regression.