Creating tools to handle raster and vector data to split it into small pieces equaled in size for machine learning datasets

How To Use

Install docker https://docs.docker.com/engine/install/ (macos, Windows or Linux)
Clone the Repository :

git clone https://github.com/Youssef-Harby/split-rs-data.git
Go to project directory :

cd split-rs-data
Copy and paste your raster(.tif) and vector(.shp) files into a seperated folders :


./split-rs-data/DataSet/  # Dataset root directory
|--raster  # Original raster data
|  |--xxx1.tif (xx1.png)
|  |--...
|  └--...
|
|--vector # All shapefiles in the same place (.shx, .shp..etc)
|  |--xxx1.shp
|  |--xxx1.shx / .prj  / .cpg / .dbf ....
|  └--xxx2.shp

Build the docke image : docker compose up --build
go to http://127.0.0.1:8888/
you will find your token in the cli of the image.
Open Tutorial.ipynb to learn
Or define your vector and raster folders in multi_raster_vector.py file and run it in docker by open cli and type :

python multi_raster_vector.py

TODO

Creating Docker Image for development env.
Splitting raster data into equal pieces with rasterio (512×512) thanks to @geoyee.
Splitting raster data into equal pieces with GDAL , https://gdal.org/.
Rasterize shapefile to raster in the same satellite pixel size and projection.
Convert 24 or 16 bit raster to 8 bit.
Export as jpg (for raster) and png (for rasterized shapefile) with GDAL.
Validation of training and testing datasets for paddlepaddle.
GUI
QGIS Plugin ➡️ Deep Learning Datasets Maker

Code In Detail ⬇️

First - Prepareing Datasets

1.Convert Vector to Raster (Rasterize) with reference coordinate system from raster tiff

all these tools made for prepare data for paddlepaddlea.

from osgeo import gdal, ogr

fn_ras = Input raster data (GTiff)
fn_vec = input vector data (Shapefile)

fn_ras = 'DataSet/raster/01/01.tif'
fn_vec = 'DataSet/vector/01/01.shp'
output = 'DataSet/results/lab_all_values.tif'

import the GDAL driver "ESRI Shapefile" to open the shapefile

driver = ogr.GetDriverByName("ESRI Shapefile")

open raster and shapefile datasets with (shapefile , 1)

(shapefile , 1) read and write in the shapefile
(shapefile , 0) read onle the shapefile

ras_ds = gdal.Open(fn_ras)
vec_ds = driver.Open(fn_vec, 1)

Get the :

GetLayer (Only shapefiles have one lyrs other fomates maybe have multi-lyrs) #VECTOR
GetGeoTransform #FROM RASTER
GetProjection #FROM RASTER

lyr = vec_ds.GetLayer()
geot = ras_ds.GetGeoTransform()
proj = ras_ds.GetProjection() # Get the projection from original tiff (fn_ras)
geot

(342940.8074133941,
 0.18114600000000536,
 0.0,
 3325329.401211367,
 0.0,
 -0.1811459999999247)

Open the shapefile feature to edit in it

layerdefinition = lyr.GetLayerDefn()
feature = ogr.Feature(layerdefinition)

feature.GetFieldIndex make you to know the id of a specific field name you want to read/edit/delete

Also you can list all fields on the shapefile by :

schema = []
    for n in range(layerdefinition.GetFieldCount()):
        fdefn = layerdefinition.GetFieldDefn(n)
        schema.append(fdefn.name)

Then I will delete the field called "MLDS" has been assumed by me

yy = feature.GetFieldIndex("MLDS")
if yy < 0:
    print("MLDS field not found, we will create one for you and make all values to 1")
else:
    lyr.DeleteField(yy)

add new field to the shapefile with a default value "1" and don't forget to close feature after the edits

new_field = ogr.FieldDefn("MLDS", ogr.OFTInteger)
lyr.CreateField(new_field)
for feature in lyr:
        feature.SetField("MLDS", 1)
        lyr.SetFeature(feature)
        feature = None

Set the projection from original tiff (fn_ras) to the rasterized tiff

drv_tiff = gdal.GetDriverByName("GTiff")
chn_ras_ds = drv_tiff.Create(
        output, ras_ds.RasterXSize, ras_ds.RasterYSize, 1, gdal.GDT_Byte)
chn_ras_ds.SetGeoTransform(geot)
chn_ras_ds.SetProjection(proj)
chn_ras_ds.FlushCache()

gdal.RasterizeLayer(chn_ras_ds, [1], lyr, burn_values=[1], options=["ATTRIBUTE=MLDS"])
chn_ras_ds = None
vec_ds = None

DONE

Second - Splitting raster and rasterized files to small tiles 512×512 depends on your memory

ds = gdal.Open(fn_ras)
gt = ds.GetGeoTransform()

get coordinates of upper left corner

xmin = gt[0]
ymax = gt[3]
resx = gt[1]
res_y = gt[5]
resy = abs(res_y)

import math
import os.path as osp

the tile size i want (may be 256×256 for smaller memory size)

needed_out_x = 512
needed_out_y = 512

round up to the nearest int

xnotround = ds.RasterXSize / needed_out_x
xround = math.ceil(xnotround)
ynotround = ds.RasterYSize / needed_out_y
yround = math.ceil(ynotround)

print(xnotround)
print(xround)
print(ynotround)
print(yround)

9.30078125
10
5.689453125
6

pixel to meter - 512×10×0.18

pixtomX = needed_out_x * xround * resx
pixtomy = needed_out_y * yround * resy

print (pixtomX)
print (pixtomy)

927.4675200000274
556.4805119997686

size of a single tile

xsize = pixtomX / xround
ysize = pixtomy / yround

print (xsize)
print (ysize)

92.74675200000274
92.74675199996143

create lists of x and y coordinates

xsteps = [xmin + xsize * i for i in range(xround + 1)]
ysteps = [ymax - ysize * i for i in range(yround + 1)]
xsteps

[342940.8074133941,
 343033.5541653941,
 343126.3009173941,
 343219.0476693941,
 343311.7944213941,
 343404.54117339413,
 343497.28792539414,
 343590.03467739414,
 343682.78142939415,
 343775.5281813941,
 343868.2749333941]

set the output path

cdpath = "DataSet/image/"

loop over min and max x and y coordinates

for i in range(xround):
    for j in range(yround):
        xmin = xsteps[i]
        xmax = xsteps[i + 1]
        ymax = ysteps[j]
        ymin = ysteps[j + 1]

        # gdal translate to subset the input raster

        gdal.Translate(osp.join(cdpath,  \
                        (str("01") + "-" + str(j) + "-" + str(i) + "." + "jpg")), 
                ds, 
                projWin=(abs(xmin), abs(ymax), abs(xmax), abs(ymin)),
                xRes=resx, 
                yRes=-resy, 
                outputType=gdal.gdalconst.GDT_Byte, 
                format="JPEG")
ds = None

Third - Spilit Custom Dataset and Generate File List

For all data that is not divided into training set, validation set, and test set, PaddleSeg provides a script to generate segmented data and generate a file list.

Use scripts to randomly split the custom dataset proportionally and generate a file list

The data file structure is as follows:

./DataSet/  # Dataset root directory
|--image  # Original image catalog
|  |--xxx1.jpg (xx1.png)
|  |--...
|  └--...
|
|--label  # Annotated image catalog
|  |--xxx1.png
|  |--...
|  └--...

Among them, the corresponding file name can be defined according to needs.

The commands used are as follows, which supports enabling specific functions through different Flags.

python tools/split_dataset_list.py <dataset_root> <images_dir_name> <labels_dir_name> ${FLAGS}

Parameters:

dataset_root: Dataset root directory
images_dir_name: Original image catalog
labels_dir_name: Annotated image catalog

FLAGS:

FLAG	Meaning	Default	Parameter numbers
--split	Dataset segmentation ratio	0.7 0.3 0	3
--separator	File list separator	"	"
--format	Data format of pictures and label sets	"jpg" "png"	2
--label_class	Label category	'__background__' '__foreground__'	several
--postfix	Filter pictures and label sets according to whether the main file name (without extension) contains the specified suffix	"" ""（2 null characters）	2

After running, train.txt, val.txt, test.txt and labels.txt will be generated in the root directory of the dataset.

Note: Requirements for generating the file list: either the original image and the number of annotated images are the same, or there is only the original image without annotated images. If the dataset lacks annotated images, a file list without separators and annotated image paths will be generated.

Example

python tools/split_dataset_list.py <dataset_root> images annotations --split 0.6 0.2 0.2 --format jpg png

Dataset file organization

If you need to use a custom dataset for training, it is recommended to organize it into the following structure: custom_dataset | |--images | |--image1.jpg | |--image2.jpg | |--... | |--labels | |--label1.png | |--label2.png | |--... | |--train.txt | |--val.txt | |--test.txt

The contents of train.txt and val.txt are as follows:

image/image1.jpg label/label1.png
image/image2.jpg label/label2.png
...

Full Docs : https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.3/docs/data/custom/data_prepare.md

import sys
import subprocess

theproc = subprocess.Popen([
"python", 
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\split_dataset_list.py", #Split text py script
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet",  # Root DataSet ath
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet\image",  #images path
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet\label", 
# "--split", 
# "0.6",  # 60% training
# "0.2",  # 20% validating
# "0.2",  # 20% testing
"--format", 
"jpg", 
"png"])
theproc.communicate()

(None, None)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Creating tools to handle raster and vector data to split it into small pieces equaled in size for machine learning datasets

How To Use

TODO

Code In Detail ⬇️

First - Prepareing Datasets

1.Convert Vector to Raster (Rasterize) with reference coordinate system from raster tiff

Second - Splitting raster and rasterized files to small tiles 512×512 depends on your memory

Third - Spilit Custom Dataset and Generate File List

Use scripts to randomly split the custom dataset proportionally and generate a file list

Example

Dataset file organization

Files

README.md

Latest commit

History

README.md

File metadata and controls

Creating tools to handle raster and vector data to split it into small pieces equaled in size for machine learning datasets

How To Use

TODO

Code In Detail ⬇️

First - Prepareing Datasets

1.Convert Vector to Raster (Rasterize) with reference coordinate system from raster tiff

Second - Splitting raster and rasterized files to small tiles 512×512 depends on your memory

Third - Spilit Custom Dataset and Generate File List

Use scripts to randomly split the custom dataset proportionally and generate a file list

Example

Dataset file organization