Is it possible to split that dataset within you code ? #1579

CPor99 · 2020-12-02T12:50:37Z

I made a few lines of code where I'm trying to instead of giving the train: , val: paths within the .yaml file I made an attribute which is dataset: containing all the dataset in it. After this process, I wrote a simple line of code which splits the dataset using sklearn but this didn't succeed because it prompted an error in the train.py when the create_dataloader is called. My question is it possible to split the dataset using ready-made functions?

glenn-jocher · 2020-12-02T13:04:03Z

@CPor99 sure. You can split a dataset automatically using autosplit() in utils/datasets, then you simply point your data.yaml to the new autosplit_*.txt files.

yolov5/utils/datasets.py

Lines 918 to 933 in 2c99560

    
           def autosplit(path='../coco128', weights=(0.9, 0.1, 0.0)):  # from utils.datasets import *; autosplit('../coco128') 
        
               """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files 
        
               # Arguments 
        
                   path:       Path to images directory 
        
                   weights:    Train, val, test weights (list) 
        
               """ 
        
               path = Path(path)  # images dir 
        
               files = list(path.rglob('*.*')) 
        
               n = len(files)  # number of files 
        
               indices = random.choices([0, 1, 2], weights=weights, k=n)  # assign each image to a split 
        
               txt = ['autosplit_train.txt', 'autosplit_val.txt', 'autosplit_test.txt']  # 3 txt files 
        
               [(path / x).unlink() for x in txt if (path / x).exists()]  # remove existing 
        
               for i, img in tqdm(zip(indices, files), total=n): 
        
                   if img.suffix[1:] in img_formats: 
        
                       with open(path / txt[i], 'a') as f: 
        
                           f.write(str(img) + '\n')  # add image to txt file

CPor99 · 2020-12-02T13:42:19Z

@glenn-jocher Nice. Has this method been added recently because I don't have that method in my version?

glenn-jocher · 2020-12-02T14:02:52Z

Update your code, changes are pushed daily.

CPor99 · 2020-12-02T15:15:09Z

@glenn-jocher Thanks! One last question, where are the autosplit_*.txt files saved after using the autosplit()?

CPor99 · 2020-12-02T15:22:15Z

Nevermind found them, thanks for your help @glenn-jocher much appreciated!

glenn-jocher · 2020-12-02T15:22:49Z

As stated in the comment section of the function:

 def autosplit(path='../coco128', weights=(0.9, 0.1, 0.0)):  # from utils.datasets import *; autosplit('../coco128') 
     """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files

lonnylundsten · 2021-07-19T22:00:30Z

For others who may be looking for code to implement autosplit:

from utils.datasets import *
autosplit('./Dataset', weights=(0.8, 0.2, 0.0))

glenn-jocher · 2021-07-20T08:57:13Z

@lonnylundsten autosplit is a two step process currently:

Run autosplit function:

yolov5/utils/datasets.py

Lines 818 to 840 in 0cc7c58

    
           def autosplit(path='../datasets/coco128/images', weights=(0.9, 0.1, 0.0), annotated_only=False): 
        
               """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files 
        
               Usage: from utils.datasets import *; autosplit() 
        
               Arguments 
        
                   path:            Path to images directory 
        
                   weights:         Train, val, test weights (list, tuple) 
        
                   annotated_only:  Only use images with an annotated txt file 
        
               """ 
        
               path = Path(path)  # images dir 
        
               files = sum([list(path.rglob(f"*.{img_ext}")) for img_ext in IMG_FORMATS], [])  # image files only 
        
               n = len(files)  # number of files 
        
               random.seed(0)  # for reproducibility 
        
               indices = random.choices([0, 1, 2], weights=weights, k=n)  # assign each image to a split 
        
               txt = ['autosplit_train.txt', 'autosplit_val.txt', 'autosplit_test.txt']  # 3 txt files 
        
               [(path.parent / x).unlink(missing_ok=True) for x in txt]  # remove existing 
        
               print(f'Autosplitting images from {path}' + ', using *.txt labeled images only' * annotated_only) 
        
               for i, img in tqdm(zip(indices, files), total=n): 
        
                   if not annotated_only or Path(img2label_paths([str(img)])[0]).exists():  # check label 
        
                       with open(path.parent / txt[i], 'a') as f: 
        
                           f.write('./' + img.relative_to(path.parent).as_posix() + '\n')  # add image to txt file

Update your data.yaml to point to your new *.txt files generated in Step 1.

lonnylundsten · 2021-07-20T17:31:14Z

Running in google colab, I had to remove '.unlink(missing_ok=True)' from line 833 for this to work.

Otherwise, I would get the following error: TypeError: unlink() got an unexpected keyword argument 'missing_ok'

glenn-jocher · 2021-07-22T11:27:58Z

@lonnylundsten I think the Colab pathlib may be out of date then.

cyndyNKCM · 2021-08-19T17:01:17Z

Hello i remove 'unlink(missing_ok=True)' but i still have the error unlink() got an unexpected keyword argument 'missing_ok'
What can i do to get rid of this?

glenn-jocher · 2021-08-23T11:04:00Z

@cyndyNKCM 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible that still produces the same problem
✅ Complete – Provide all parts someone else needs to reproduce your problem in the question itself
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

✅ Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
✅ Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

gembancud · 2021-11-07T09:25:08Z

@glenn-jocher

Update your data.yaml to point to your new *.txt files generated in Step 1.

How would I update my data.yaml to point to the generated autosplit?

Edit:
I didn't read thoroughly. Its here for anyone else:

yolov5/data/coco.yaml

Lines 11 to 14 in b8f979b

    
           path: ../datasets/coco  # dataset root dir 
        
           train: train2017.txt  # train images (relative to 'path') 118287 images 
        
           val: val2017.txt  # train images (relative to 'path') 5000 images 
        
           test: test-dev2017.txt  # 20288 of 40670 images, submit to https://competitions.codalab.org/competitions/20794

sushant097 · 2023-04-12T08:36:21Z

Here is the complete code for Autosplit of data:

from tqdm import tqdm

from pathlib import Path

import random
import os


DATASETS_DIR = Path("dataset")

IMG_FORMATS = 'bmp', 'dng', 'jpeg', 'jpg', 'mpo', 'png', 'tif', 'tiff', 'webp', 'pfm'  # include image suffixes

def img2label_paths(img_paths):
    # Define label paths as a function of image paths
    sa, sb = f'{os.sep}images{os.sep}', f'{os.sep}labels{os.sep}'  # /images/, /labels/ substrings
    return [sb.join(x.rsplit(sa, 1)).rsplit('.', 1)[0] + '.txt' for x in img_paths]



def autosplit(path=DATASETS_DIR / 'images', weights=(0.9, 0.1, 0.0), annotated_only=False):
    """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
    Usage: from utils.dataloaders import *; autosplit()
    Arguments
        path:            Path to images directory
        weights:         Train, val, test weights (list, tuple)
        annotated_only:  Only use images with an annotated txt file
    """
    path = Path(path)  # images dir
    files = sorted(x for x in path.rglob('*.*') if x.suffix[1:].lower() in IMG_FORMATS)  # image files only
    n = len(files)  # number of files
    random.seed(0)  # for reproducibility
    indices = random.choices([0, 1, 2], weights=weights, k=n)  # assign each image to a split

    txt = ['autosplit_train.txt', 'autosplit_val.txt', ]  # 2 txt files
    for x in txt:
        if (path.parent / x).exists():
            (path.parent / x).unlink()  # remove existing

    print(f'Autosplitting images from {path}' + ', using *.txt labeled images only' * annotated_only)
    for i, img in tqdm(zip(indices, files), total=n):
        if not annotated_only or Path(img2label_paths([str(img)])[0]).exists():  # check label
            with open(path.parent / txt[i], 'a') as f:
                f.write(f'./{img.relative_to(path.parent).as_posix()}' + '\n')  # add image to txt file


autosplit()

glenn-jocher · 2023-04-12T13:34:59Z

@sushant097 This is a great code snippet for implementing autosplit in YOLOv5. To update your data.yaml with the new splits, you would need to change the "train" and "val" fields in the data.yaml file as follows:

Set the "train" field to the path of the newly generated train.txt file.
Set the "val" field to the path of the newly generated val.txt file.

Here's an example of what those fields might look like:

train: path/to/autosplit_train.txt
val: path/to/autosplit_val.txt

Make sure to save your changes to the data.yaml file, and then you should be able to use the new splits to train your YOLOv5 model.

Yuri-Njathi · 2024-03-13T15:59:08Z

Is there a function that can split the dataset into folders as required by yolov8 as well as generate the yaml file?

glenn-jocher · 2024-03-15T05:41:42Z

@Yuri-Njathi hello! 😊 There isn't a built-in function that directly splits the dataset into folders and generates a YAML file for YOLOv8. However, you can easily achieve this with a few lines of Python code. Here's a basic outline:

Use the autosplit function from YOLOv5's utils/datasets.py for the initial split.
Move the split datasets into their respective folders.
Generate a YAML file by creating a Python script that writes the necessary paths and settings to a .yaml file.

I hope this helps you get started! If you have more questions, feel free to ask.

a-sajjad72 · 2024-07-16T19:04:05Z

Here is the complete code for Autosplit of data:

from tqdm import tqdm

from pathlib import Path

import random
import os


DATASETS_DIR = Path("dataset")

IMG_FORMATS = 'bmp', 'dng', 'jpeg', 'jpg', 'mpo', 'png', 'tif', 'tiff', 'webp', 'pfm'  # include image suffixes

def img2label_paths(img_paths):
    # Define label paths as a function of image paths
    sa, sb = f'{os.sep}images{os.sep}', f'{os.sep}labels{os.sep}'  # /images/, /labels/ substrings
    return [sb.join(x.rsplit(sa, 1)).rsplit('.', 1)[0] + '.txt' for x in img_paths]



def autosplit(path=DATASETS_DIR / 'images', weights=(0.9, 0.1, 0.0), annotated_only=False):
    """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
    Usage: from utils.dataloaders import *; autosplit()
    Arguments
        path:            Path to images directory
        weights:         Train, val, test weights (list, tuple)
        annotated_only:  Only use images with an annotated txt file
    """
    path = Path(path)  # images dir
    files = sorted(x for x in path.rglob('*.*') if x.suffix[1:].lower() in IMG_FORMATS)  # image files only
    n = len(files)  # number of files
    random.seed(0)  # for reproducibility
    indices = random.choices([0, 1, 2], weights=weights, k=n)  # assign each image to a split

    txt = ['autosplit_train.txt', 'autosplit_val.txt', ]  # 2 txt files
    for x in txt:
        if (path.parent / x).exists():
            (path.parent / x).unlink()  # remove existing

    print(f'Autosplitting images from {path}' + ', using *.txt labeled images only' * annotated_only)
    for i, img in tqdm(zip(indices, files), total=n):
        if not annotated_only or Path(img2label_paths([str(img)])[0]).exists():  # check label
            with open(path.parent / txt[i], 'a') as f:
                f.write(f'./{img.relative_to(path.parent).as_posix()}' + '\n')  # add image to txt file


autosplit()

@sushant097 great snippet. but this will not allow to distribute the dataset along train, test and validation equally. in other words there is no support for stratification.

CPor99 added the question Further information is requested label Dec 2, 2020

CPor99 closed this as completed Dec 3, 2020

randy-seng mentioned this issue Feb 7, 2022

Split custom dataset in train, val, test. OpenTrafficCam/OTLabels#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to split that dataset within you code ? #1579

Is it possible to split that dataset within you code ? #1579

CPor99 commented Dec 2, 2020 •

edited

Loading

glenn-jocher commented Dec 2, 2020 •

edited

Loading

CPor99 commented Dec 2, 2020

glenn-jocher commented Dec 2, 2020

CPor99 commented Dec 2, 2020

CPor99 commented Dec 2, 2020

glenn-jocher commented Dec 2, 2020

lonnylundsten commented Jul 19, 2021 •

edited

Loading

glenn-jocher commented Jul 20, 2021

lonnylundsten commented Jul 20, 2021 •

edited

Loading

glenn-jocher commented Jul 22, 2021

cyndyNKCM commented Aug 19, 2021

glenn-jocher commented Aug 23, 2021 •

edited

Loading

gembancud commented Nov 7, 2021 •

edited

Loading

sushant097 commented Apr 12, 2023

glenn-jocher commented Apr 12, 2023

Yuri-Njathi commented Mar 13, 2024

glenn-jocher commented Mar 15, 2024

a-sajjad72 commented Jul 16, 2024

Is it possible to split that dataset within you code ? #1579

Is it possible to split that dataset within you code ? #1579

Comments

CPor99 commented Dec 2, 2020 • edited Loading

glenn-jocher commented Dec 2, 2020 • edited Loading

CPor99 commented Dec 2, 2020

glenn-jocher commented Dec 2, 2020

CPor99 commented Dec 2, 2020

CPor99 commented Dec 2, 2020

glenn-jocher commented Dec 2, 2020

lonnylundsten commented Jul 19, 2021 • edited Loading

glenn-jocher commented Jul 20, 2021

lonnylundsten commented Jul 20, 2021 • edited Loading

glenn-jocher commented Jul 22, 2021

cyndyNKCM commented Aug 19, 2021

glenn-jocher commented Aug 23, 2021 • edited Loading

How to create a Minimal, Reproducible Example

gembancud commented Nov 7, 2021 • edited Loading

sushant097 commented Apr 12, 2023

glenn-jocher commented Apr 12, 2023

Yuri-Njathi commented Mar 13, 2024

glenn-jocher commented Mar 15, 2024

a-sajjad72 commented Jul 16, 2024

CPor99 commented Dec 2, 2020 •

edited

Loading

glenn-jocher commented Dec 2, 2020 •

edited

Loading

lonnylundsten commented Jul 19, 2021 •

edited

Loading

lonnylundsten commented Jul 20, 2021 •

edited

Loading

glenn-jocher commented Aug 23, 2021 •

edited

Loading

gembancud commented Nov 7, 2021 •

edited

Loading