Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to split that dataset within you code ? #1579

Closed
CPor99 opened this issue Dec 2, 2020 · 18 comments
Closed

Is it possible to split that dataset within you code ? #1579

CPor99 opened this issue Dec 2, 2020 · 18 comments
Labels
question Further information is requested

Comments

@CPor99
Copy link

CPor99 commented Dec 2, 2020

I made a few lines of code where I'm trying to instead of giving the train: , val: paths within the .yaml file I made an attribute which is dataset: containing all the dataset in it. After this process, I wrote a simple line of code which splits the dataset using sklearn but this didn't succeed because it prompted an error in the train.py when the create_dataloader is called. My question is it possible to split the dataset using ready-made functions?

@CPor99 CPor99 added the question Further information is requested label Dec 2, 2020
@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 2, 2020

@CPor99 sure. You can split a dataset automatically using autosplit() in utils/datasets, then you simply point your data.yaml to the new autosplit_*.txt files.

yolov5/utils/datasets.py

Lines 918 to 933 in 2c99560

def autosplit(path='../coco128', weights=(0.9, 0.1, 0.0)): # from utils.datasets import *; autosplit('../coco128')
""" Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
# Arguments
path: Path to images directory
weights: Train, val, test weights (list)
"""
path = Path(path) # images dir
files = list(path.rglob('*.*'))
n = len(files) # number of files
indices = random.choices([0, 1, 2], weights=weights, k=n) # assign each image to a split
txt = ['autosplit_train.txt', 'autosplit_val.txt', 'autosplit_test.txt'] # 3 txt files
[(path / x).unlink() for x in txt if (path / x).exists()] # remove existing
for i, img in tqdm(zip(indices, files), total=n):
if img.suffix[1:] in img_formats:
with open(path / txt[i], 'a') as f:
f.write(str(img) + '\n') # add image to txt file

@CPor99
Copy link
Author

CPor99 commented Dec 2, 2020

@glenn-jocher Nice. Has this method been added recently because I don't have that method in my version?

@glenn-jocher
Copy link
Member

Update your code, changes are pushed daily.

@CPor99
Copy link
Author

CPor99 commented Dec 2, 2020

@glenn-jocher Thanks! One last question, where are the autosplit_*.txt files saved after using the autosplit()?

@CPor99
Copy link
Author

CPor99 commented Dec 2, 2020

Nevermind found them, thanks for your help @glenn-jocher much appreciated!

@glenn-jocher
Copy link
Member

As stated in the comment section of the function:

 def autosplit(path='../coco128', weights=(0.9, 0.1, 0.0)):  # from utils.datasets import *; autosplit('../coco128') 
     """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files 

@CPor99 CPor99 closed this as completed Dec 3, 2020
@lonnylundsten
Copy link

lonnylundsten commented Jul 19, 2021

For others who may be looking for code to implement autosplit:

from utils.datasets import *
autosplit('./Dataset', weights=(0.8, 0.2, 0.0))

@glenn-jocher
Copy link
Member

@lonnylundsten autosplit is a two step process currently:

  1. Run autosplit function:

    yolov5/utils/datasets.py

    Lines 818 to 840 in 0cc7c58

    def autosplit(path='../datasets/coco128/images', weights=(0.9, 0.1, 0.0), annotated_only=False):
    """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
    Usage: from utils.datasets import *; autosplit()
    Arguments
    path: Path to images directory
    weights: Train, val, test weights (list, tuple)
    annotated_only: Only use images with an annotated txt file
    """
    path = Path(path) # images dir
    files = sum([list(path.rglob(f"*.{img_ext}")) for img_ext in IMG_FORMATS], []) # image files only
    n = len(files) # number of files
    random.seed(0) # for reproducibility
    indices = random.choices([0, 1, 2], weights=weights, k=n) # assign each image to a split
    txt = ['autosplit_train.txt', 'autosplit_val.txt', 'autosplit_test.txt'] # 3 txt files
    [(path.parent / x).unlink(missing_ok=True) for x in txt] # remove existing
    print(f'Autosplitting images from {path}' + ', using *.txt labeled images only' * annotated_only)
    for i, img in tqdm(zip(indices, files), total=n):
    if not annotated_only or Path(img2label_paths([str(img)])[0]).exists(): # check label
    with open(path.parent / txt[i], 'a') as f:
    f.write('./' + img.relative_to(path.parent).as_posix() + '\n') # add image to txt file
  2. Update your data.yaml to point to your new *.txt files generated in Step 1.

@lonnylundsten
Copy link

lonnylundsten commented Jul 20, 2021

Running in google colab, I had to remove '.unlink(missing_ok=True)' from line 833 for this to work.

Otherwise, I would get the following error: TypeError: unlink() got an unexpected keyword argument 'missing_ok'

@glenn-jocher
Copy link
Member

@lonnylundsten I think the Colab pathlib may be out of date then.

@cyndyNKCM
Copy link

Hello i remove 'unlink(missing_ok=True)' but i still have the error unlink() got an unexpected keyword argument 'missing_ok'
What can i do to get rid of this?

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 23, 2021

@cyndyNKCM 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible that still produces the same problem
  • Complete – Provide all parts someone else needs to reproduce your problem in the question itself
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

  • Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
  • Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

@gembancud
Copy link

gembancud commented Nov 7, 2021

@glenn-jocher

  1. Update your data.yaml to point to your new *.txt files generated in Step 1.

How would I update my data.yaml to point to the generated autosplit?

Edit:
I didn't read thoroughly. Its here for anyone else:

yolov5/data/coco.yaml

Lines 11 to 14 in b8f979b

path: ../datasets/coco # dataset root dir
train: train2017.txt # train images (relative to 'path') 118287 images
val: val2017.txt # train images (relative to 'path') 5000 images
test: test-dev2017.txt # 20288 of 40670 images, submit to https://competitions.codalab.org/competitions/20794

@sushant097
Copy link

Here is the complete code for Autosplit of data:

from tqdm import tqdm

from pathlib import Path

import random
import os


DATASETS_DIR = Path("dataset")

IMG_FORMATS = 'bmp', 'dng', 'jpeg', 'jpg', 'mpo', 'png', 'tif', 'tiff', 'webp', 'pfm'  # include image suffixes

def img2label_paths(img_paths):
    # Define label paths as a function of image paths
    sa, sb = f'{os.sep}images{os.sep}', f'{os.sep}labels{os.sep}'  # /images/, /labels/ substrings
    return [sb.join(x.rsplit(sa, 1)).rsplit('.', 1)[0] + '.txt' for x in img_paths]



def autosplit(path=DATASETS_DIR / 'images', weights=(0.9, 0.1, 0.0), annotated_only=False):
    """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
    Usage: from utils.dataloaders import *; autosplit()
    Arguments
        path:            Path to images directory
        weights:         Train, val, test weights (list, tuple)
        annotated_only:  Only use images with an annotated txt file
    """
    path = Path(path)  # images dir
    files = sorted(x for x in path.rglob('*.*') if x.suffix[1:].lower() in IMG_FORMATS)  # image files only
    n = len(files)  # number of files
    random.seed(0)  # for reproducibility
    indices = random.choices([0, 1, 2], weights=weights, k=n)  # assign each image to a split

    txt = ['autosplit_train.txt', 'autosplit_val.txt', ]  # 2 txt files
    for x in txt:
        if (path.parent / x).exists():
            (path.parent / x).unlink()  # remove existing

    print(f'Autosplitting images from {path}' + ', using *.txt labeled images only' * annotated_only)
    for i, img in tqdm(zip(indices, files), total=n):
        if not annotated_only or Path(img2label_paths([str(img)])[0]).exists():  # check label
            with open(path.parent / txt[i], 'a') as f:
                f.write(f'./{img.relative_to(path.parent).as_posix()}' + '\n')  # add image to txt file


autosplit()

@glenn-jocher
Copy link
Member

@sushant097 This is a great code snippet for implementing autosplit in YOLOv5. To update your data.yaml with the new splits, you would need to change the "train" and "val" fields in the data.yaml file as follows:

  1. Set the "train" field to the path of the newly generated train.txt file.
  2. Set the "val" field to the path of the newly generated val.txt file.

Here's an example of what those fields might look like:

train: path/to/autosplit_train.txt
val: path/to/autosplit_val.txt

Make sure to save your changes to the data.yaml file, and then you should be able to use the new splits to train your YOLOv5 model.

@Yuri-Njathi
Copy link

Is there a function that can split the dataset into folders as required by yolov8 as well as generate the yaml file?

@glenn-jocher
Copy link
Member

@Yuri-Njathi hello! 😊 There isn't a built-in function that directly splits the dataset into folders and generates a YAML file for YOLOv8. However, you can easily achieve this with a few lines of Python code. Here's a basic outline:

  1. Use the autosplit function from YOLOv5's utils/datasets.py for the initial split.
  2. Move the split datasets into their respective folders.
  3. Generate a YAML file by creating a Python script that writes the necessary paths and settings to a .yaml file.

I hope this helps you get started! If you have more questions, feel free to ask.

@a-sajjad72
Copy link

Here is the complete code for Autosplit of data:

from tqdm import tqdm

from pathlib import Path

import random
import os


DATASETS_DIR = Path("dataset")

IMG_FORMATS = 'bmp', 'dng', 'jpeg', 'jpg', 'mpo', 'png', 'tif', 'tiff', 'webp', 'pfm'  # include image suffixes

def img2label_paths(img_paths):
    # Define label paths as a function of image paths
    sa, sb = f'{os.sep}images{os.sep}', f'{os.sep}labels{os.sep}'  # /images/, /labels/ substrings
    return [sb.join(x.rsplit(sa, 1)).rsplit('.', 1)[0] + '.txt' for x in img_paths]



def autosplit(path=DATASETS_DIR / 'images', weights=(0.9, 0.1, 0.0), annotated_only=False):
    """ Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
    Usage: from utils.dataloaders import *; autosplit()
    Arguments
        path:            Path to images directory
        weights:         Train, val, test weights (list, tuple)
        annotated_only:  Only use images with an annotated txt file
    """
    path = Path(path)  # images dir
    files = sorted(x for x in path.rglob('*.*') if x.suffix[1:].lower() in IMG_FORMATS)  # image files only
    n = len(files)  # number of files
    random.seed(0)  # for reproducibility
    indices = random.choices([0, 1, 2], weights=weights, k=n)  # assign each image to a split

    txt = ['autosplit_train.txt', 'autosplit_val.txt', ]  # 2 txt files
    for x in txt:
        if (path.parent / x).exists():
            (path.parent / x).unlink()  # remove existing

    print(f'Autosplitting images from {path}' + ', using *.txt labeled images only' * annotated_only)
    for i, img in tqdm(zip(indices, files), total=n):
        if not annotated_only or Path(img2label_paths([str(img)])[0]).exists():  # check label
            with open(path.parent / txt[i], 'a') as f:
                f.write(f'./{img.relative_to(path.parent).as_posix()}' + '\n')  # add image to txt file


autosplit()

@sushant097 great snippet. but this will not allow to distribute the dataset along train, test and validation equally. in other words there is no support for stratification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants