-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to split that dataset within you code ? #1579
Comments
@CPor99 sure. You can split a dataset automatically using autosplit() in utils/datasets, then you simply point your data.yaml to the new autosplit_*.txt files. Lines 918 to 933 in 2c99560
|
@glenn-jocher Nice. Has this method been added recently because I don't have that method in my version? |
Update your code, changes are pushed daily. |
@glenn-jocher Thanks! One last question, where are the autosplit_*.txt files saved after using the autosplit()? |
Nevermind found them, thanks for your help @glenn-jocher much appreciated! |
As stated in the comment section of the function: def autosplit(path='../coco128', weights=(0.9, 0.1, 0.0)): # from utils.datasets import *; autosplit('../coco128')
""" Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files |
For others who may be looking for code to implement autosplit: from utils.datasets import * |
@lonnylundsten autosplit is a two step process currently:
|
Running in google colab, I had to remove '.unlink(missing_ok=True)' from line 833 for this to work. Otherwise, I would get the following error: TypeError: unlink() got an unexpected keyword argument 'missing_ok' |
@lonnylundsten I think the Colab pathlib may be out of date then. |
Hello i remove 'unlink(missing_ok=True)' but i still have the error unlink() got an unexpected keyword argument 'missing_ok' |
@cyndyNKCM 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem. How to create a Minimal, Reproducible ExampleWhen asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
In addition to the above requirements, for Ultralytics to provide assistance your code should be:
If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem. Thank you! 😃 |
How would I update my data.yaml to point to the generated autosplit? Edit: Lines 11 to 14 in b8f979b
|
Here is the complete code for Autosplit of data: from tqdm import tqdm
from pathlib import Path
import random
import os
DATASETS_DIR = Path("dataset")
IMG_FORMATS = 'bmp', 'dng', 'jpeg', 'jpg', 'mpo', 'png', 'tif', 'tiff', 'webp', 'pfm' # include image suffixes
def img2label_paths(img_paths):
# Define label paths as a function of image paths
sa, sb = f'{os.sep}images{os.sep}', f'{os.sep}labels{os.sep}' # /images/, /labels/ substrings
return [sb.join(x.rsplit(sa, 1)).rsplit('.', 1)[0] + '.txt' for x in img_paths]
def autosplit(path=DATASETS_DIR / 'images', weights=(0.9, 0.1, 0.0), annotated_only=False):
""" Autosplit a dataset into train/val/test splits and save path/autosplit_*.txt files
Usage: from utils.dataloaders import *; autosplit()
Arguments
path: Path to images directory
weights: Train, val, test weights (list, tuple)
annotated_only: Only use images with an annotated txt file
"""
path = Path(path) # images dir
files = sorted(x for x in path.rglob('*.*') if x.suffix[1:].lower() in IMG_FORMATS) # image files only
n = len(files) # number of files
random.seed(0) # for reproducibility
indices = random.choices([0, 1, 2], weights=weights, k=n) # assign each image to a split
txt = ['autosplit_train.txt', 'autosplit_val.txt', ] # 2 txt files
for x in txt:
if (path.parent / x).exists():
(path.parent / x).unlink() # remove existing
print(f'Autosplitting images from {path}' + ', using *.txt labeled images only' * annotated_only)
for i, img in tqdm(zip(indices, files), total=n):
if not annotated_only or Path(img2label_paths([str(img)])[0]).exists(): # check label
with open(path.parent / txt[i], 'a') as f:
f.write(f'./{img.relative_to(path.parent).as_posix()}' + '\n') # add image to txt file
autosplit() |
@sushant097 This is a great code snippet for implementing autosplit in YOLOv5. To update your data.yaml with the new splits, you would need to change the "train" and "val" fields in the data.yaml file as follows:
Here's an example of what those fields might look like: train: path/to/autosplit_train.txt
val: path/to/autosplit_val.txt Make sure to save your changes to the data.yaml file, and then you should be able to use the new splits to train your YOLOv5 model. |
Is there a function that can split the dataset into folders as required by yolov8 as well as generate the yaml file? |
@Yuri-Njathi hello! 😊 There isn't a built-in function that directly splits the dataset into folders and generates a YAML file for YOLOv8. However, you can easily achieve this with a few lines of Python code. Here's a basic outline:
I hope this helps you get started! If you have more questions, feel free to ask. |
@sushant097 great snippet. but this will not allow to distribute the dataset along train, test and validation equally. in other words there is no support for stratification. |
I made a few lines of code where I'm trying to instead of giving the train: , val: paths within the .yaml file I made an attribute which is dataset: containing all the dataset in it. After this process, I wrote a simple line of code which splits the dataset using sklearn but this didn't succeed because it prompted an error in the train.py when the create_dataloader is called. My question is it possible to split the dataset using ready-made functions?
The text was updated successfully, but these errors were encountered: