Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset index and slicing #878

Merged
merged 6 commits into from
Nov 4, 2020
Merged

Conversation

davidslater
Copy link
Contributor

Fixes #871

This PR does a few related things:

  1. Renames "split_type" to "split" to make it clear that it is verbatim what is given to TFDS load (same kwarg name)
    https://www.tensorflow.org/datasets/api_docs/python/tfds/load

  2. Modifies scenarios so that configs can override the train and eval splits. This enables TFDS slicing operations on them such as "test[:10]" or "train_clean100+train_clean360+train_other500":
    https://www.tensorflow.org/datasets/splits#examples

  3. Adds additional parsing of splits to enable individual indexing for a specific data point, such as "test[355]", and indexing based off of a list, "test[[1, 4, 5, 7, 9]]". Under the hood, these are translated to single-element slices. Advanced slicing operations involving a step parameter, such as "test[10:20:2]" are not permitted, and raise a NotImplementedError.

@davidslater
Copy link
Contributor Author

Biggest code changes are in datasets.py and its associated test file.

@lcadalzo
Copy link
Contributor

lcadalzo commented Nov 3, 2020

What's the intended behavior when overlapping splits are provided? E.g. "train[1:3] + train[2]". The current behavior is to repeat elements, so in this case train[2] is returned in two consecutive batches.

@lcadalzo
Copy link
Contributor

lcadalzo commented Nov 3, 2020

Is it of interest to allow for specifying negative indices when (1) specifying a single index and/or (2) when providing a list of indices?

(1) E.g, if I only want to evaluate on the penultimate sample, specifying "train[-2]" will cause an error. "train[-2:-1]", however, will yield the one sample I want.

(2) E.g., if I want to evaluate on the first and penultimate sample, "train[[0, -2]]" will cause an error since only positive indices are supported when passing a list. And "train[0] + train[-2:-1]" would work.

@davidslater
Copy link
Contributor Author

What's the intended behavior when overlapping splits are provided? E.g. "train[1:3] + train[2]". The current behavior is to repeat elements, so in this case train[2] is returned in two consecutive batches.

The intended behavior is simply to match the underlying TFDS behavior, which allows duplicate and out-of-order splits. I don't expect people to make much use of it, but it is much simpler on our end than trying to determine uniqueness.

@davidslater
Copy link
Contributor Author

Is it of interest to allow for specifying negative indices when (1) specifying a single index and/or (2) when providing a list of indices?

(1) E.g, if I only want to evaluate on the penultimate sample, specifying "train[-2]" will cause an error. "train[-2:-1]", however, will yield the one sample I want.

(2) E.g., if I want to evaluate on the first and penultimate sample, "train[[0, -2]]" will cause an error since only positive indices are supported when passing a list. And "train[0] + train[-2:-1]" would work.

I had not considered using negative indices. I didn't realize they would work. I can make an update to allow them if you think that makes sense.

@lcadalzo
Copy link
Contributor

lcadalzo commented Nov 4, 2020

I think, since it's unlikely they would be commonly used, we can hold off and only add later if it's something that a user is specifically looking for.

Copy link
Contributor

@lcadalzo lcadalzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lcadalzo lcadalzo merged commit 6da24a4 into twosixlabs:master Nov 4, 2020
lcadalzo added a commit that referenced this pull request Nov 5, 2020
* fix pytorch deepspeech Dockerfile (#820)

* Bump to version 0.12.2-dev (#819)

* Update docs with new object detection datasets/models/metrics (#818)

* added object detection metrics

* adding xView and APRICOT docs

* update dataset doc

* fix typo

* don't throw spurious errors on container shutdown after armory exec or launch (#828)

* don't throw spurious errors on container shutdown after armory exec or launch

* only log a clean container exit when the exit is known clean

* Docs update (faq) (#831)

* added object detection metrics

* update faq with canonical preprocessing

* minor grammar nitpicks

* Canonical preprocessing for adversarial datasets and non-scenario datasets (#829)

* canonical adversarial datasets

* canonical datasets

* canonical datasets

* canonical datasets

* canonical datasets

* refactored ucf, xview, and apricot

* black formatting

Co-authored-by: lucas.cadalzo <lucas.cadalzo@twosixlabs.com>

* Update datasets, licensing, scenarios, baseline models, and metrics (#826)

* Updating datasets docs

* Updating adversarial dataset docs

* Updating metrics docs

* Fix formatting

* Fix formatting

* Update licensing

* Update scenarios doc

* Fix formatting

* Revert "Fix formatting"

This reverts commit cac7eb9.

* Update baseline model docs

* Remove precision/recall metrics from doc

* Update datasets.md

* Updating dataset docs

* Updating dataset docs

* Fix MNIST dtype

* Update scenarios doc (#844)

* updated ASR scenario documentation

* update ASR documentation

* updated multimodal scenario documentation

* updated xview scenario documentation

* updated APRICOT scenario documentation

* updated APRICOT scenario documentation

* updated poison scenario documentation

* updated UCF101 scenario documentation

* updated UCF101 scenario documentation

* Defences fix (#839)

* chmod

* ensure dtype output

* variable y (#841)

* update download command line (#840)

* Update attack params (#845)

* update

* updated to max of test and dev datasets

* fast untarring when possible (#832)

* fast untarring when possible

* standard output

* Piping from stdin (#842)

* accept standard input

* example

* exception -> error

* reject configs on raw stdin

Co-authored-by: Adam Jacobson <adam.jacobson@twosixlabs.com>

* xView model gpu fix (#853)

* added object detection metrics

* remove hardcoded device_type of cpu

* iou warning logic (#861)

* added object detection metrics

* warning is now thrown if the following holds for exactly one of the two boxes: all nonzero coordinates from a box are < 1

* scaled epsilon (#865)

* Better ASR Labels (#863)

* remove int check to enable str inputs

* label updates

* weight saving hack (#870)

* working FGSM config for ASR + lint (#859)

* apricot patch targeted AP metric (#866)

* added object detection metrics

* checkpoint progress with apricot_patch_targeted_AP_per_class metric; still need to decide how to handle duplicate predictions of targeted class

* ignore duplicate true positives for a single patch box; updated comments and changed variable names

* adding apricot patch metadata used in apricot_patch_targeted_AP_per_class metric

* minor edits to variable name and comments

* update apricot test to incorporate new metric

* make APRICOT_PATCHES a dict with id's for keys

* Update host-requirements.txt (#873)

* Add --skip-attack flag (#877)

* added object detection metrics

* added --skip-attack flag

* allow for passing --skip-benign and --skip-attack

* truncate very long videos (#880)

* Dataset index and slicing (#878)

* update split naming

* update dataset loading

* updated docs

* updated token tests

* working tests for slicing

* allow ordering and duplicates

* Sanity check model configurations (#875)

* Initial sanity checking

* Initial sanity checking

* New sanity checks. Ensuring baseline scenario compliance

* Moving to pytest framework

* Integrated into armory run command with validate-config flag

* Move test_config folder

* Explict check for classifier model

* Add specific ground truth box set

* Black formatting

* Updating docs

* Fix typo

* Pin to ART 1.4.2 and move to end of container installs (#887)

* added object detection metrics

* pin to ART 1.4.2 and move to end of docker container installs

* move ART installation to framework Dockerfiles

* add ART install to dev Dockerfiles

* UCF101 shape bug (#889)

* updated preprocessing

* revert

* Update xview results (#879)

* updated baseline results for xview

* formatting

* fix typos

Co-authored-by: lucas.cadalzo <lucas.cadalzo@twosixlabs.com>

* Split metrics when targeted attack is present (#884)

* Split adversarial metrics relative to ground truth and targeted labels

* Flake8

* Updating metrics doc

* added pgd_patch art_experimental attack (#883)

* added pgd_patch art_experimental attack

* ran format_json

* set use_gpu false

* removing accidentally added files

Co-authored-by: lucas.cadalzo <lucas.cadalzo@twosixlabs.com>

* updated configs (#897)

* Add type annotations to baseline models (#885)

* Add type annotations to baseline models

* Add docstrings to baseline models

* Update type annotations

* Update type annotation

* GTSRB canonical preprocessing (#888)

* Add canonical preprocessing for GTSRB

* Update canonical preprocessing for GTSRB

* Update GTSRB canonical preprocessing

* Update documentation

* Update scenario documentation

* Deepspeech nan rebase (#891)

* updated docker builds

* update docker file

* update

* xView perturbation metric (#898)

* added object detection metrics

* update l0 norm to divide by number of elements in array; update xView scenarios to use l0 norm

* update test_metrics.py

* Docs for --skip flags (#900)

* added object detection metrics

* adding docs for --skip flags

* Release version 0.12.2

Co-authored-by: Adam Jacobson <38766736+adamj26@users.noreply.github.com>
Co-authored-by: lcadalzo <39925313+lcadalzo@users.noreply.github.com>
Co-authored-by: davidslater <david.slater@twosixlabs.com>
Co-authored-by: lucas.cadalzo <lucas.cadalzo@twosixlabs.com>
Co-authored-by: kevinmerchant <67436031+kevinmerchant@users.noreply.github.com>
Co-authored-by: yusong-tan <59029053+yusong-tan@users.noreply.github.com>
Co-authored-by: Adam Jacobson <adam.jacobson@twosixlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "skip" feature to dataset kwargs and dataset generator to start at a specific index
2 participants