Skip to content

Commit

Permalink
Merge pull request #1 from arranger1044/master
Browse files Browse the repository at this point in the history
Sync with original repo
  • Loading branch information
guyvdbroeck authored Oct 27, 2019
2 parents 25c2f34 + 8509086 commit 44c51c5
Show file tree
Hide file tree
Showing 20 changed files with 362,446 additions and 1 deletion.
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
A collection of datasets used in machine learning for density
estimation.

If you use any of the datasets, you should cite their original papers.<sup id="a1">[1](#f1)</sup><sup id="a1">[2](#f2)</sup><sup id="a1">[3](#f3)</sup>
If you use any of the datasets, you should cite their original papers.<sup id="a1">[1](#f1)</sup><sup id="a1">[2](#f2)</sup><sup id="a1">[3](#f3)</sup><sup id="a1">[4](#f4)</sup>

## Datasets

Expand All @@ -17,11 +17,17 @@ If you use any of the datasets, you should cite their original papers.<sup id="a
|**Jester**<sup id="a1">[1](#f1)</sup>| binary | 100 | 9000 | 1000 | 4116 | 0.608|`jester`|
|**Netflix**<sup id="a1">[1](#f1)</sup>| binary | 100 | 15000 | 2000 | 3000 | 0.541|`bnetflix`|
|**Accidents**<sup id="a1">[2](#f2)</sup>| binary | 111 | 12758 | 1700 | 2551 | 0.291|`accidents`|
|**Mushrooms**<sup id="a3">[4](#f4)</sup>| binary | 112 | 2000 | 500 | 5624 | 0.187|`mushrooms`|
|**Adult**<sup id="a3">[4](#f4)</sup>| binary | 123 | 5000 | 1414 | 26147 | 0.112|`adult`|
|**Connect 4**<sup id="a3">[4](#f4)</sup>| binary | 126 | 16000 | 4000 | 47557 | 0.333|`connect4`|
|**OCR Letters**<sup id="a3">[4](#f4)</sup>| binary | 128 | 32152 | 10000 | 10000 | 0.220|`ocr_letters`|
|**RCV-1**<sup id="a3">[4](#f4)</sup>| binary | 150 | 40000 | 10000 | 150000 | 0.138|`rcv1`|
|**Retail**<sup id="a2">[2](#f2)</sup>| binary | 135 | 22041 | 2938 | 4408 | 0.024|`tretail`|
|**Pumsb-star**<sup id="a2">[2](#f2)</sup>| binary | 163 | 12262 | 1635 | 2452 | 0.270|`pumsb_star`|
|**DNA**<sup id="a2">[2](#f2)</sup>| binary | 180 | 1600 | 400 | 1186 | 0.253|`dna`|
|**Kosarek**<sup id="a2">[2](#f2)</sup>| binary | 190 | 33375 | 4450 | 6675 | 0.020|`kosarek`|
|**MSWeb**<sup id="a1">[1](#f1)</sup>| binary | 294 | 29441 | 3270 | 5000 | 0.010|`MSWeb`|
|**NIPS**<sup id="a3">[4](#f4)</sup>| binary | 500 | 400 | 100 | 1240 | 0.367|`nips`|
|**Book**<sup id="a1">[1](#f1)</sup>| binary | 500 | 8700 | 1159 | 1739 | 0.016|`book`|
|**EachMovie**<sup id="a1">[1](#f1)</sup>| binary | 500 | 4525 | 1002 | 591 | 0.059|`tmovie`|
|**WebKB**<sup id="a1">[1](#f1)</sup>| binary | 839 | 2803 | 558 | 838 | 0.064|`cwebkb`|
Expand All @@ -44,6 +50,9 @@ Structure Learning: A Randomized Feature Generation Approach*][VanHaaren2012]. A
<b id="f3">3</b> Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche, Guy Van den Broeck: [*Tractable Learning
for Complex Probability Queries*][Bekker2015]. NIPS 2015

<b id="f4">4</b> Hugo Larochelle, Iain Murray: [*The Neural Autoregressive Distribution Estimator*][Larochelle2011]. AISTATS 2011

[Lowd2010]: http://ix.cs.uoregon.edu/~lowd/icdm10lowd.pdf
[VanHaaren2012]: http://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/viewFile/5107/5534
[Bekker2015]: https://lirias.kuleuven.be/bitstream/123456789/513299/4/nips15_cr.pdf
[Larochelle2011]: http://proceedings.mlr.press/v15/larochelle11a/larochelle11a.pdf
26,147 changes: 26,147 additions & 0 deletions datasets/adult/adult.test.data

Large diffs are not rendered by default.

5,000 changes: 5,000 additions & 0 deletions datasets/adult/adult.train.data

Large diffs are not rendered by default.

1,414 changes: 1,414 additions & 0 deletions datasets/adult/adult.valid.data

Large diffs are not rendered by default.

47,557 changes: 47,557 additions & 0 deletions datasets/connect4/connect4.test.data

Large diffs are not rendered by default.

16,000 changes: 16,000 additions & 0 deletions datasets/connect4/connect4.train.data

Large diffs are not rendered by default.

4,000 changes: 4,000 additions & 0 deletions datasets/connect4/connect4.valid.data

Large diffs are not rendered by default.

5,624 changes: 5,624 additions & 0 deletions datasets/mushrooms/mushrooms.test.data

Large diffs are not rendered by default.

2,000 changes: 2,000 additions & 0 deletions datasets/mushrooms/mushrooms.train.data

Large diffs are not rendered by default.

500 changes: 500 additions & 0 deletions datasets/mushrooms/mushrooms.valid.data

Large diffs are not rendered by default.

1,240 changes: 1,240 additions & 0 deletions datasets/nips/nips.test.data

Large diffs are not rendered by default.

400 changes: 400 additions & 0 deletions datasets/nips/nips.train.data

Large diffs are not rendered by default.

100 changes: 100 additions & 0 deletions datasets/nips/nips.valid.data

Large diffs are not rendered by default.

10,000 changes: 10,000 additions & 0 deletions datasets/ocr_letters/ocr_letters.test.data

Large diffs are not rendered by default.

32,152 changes: 32,152 additions & 0 deletions datasets/ocr_letters/ocr_letters.train.data

Large diffs are not rendered by default.

10,000 changes: 10,000 additions & 0 deletions datasets/ocr_letters/ocr_letters.valid.data

Large diffs are not rendered by default.

150,000 changes: 150,000 additions & 0 deletions datasets/rcv1/rcv1.test.data

Large diffs are not rendered by default.

40,000 changes: 40,000 additions & 0 deletions datasets/rcv1/rcv1.train.data

Large diffs are not rendered by default.

10,000 changes: 10,000 additions & 0 deletions datasets/rcv1/rcv1.valid.data

Large diffs are not rendered by default.

302 changes: 302 additions & 0 deletions process_splits.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
import numpy

import os

import itertools

from utils import load_train_valid_test_csvs
from utils import dataset_to_instances_set
from utils import split_union

RND_SEED = 1337


DATASET_NAMES = ['accidents',
'ad',
'baudio',
'bbc',
'bnetflix',
'book',
'c20ng',
'cr52',
'cwebkb',
'dna',
'jester',
'kdd',
'msnbc',
'msweb',
'nltcs',
'plants',
'pumsb_star',
'tmovie',
'tretail',
'bin-mnist']

DATA_PATH = './datasets'
OUTPUT_PATH = './datasets'

SPLIT_NAMES = ['train', 'valid', 'test']


def only_shuffle_split_train_valid_test(data, percs=[0.75, 0.1, 0.15], rand_gen=None):

if rand_gen is None:
rand_gen = numpy.random.RandomState(RND_SEED)

n_instances = data.shape[0]

samples_ids = numpy.arange(n_instances)
#
# shuffle the ids
shuffled_ids = rand_gen.permutation(samples_ids)

#
# getting slices
n_splits = len(percs)
nb_split_samples = [int(n_instances * p) for p in percs[:-1]]
nb_split_samples += [n_instances - sum(nb_split_samples)]
assert sum(nb_split_samples) == len(samples_ids)

split_samples_ids = [0] * (n_splits + 1)
for i in range(n_splits):
split_samples_ids[i + 1] = nb_split_samples[i] + split_samples_ids[i]

# print('Splits', split_samples_ids)
sample_splits = [shuffled_ids[split_samples_ids[i]:split_samples_ids[i + 1]]
for i in range(n_splits)]
# print('unique splits', unique_sample_splits)

#
# extracting samples
data_splits = []
for split_ids in sample_splits:
# print(split_ids)
extracted_split = data[split_ids]
data_splits.append(extracted_split)

assert len(data_splits) == n_splits

return data_splits


def shuffle_split_train_valid_test(data,
percs=[0.75, 0.1, 0.15],
only_uniques=False,
rand_gen=None):

if rand_gen is None:
rand_gen = numpy.random.RandomState(RND_SEED)

n_instances = data.shape[0]

ids_dict = {}

samples_ids = numpy.ones(n_instances, dtype=numpy.int)

#
# collecting unique samples ids
counter = 0
for i, s in enumerate(data):
#
# making it hashable
sample_tuple = tuple(s)

if sample_tuple not in ids_dict:
ids_dict[sample_tuple] = counter
counter += 1

samples_ids[i] = ids_dict[sample_tuple]

print('There are {} unique samples'.format(counter))
# print(samples_ids)
unique_samples_ids = list(sorted(set(samples_ids)))
assert len(unique_samples_ids) == counter

#
# shuffle the ids
shuffled_ids = rand_gen.permutation(unique_samples_ids)

#
# getting slices
n_splits = len(percs)
nb_split_samples = [int(counter * p) for p in percs[:-1]]
nb_split_samples += [counter - sum(nb_split_samples)]
assert sum(nb_split_samples) == len(unique_samples_ids)

split_samples_ids = [0] * (n_splits + 1)
for i in range(n_splits):
split_samples_ids[i + 1] = nb_split_samples[i] + split_samples_ids[i]

# print('Splits', split_samples_ids)
unique_sample_splits = [
shuffled_ids[split_samples_ids[i]:split_samples_ids[i + 1]] for i in range(n_splits)]
# print('unique splits', unique_sample_splits)

#
# extracting samples
data_splits = []
for u_split in unique_sample_splits:
split_ids = set(u_split)
# print(split_ids)
samples_mask = numpy.zeros(n_instances, dtype=bool)
for i in range(n_instances):
if samples_ids[i] in split_ids:
# print(samples_ids[i])
samples_mask[i] = True
# print(samples_mask)
extracted_split = data[samples_mask]
data_splits.append(extracted_split)

assert len(data_splits) == n_splits

#
# collapsing to uniques
if only_uniques:
unique_data_sets = [dataset_to_instances_set(split) for split in data_splits]
for i, split in enumerate(unique_data_sets):
data_array_list = [numpy.array(s) for s in split]
data_splits[i] = numpy.array(data_array_list)

return data_splits


def print_split_stats(dataset_splits, split_names=SPLIT_NAMES, out_log=None):

unique_splits = [dataset_to_instances_set(split) for split in dataset_splits]

uniq_str = 'Unique instances in splits:'
print(uniq_str)
if out_log:
out_log.write(uniq_str + '\n')
for orig_split, unique_split, split_name in zip(dataset_splits, unique_splits, split_names):
uniq_str = '\t# all instances {2}: {0} -> {1}'.format(len(orig_split),
len(unique_split),
split_name)
print(uniq_str)
if out_log:
out_log.write(uniq_str + '\n')

over_str = 'Overlapping instances:'
print(over_str)
if out_log:
out_log.write(over_str + '\n')
split_pairs = itertools.combinations(unique_splits, 2)
split_names = itertools.combinations(SPLIT_NAMES, 2)
for (name_1, name_2), (split_1, split_2) in zip(split_names, split_pairs):
n_over_instances = len(split_1 & split_2)
over_str = '\toverlapping instances from {0} to {1}:\t{2}'.format(name_1,
name_2,
n_over_instances)
print(over_str)
if out_log:
out_log.write(over_str + '\n')

n_over_instances = len(set.intersection(*unique_splits))
over_str = '\toverlapping instances among {0}: {1}\n'.format(SPLIT_NAMES, n_over_instances)
print(over_str)
if out_log:
out_log.write(over_str)


import argparse


if __name__ == '__main__':

parser = argparse.ArgumentParser()

parser.add_argument('dataset', type=str,
help='Name of the dataset to process (e.g. nltcs)')
parser.add_argument('-o', '--output', type=str, nargs='?',
default=OUTPUT_PATH,
help='Output dir path')
parser.add_argument('--prefix', type=str, nargs='?',
default='no-over',
help='Prefix for the new dataset name')
parser.add_argument('--perc', type=float, nargs='+',
default=[0.75, 0.1, 0.15],
help='Percentages of split')
parser.add_argument('--unique', action='store_true',
help='Whether to remove all duplicate instances')
parser.add_argument('--shuffle', action='store_true',
help='Whether to shuffle data')
parser.add_argument('--no-overlap', action='store_true',
help='No overlapping instances among splits')
parser.add_argument('-v', '--verbose', type=int, nargs='?',
default=0,
help='Verbosity level')
parser.add_argument('--seed', type=int, nargs='?',
default=1337,
help='Seed for the random generator')

args = parser.parse_args()
print("Starting with arguments:", args)

rand_gen = numpy.random.RandomState(args.seed)

new_dataset_name = '{0}-{1}'.format(args.prefix,
args.dataset)
out_path = os.path.join(args.output, new_dataset_name)
os.makedirs(out_path, exist_ok=True)
out_log_path = os.path.join(out_path, '{0}.exp.log'.format(new_dataset_name))
print(out_path, out_log_path)

split_names = SPLIT_NAMES

with open(out_log_path, 'w') as out_log:

#
# loading dataset
dataset_splits = load_train_valid_test_csvs(args.dataset,
os.path.join(DATA_PATH, args.dataset),
verbose=False)

train, valid, test = dataset_splits
split_shape_str = 'Data splits shapes:\n\ttrain: '\
'{0}\n\tvalid: {1}\n\ttest: {2}'.format(train.shape,
valid.shape,
test.shape)
print(split_shape_str)
out_log.write(split_shape_str + '\n')

print_split_stats(dataset_splits, split_names, out_log)

#
# merging them
merged_data = split_union((train, valid, test))
merge_shape_str = 'Merged splits shape: {}'.format(merged_data.shape)
print(merge_shape_str)
out_log.write(merge_shape_str + '\n')

resampled_splits = None

if args.shuffle and not args.unique and not args.no_overlap:
print('shuffling only')
resampled_splits = only_shuffle_split_train_valid_test(merged_data,
percs=args.perc,
rand_gen=rand_gen)

else:
#
# resampling
resampled_splits = shuffle_split_train_valid_test(merged_data,
percs=args.perc,
only_uniques=args.unique,
rand_gen=rand_gen)

re_train, re_valid, re_test = resampled_splits
split_shape_str = 'Shapes after resampling:\n\ttrain: ' \
'{0}\n\tvalid: {1}\n\ttest: {2}'.format(re_train.shape,
re_valid.shape,
re_test.shape)
print(split_shape_str)
out_log.write(split_shape_str + '\n')

print_split_stats(resampled_splits, split_names, out_log)

numpy.savetxt(os.path.join(out_path, '{}.train.data'.format(new_dataset_name)),
re_train, delimiter=',', fmt='%d')
numpy.savetxt(os.path.join(out_path, '{}.valid.data'.format(new_dataset_name)),
re_valid, delimiter=',', fmt='%d')
numpy.savetxt(os.path.join(out_path, '{}.test.data'.format(new_dataset_name)),
re_test, delimiter=',', fmt='%d')

0 comments on commit 44c51c5

Please sign in to comment.