GitHub - tiantiangao7/text: Data loaders and abstractions for text and NLP

https://travis-ci.org/pytorch/text.svg?branch=master

torchtext

This repository consists of:

torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)
torchtext.datasets: Pre-built loaders for common NLP datasets

Installation

Make sure you have Python 2.7 or 3.5+ and PyTorch 0.4.0 or newer. You can then install torchtext using pip:

pip install torchtext

For PyTorch versions before 0.4.0, please use pip install torchtext==0.2.3.

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses:

pip install sacremoses

Documentation

Find the documentation here.

Data

The data module provides the following:

Ability to describe declaratively how to load a custom NLP dataset that's in a "normal" format:

>>> pos = data.TabularDataset(
...    path='data/pos/pos_wsj_train.tsv', format='tsv',
...    fields=[('text', data.Field()),
...            ('labels', data.Field())])
...
>>> sentiment = data.TabularDataset(
...    path='data/sentiment/train.json', format='json',
...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
...            'sentiment_gold': ('labels', data.Field(sequential=False))})

Ability to define a preprocessing pipeline:

>>> src = data.Field(tokenize=my_custom_tokenizer)
>>> trg = data.Field(tokenize=my_custom_tokenizer)
>>> mt_train = datasets.TranslationDataset(
...     path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
...     fields=(src, trg))

Batching, padding, and numericalizing (including building a vocabulary object):

>>> # continuing from above
>>> mt_dev = datasets.TranslationDataset(
...     path='data/mt/newstest2014', exts=('.en', '.de'),
...     fields=(src, trg))
>>> src.build_vocab(mt_train, max_size=80000)
>>> trg.build_vocab(mt_train, max_size=40000)
>>> # mt_dev shares the fields, so it shares their vocab objects
>>>
>>> train_iter = data.BucketIterator(
...     dataset=mt_train, batch_size=32,
...     sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
>>> # usage
>>> next(iter(train_iter))
<data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>

Wrapper for dataset splits (train, validation, test):

>>> TEXT = data.Field()
>>> LABELS = data.Field()
>>>
>>> train, val, test = data.TabularDataset.splits(
...     path='/data/pos_wsj/pos_wsj', train='_train.tsv',
...     validation='_dev.tsv', test='_test.tsv', format='tsv',
...     fields=[('text', TEXT), ('labels', LABELS)])
>>>
>>> train_iter, val_iter, test_iter = data.BucketIterator.splits(
...     (train, val, test), batch_sizes=(16, 256, 256),
>>>     sort_key=lambda x: len(x.text), device=0)
>>>
>>> TEXT.build_vocab(train)
>>> LABELS.build_vocab(train)

Datasets

The datasets module currently contains:

Sentiment analysis: SST and IMDb
Question classification: TREC
Entailment: SNLI, MultiNLI
Language modeling: abstract class + WikiText-2, WikiText103, PennTreebank
Machine translation: abstract class + Multi30k, IWSLT, WMT14
Sequence tagging (e.g. POS/NER): abstract class + UDPOS, CoNLL2000Chunking
Question answering: 20 QA bAbI tasks
Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull

Others are planned or a work in progress:

Question answering: SQuAD

See the test directory for examples of dataset usage.

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Name		Name	Last commit message	Last commit date
Latest commit History 413 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
build_tools		build_tools
docs		docs
examples		examples
test		test
torchtext		torchtext
.flake8		.flake8
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.rst		README.rst
codecov.yml		codecov.yml
pytest.ini		pytest.ini
readthedocs.yml		readthedocs.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torchtext

Installation

Optional requirements

Documentation

Data

Datasets

Disclaimer on Datasets

About

Releases

Packages

Languages

License

tiantiangao7/text

Folders and files

Latest commit

History

Repository files navigation

torchtext

Installation

Optional requirements

Documentation

Data

Datasets

Disclaimer on Datasets

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages