Skip to content

Commit

Permalink
Fixed typos and update documentation (#42)
Browse files Browse the repository at this point in the history
* Fixed typos and update documentation

* Fixed annotation

* Fixed comments
  • Loading branch information
karan6181 authored Oct 13, 2022
1 parent 3f2af08 commit 02b65e7
Show file tree
Hide file tree
Showing 16 changed files with 126 additions and 126 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ Please check the [quick start guide](https://docs.mosaicml.com/projects/streamin
- High performance, accurate streaming of training data from cloud storage
- Efficiently train anywhere, independent of training data location
- Cloud-native, no persistent storage required
- Enhanced data security—data exists ephemerally in training cluster
- Enhanced data security—data exists ephemerally on training cluster


# 🚀 Quickstart
Expand All @@ -69,11 +69,11 @@ Streaming is available with Pip:
pip install mosaicml-streaming
```

# Tutorial
Please check our [tutorial](https://docs.mosaicml.com/projects/streaming/) section for the end-to-end model training workflow using Streaming datasets.
# Examples
Please check our [Examples](https://docs.mosaicml.com/projects/streaming/) section for the end-to-end model training workflow using Streaming datasets.

# 📚 Documentation
Getting started guides, examples, tutorials, API reference, and other useful information can be found in our [docs](https://docs.mosaicml.com/projects/streaming).
Getting started guides, examples, API reference, and other useful information can be found in our [docs](https://docs.mosaicml.com/projects/streaming).

# 💫 Contributors
We welcome any contributions, pull requests, or issues!
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,6 @@ Start training your model with the Streaming dataset in a few steps!
dataloader = DataLoader(dataset)
```

That's it! For additional details on using {mod}`streaming`, please see check out our [User Guide](user_guide.md) and [Tutorial](../tutorial/).
That's it! For additional details on using {mod}`streaming`, please check out our [User Guide](user_guide.md) and [Examples](../examples/cifar10.ipynb).

Happy training!
20 changes: 10 additions & 10 deletions docs/source/getting_started/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

At a very high level, one needs to convert a raw dataset into streaming format files and then use the same streaming format files using {class}`streaming.Dataset` class for model training.

Streaming supports different dataset writers based on your need to for conversion of raw datasets into a streaming format such as
- {class}`streaming.MDSWriter`: Writes the dataset into `.mds` (Mosaic Data Shard) extension. It supports various encoding/decoding formats(`str`, `int`, `bytes`, `jpeg`, `png`, `pil`, `pkl`, and `json`) which converts the data from that format to bytes and vice-versa.
- {class}`streaming.CSVWriter`: Writes the dataset into `.csv` (Comma Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which converts the data from that format to string and vice-versa.
Streaming supports different dataset writers based on your need for conversion of raw datasets into a streaming format such as
- {class}`streaming.MDSWriter`: Writes the dataset into `.mds` (Mosaic Data Shard) extension. It supports various encoding/decoding formats(`str`, `int`, `bytes`, `jpeg`, `png`, `pil`, `pkl`, and `json`) which convert the data from that format to bytes and vice-versa.
- {class}`streaming.CSVWriter`: Writes the dataset into `.csv` (Comma Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which convert the data from that format to string and vice-versa.
- {class}`streaming.JSONWriter`: Writes the dataset into `.json` (JavaScript Object Notation) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`).
- {class}`streaming.TSVWriter`: Writes the dataset into `.tsv` (Tab Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which converts the data from that format to string and vice-versa.
- {class}`streaming.XSVWriter`: Writes the dataset into `.xsv` (user defined Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which converts the data from that format to string and vice-versa.
- {class}`streaming.TSVWriter`: Writes the dataset into `.tsv` (Tab Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which convert the data from that format to string and vice-versa.
- {class}`streaming.XSVWriter`: Writes the dataset into `.xsv` (user defined Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which convert the data from that format to string and vice-versa.

For more information about writers and its parameters, look at the [API reference doc](../api_reference/).
For more information about writers and their parameters, look at the [API reference doc](../api_reference/streaming.rst).

After the dataset has been converted to one of our streaming formats, one just needs to instantiate the {class}`streaming.Dataset` class by providing the dataset path of the streaming formats and use that dataset object in PyTorch {class}`torch.utils.data.DataLoader` class. For more information about `streaming.Dataset` and its parameters, look at the {class}`streaming.Dataset` API reference doc.

Expand Down Expand Up @@ -71,7 +71,7 @@ The below parameters are optional to {class}`streaming.MDSWriter`. Let's look at
compression = 'zstd:7'
```

2. Provide a name of a hashing algorithm; the default is `None`. Streaming support families of hashing algorithm such as `sha`, `blake`, `md5`, `xxHash`, etc.
2. Provide a name of a hashing algorithm; the default is `None`. Streaming supports families of hashing algorithm such as `sha`, `blake`, `md5`, `xxHash`, etc.
<!--pytest-codeblocks:cont-->
```python
hashes = ['sha1']
Expand Down Expand Up @@ -131,9 +131,9 @@ $ aws s3 cp dirname s3://mybucket/myfolder --recursive

## Loading a streaming dataset

After writing a dataset in the streaming format in the previous step and uploaded to a cloud object storage as s3, we are ready to start loading the data.
After writing a dataset in the streaming format in the previous step and uploading to a cloud object storage as s3, we are ready to start loading the data.

To load the same dataset files that was created in the above steps, create a `CustomDataset` class by inheriting the {class}`streaming.Dataset` class and override the `__getitem__(idx: int)` method to get the samples. The {class}`streaming.Dataset` class requires to two mandatory parameters which are `remote` which is a remote directory (S3 or local filesystem) where dataset is stored and `local` which is a local directory where dataset is cached during operation.
To load the same dataset files that were created in the above steps, create a `CustomDataset` class by inheriting the {class}`streaming.Dataset` class and override the `__getitem__(idx: int)` method to get the samples. The {class}`streaming.Dataset` class requires two mandatory parameters which are `remote` which is a remote directory (S3 or local filesystem) where dataset is stored and `local` which is a local directory where dataset is cached during operation.
<!--pytest-codeblocks:cont-->
```python
from streaming.base import Dataset
Expand Down Expand Up @@ -167,7 +167,7 @@ from torch.utils.data import DataLoader
dataloader = DataLoader(dataset=dataset)
```

You've now seen an in-depth look at how to prepare and use steaming datasets with PyTorch. To continue learning about Streaming, please continue to explore our [tutorials](../tutorial/)!
You've now seen an in-depth look at how to prepare and use streaming datasets with PyTorch. To continue learning about Streaming, please continue to explore our [examples](../examples/cifar10.ipynb/)!

## Other options

Expand Down
14 changes: 7 additions & 7 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ cloud-based object stores. Streaming can read files from local disk or from clou

<!--pytest.mark.skip-->
```python
dataloader = torch.utils.data.DataLoader(dataset=ImageStreamingDataset(remote=s3://...))
dataloader = torch.utils.data.DataLoader(dataset=ImageStreamingDataset(remote='s3://...'))
```

For additional details, please see our [Quick Start](getting_started/quick_start.md) and [User Guide](getting_started/user_start.md).
For additional details, please see our [Quick Start](getting_started/quick_start.md) and [User Guide](getting_started/user_guide.md).

Streaming was originally developed as a part of MosaicML’s Composer training library and is a critical component of our efficient machine learning infrastructure.

Expand All @@ -22,14 +22,14 @@ pip install mosaicml-streaming

## Key Benefits

- High performance, accurate streaming of training data from cloud storage.
- Efficiently train anywhere, independent of training data location.
- Cloud-native, no persistent storage required; simplifying infrastructure.
- Enhanced data security, data exists ephemerally training cluster.
- High performance, accurate streaming of training data from cloud storage
- Efficiently train anywhere, independent of training data location
- Cloud-native, no persistent storage required
- Enhanced data securitydata exists ephemerally on training cluster

## Features

- Drop-in replacement for {class}`torch.utils.data.Dataset` datasets, compatible {class}`torch.utils.data.IterableDataset` style dataloaders.
- Drop-in replacement for {class}`torch.utils.data.IterableDataset` class.
- Built-in support for popular open source datasets (e.g., ADE20K, C4, COCO, Enwiki, ImageNet, etc.).
- Support for various image, structured and unstructured text formats.
- Helper utilities to convert proprietary datasets to streaming format.
Expand Down
34 changes: 16 additions & 18 deletions examples/cifar10.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -61,14 +61,14 @@
"import time\n",
"import os\n",
"import shutil\n",
"from typing import Callable, Any\n",
"from typing import Callable, Any, Tuple\n",
"\n",
"import numpy as np\n",
"from tqdm import tqdm\n",
"import torch\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"from torch.utils.data import DataLoader\n",
"from torch.utils.data import DataLoader, Dataset\n",
"from torchvision import transforms, models\n",
"from torchvision.datasets import CIFAR10"
]
Expand All @@ -86,7 +86,7 @@
"metadata": {},
"outputs": [],
"source": [
"from streaming import MDSWriter, Dataset"
"import streaming as ms"
]
},
{
Expand All @@ -107,11 +107,11 @@
"# the location of our dataset\n",
"in_root = \"./dataset\"\n",
"\n",
"# the location of the \"remote\" streaming dataset. \n",
"# the location of the \"remote\" streaming dataset (`sds`). \n",
"# Upload `out_root` to your cloud storage provider of choice.\n",
"out_root = \"./sdl\"\n",
"out_train = \"./sdl/train\"\n",
"out_test = \"./sdl/test\"\n",
"out_root = \"./sds\"\n",
"out_train = \"./sds/train\"\n",
"out_test = \"./sds/test\"\n",
"\n",
"# the location to download the streaming dataset during training\n",
"local = './local'\n",
Expand Down Expand Up @@ -199,19 +199,17 @@
"metadata": {},
"outputs": [],
"source": [
"def write_datasets(dataset, split_dir) -> None:\n",
"def write_datasets(dataset: Dataset, split_dir: str) -> None:\n",
" fields = {\n",
" 'i': 'int',\n",
" 'x': 'pil',\n",
" 'y': 'int',\n",
" }\n",
" indices = np.random.permutation(len(dataset))\n",
" indices = tqdm(indices)\n",
" with MDSWriter(dirname=split_dir, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
" with ms.MDSWriter(dirname=split_dir, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
" for i in indices:\n",
" x, y = dataset[i]\n",
" out.write({\n",
" 'i': i,\n",
" 'x': x,\n",
" 'y': y,\n",
" })"
Expand Down Expand Up @@ -251,7 +249,7 @@
"metadata": {},
"outputs": [],
"source": [
"class CIFAR10Dataset(Dataset):\n",
"class CIFAR10Dataset(ms.Dataset):\n",
" def __init__(self,\n",
" remote: str,\n",
" local: str,\n",
Expand All @@ -273,7 +271,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform the data"
"## Initialize the data transformation"
]
},
{
Expand Down Expand Up @@ -393,7 +391,7 @@
"model = Net()\n",
"model = model.to(device)\n",
"criterion = nn.CrossEntropyLoss()\n",
"optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9,weight_decay=5e-4)"
"optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=5e-4)"
]
},
{
Expand All @@ -409,7 +407,7 @@
"metadata": {},
"outputs": [],
"source": [
"def fit(model, train_dataloader):\n",
"def fit(model: nn.Module, train_dataloader: DataLoader) -> Tuple[float, float]:\n",
" model.train()\n",
" train_running_loss = 0.0\n",
" train_running_correct = 0\n",
Expand Down Expand Up @@ -445,7 +443,7 @@
"metadata": {},
"outputs": [],
"source": [
"def eval(model, test_dataloader):\n",
"def eval(model: nn.Module, test_dataloader: DataLoader) -> Tuple[float, float]:\n",
" model.eval()\n",
" val_running_loss = 0.0\n",
" val_running_correct = 0\n",
Expand Down Expand Up @@ -514,9 +512,9 @@
"\n",
"## What next?\n",
"\n",
"You've now seen an in-depth look at how to prepare and use steaming datasets with PyTorch.\n",
"You've now seen an in-depth look at how to prepare and use streaming datasets with PyTorch.\n",
"\n",
"To continue learning about Streaming, please continue to explore our tutorials!"
"To continue learning about Streaming, please continue to explore our examples!"
]
},
{
Expand Down
27 changes: 13 additions & 14 deletions examples/facesynthetics.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@
"metadata": {},
"outputs": [],
"source": [
"from streaming import MDSWriter, Dataset\n",
"import streaming as ms\n",
"from composer.models.deeplabv3 import composer_deeplabv3"
]
},
Expand Down Expand Up @@ -140,11 +140,11 @@
"# the location of our dataset\n",
"in_root = \"./dataset\"\n",
"\n",
"# the location of the \"remote\" streaming dataset. \n",
"# the location of the \"remote\" streaming dataset (`sds`). \n",
"# Upload `out_root` to your cloud storage provider of choice.\n",
"out_root = \"./sdl\"\n",
"out_train = \"./sdl/train\"\n",
"out_test = \"./sdl/test\"\n",
"out_root = \"./sds\"\n",
"out_train = \"./sds/train\"\n",
"out_test = \"./sds/test\"\n",
"\n",
"# the location to download the streaming dataset during training\n",
"local = './local'\n",
Expand Down Expand Up @@ -264,7 +264,6 @@
"\n",
" with open(image, 'rb') as x, open(annotation, 'rb') as y:\n",
" yield {\n",
" 'i': i,\n",
" 'x': x.read(),\n",
" 'y': y.read(),\n",
" }"
Expand All @@ -290,16 +289,16 @@
"outputs": [],
"source": [
"def write_datasets() -> None:\n",
" fields = {'i': 'int', 'x': 'png', 'y': 'png'}\n",
" fields = {'x': 'png', 'y': 'png'}\n",
" \n",
" num_training_images = int(num_images * training_ratio)\n",
" \n",
" start_ix, end_ix = 0, num_training_images\n",
" with MDSWriter(dirname=out_train, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
" with ms.MDSWriter(dirname=out_train, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
" for sample in each(in_root, start_ix, end_ix):\n",
" out.write(sample)\n",
" start_ix, end_ix = end_ix, num_images \n",
" with MDSWriter(dirname=out_test, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
" with ms.MDSWriter(dirname=out_test, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
" for sample in each(in_root, start_ix, end_ix):\n",
" out.write(sample) "
]
Expand Down Expand Up @@ -342,7 +341,7 @@
"metadata": {},
"outputs": [],
"source": [
"class FaceSynthetics(Dataset):\n",
"class FaceSynthetics(ms.Dataset):\n",
" def __init__(self,\n",
" remote: str,\n",
" local: str,\n",
Expand Down Expand Up @@ -504,9 +503,9 @@
"\n",
"## What next?\n",
"\n",
"You've now seen an in-depth look at how to prepare and use steaming datasets with Composer.\n",
"You've now seen an in-depth look at how to prepare and use streaming datasets with Composer.\n",
"\n",
"To continue learning about Streaming, please continue to explore our tutorials!"
"To continue learning about Streaming, please continue to explore our examples!"
]
},
{
Expand Down Expand Up @@ -534,7 +533,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 ('composer_py3_10')",
"display_name": "Python 3.10.6 ('streaming_py3_10')",
"language": "python",
"name": "python3"
},
Expand All @@ -552,7 +551,7 @@
},
"vscode": {
"interpreter": {
"hash": "9212f7dacb3fe721700a61800ceb3ababe1b81b3f1dae0e262f7fafb18a5051f"
"hash": "cb0371d9985d03b7be04a8e8a123b72f0ef8951070c9235d824cee9281d7d420"
}
}
},
Expand Down
Loading

0 comments on commit 02b65e7

Please sign in to comment.