Fixed typos and update documentation (#42)

* Fixed typos and update documentation * Fixed annotation * Fixed comments
mosaicml · Oct 13, 2022 · 02b65e7 · 02b65e7
1 parent 3f2af08
commit 02b65e7
Show file tree

Hide file tree

Showing 16 changed files with 126 additions and 126 deletions.
diff --git a/README.md b/README.md
@@ -56,7 +56,7 @@ Please check the [quick start guide](https://docs.mosaicml.com/projects/streamin
 - High performance, accurate streaming of training data from cloud storage
 - Efficiently train anywhere, independent of training data location
 - Cloud-native, no persistent storage required
-- Enhanced data security—data exists ephemerally in training cluster
+- Enhanced data security—data exists ephemerally on training cluster
 
 
 # 🚀 Quickstart
@@ -69,11 +69,11 @@ Streaming is available with Pip:
 pip install mosaicml-streaming
 ```
 
-# Tutorial
-Please check our [tutorial](https://docs.mosaicml.com/projects/streaming/) section for the end-to-end model training workflow using Streaming datasets.
+# Examples
+Please check our [Examples](https://docs.mosaicml.com/projects/streaming/) section for the end-to-end model training workflow using Streaming datasets.
 
 # 📚 Documentation
-Getting started guides, examples, tutorials, API reference, and other useful information can be found in our [docs](https://docs.mosaicml.com/projects/streaming).
+Getting started guides, examples, API reference, and other useful information can be found in our [docs](https://docs.mosaicml.com/projects/streaming).
 
 # 💫 Contributors
 We welcome any contributions, pull requests, or issues!

diff --git a/docs/source/getting_started/quick_start.md b/docs/source/getting_started/quick_start.md
@@ -65,6 +65,6 @@ Start training your model with the Streaming dataset in a few steps!
     dataloader = DataLoader(dataset)
     ```
 
-That's it!  For additional details on using {mod}`streaming`, please see check out our [User Guide](user_guide.md) and [Tutorial](../tutorial/).
+That's it!  For additional details on using {mod}`streaming`, please check out our [User Guide](user_guide.md) and [Examples](../examples/cifar10.ipynb).
 
 Happy training!
diff --git a/docs/source/getting_started/user_guide.md b/docs/source/getting_started/user_guide.md
@@ -2,14 +2,14 @@
 
 At a very high level, one needs to convert a raw dataset into streaming format files and then use the same streaming format files using {class}`streaming.Dataset` class for model training.
 
-Streaming supports different dataset writers based on your need to for conversion of raw datasets into a streaming format such as
-- {class}`streaming.MDSWriter`: Writes the dataset into `.mds` (Mosaic Data Shard) extension. It supports various encoding/decoding formats(`str`, `int`, `bytes`, `jpeg`, `png`, `pil`, `pkl`, and `json`) which converts the data from that format to bytes and vice-versa.
-- {class}`streaming.CSVWriter`: Writes the dataset into `.csv` (Comma Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which converts the data from that format to string and vice-versa.
+Streaming supports different dataset writers based on your need for conversion of raw datasets into a streaming format such as
+- {class}`streaming.MDSWriter`: Writes the dataset into `.mds` (Mosaic Data Shard) extension. It supports various encoding/decoding formats(`str`, `int`, `bytes`, `jpeg`, `png`, `pil`, `pkl`, and `json`) which convert the data from that format to bytes and vice-versa.
+- {class}`streaming.CSVWriter`: Writes the dataset into `.csv` (Comma Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which convert the data from that format to string and vice-versa.
 - {class}`streaming.JSONWriter`: Writes the dataset into `.json` (JavaScript Object Notation) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`).
-- {class}`streaming.TSVWriter`: Writes the dataset into `.tsv` (Tab Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which converts the data from that format to string and vice-versa.
-- {class}`streaming.XSVWriter`: Writes the dataset into `.xsv` (user defined Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which converts the data from that format to string and vice-versa.
+- {class}`streaming.TSVWriter`: Writes the dataset into `.tsv` (Tab Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which convert the data from that format to string and vice-versa.
+- {class}`streaming.XSVWriter`: Writes the dataset into `.xsv` (user defined Separated Values) extension. It supports various encoding/decoding formats(`str`, `int`, and `float`) which convert the data from that format to string and vice-versa.
 
-For more information about writers and its parameters, look at the [API reference doc](../api_reference/).
+For more information about writers and their parameters, look at the [API reference doc](../api_reference/streaming.rst).
 
 After the dataset has been converted to one of our streaming formats, one just needs to instantiate the {class}`streaming.Dataset` class by providing the dataset path of the streaming formats and use that dataset object in PyTorch {class}`torch.utils.data.DataLoader` class. For more information about `streaming.Dataset` and its parameters, look at the {class}`streaming.Dataset` API reference doc.
 
@@ -71,7 +71,7 @@ The below parameters are optional to {class}`streaming.MDSWriter`. Let's look at
     compression = 'zstd:7'
     ```
 
-2. Provide a name of a hashing algorithm; the default is `None`. Streaming support families of hashing algorithm such as `sha`, `blake`, `md5`, `xxHash`, etc.
+2. Provide a name of a hashing algorithm; the default is `None`. Streaming supports families of hashing algorithm such as `sha`, `blake`, `md5`, `xxHash`, etc.
     <!--pytest-codeblocks:cont-->
     ```python
     hashes = ['sha1']
@@ -131,9 +131,9 @@ $ aws s3 cp dirname s3://mybucket/myfolder --recursive
 
 ## Loading a streaming dataset
 
-After writing a dataset in the streaming format in the previous step and uploaded to a cloud object storage as s3, we are ready to start loading the data.
+After writing a dataset in the streaming format in the previous step and uploading to a cloud object storage as s3, we are ready to start loading the data.
 
-To load the same dataset files that was created in the above steps, create a `CustomDataset` class by inheriting the {class}`streaming.Dataset` class and override the `__getitem__(idx: int)` method to get the samples. The {class}`streaming.Dataset` class requires to two mandatory parameters which are `remote` which is a remote directory (S3 or local filesystem) where dataset is stored and `local` which is a local directory where dataset is cached during operation.
+To load the same dataset files that were created in the above steps, create a `CustomDataset` class by inheriting the {class}`streaming.Dataset` class and override the `__getitem__(idx: int)` method to get the samples. The {class}`streaming.Dataset` class requires two mandatory parameters which are `remote` which is a remote directory (S3 or local filesystem) where dataset is stored and `local` which is a local directory where dataset is cached during operation.
 <!--pytest-codeblocks:cont-->
  ```python
 from streaming.base import Dataset
@@ -167,7 +167,7 @@ from torch.utils.data import DataLoader
 dataloader = DataLoader(dataset=dataset)
 ```
 
-You've now seen an in-depth look at how to prepare and use steaming datasets with PyTorch. To continue learning about Streaming, please continue to explore our [tutorials](../tutorial/)!
+You've now seen an in-depth look at how to prepare and use streaming datasets with PyTorch. To continue learning about Streaming, please continue to explore our [examples](../examples/cifar10.ipynb/)!
 
 ## Other options
 

diff --git a/docs/source/index.md b/docs/source/index.md
@@ -7,10 +7,10 @@ cloud-based object stores. Streaming can read files from local disk or from clou
 
 <!--pytest.mark.skip-->
 ```python
-dataloader = torch.utils.data.DataLoader(dataset=ImageStreamingDataset(remote=s3://...))
+dataloader = torch.utils.data.DataLoader(dataset=ImageStreamingDataset(remote='s3://...'))
 ```
 
-For additional details, please see our [Quick Start](getting_started/quick_start.md) and [User Guide](getting_started/user_start.md).
+For additional details, please see our [Quick Start](getting_started/quick_start.md) and [User Guide](getting_started/user_guide.md).
 
 Streaming was originally developed as a part of MosaicML’s Composer training library and is a critical component of our efficient machine learning infrastructure.
 
@@ -22,14 +22,14 @@ pip install mosaicml-streaming
 
 ## Key Benefits
 
-- High performance, accurate streaming of training data from cloud storage.
-- Efficiently train anywhere, independent of training data location.
-- Cloud-native, no persistent storage required; simplifying infrastructure.
-- Enhanced data security, data exists ephemerally training cluster.
+- High performance, accurate streaming of training data from cloud storage
+- Efficiently train anywhere, independent of training data location
+- Cloud-native, no persistent storage required
+- Enhanced data security—data exists ephemerally on training cluster
 
 ## Features
 
-- Drop-in replacement for {class}`torch.utils.data.Dataset` datasets, compatible {class}`torch.utils.data.IterableDataset` style dataloaders.
+- Drop-in replacement for {class}`torch.utils.data.IterableDataset` class.
 - Built-in support for popular open source datasets (e.g., ADE20K, C4, COCO, Enwiki, ImageNet, etc.).
 - Support for various image, structured and unstructured text formats.
 - Helper utilities to convert proprietary datasets to streaming format.

diff --git a/examples/cifar10.ipynb b/examples/cifar10.ipynb
@@ -61,14 +61,14 @@
     "import time\n",
     "import os\n",
     "import shutil\n",
-    "from typing import Callable, Any\n",
+    "from typing import Callable, Any, Tuple\n",
     "\n",
     "import numpy as np\n",
     "from tqdm import tqdm\n",
     "import torch\n",
     "import torch.nn as nn\n",
     "import torch.nn.functional as F\n",
-    "from torch.utils.data import DataLoader\n",
+    "from torch.utils.data import DataLoader, Dataset\n",
     "from torchvision import transforms, models\n",
     "from torchvision.datasets import CIFAR10"
    ]
@@ -86,7 +86,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from streaming import MDSWriter, Dataset"
+    "import streaming as ms"
    ]
   },
   {
@@ -107,11 +107,11 @@
     "# the location of our dataset\n",
     "in_root = \"./dataset\"\n",
     "\n",
-    "# the location of the \"remote\" streaming dataset. \n",
+    "# the location of the \"remote\" streaming dataset (`sds`). \n",
     "# Upload `out_root` to your cloud storage provider of choice.\n",
-    "out_root = \"./sdl\"\n",
-    "out_train = \"./sdl/train\"\n",
-    "out_test = \"./sdl/test\"\n",
+    "out_root = \"./sds\"\n",
+    "out_train = \"./sds/train\"\n",
+    "out_test = \"./sds/test\"\n",
     "\n",
     "# the location to download the streaming dataset during training\n",
     "local = './local'\n",
@@ -199,19 +199,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def write_datasets(dataset, split_dir) -> None:\n",
+    "def write_datasets(dataset: Dataset, split_dir: str) -> None:\n",
     "    fields = {\n",
-    "        'i': 'int',\n",
     "        'x': 'pil',\n",
     "        'y': 'int',\n",
     "    }\n",
     "    indices = np.random.permutation(len(dataset))\n",
     "    indices = tqdm(indices)\n",
-    "    with MDSWriter(dirname=split_dir, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
+    "    with ms.MDSWriter(dirname=split_dir, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
     "        for i in indices:\n",
     "            x, y = dataset[i]\n",
     "            out.write({\n",
-    "                'i': i,\n",
     "                'x': x,\n",
     "                'y': y,\n",
     "            })"
@@ -251,7 +249,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "class CIFAR10Dataset(Dataset):\n",
+    "class CIFAR10Dataset(ms.Dataset):\n",
     "    def __init__(self,\n",
     "                 remote: str,\n",
     "                 local: str,\n",
@@ -273,7 +271,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Transform the data"
+    "## Initialize the data transformation"
    ]
   },
   {
@@ -393,7 +391,7 @@
     "model = Net()\n",
     "model = model.to(device)\n",
     "criterion = nn.CrossEntropyLoss()\n",
-    "optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9,weight_decay=5e-4)"
+    "optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=5e-4)"
    ]
   },
   {
@@ -409,7 +407,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def fit(model, train_dataloader):\n",
+    "def fit(model: nn.Module, train_dataloader: DataLoader) -> Tuple[float, float]:\n",
     "    model.train()\n",
     "    train_running_loss = 0.0\n",
     "    train_running_correct = 0\n",
@@ -445,7 +443,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def eval(model, test_dataloader):\n",
+    "def eval(model: nn.Module, test_dataloader: DataLoader) -> Tuple[float, float]:\n",
     "    model.eval()\n",
     "    val_running_loss = 0.0\n",
     "    val_running_correct = 0\n",
@@ -514,9 +512,9 @@
     "\n",
     "## What next?\n",
     "\n",
-    "You've now seen an in-depth look at how to prepare and use steaming datasets with PyTorch.\n",
+    "You've now seen an in-depth look at how to prepare and use streaming datasets with PyTorch.\n",
     "\n",
-    "To continue learning about Streaming, please continue to explore our tutorials!"
+    "To continue learning about Streaming, please continue to explore our examples!"
    ]
   },
   {

diff --git a/examples/facesynthetics.ipynb b/examples/facesynthetics.ipynb
@@ -104,7 +104,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from streaming import MDSWriter, Dataset\n",
+    "import streaming as ms\n",
     "from composer.models.deeplabv3 import composer_deeplabv3"
    ]
   },
@@ -140,11 +140,11 @@
     "# the location of our dataset\n",
     "in_root = \"./dataset\"\n",
     "\n",
-    "# the location of the \"remote\" streaming dataset. \n",
+    "# the location of the \"remote\" streaming dataset (`sds`). \n",
     "# Upload `out_root` to your cloud storage provider of choice.\n",
-    "out_root = \"./sdl\"\n",
-    "out_train = \"./sdl/train\"\n",
-    "out_test = \"./sdl/test\"\n",
+    "out_root = \"./sds\"\n",
+    "out_train = \"./sds/train\"\n",
+    "out_test = \"./sds/test\"\n",
     "\n",
     "# the location to download the streaming dataset during training\n",
     "local = './local'\n",
@@ -264,7 +264,6 @@
     "\n",
     "        with open(image, 'rb') as x, open(annotation, 'rb') as y:\n",
     "            yield {\n",
-    "                'i': i,\n",
     "                'x': x.read(),\n",
     "                'y': y.read(),\n",
     "            }"
@@ -290,16 +289,16 @@
    "outputs": [],
    "source": [
     "def write_datasets() -> None:\n",
-    "    fields = {'i': 'int', 'x': 'png', 'y': 'png'}\n",
+    "    fields = {'x': 'png', 'y': 'png'}\n",
     "    \n",
     "    num_training_images = int(num_images * training_ratio)\n",
     "    \n",
     "    start_ix, end_ix = 0, num_training_images\n",
-    "    with MDSWriter(dirname=out_train, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
+    "    with ms.MDSWriter(dirname=out_train, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
     "        for sample in each(in_root, start_ix, end_ix):\n",
     "            out.write(sample)\n",
     "    start_ix, end_ix = end_ix, num_images  \n",
-    "    with MDSWriter(dirname=out_test, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
+    "    with ms.MDSWriter(dirname=out_test, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
     "        for sample in each(in_root, start_ix, end_ix):\n",
     "            out.write(sample) "
    ]
@@ -342,7 +341,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "class FaceSynthetics(Dataset):\n",
+    "class FaceSynthetics(ms.Dataset):\n",
     "    def __init__(self,\n",
     "                 remote: str,\n",
     "                 local: str,\n",
@@ -504,9 +503,9 @@
     "\n",
     "## What next?\n",
     "\n",
-    "You've now seen an in-depth look at how to prepare and use steaming datasets with Composer.\n",
+    "You've now seen an in-depth look at how to prepare and use streaming datasets with Composer.\n",
     "\n",
-    "To continue learning about Streaming, please continue to explore our tutorials!"
+    "To continue learning about Streaming, please continue to explore our examples!"
    ]
   },
   {
@@ -534,7 +533,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3.10.6 ('composer_py3_10')",
+   "display_name": "Python 3.10.6 ('streaming_py3_10')",
    "language": "python",
    "name": "python3"
   },
@@ -552,7 +551,7 @@
   },
   "vscode": {
    "interpreter": {
-    "hash": "9212f7dacb3fe721700a61800ceb3ababe1b81b3f1dae0e262f7fafb18a5051f"
+    "hash": "cb0371d9985d03b7be04a8e8a123b72f0ef8951070c9235d824cee9281d7d420"
    }
   }
  },