diff --git a/NEWS.md b/NEWS.md
index 5a5a0525e79..0412c1fb821 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,5 +1,21 @@
 <h2>News</h2>
 
+2017-08-15: [New task added: CLEVR](https://github.com/facebookresearch/ParlAI/blob/master/parlai/tasks/task_list.py)
+
+2017-07-20: [ParlAI Request For Proposals: Funding university teams - 7 awards are available - deadline Aug 25](https://research.fb.com/programs/research-awards/proposals/parlai/)
+
+2017-07-20: [added building an (seq2seq) agent tutorial](http://www.parl.ai/static/docs/seq2seq_tutorial.html)
+
+2017-07-12: [Several new tasks added: MS Marco, TriviaQA, InsuranceQA, personalized-dialog and MNIST_QA](https://github.com/facebookresearch/ParlAI/blob/master/parlai/tasks/task_list.py)
+
+2017-06-27: [ExecutableWorld class for interactive worlds with dialog](https://github.com/facebookresearch/ParlAI/pull/170)
+
+2017-06-21: [MTurk now supports multiple assignments per HIT](https://github.com/facebookresearch/ParlAI/pull/156)
+
+2017-06-20: [updated MTurk tutorial to reflect new design](http://parl.ai/static/docs/mturk.html)
+
+2017-06-20: [MTurk now uses general world and agent classes](https://github.com/facebookresearch/ParlAI/pull/128)
+
 2017-06-16: [added Creating a New Task tutorial](http://parl.ai/static/docs/task_tutorial.html)
 
 2017-05-31: [added Seq2Seq model](https://github.com/facebookresearch/ParlAI/pull/96)
diff --git a/README.md b/README.md
index bc02b32ecfe..8189824a790 100644
--- a/README.md
+++ b/README.md
@@ -5,12 +5,11 @@
 ParlAI (pronounced “par-lay”) is a framework for dialog AI research, implemented in Python.
 
 Its goal is to provide researchers:
-- a unified framework for training and testing dialog models
+- a unified framework for sharing, training and testing dialog models
 - multi-task training over many datasets at once
 - seamless integration of [Amazon Mechanical Turk](https://www.mturk.com/mturk/welcome) for data collection and human evaluation
 
-
-Over 20 tasks are supported in the first release, including popular datasets such as [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/), [bAbI tasks](https://arxiv.org/abs/1502.05698), [MCTest](https://www.microsoft.com/en-us/research/publication/mctest-challenge-dataset-open-domain-machine-comprehension-text/), [WikiQA](https://www.microsoft.com/en-us/download/details.aspx?id=52419), [WebQuestions](http://www.aclweb.org/anthology/D13-1160), [SimpleQuestions](https://arxiv.org/abs/1506.02075), [WikiMovies](https://arxiv.org/abs/1606.03126), [QACNN & QADailyMail](https://arxiv.org/abs/1506.03340), [CBT](https://arxiv.org/abs/1511.02301), [BookTest](https://arxiv.org/abs/1610.00956), [bAbI Dialog tasks](https://arxiv.org/abs/1605.07683), [Ubuntu Dialog](https://arxiv.org/abs/1506.08909), [OpenSubtitles](http://opus.lingfil.uu.se/OpenSubtitles.php), [Cornell Movie](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) and [VQA-COCO2014](http://visualqa.org/).
+Over 20 [tasks](https://github.com/facebookresearch/ParlAI/blob/master/parlai/tasks/task_list.py) are currently supported, including popular datasets such as [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/), [bAbI tasks](https://arxiv.org/abs/1502.05698), [MS MARCO](http://www.msmarco.org/), [MCTest](https://www.microsoft.com/en-us/research/publication/mctest-challenge-dataset-open-domain-machine-comprehension-text/), [WikiQA](https://www.microsoft.com/en-us/download/details.aspx?id=52419), [WebQuestions](http://www.aclweb.org/anthology/D13-1160), [SimpleQuestions](https://arxiv.org/abs/1506.02075), [WikiMovies](https://arxiv.org/abs/1606.03126), [QACNN & QADailyMail](https://arxiv.org/abs/1506.03340), [CBT](https://arxiv.org/abs/1511.02301), [BookTest](https://arxiv.org/abs/1610.00956), [bAbI Dialog tasks](https://arxiv.org/abs/1605.07683), [Ubuntu Dialog](https://arxiv.org/abs/1506.08909), [OpenSubtitles](http://opus.lingfil.uu.se/OpenSubtitles.php), [Cornell Movie](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), [VQA-COCO2014](http://visualqa.org/), [VisDial](https://arxiv.org/abs/1611.08669) and [CLEVR](http://cs.stanford.edu/people/jcjohns/clevr/). See [here](http://www.parl.ai/static/docs/tasks.html#) for the current complete task list.
 
 Included are examples of training neural models with [PyTorch](http://pytorch.org/) and [Lua Torch](http://torch.ch/), with batch training on GPU or hogwild training on CPUs. Using [Theano](http://deeplearning.net/software/theano/) or [Tensorflow](https://www.tensorflow.org/) instead is also straightforward.
 
@@ -23,6 +22,8 @@ ParlAI is described in the following paper:
 We are in an early-release Beta. Expect some adventures and rough edges.<br>
 See the [news page](https://github.com/facebookresearch/ParlAI/blob/master/NEWS.md) for the latest additions & updates, and the website [http://parl.ai](http://parl.ai) for further docs.
 
+Please also note there is a [ParlAI Request For Proposals funding university teams, 7 awards are available - deadline Aug 25.](https://research.fb.com/programs/research-awards/proposals/parlai/)
+
 ## Goals
 
 Unified framework for evaluation of dialogue models
@@ -85,14 +86,14 @@ Display the predictions of that same IR baseline model:
 python examples/display_model.py -m ir_baseline -t "#moviedd-reddit" -dt valid
 ```
 
-Train a simple cpu-based memory network on the "10k training examples" bAbI task 1 with 8 threads (python processes) using Hogwild (requires zmq and Lua Torch):
+Train a seq2seq model on the "1k training examples" bAbI task 1 with batch size of 8 examples for one epoch (requires pytorch):
 ```bash
-python examples/memnn_luatorch_cpu/full_task_train.py -t babi:task10k:1 -nt 8
+python examples/train_model.py -m seq2seq -t babi:task1k:1 -bs 8 -e 1 -mf /tmp/model_s2s
 ```
 
 Trains an attentive LSTM model on the SQuAD dataset with a batch size of 32 examples (pytorch and regex):
 ```bash
-python examples/train_model.py -m drqa -t squad -bs 32 -mf /tmp/model
+python examples/train_model.py -m drqa -t squad -bs 32 -mf /tmp/model_drqa
 ```
 
 ## Requirements
@@ -124,7 +125,8 @@ All needed data will be downloaded to ~/ParlAI/data, and any non-data files (suc
 The main concepts (classes) in ParlAI:
 - world - defines the environment (can be very simple, just two agents talking to each other).
 - agent – an agent in the world, e.g. the learner. (There can be multiple learners.)
-- teacher – a type of agent that talks to the learner, implements one of the tasks listed before.
+- teacher – a type of agent that talks to the learner, implements one of the 
+listed before.
 
 After defining a world and the agents in it, a main loop can be run for training, testing or displaying, which calls the function world.parley(). The skeleton of an example main is given in the left panel, and the actual code for parley() on the right.
 
@@ -234,15 +236,13 @@ This directory contains a few particular examples of basic loops.
 
 ### Tasks
 
-
-Over 20 tasks are supported in the first release, including popular datasets such as
-SQuAD, bAbI tasks, MCTest, WikiQA, WebQuestions, SimpleQuestions, WikiMovies, QACNN, QADailyMail, CBT, BookTest, bAbI Dialog tasks,
-Ubuntu, OpenSubtitles, Cornell Movie and VQA-COCO2014.
-
-Our first release includes the following datasets (shown in the left panel), and accessing one of them is as simple as specifying the name of the task as a command line option, as shown in the dataset display utility (right panel):
+Our first release included the following datasets (shown in the left panel), and accessing one of them is as simple as specifying the name of the task as a command line option, as shown in the dataset display utility (right panel):
 <p align=center><img width="100%" src="docs/source/\_static/img/tasks.png" /></p>
 
-See [here](https://github.com/facebookresearch/ParlAI/blob/master/parlai/tasks/task_list.py) for the current complete task list.
+Over 20 tasks were supported in the first release, including popular datasets such as
+SQuAD, bAbI tasks, MCTest, WikiQA, WebQuestions, SimpleQuestions, WikiMovies, QACNN, QADailyMail, CBT, BookTest, bAbI Dialog tasks,
+Ubuntu, OpenSubtitles, Cornell Movie, VQA-COCO2014.
+Since then, several datasets have been added such as  VQAv2, VisDial, MNIST_QA, Personalized Dialog, InsuranceQA, MS MARCO, TriviaQA, and CLEVR. See [here](http://www.parl.ai/static/docs/tasks.html#) for the current complete task list.
 
 Choosing a task in ParlAI is as easy as specifying it on the command line, as shown in the above image (right). If the dataset has not been used before, ParlAI will automatically download it. As all datasets are treated in the same way in ParlAI (with a single dialog API), a dialog agent can in principle switch training and testing between any of them. Even better, one can specify many tasks at once (multi-tasking) by simply providing a comma-separated list, e.g.  the command line “-t babi,squad”, to use those two datasets, or even all  the QA datasets at once  (-t #qa) or indeed every task in ParlAI at once (-t #all). The aim is to make it easy to build and evaluate very rich dialog models.
 
@@ -300,17 +300,18 @@ If you have any questions, bug reports or feature requests, please don't hesitat
 ## The Team
 ParlAI is currently maintained by Alexander H. Miller, Will Feng and Jason Weston.
 A non-exhaustive list of other major contributors includes:
-Adam Fisch,  Jiasen Lu, Antoine Bordes, Devi Parikh and Dhruv Batra.
+Adam Fisch,  Jiasen Lu, Antoine Bordes, Devi Parikh, Dhruv Batra,
+Filipe de Avila Belbute Peres and Chao Pan.
 
 ## Citation
 
-Please cite the arXiv paper if you use ParlAI in your work:
+Please cite the [arXiv paper](https://arxiv.org/abs/1705.06476) if you use ParlAI in your work:
 
 ```
 @article{miller2017parlai,
   title={ParlAI: A Dialog Research Software Platform},
   author={{Miller}, A.~H. and {Feng}, W. and {Fisch}, A. and {Lu}, J. and {Batra}, D. and {Bordes}, A. and {Parikh}, D. and {Weston}, J.},
-  journal={arXiv preprint arXiv:{1705.06476},
+  journal={arXiv preprint arXiv:{1705.06476}},
   year={2017}
 }
 ```
diff --git a/docs/source/_static/img/task_tutorial_skateboard.jpg b/docs/source/_static/img/task_tutorial_skateboard.jpg
new file mode 100644
index 00000000000..d8048f91be8
Binary files /dev/null and b/docs/source/_static/img/task_tutorial_skateboard.jpg differ
diff --git a/docs/source/basic_tutorial.rst b/docs/source/basic_tutorial.rst
index b67b898c2ea..dbec28966c3 100644
--- a/docs/source/basic_tutorial.rst
+++ b/docs/source/basic_tutorial.rst
@@ -7,6 +7,7 @@
 
 What is ParlAI?
 ===============
+**Author**: Alexander Holden Miller
 
 It's a python-based platform for enabling dialog AI research.
 
@@ -24,19 +25,20 @@ Follow the step by step guide on how to download and install ParlAI.
 
 .. code-block:: bash
 
-        git clone https://github.com/facebookresearch/ParlAI.git ~/ParlAI
+    git clone https://github.com/facebookresearch/ParlAI.git ~/ParlAI
 
 2. Install ParlAI:
 
-.. code-block:: bash 
+.. code-block:: bash
+
+    cd ~/ParlAI; python setup.py develop
 
-        cd ~/ParlAI; python setup.py develop
+3. Several models have additional requirements:
 
-3. Several models have additional requirements
+- DrQA and Seq2Seq require installing `PyTorch <http://pytorch.org/>`_.
 
-  a. DrQA requires installing `PyTorch <http://pytorch.org/>`
+- MemNN requires installing `Lua Torch <http://torch.ch/docs/getting-started.html>`_.
 
-  b. MemNN requires installing `Lua Torch <http://torch.ch/docs/getting-started.html>`
 
 Getting Started
 ---------------
@@ -202,11 +204,6 @@ Now that we have our our agent, we'll set up the display loop.
     parser = ParlaiParser()
     opt = parser.parse_args()
 
-    if 'task' not in opt:
-        # if task not specified from the command line,
-        # default to the 1000-training example bAbI task 1
-        opt['task'] = 'babi:task1k:1'
-
     agent = RepeatLabelAgent(opt)
     world = create_task(opt, agent)
 
@@ -269,5 +266,5 @@ the labels aren't available:
         return reply
 
 
-Of course, we can do much better than randomly guessing. In the next tutorial,
+Of course, we can do much better than randomly guessing. In another tutorial,
 we'll set up a better agent which learns from the training data.
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 38e9d4a022f..cba544f64d2 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -23,6 +23,7 @@ ParlAI is a one-stop-shop for dialog research.
 
    basic_tutorial
    task_tutorial
+   seq2seq_tutorial
    mturk
 
 .. toctree::
@@ -49,7 +50,6 @@ ParlAI is a one-stop-shop for dialog research.
   :maxdepth: 1
   :caption: Reference Models
 
-  memnn_luatorch_cpu
   remote_agent
   repeat_label
 
diff --git a/docs/source/memnn_luatorch_cpu.rst b/docs/source/memnn_luatorch_cpu.rst
deleted file mode 100644
index 2567f612477..00000000000
--- a/docs/source/memnn_luatorch_cpu.rst
+++ /dev/null
@@ -1,20 +0,0 @@
-..
-  Copyright (c) 2017-present, Facebook, Inc.
-  All rights reserved.
-  This source code is licensed under the BSD-style license found in the
-  LICENSE file in the root directory of this source tree. An additional grant
-  of patent rights can be found in the PATENTS file in the same directory.
-
-agents.memnn_luatorch_cpu
-====================================
-
-Memory Networks (LuaTorch CPU model)
-
-
-memnn_agent_parsed.lua
-
-memnn_agent.lua
-
-memnn_zmq_parsed.lua
-
-memnn_zmq.lua
diff --git a/docs/source/mturk.rst b/docs/source/mturk.rst
index bdcb806a274..9d7cc1f4dc8 100644
--- a/docs/source/mturk.rst
+++ b/docs/source/mturk.rst
@@ -7,10 +7,11 @@
 
 Using Mechanical Turk
 =====================
+**Author**: Will Feng
 
-In ParlAI, you can use Amazon Mechanical Turk for **data collection**, **training** and **evaluation** of your dialog model. 
+In ParlAI, you can use Amazon Mechanical Turk for **data collection**, **training** and **evaluation** of your dialog model.
 
-Human Turkers are viewed as just another type of agent in ParlAI, and hence person-to-person, person-to-bot, or multiple people and bots in group chat can all talk to each other within the same framework. 
+Human Turkers are viewed as just another type of agent in ParlAI, and hence person-to-person, person-to-bot, or multiple people and bots in group chat can all talk to each other within the same framework.
 
 The human Turkers communicate in observation/action dict format, the same as all other agents in ParlAI. During the conversation, the message that human Turkers receive is rendered on the live chat webpage in a pretty printed format, similar to the following:
 
@@ -35,7 +36,7 @@ We provide a few examples of using Mechanical Turk with ParlAI:
 Task 1: Collecting Data
 ^^^^^^^^^^^^^^^^^^^^^^^
 
-One of the biggest use cases of Mechanical Turk is to collect natural language data from human Turkers. 
+One of the biggest use cases of Mechanical Turk is to collect natural language data from human Turkers.
 
 As an example, the `QA Data Collection task <https://github.com/facebookresearch/ParlAI/blob/master/parlai/mturk/tasks/qa_data_collection/>`__ does the following:
 
@@ -61,7 +62,7 @@ You can easily evaluate your dialog model's performance with human Turkers using
 
 In ``ModelEvaluatorWorld``, there are two main components: one is the ``task_world`` that contains the task and the dialog model we are evaluating, the other is the ``MTurkAgent`` which is an interface to the human Turker.
 
-Note that since the human Turker speaks only once to provide the rating, the ``ModelEvaluatorWorld`` doesn't need to use ``turn_index`` to keep track of the turns. 
+Note that since the human Turker speaks only once to provide the rating, the ``ModelEvaluatorWorld`` doesn't need to use ``turn_index`` to keep track of the turns.
 
 After one turn, the task is finished, and the Turker's work is submitted for your review.
 
@@ -77,12 +78,14 @@ This task uses the ``MultiAgentDialogWorld`` which is already implemented in ``p
 Creating Your Own Task
 ----------------------
 
-ParlAI provides a generic MTurk dialog interface that one can use to implement any kind of dialog tasks. To create your own task, start with reading the tutorials on the provided examples, and then copy and modify the example ``worlds.py``, ``run.py`` and ``task_config.py`` files to create your task. 
+ParlAI provides a generic MTurk dialog interface that one can use to implement any kind of dialog tasks. To create your own task, start with reading the tutorials on the provided examples, and then copy and modify the example ``worlds.py``, ``run.py`` and ``task_config.py`` files to create your task.
 
 A few things to keep in mind:
 
 1. To end a conversation, you should send a message with ``episode_done = True`` from the first non-MTurk agent, and the conversation is ended after all MTurk agents respond.
-2. Make sure to test your dialog task using MTurk's sandbox mode before pushing it live, by using the ``--sandbox`` flag (enabled by default) when running ``run.py``.
+2. In ``run.py``, You can use ``hit_index`` and ``assignment_index`` to differentiate between different HITs and assignments, and change the content of the task accordingly.
+3. Make sure to test your dialog task using MTurk's sandbox mode before pushing it live, by using the ``--sandbox`` flag (enabled by default) when running ``run.py``.
+4. [Optional] If you want to show a custom webpage (instead of the default one) for any of your MTurk agents, you can create an ``html`` folder within your task directory, and then create the ``<mturk_agent_id>_cover_page.html`` and ``<mturk_agent_id>_index.html`` files within the ``html`` directory. In those files, you can extend from ``core.html`` and override any code blocks that you want to change. (Please look at `parlai/mturk/core/html/mturk_index.html <https://github.com/facebookresearch/ParlAI/blob/master/parlai/mturk/core/html/mturk_index.html>`__ as an example.) These agent-specific templates will automatically be shown to the Turkers in the next run.
 
 
 Running a Task
@@ -118,7 +121,7 @@ Please make sure to test your task in MTurk sandbox mode first (``--sandbox``) b
 Reviewing Turker's Work
 -----------------------
 
-After all HITs are completed, you will be provided a webpage link to review them. 
+After all HITs are completed, you will be provided a webpage link to review them.
 
 If you don't take any action in 4 weeks, all HITs will be auto-approved and Turkers will be paid.
 
diff --git a/docs/source/seq2seq_tutorial.rst b/docs/source/seq2seq_tutorial.rst
new file mode 100644
index 00000000000..821cc14142e
--- /dev/null
+++ b/docs/source/seq2seq_tutorial.rst
@@ -0,0 +1,441 @@
+..
+  Copyright (c) 2017-present, Facebook, Inc.
+  All rights reserved.
+  This source code is licensed under the BSD-style license found in the
+  LICENSE file in the root directory of this source tree. An additional grant
+  of patent rights can be found in the PATENTS file in the same directory.
+
+Creating an Agent
+=================
+**Author**: Alexander Holden Miller
+
+In this tutorial, we'll be setting up an agent which learns from the data it
+sees to produce the right answers.
+
+For this agent, we'll be implementing a simple GRU Seq2Seq agent based on
+Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014) and
+Sean Robertson's `Seq2Seq PyTorch tutorial <http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html>`_.
+
+
+Part 1: Naming Things
+^^^^^^^^^^^^^^^^^^^^^
+
+In order to make programmatic importing easier, we use a simple naming scheme
+for our models, so that on the command line we can just type "--model seq2seq"
+to load up the seq2seq model.
+
+To this end, we create a folder under parlai/agents with the name seqseq, and
+then put an empty __init__.py file there along with seq2seq.py.
+Then, we name our agent "Seq2seqAgent".
+
+This way, "--model seq2seq" can translate to "parlai.agents.seq2seq.seq2seq:Seq2seqAgent".
+Underscores in the name become capitals in the class name: "--model local_human"
+resides at "parlai.agents.local_human.local_human:LocalHumanAgent".
+If you need to put a model at a different path, you can specify the full path
+on the command line in the format above (with a colon in front of the class name).
+For example, "--model parlai.agents.remote_agent.remote_agent:ParsedRemoteAgent".
+
+Part 2: Main Agent Methods
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+First off, generally we should inherit from the Agent class in parlai.core.agents.
+This provides us with some default implementations (often, ``pass``) of some utility
+functions like "shutdown".
+
+First let's focus on the primary functions to implement: ``__init__``, ``observe``, and ``act``.
+
+The standard initialization parameters for agents are a dict of command-line parameters `opt`
+and an optional dict of shared parameters called `shared`.
+
+For our Seq2Seq model we'll call our parent init method, which does a few basic operations
+like setting self.observation to None and creating a deep copy of the `opt` dict.
+
+Then, we do a check to see if the `shared` parameter is set.
+When it is not None, it's telling this instance to initialize with this particular
+state, as this instance will be used either for batched or hogwild training
+(depending on your preference). We'll take a quick digression to describe how
+batching is set up.
+
+Batching Example
+----------------
+
+Let's say we are training our seq2seq model on `babi:task10k:1`. What happens
+behind the scenes for a batch size of 4 is that we actually create four shared
+versions of the bAbI Task10k teacher, and four shared versions of the seq2seq
+agent. These shared versions are initialized from the originals: for the bAbI
+teachers, they inherit the data from their parent agent, but they each have
+their own local state such as the current example they're showing or how far
+through a bAbI episode they are (bAbI task 1 has five examples per episode).
+For the seq2seq agent, each shared agent is keeping track of the previous
+examples they've seen in this same episode, since each observation does not
+repeat previously seen but related information--the agent has to remember it.
+
+For example, in the first example the agent could get something like the following:
+"John is in the bathroom. Mary is in the kitchen. Where is Mary?"
+And in the second example in the episode, the agent could get:
+"Mary picked up the milk. Mary went to the hallway. Where is John?"
+Here, the answer is in the first example's context, so the agent had to remember it.
+
+Observations are generated by calling the ``act`` function on each teacher, then
+passing those observations to each agent by calling the ``observe`` function of the
+shared agents. The agents are free to transform the previous observation
+(for example, prepending previously seen text from the same episode, if applicable).
+These transformed observations are packed into a list, which is then passed to
+``batch_act`` function our agent implements. We can implement ``batch_act`` differently
+from the simple ``act`` function to take advantage of the effects of batching
+over multiple examples when executing or updating our model.
+
+Thus, since our  agent's shared-instances will only be used to keep track
+of state particular to their sequence of examples in the batch, we have
+barely anything to do when setting these shared instances up: we just initialize the
+``self.episodeDone`` flag so we know whether we are in the middle of an episode or not.
+
+The full initialization of the model is included further below, but is very
+particular to this particular implementation. Let's talk more about the primary
+agent functions we need to define first.
+
+Observing and Acting
+--------------------
+Let's take a look at the ``observe`` function. Here, we can modify the
+observation dict if necessary, and then return it to be queued for batching.
+
+In this version, we first make a deep copy of the observation. Then, if this is
+not the first entry in an episode (some datasets like SQuAD have only one entry
+for every episode, but others like bAbI have multiple), then we prepend the
+previous text to the current text. We use a newline to separate them in case the
+model wants to recognize the difference between different lines.
+
+Then, we store whether this is the last entry in the episode so that we'll be
+ready to reset next time if we need to.
+
+.. code-block:: python
+
+    def observe(self, observation):
+        observation = copy.deepcopy(observation)
+        if not self.episode_done:
+            # if the last example wasn't the end of an episode, then we need to
+            # recall what was said in that example
+            prev_dialogue = self.observation['text']
+            observation['text'] = prev_dialogue + '\n' + observation['text']
+        self.observation = observation
+        self.episode_done = observation['episode_done']
+        return observation
+
+
+Next up is the ``act`` function. Since we are going to implement a batched
+version, we'll just call the batched version from our single-example act to
+reduce code duplication. The performance hit here won't matter much since we'll
+only use a batch size of one when debugging.
+
+.. code-block:: python
+
+    def act(self):
+        # call batch_act with this batch of one
+        return self.batch_act([self.observation])[0]
+
+
+Now it's time for the batch_act function. This function gets a list of length
+batchsize of observations and returns a list of the same length with this
+agent's replies.
+
+We'll follow this loose format:
+
+1. Set up our list of dicts to send back as replies, with the agent's ID set.
+
+2. Convert the incoming observations into tensors to feed into our model.
+
+3. Produce predictions on the input text using the model. If labels were provided, update the model as well.
+
+4. Unpack the predictions into the reply dicts and return them.
+
+.. code-block:: python
+
+    def batch_act(self, observations):
+        batchsize = len(observations)
+        # initialize a table of replies with this agent's id
+        batch_reply = [{'id': self.getID()} for _ in range(batchsize)]
+
+        # convert the observations into batches of inputs and targets
+        # valid_inds tells us the indices of all valid examples
+        # e.g. for input [{}, {'text': 'hello'}, {}, {}], valid_inds is [1]
+        # since the other three elements had no 'text' field
+        xs, ys, valid_inds = self.batchify(observations)
+
+        if len(xs) == 0:
+            # no valid examples, just return the empty responses we set up
+            return batch_reply
+
+        # produce predictions either way, but use the targets if available
+        predictions = self.predict(xs, ys)
+
+        for i in range(len(predictions)):
+            # map the predictions back to non-empty examples in the batch
+            # we join with spaces since we produce tokens one at a time
+            batch_reply[valid_inds[i]]['text'] = ' '.join(
+                c for c in predictions[i] if c != self.EOS)
+
+        return batch_reply
+
+Since the implementation of ``batchify`` and ``predict`` are particular to our
+model, we'll table those for now. Next up, we'll cover some of
+the other methods in the Agent API.
+
+
+Part 3: Extended Agent API
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are a few other useful methods you may want to define in your agent to
+take of additional functionality one might want during training. Many of these
+functions will be automatically called if you use our example training function
+to train your model.
+
+share()
+-------
+Agents can use this method to share any information they might want between
+different instances during batching or hogwild training. For example, during
+hogwild training all models are being trained indepedently in multiple processes,
+so you would want to share the model parameters between each one. Teacher classes
+use this method to share their data and metrics with other shared intances.
+
+If you define this method, it's usually a good idea to initialize the shared
+dict that's begin return by calling super().share() first. For example, the
+Teacher class in parlai.core.agents defines it this way:
+
+.. code-block:: python
+
+    def share(self):
+        """In addition to default Agent shared parameters, share metrics."""
+        shared = super().share()
+        shared['metrics'] = self.metrics
+        return shared
+
+shutdown()
+----------
+This function allows your model to do any final wrapup, such as writing any last
+logging info, saving an end-state version of the model if desired, or closing
+any open connections.
+
+Our seq2seq model doesn't implement this, but the agents in parlai/agents/remote_agent
+use this to close their open TCP connection after sending a shutdown signal through.
+
+
+Part 4: Finishing the Seq2Seq model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Here we'll take a look at the full details of ``__init__``, ``batchify``, ``predict``, and more.
+
+Full __init__()
+---------------
+
+Here's the full code to get the initialization of our model working.
+While you might define the model as a separate class if you prefer,
+we're going to define its modules in-line here, since it's such a simple model.
+
+.. code-block:: python
+
+    class Seq2seqAgent(Agent):
+
+        def __init__(self, opt, shared=None):
+            # initialize defaults first
+            super().__init__(opt, shared)
+            if not shared:
+                # this is not a shared instance of this class, so do full
+                # initialization. if shared is set, only set up shared members.
+
+                self.dict = DictionaryAgent(opt)
+                self.id = 'Seq2Seq'
+                # we use EOS markers to break input and output and end our output
+                self.EOS = self.dict.eos_token
+                self.observation = {'text': self.EOS, 'episode_done': True}
+                self.EOS_TENSOR = torch.LongTensor(self.dict.parse(self.EOS))
+
+                # store important params directly
+                hsz = opt['hiddensize']
+                self.hidden_size = hsz
+                self.num_layers = opt['numlayers']
+                self.learning_rate = opt['learningrate']
+                self.longest_label = 1
+
+                # set up modules
+                self.criterion = nn.NLLLoss()
+                # lookup table stores word embeddings
+                self.lt = nn.Embedding(len(self.dict), hsz, padding_idx=0,
+                                       scale_grad_by_freq=True)
+                # encoder captures the input text
+                self.encoder = nn.GRU(hsz, hsz, opt['numlayers'])
+                # decoder produces our output states
+                self.decoder = nn.GRU(hsz, hsz, opt['numlayers'])
+                # linear layer helps us produce outputs from final decoder state
+                self.h2o = nn.Linear(hsz, len(self.dict))
+                # droput on the linear layer helps us generalize
+                self.dropout = nn.Dropout(opt['dropout'])
+                # softmax maps output scores to probabilities
+                self.softmax = nn.LogSoftmax()
+
+                # set up optims for each module
+                lr = opt['learningrate']
+                self.optims = {
+                    'lt': optim.SGD(self.lt.parameters(), lr=lr),
+                    'encoder': optim.SGD(self.encoder.parameters(), lr=lr),
+                    'decoder': optim.SGD(self.decoder.parameters(), lr=lr),
+                    'h2o': optim.SGD(self.h2o.parameters(), lr=lr),
+                }
+
+                # check for cuda
+                self.use_cuda = not opt.get('no_cuda') and torch.cuda.is_available()
+                if self.use_cuda:
+                    print('[ Using CUDA ]')
+                    torch.cuda.set_device(opt['gpu'])
+                if self.use_cuda:
+                    self.cuda()
+
+            self.episode_done = True
+
+batchify()
+----------
+The batchify function takes in a list of observations and turns them into
+tensors to use with our model.
+
+.. code-block:: python
+
+    def batchify(self, observations):
+        """Convert a list of observations into input & target tensors."""
+        # valid examples
+        exs = [ex for ex in observations if 'text' in ex]
+        # the indices of the valid (non-empty) tensors
+        valid_inds = [i for i, ex in enumerate(observations) if 'text' in ex]
+
+        # set up the input tensors
+        batchsize = len(exs)
+        # tokenize the text
+        parsed = [self.parse(ex['text']) for ex in exs]
+        max_x_len = max([len(x) for x in parsed])
+        xs = torch.LongTensor(batchsize, max_x_len).fill_(0)
+        # pack the data to the right side of the tensor for this model
+        for i, x in enumerate(parsed):
+            offset = max_x_len - len(x)
+            for j, idx in enumerate(x):
+                xs[i][j + offset] = idx
+        if self.use_cuda:
+            xs = xs.cuda(async=True)
+        xs = Variable(xs)
+
+        # set up the target tensors
+        ys = None
+        if 'labels' in exs[0]:
+            # randomly select one of the labels to update on, if multiple
+            # append EOS to each label
+            labels = [random.choice(ex['labels']) + ' ' + self.EOS for ex in exs]
+            parsed = [self.parse(y) for y in labels]
+            max_y_len = max(len(y) for y in parsed)
+            ys = torch.LongTensor(batchsize, max_y_len).fill_(0)
+            for i, y in enumerate(parsed):
+                for j, idx in enumerate(y):
+                    ys[i][j] = idx
+            if self.use_cuda:
+                ys = ys.cuda(async=True)
+            ys = Variable(ys)
+        return xs, ys, valid_inds
+
+
+predict()
+---------
+The predict function returns an output from our model. If the targets are
+provided, then it also updates the model. The predictions will be biased in
+this case, since we condition each token on the true label token, but we are
+okay with that--it just improves training F1 scores.
+
+.. code-block:: python
+
+    def predict(self, xs, ys=None):
+        """Produce a prediction from our model. Update the model using the
+        targets if available.
+        """
+        batchsize = len(xs)
+
+        # first encode context
+        xes = self.lt(xs).t()
+        h0 = torch.zeros(self.num_layers, bsz, self.hidden_size)
+        if self.use_cuda:
+            h0 = h0.cuda(async=True)
+        h0 = Variable(h0)
+        _output, hn = self.encoder(xes, h0)
+
+        # next we use EOS as an input to kick off our decoder
+        x = Variable(self.EOS_TENSOR)
+        xe = self.lt(x).unsqueeze(1)
+        xes = xe.expand(xe.size(0), batchsize, xe.size(2))
+
+        # list of output tokens for each example in the batch
+        output_lines = [[] for _ in range(batchsize)]
+
+        if ys is not None:
+            # update the model based on the labels
+            self.zero_grad()
+            loss = 0
+            # keep track of longest label we've ever seen
+            self.longest_label = max(self.longest_label, ys.size(1))
+            for i in range(ys.size(1)):
+                output, hn = self.decoder(xes, hn)
+                preds, scores = self.hidden_to_idx(output, drop=True)
+                y = ys.select(1, i)
+                loss += self.criterion(scores, y)
+                # use the true token as the next input instead of predicted
+                # this produces a biased prediction but better training
+                xes = self.lt(y).unsqueeze(0)
+                for b in range(batchsize):
+                    # convert the output scores to tokens
+                    token = self.v2t([preds.data[b][0]])
+                    output_lines[b].append(token)
+
+            loss.backward()
+            self.update_params()
+        else:
+            # just produce a prediction without training the model
+            done = [False for _ in range(batchsize)]
+            total_done = 0
+            max_len = 0
+
+            while(total_done < batchsize) and max_len < self.longest_label:
+                # keep producing tokens until we hit EOS or max length for each
+                # example in the batch
+                output, hn = self.decoder(xes, hn)
+                preds, scores = self.hidden_to_idx(output, drop=False)
+                xes = self.lt(preds.t())
+                max_len += 1
+                for b in range(batchsize):
+                    if not done[b]:
+                        # only add more tokens for examples that aren't done yet
+                        token = self.v2t(preds.data[b])
+                        if token == self.EOS:
+                            # if we produced EOS, we're done
+                            done[b] = True
+                            total_done += 1
+                        else:
+                            output_lines[b].append(token)
+
+        return output_lines
+
+hidden_to_idx()
+---------------
+
+Finally, this function converts our hidden state (from the decoder) to specific
+indices into our dictionary, allowing us to return tokens from the dictionary.
+
+.. code-block:: python
+
+    def hidden_to_idx(self, hidden, drop=False):
+        """Converts hidden state vectors into indices into the dictionary."""
+        if hidden.size(0) > 1:
+            raise RuntimeError('bad dimensions of tensor:', hidden)
+        hidden = hidden.squeeze(0)
+        scores = self.d2o(hidden)
+        if drop:
+            scores = self.dropout(scores)
+        scores = self.softmax(scores)
+        _max_score, idx = scores.max(1)
+        return idx, scores
+
+For other utility functions like loading from file, or to see any new features
+that we may have added to the model such as attention over the input or ranking
+candidates, check out the source code at parlai/agents/seq2seq.
diff --git a/docs/source/task_list.inc b/docs/source/task_list.inc
deleted file mode 100644
index c9aebee38db..00000000000
--- a/docs/source/task_list.inc
+++ /dev/null
@@ -1,368 +0,0 @@
-QA
---
-
-bAbI 1k
-^^^^^^^
-
-**Tag**: ``#bAbI-1k``
-
-**Full Path**: ``babi:All1k``
-
-**Group Tags**: ``#all``, ``#QA``
-
-**Description**: 20 synthetic tasks that each test a unique aspect of text and reasoning, and hence test different capabilities of learning models. From Weston et al. '16. Link: http://arxiv.org/abs/1502.05698
-
-**Notes**: You can access just one of the bAbI tasks with e.g. 'babi:Task1k:3' for task 3.
-
-
-bAbI 10k
-^^^^^^^^
-
-**Tag**: ``#bAbI-10k``
-
-**Full Path**: ``babi:All10k``
-
-**Group Tags**: ``#all``, ``#QA``
-
-**Description**: 20 synthetic tasks that each test a unique aspect of text and reasoning, and hence test different capabilities of learning models. From Weston et al. '16. Link: http://arxiv.org/abs/1502.05698
-
-**Notes**: You can access just one of the bAbI tasks with e.g. 'babi:Task10k:3' for task 3.
-
-
-MCTest
-^^^^^^
-
-**Tag**: ``#MCTest``
-
-**Full Path**: ``mctest``
-
-**Group Tags**: ``#all``, ``#QA``
-
-**Description**: Questions about short children's stories, from Richardson et al. '13. Link: https://www.microsoft.com/en-us/research/publication/mctest-challenge-dataset-open-domain-machine-comprehension-text/
-
-
-
-Movie Dialog QA
-^^^^^^^^^^^^^^^
-
-**Tag**: ``#MovieDD-QA``
-
-**Full Path**: ``moviedialog:Task:1``
-
-**Group Tags**: ``#all``, ``#QA``, ``#MovieDD``
-
-**Description**: Closed-domain QA dataset asking templated questions about movies, answerable from Wikipedia, similar to WikiMovies. From Dodge et al. '15. Link: https://arxiv.org/abs/1511.06931
-
-
-
-Movie Dialog Recommendations
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-**Tag**: ``#MovieDD-Recs``
-
-**Full Path**: ``moviedialog:Task:2``
-
-**Group Tags**: ``#all``, ``#QA``, ``#MovieDD``
-
-**Description**: Questions asking for movie recommendations. From Dodge et al. '15. Link: https://arxiv.org/abs/1511.06931
-
-
-
-MTurk WikiMovies
-^^^^^^^^^^^^^^^^
-
-**Tag**: ``#MTurkWikiMovies``
-
-**Full Path**: ``mturkwikimovies``
-
-**Group Tags**: ``#all``, ``#QA``
-
-**Description**: Closed-domain QA dataset asking MTurk-derived questions about movies, answerable from Wikipedia. From Li et al. '16. Link: https://arxiv.org/abs/1611.09823
-
-
-
-Simple Questions
-^^^^^^^^^^^^^^^^
-
-**Tag**: ``#SimpleQuestions``
-
-**Full Path**: ``simplequestions``
-
-**Group Tags**: ``#all``, ``#QA``
-
-**Description**: Open-domain QA dataset based on Freebase triples from Bordes et al. '15. Link: https://arxiv.org/abs/1506.02075
-
-
-
-SQuAD
-^^^^^
-
-**Tag**: ``#SQuAD``
-
-**Full Path**: ``squad``
-
-**Group Tags**: ``#all``, ``#QA``
-
-**Description**: Open-domain QA dataset answerable from a given paragraph from Wikipedia, from Rajpurkar et al. '16. Link: https://arxiv.org/abs/1606.05250
-
-
-
-Web Questions
-^^^^^^^^^^^^^
-
-**Tag**: ``#WebQuestions``
-
-**Full Path**: ``webquestions``
-
-**Group Tags**: ``#all``, ``#QA``
-
-**Description**: Open-domain QA dataset from Web queries from Berant et al. '13. Link: http://www.aclweb.org/anthology/D13-1160
-
-
-
-WikiMovies
-^^^^^^^^^^
-
-**Tag**: ``#WikiMovies``
-
-**Full Path**: ``wikimovies``
-
-**Group Tags**: ``#all``, ``#QA``
-
-**Description**: Closed-domain QA dataset asking templated questions about movies, answerable from Wikipedia. From Miller et al. '16. Link: https://arxiv.org/abs/1606.03126
-
-
-
-WikiQA
-^^^^^^
-
-**Tag**: ``#WikiQA``
-
-**Full Path**: ``wikiqa``
-
-**Group Tags**: ``#all``, ``#QA``
-
-**Description**: Open domain QA from Wikipedia dataset from Yang et al. '15. Link: https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/
-
-
-
-Cloze
------
-
-BookTest
-^^^^^^^^
-
-**Tag**: ``#BookTest``
-
-**Full Path**: ``booktest``
-
-**Group Tags**: ``#all``, ``#Cloze``
-
-**Description**: Sentence completion given a few sentences as context from a book. A larger version of CBT. From Bajgar et al., 16. Link: https://arxiv.org/abs/1610.00956
-
-
-
-Children's Book Test (CBT)
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-**Tag**: ``#CBT``
-
-**Full Path**: ``cbt``
-
-**Group Tags**: ``#all``, ``#Cloze``
-
-**Description**: Sentence completion given a few sentences as context from a children's book. From Hill et al., '16. Link: https://arxiv.org/abs/1511.02301
-
-
-
-QA CNN
-^^^^^^
-
-**Tag**: ``#QACNN``
-
-**Full Path**: ``qacnn``
-
-**Group Tags**: ``#all``, ``#Cloze``
-
-**Description**: Cloze dataset based on a missing (anonymized) entity phrase from a CNN article, Hermann et al. '15. Link: https://arxiv.org/abs/1506.03340
-
-
-
-QA Daily Mail
-^^^^^^^^^^^^^
-
-**Tag**: ``#QADailyMail``
-
-**Full Path**: ``qadailymail``
-
-**Group Tags**: ``#all``, ``#Cloze``
-
-**Description**: Cloze dataset based on a missing (anonymized) entity phrase from a Daily Mail article, Hermann et al. '15. Link: https://arxiv.org/abs/1506.03340
-
-
-
-Goal
-----
-
-Dialog Based Language Learning: bAbI Task
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-**Tag**: ``#DBLL-bAbI``
-
-**Full Path**: ``dbll_babi``
-
-**Group Tags**: ``#all``, ``#Goal``
-
-**Description**: Short dialogs based on the bAbI tasks, but in the form of a question from a teacher, the answer from the student, and finally a comment on the answer from the teacher. The aim is to find learning models that use the comments to improve. From Weston '16. Link: https://arxiv.org/abs/1604.06045
-
-
-
-Dialog Based Language Learning: WikiMovies Task
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-**Tag**: ``#DBLL-Movie``
-
-**Full Path**: ``dbll_movie``
-
-**Group Tags**: ``#all``, ``#Goal``
-
-**Description**: Short dialogs based on WikiMovies, but in the form of a question from a teacher, the answer from the student, and finally a comment on the answer from the teacher. The aim is to find learning models that use the comments to improve. From Weston '16. Link: https://arxiv.org/abs/1604.06045
-
-
-
-Dialog bAbI
-^^^^^^^^^^^
-
-**Tag**: ``#dialog-bAbI``
-
-**Full Path**: ``dialog_babi``
-
-**Group Tags**: ``#all``, ``#Goal``
-
-**Description**: Simulated dialogs of restaurant booking, from Bordes et al. '16. Link: https://arxiv.org/abs/1605.07683
-
-
-
-Movie Dialog QA Recommendations
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-**Tag**: ``#MovieDD-QARecs``
-
-**Full Path**: ``moviedialog:Task:3``
-
-**Group Tags**: ``#all``, ``#Goal``, ``#MovieDD``
-
-**Description**: Dialogs discussing questions about movies as well as recommendations. From Dodge et al. '15. Link: https://arxiv.org/abs/1511.06931
-
-
-
-ChitChat
---------
-
-Cornell Movie
-^^^^^^^^^^^^^
-
-**Tag**: ``#CornellMovie``
-
-**Full Path**: ``cornell_movie``
-
-**Group Tags**: ``#all``, ``#ChitChat``
-
-**Description**: Fictional conversations extracted from raw movie scripts. Link: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
-
-
-
-Movie Dialog Reddit
-^^^^^^^^^^^^^^^^^^^
-
-**Tag**: ``#MovieDD-Reddit``
-
-**Full Path**: ``moviedialog:Task:4``
-
-**Group Tags**: ``#all``, ``#ChitChat``, ``#MovieDD``
-
-**Description**: Dialogs discussing Movies from Reddit (the Movies SubReddit). From Dodge et al. '15. Link: https://arxiv.org/abs/1511.06931
-
-
-
-Open Subtitles
-^^^^^^^^^^^^^^
-
-**Tag**: ``#OpenSubtitles``
-
-**Full Path**: ``opensubtitles``
-
-**Group Tags**: ``#all``, ``#ChitChat``
-
-**Description**: Dataset of dialogs from movie scripts: http://opus.lingfil.uu.se/OpenSubtitles.php. A variant of the dataset used in Vinyals & Le '15, https://arxiv.org/abs/1506.05869.
-
-
-
-Ubuntu
-^^^^^^
-
-**Tag**: ``#Ubuntu``
-
-**Full Path**: ``ubuntu``
-
-**Group Tags**: ``#all``, ``#ChitChat``
-
-**Description**: Dialogs between an Ubunt user and an expert trying to fix issue, from Lowe et al. '15. Link: https://arxiv.org/abs/1506.08909
-
-
-
-Visual
-------
-
-VQAv1
-^^^^^
-
-**Tag**: ``#VQAv1``
-
-**Full Path**: ``vqa_v1``
-
-**Group Tags**: ``#all``, ``#Visual``
-
-**Description**: Open-ended question answering about visual content. From Agrawal et al. '15. Link: https://arxiv.org/abs/1505.00468
-
-
-
-VQAv2
-^^^^^
-
-**Tag**: ``#VQAv2``
-
-**Full Path**: ``vqa_v2``
-
-**Group Tags**: ``#all``, ``#Visual``
-
-**Description**: Bigger, more balanced version of the original VQA dataset. From Goyal et al. '16. Link: https://arxiv.org/abs/1612.00837
-
-
-
-VisDial
-^^^^^^^
-
-**Tag**: ``#VisDial``
-
-**Full Path**: ``visdial``
-
-**Group Tags**: ``#all``, ``#Visual``
-
-**Description**: Task which requires agents to hold a meaningful dialog about visual content. From Das et al. '16. Link: https://arxiv.org/abs/1611.08669
-
-
-
-MNIST_QA
-^^^^^^^^
-
-**Tag**: ``#MNIST_QA``
-
-**Full Path**: ``mnist_qa``
-
-**Group Tags**: ``#all``, ``#Visual``
-
-**Description**: Task which requires agents to identify which number they are seeing. From the MNIST dataset.
-
-
-
diff --git a/docs/source/task_tutorial.rst b/docs/source/task_tutorial.rst
index 35f8c33b83a..56472799e49 100644
--- a/docs/source/task_tutorial.rst
+++ b/docs/source/task_tutorial.rst
@@ -7,36 +7,58 @@
 
 Creating a New Task
 ===================
+**Author**: Filipe de Avila Belbute Peres
 
-Adding new tasks to ParlAI is a simple process. In this tutorial we will go over the different ways a new task can be created. 
+Adding new tasks to ParlAI is a simple process. In this tutorial we will go over the different ways a new task can be created.
 
-Tasks are located in the ``parlai/tasks`` directory. Therefore, the first thing to do is to create a directory for your new task there. (Don't forget to create an ``__init__.py`` file there.) The code for the tasks in this tutorial can also be found in this directory.
+Tasks code is located in the ``parlai/tasks`` directory. You will need to create a directory for your new task there. (Don't forget to create an ``__init__.py`` file.) The code for the tasks in this tutorial can also be found in this directory.
+
+
+Summary
+^^^^^^^
+
+In brief, to add your own task you need to:
+
+1. Implement ``build.py`` to `download and build any needed data <http://parl.ai/static/docs/task_tutorial.html#part-1-building-the-data>`__.
+2. Implement ``agents.py``, with at least a ``DefaultTeacher`` (extending ``Teacher`` or one of its children)
+
+    - if your data is in FB Dialog format, subclass `FbDialogTeacher`_.
+    - if your data consists of fixed logs, you can extend `DialogTeacher`_, in which case you just need to write your own ``setup_data()`` function, which provides an iterable over the data.
+    - if your data uses other fields, build your `task from scratch`_, by subclassing ``Teacher`` and writing your own ``act()`` method, which will provide observations from your task each time it's called.
+
+3. Add the task to the `task list <http://parl.ai/static/docs/task_tutorial.html#part-3-add-task-to-task-list>`__.
+
+Below we go into more details for each of these steps.
 
 
 Part 1: Building the Data
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 
-We first need to create functionality for downloading and setting up the dataset that is going to be used for the task. This is done in the ``build.py`` file. Useful functionality for setting up data can be found in ``parlai.core.build_data``. We thus start by importing it: 
+We first need to create functionality for downloading and setting up the dataset that is going to be used for the task. This is done in the ``build.py`` file. Useful functionality for setting up data can be found in ``parlai.core.build_data``. We thus start by importing it:
 
 .. code-block:: python
 
     import parlai.core.build_data as build_data
     import os
 
-Now we define our build method, which takes in the argument ``opt``, which contains parsed arguments from the command line (or their default), including the path to the data directory. We then use the build_data utilities to check if this data has been previously built, so that work is only done once. If not, we proceed to creating the directory for the data, and then downloading and uncompressing it. Finally, we mark the build as done, so that ``build_data.built`` returns true from now on. Below is an example of setting up the MNIST dataset.
+Now we define our build method, which takes in the argument ``opt``, which contains parsed arguments from the command line (or their default), including the path to the data directory. We can also define a version string, so that the data is updated automatically in case there is a new version (here it was just left as ``None`` as the MNIST dataset doesn't have a version). We then use the build_data utilities to check if this data hasn't been previously built or if the version is outdated. If not, we proceed to creating the directory for the data, and then downloading and uncompressing it. Finally, we mark the build as done, so that ``build_data.built`` returns true from now on. Below is an example of setting up the MNIST dataset.
 
 .. code-block:: python
 
     def build(opt):
         # get path to data directory
         dpath = os.path.join(opt['datapath'], 'mnist')
-        
+        # define version if any
+        version = None
+
         # check if data had been previously built
-        if not build_data.built(dpath):
+        if not build_data.built(dpath, version_string=version):
             print('[building data: ' + dpath + ']')
-            
-            # make a clean directory
-            build_data.remove_dir(dpath)
+
+            # make a clean directory if needed
+            if build_data.built(dpath):
+                # an older version exists, so remove these outdated files.
+                build_data.remove_dir(dpath)
             build_data.make_dir(dpath)
 
             # download the data.
@@ -48,7 +70,7 @@ Now we define our build method, which takes in the argument ``opt``, which conta
             build_data.untar(dpath, fname)
 
             # mark the data as built
-            build_data.mark_done(dpath)
+            build_data.mark_done(dpath, version_string=version)
 
 
 
@@ -61,7 +83,7 @@ The simplest method available for creating a teacher is to use the ``FbDialogTea
 
 If the data is not in this format or there are different requirements, one can still use the ``DialogTeacher`` which automates much of the work in setting up a dialog task, but gives the user more flexibility in setting up the data. This is shown in the section `DialogTeacher`_.
 
-Finally, if the requirements for the task do not fit any of the above, one can still write a task from scratch without much trouble. This is shown in the section `Task from Scratch`_. (Coming soon)
+Finally, if the requirements for the task do not fit any of the above, one can still write a task from scratch without much trouble. This is shown in the section `Task from Scratch`_.
 
 
 FbDialogTeacher
@@ -69,7 +91,7 @@ FbDialogTeacher
 
 In this section we will illustrate the process of using the ``FbDialogTeacher`` class by adding the `MTurk WikiMovies <http://parl.ai/static/docs/tasks.html#mturk-wikimovies>`__ question-answering task. This task has data in textual form and has been formatted to follow the Facebook Dialog format. It is thus very simple to implement it using ``FbDialogTeacher``. More information on this class and the dialog format can be found `here <http://parl.ai/static/docs/fbdialog.html>`__.
 
-In this task, the agent is presented with presented with questions about movies that are answerable from Wikipedia. A sample dialog is demonstrated below. 
+In this task, the agent is presented with questions about movies that are answerable from Wikipedia. A sample dialog is demonstrated below.
 
 ::
 
@@ -85,7 +107,7 @@ Every task requires a ``DefaultTeacher``. We will thus create one for this task.
     class DefaultTeacher(FbDialogTeacher):
         def __init__(self, opt, shared=None):
             opt = copy.deepcopy(opt)
-            
+
             # get datafile
             opt['datafile'] = _path(opt, '')
 
@@ -95,7 +117,7 @@ Every task requires a ``DefaultTeacher``. We will thus create one for this task.
                                                  'entities.txt')
             super().__init__(opt, shared)
 
-We can notice there was a call to a ``_path()`` method, which returns the path to the correct datafile. The path to the file is then stored in the options dictionary under the ``'datafile'`` key. We still need to implement this ``_path()`` method. The version for this example is presented below. It first ensures the data is built by calling the ``build()`` method described above. It then sets up the paths for the built data. 
+We can notice there was a call to a ``_path()`` method, which returns the path to the correct datafile. The path to the file is then stored in the options dictionary under the ``'datafile'`` key. We still need to implement this ``_path()`` method. The version for this example is presented below. It first ensures the data is built by calling the ``build()`` method described above. It then sets up the paths for the built data.
 
 .. code-block:: python
 
@@ -116,12 +138,12 @@ And this is all that needs to be done to create a teacher for our task using ``F
 DialogTeacher
 ~~~~~~~~~~~~~
 
-In this section we will demonstrate the process of using the ``DialogTeacher`` class by adding a simple question-answering task based on the MNIST dataset. This task depends on visual data and so does not fit the ``FbDialogTeacher`` class described above. Still, using ``DialogTeacher`` makes it easy to implement dialog tasks such as this one. 
+In this section we will demonstrate the process of using the ``DialogTeacher`` class by adding a simple question-answering task based on the MNIST dataset. This task depends on visual data and so does not fit the ``FbDialogTeacher`` class described above. Still, using ``DialogTeacher`` makes it easy to implement dialog tasks such as this one.
 
-In this task, the agent is presented with the image of a digit and then asked to answer which number it is seeing. A sample episode is demonstrated below. 
+In this task, the agent is presented with the image of a digit and then asked to answer which number it is seeing. A sample episode is demonstrated below.
 
 ::
-    
+
     [mnist_qa]: Which number is in the image?
     @@@@@@@@@@@@@@@@@@@@@@@@@@@@
     @@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@ -165,13 +187,13 @@ We will call our teacher ``MnistQATeacher``. Let's initialize this class first.
         def __init__(self, opt, shared=None):
             # store datatype
             self.datatype = opt['datatype'].split(':')[0]
-            
+
             # _path method explained below, returns paths to images and labels
             labels_path, self.image_path = _path(opt)
-            
+
             # store path to label data in options dictionary
             opt['datafile'] = labels_path
-            
+
             # store identifier for the teacher in the dialog
             self.id = 'mnist_qa'
 
@@ -182,9 +204,9 @@ We will call our teacher ``MnistQATeacher``. Let's initialize this class first.
 
             super().__init__(opt, shared)
 
-The ``id`` field names the teacher in the dialog. The ``num_strs`` field is specific to this example task. It is being used simply to store the text version of the digits. 
+The ``id`` field names the teacher in the dialog. The ``num_strs`` field is specific to this example task. It is being used simply to store the text version of the digits.
 
-More importantly, we can notice there was a call to a ``_path()`` method, which returns the paths to the image files and the labels. The path to the file is then stored in the options dictionary under the ``'datafile'`` key. This key should be used to store data that will be useful for performing the task. 
+More importantly, we can notice there was a call to a ``_path()`` method, which returns the paths to the image files and the labels. The path to the file is then stored in the options dictionary under the ``'datafile'`` key. This key should be used to store data that will be useful for performing the task.
 
 We still need to implement this ``_path()`` method. The version for this example is presented below. It first ensures the data is built by calling the ``build()`` method described above. It then sets up the paths for the built data. This should be specific to the dataset being used. If your dataset does not use images, the ``image_path`` is not necessary, for example. Or if your task will use data other than labels, the path to the file containing this information can also be returned.
 
@@ -193,25 +215,25 @@ We still need to implement this ``_path()`` method. The version for this example
     def _path(opt):
         # ensure data is built
         build(opt)
-        
+
         # set up paths to data (specific to each dataset)
         dt = opt['datatype'].split(':')[0]
         labels_path = os.path.join(opt['datapath'], 'mnist', dt, 'labels.json')
         image_path = os.path.join(opt['datapath'], 'mnist', dt)
         return labels_path, image_path
 
-By creating ``MnistQATeacher`` as a subclass of ``DialogTeacher``, the job of creating a teacher for this task becomes much simpler: most of the work that needs to be done will limit itself to defining a ``setup_data`` method. This method is a generator that will take in a path to the data and yield a pair of elements for each call. The first element of the pair is a tuple containing the following information: ``(query, labels, reward, label_candidates, path_to_image)``. The second is a boolean flag ``episode_done?`` which indicates if the current query marks the end of an episode or not. 
+By creating ``MnistQATeacher`` as a subclass of ``DialogTeacher``, the job of creating a teacher for this task becomes much simpler: most of the work that needs to be done will limit itself to defining a ``setup_data`` method. This method is a generator that will take in a path to the data and yield a pair of elements for each call. The first element of the pair is a tuple containing the following information: ``(query, labels, reward, label_candidates, path_to_image)``. The second is a boolean flag ``episode_done?`` which indicates if the current query marks the end of an episode or not.
 
 More information on this format can be found in the documentation on ``data_loader`` in `DialogData <http://parl.ai/static/docs/dialog.html#parlai.core.dialog_teacher.DialogData>`__ (``setup_data`` is provided as a data_loader to ``DialogData``).
 
-The sample ``setup_data`` method for our task is presented below. 
+The sample ``setup_data`` method for our task is presented below.
 
 .. code-block:: python
 
     def setup_data(self, path):
         print('loading: ' + path)
 
-        # open data file with labels 
+        # open data file with labels
         # (path will be provided to setup_data from opt['datafile'] defined above)
         with open(path) as labels_file:
             self.labels = json.load(labels_file)
@@ -232,7 +254,7 @@ The sample ``setup_data`` method for our task is presented below.
 
 As we can see from the code above, for this specific task the question is always the same, and thus it is fixed. For different tasks, this might change at each iteration. Similarly, for this task, each episode consists of only one query, thus ``episode_done?`` is always true (*i.e.*, each query is the end of its episode). This could also vary depending on the task.
 
-Looking at the tuple provided by the iterator at each yield, we can see that we defined a query, a label and an image path. When working with ``DialogTeacher`` in visual tasks, it is important to provide the path to the image in the ``setup_data`` tuple. This allows one to inherit functionality around the "image-mode" command line parameter, such as automatically returning ascii versions of images if -im ascii is set. 
+Looking at the tuple provided by the iterator at each yield, we can see that we defined a query, a label and an image path. When working with ``DialogTeacher`` in visual tasks, it is important to provide the path to the image in the ``setup_data`` tuple. This allows one to inherit functionality around the "image-mode" command line parameter, such as automatically returning ascii versions of images if -im ascii is set.
 
 Finally, one might notice that no reward or label candidates were provided in the tuple (both are set to ``None``). The reward is not specified because it is not useful for this task. The label candidates, however, were not specified per-example for this task because we instead use a single set of universal candidates for every example in this task (the digits from '0' to '9'). For cases like this, with fixed label candidates, one can simply define a method ``label_candidates()`` that returns the unchanging candidates, as demonstrated below. For cases where the label candidates vary for each query, the field in the tuple can be used.
 
@@ -254,7 +276,190 @@ And we have finished building our task.
 Task from Scratch
 ~~~~~~~~~~~~~~~~~
 
-Coming soon.
+In this section we will demonstrate the process of creating a task from scratch by adding the VQAv2 visual question-answering task. To implement this task we will inherit directly from the base ``Teacher`` class instead of using ``DialogTeacher``. This is usually not necessary, but it is done here as an example of creating a task from scratch.
+
+In this task, the agent is presented with an image of a scene and then asked to answer a question about that scene. A sample episode is demonstrated below.
+
+.. image:: _static/img/task_tutorial_skateboard.jpg
+
+::
+
+    [vqa_v2]: What is this man holding?
+    [labels: skateboard]
+       [Agent]: skateboard
+
+
+We will call our teacher ``OeTeacher`` (for open-ended teacher, since it doesn't provide the agent with label candidates). Let's initialize this class first.
+
+.. code-block:: python
+
+    class OeTeacher(Teacher):
+        def __init__(self, opt, shared=None):
+            super().__init__(opt)
+            # store datatype
+            self.datatype = opt['datatype']
+            # _path method explained below, returns paths to images and labels
+            data_path, annotation_path, self.image_path = _path(opt)
+
+            # setup data if it hasn't been provided in shared
+            if shared and 'ques' in shared:
+                self.ques = shared['ques']
+                if 'annotation' in shared:
+                    self.annotation = shared['annotation']
+            else:
+                self._setup_data(data_path, annotation_path)
+            self.len = len(self.ques['questions'])
+
+            # for ordered data in batch mode (especially, for validation and
+            # testing), each teacher in the batch gets a start index and a step
+            # size so they all process disparate sets of the data
+            self.step_size = opt.get('batchsize', 1)
+            self.data_offset = opt.get('batchindex', 0)
+
+            # instantiate image loader for later usage
+            self.image_loader = ImageLoader(opt)
+
+            self.reset()
+
+There are three important parts to this initialization. First, the call to the ``_path()`` method, which returns the paths to the data, annotation and image files. Second, setting up the data and handling the ``shared`` argument, which is used when initializing multiple teachers (*e.g.*, for batch training). It is a dictionary containing data that can be shared across instances of the class. Third, defining step sizes and offsets for walking over the data in batch mode. Let's look at each of these in order.
+
+First, we need to implement the ``_path()`` method. The version for this example is presented below. It first ensures the data is built by calling the ``build()`` method described above. In this case, it also calls a ``buildImage()`` method, which downloads the images for this task. This method is analogous to ``build()`` and can be found in the same ``build.py`` file. It then sets up the paths for the built data. This should be specific to the dataset being used. If your dataset does not use images, the ``image_path`` is not necessary, for example. (The same applies to the ``image_loader``.)
+
+.. code-block:: python
+
+    def _path(opt):
+        # ensure data is built
+        build(opt)
+        buildImage(opt)
+        dt = opt['datatype'].split(':')[0]
+
+        # verify datatype to decide which sub-dataset to load
+        if dt == 'train':
+            ques_suffix = 'v2_OpenEnded_mscoco_train2014'
+            annotation_suffix = 'v2_mscoco_train2014'
+            img_suffix = os.path.join('train2014', 'COCO_train2014_')
+        elif dt == 'valid':
+            ques_suffix = 'v2_OpenEnded_mscoco_val2014'
+            annotation_suffix = 'v2_mscoco_val2014'
+            img_suffix = os.path.join('val2014', 'COCO_val2014_')
+        elif dt == 'test':
+            ques_suffix = 'v2_OpenEnded_mscoco_test2015'
+            annotation_suffix = 'None'
+            img_suffix = os.path.join('test2015', 'COCO_test2015_')
+        else:
+            raise RuntimeError('Not valid datatype.')
+
+        # set up paths to data
+        data_path = os.path.join(opt['datapath'], 'VQA-v2',
+                                 ques_suffix + '_questions.json')
+
+        annotation_path = os.path.join(opt['datapath'], 'VQA-v2',
+                                       annotation_suffix + '_annotations.json')
+
+        image_path = os.path.join(opt['datapath'], 'COCO-IMG', img_suffix)
+
+        return data_path, annotation_path, image_path
+
+Now, we can look at how to setup the data and handle the ``shared`` argument. If an ``OeTeacher`` instance is the first one being created in a task execution, ``shared`` will be ``None``, and thus it will need to set up it's data. This is done in the ``_setup_data()`` method, pasted below. In the case of this task, ``_setup_data()`` simply loads the data (and possibly the annotations) and stores them as class attributes.
+
+.. code-block:: python
+
+    def _setup_data(self, data_path, annotation_path):
+        # loads data
+        print('loading: ' + data_path)
+        with open(data_path) as data_file:
+            self.ques = json.load(data_file)
+        # if testing load annotations
+        if self.datatype != 'test':
+            print('loading: ' + annotation_path)
+            with open(annotation_path) as data_file:
+                self.annotation = json.load(data_file)
+
+However, if the ``OeTeacher`` instance being created is not the first one for a certain task execution, we want to avoid having to reload the same data many times. For this to work we need to do two things. First, we define a ``share()`` method, which will set up the task-specific contents of the ``shared`` parameter. This method is presented below. It places the data we have just loaded in ``_setup_data()`` in the shared dictionary and returns it.
+
+.. code-block:: python
+
+    def share(self):
+        shared = super().share()
+        shared['ques'] = self.ques
+        if hasattr(self, 'annotation'):
+            shared['annotation'] = self.annotation
+        return shared
+
+Now that the data sharing is properly set up, when other instances of ``OeTeacher`` are created for a task execution, they will be able to use the ``shared`` argument passed to ``__init__()`` in order to use the already loaded data, as seen before.
+
+We have also seen that we have set up ``self.step_size`` to the size of the batch and ``self.data_offset`` to the batch index, so that different teachers in a batch access diferent parts of the data. A method ``reset()`` is then called to initialize the data loading. Let's look at that method below. It first sets the attribute ``self.lastY`` to ``None``. This attribute will be used to hold the label for the last example seen by the instance. Then, ``self.episode_idx`` is set to a ``step_size`` below the ``data_offset``, so that when the first action is executed, it is incremented and starts exactly at the ``data_offset`` index.
+
+.. code-block:: python
+
+    def reset(self):
+        # Reset the dialog so that it is at the start of the epoch,
+        # and all metrics are reset.
+        super().reset()
+        self.lastY = None
+        self.episode_idx = self.data_offset - self.step_size
+
+Now that we are done with the class initialization, there are only a few steps left in creating the task. First, the ``OeTeacher`` requires a ``__len__()`` method that returns the size of the data it is presenting. Since ``self.len`` had already been defined in the initialization, this is easy to achieve.
+
+.. code-block:: python
+
+    def __len__(self):
+        return self.len
+
+The final step is to define the important ``act()`` and ``observe()`` methods, which are required of all agents in parlai. In the observe method we simply check if a prediction was made in the last step and if so update the metrics with the last observation and label and clear ``lastY``. This is important because it is the job of the ``Teacher`` to update the metrics.
+
+.. code-block:: python
+
+    def observe(self, observation):
+        """Process observation for metrics."""
+        if self.lastY is not None:
+            self.metrics.update(observation, self.lastY)
+            self.lastY = None
+        return observation
+
+In the act method we need to return the ``Teacher``'s action, which will then be presented to the agent(s) performing the task. In this case, this includes an image and a question. We first select which example to use: randomly in the case of training or sequentially in the case of validation/testing. The ``OeTeacher`` then loads the appropriate question, which is placed in the ``text`` field of the dict. The image_path is also constructed and an image object (loaded utilizing the ``ImageLoader`` class) is passed in the ``image`` field. The ``episode_done`` flag is always set to true in this task specifically due to the fact that all episodes consist of only one example.
+
+.. code-block:: python
+
+    def act(self):
+        # pick random example if training, else proceed sequentially
+        if self.datatype == 'train':
+            self.episode_idx = random.randrange(self.len)
+        else:
+            self.episode_idx = (self.episode_idx + self.step_size) % len(self)
+            if self.episode_idx == len(self) - self.step_size:
+                self.epochDone = True
+        # get question and image path for current example
+        qa = self.ques['questions'][self.episode_idx]
+        question = qa['question']
+        image_id = qa['image_id']
+
+        img_path = self.image_path + '%012d.jpg' % (image_id)
+        # build action dict, all episodes consist of 1 example in this task
+        action = {
+            'image': self.image_loader.load(img_path),
+            'text': question,
+            'episode_done': True
+        }
+        # if not testing get annotations and set lastY
+        if not self.datatype.startswith('test'):
+            anno = self.annotation['annotations'][self.episode_idx]
+            self.lastY = [ans['answer'] for ans in anno['answers']]
+        # if training, set fill labels field
+        if self.datatype.startswith('train'):
+            action['labels'] = self.lastY
+
+        return action
+
+The only thing left to be done for this part is to define a ``DefaultTeacher`` class. This is a requirement for any task, since it defaults to this teacher when no one is specified. We can simply default to the class we have built so far.
+
+.. code-block:: python
+
+    class DefaultTeacher(OeTeacher):
+        pass
+
+And we have finished building a task from scratch.
+
 
 
 Part 3: Add Task to Task List
@@ -281,11 +486,11 @@ Now that our task is complete, we must add an entry to the ``task_list.py`` file
             "description": "Task which requires agents to identify which number they are seeing. From the MNIST dataset."
         },
         {
-            "id": "VQAv1",
-            "display_name": "VQAv1",
-            "task": "vqa_v1",
+            "id": "VQAv2",
+            "display_name": "VQAv2",
+            "task": "vqa_v2",
             "tags": [ "all", "Visual" ],
-            "description": "Open-ended question answering about visual content. From Agrawal et al. '15. Link: https://arxiv.org/abs/1505.00468"
+            "description": "Bigger, more balanced version of the original VQA dataset. From Goyal et al. '16. Link: https://arxiv.org/abs/1612.00837"
         },
         # other tasks...
     ]
@@ -300,3 +505,7 @@ A simple way of testing the basic functionality in a task is to run the ``displa
 To run the MNIST_QA task, while displaying the images in ascii format, we could call:
 
 ``python display_data.py -t mnist_qa -im ascii``
+
+And for VQAv2:
+
+``python display_data.py -t vqa_v2``
diff --git a/examples/README.md b/examples/README.md
index 67839006bda..14c8f9edfa9 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -48,14 +48,14 @@ Build a dictionary on a bAbI "1k training examples" task 1 and save it to /tmp/d
 python build_dict.py -t babi:task1k:1 --dict-file /tmp/dict.tsv
 ```
 
-Train a simple cpu-based memory network on the "10k training examples" bAbI task 1 with 8 threads (python processes) using Hogwild (requires zmq and Lua Torch):
+Train a simple sequence to sequence model on the "1k training examples" bAbI task 1 with batch size of 8 examples for one epoch (requires pytorch):
 ```bash
-python memnn_luatorch_cpu/full_task_train.py -t babi:task10k:1 -nt 8
+python train_model.py -m seq2seq -t babi:task1k:1 -bs 8 -e 1 -mf /tmp/model_s2s
 ```
 
 Trains an attentive LSTM model of [Chen et al.](https://arxiv.org/abs/1704.00051) on the SQuAD dataset with a batch size of 32 examples (requires pytorch):
 ```bash
-python train_model.py -m drqa -t squad -bs 32 -mf /tmp/model
+python train_model.py -m drqa -t squad -bs 32 -mf /tmp/model_drqa
 ```
 
 Evaluates on an already trained SQuAD model:
@@ -67,5 +67,10 @@ python eval_model.py -m drqa -t squad -mf squad.mdl -dt valid
 Interactive session on an already trained SQuAD model:
 ```bash
 wget https://s3.amazonaws.com/fair-data/parlai/_models/drqa/squad.mdl
-python interactive.py -m drqa -mf squad.mdl 
+python interactive.py -m drqa -mf squad.mdl
+```
+
+Train a simple cpu-based memory network on the "10k training examples" bAbI task 1 with 8 threads (python processes) using Hogwild (requires zmq and Lua Torch):
+```bash
+python memnn_luatorch_cpu/full_task_train.py -t babi:task10k:1 -nt 8
 ```
diff --git a/examples/display_model.py b/examples/display_model.py
index dbc213c29c1..bc2e856f583 100644
--- a/examples/display_model.py
+++ b/examples/display_model.py
@@ -25,6 +25,7 @@ def main():
     parser = ParlaiParser(True, True)
     parser.add_argument('-n', '--num-examples', default=10)
     opt = parser.parse_args()
+
     # Create model and assign it to the specified task
     agent = create_agent(opt)
     world = create_task(opt, agent)
diff --git a/examples/eval_model.py b/examples/eval_model.py
index eeb2ba01f3a..a085d2460ef 100644
--- a/examples/eval_model.py
+++ b/examples/eval_model.py
@@ -22,8 +22,9 @@ def main():
 
     # Get command line arguments
     parser = ParlaiParser(True, True)
-    parser.add_argument('-n', '--num-examples', default=1000)
+    parser.add_argument('-n', '--num-examples', default=100000000)
     parser.add_argument('-d', '--display-examples', type='bool', default=False)
+    parser.set_defaults(datatype='valid')
     opt = parser.parse_args()
     # Create model and assign it to the specified task
     agent = create_agent(opt)
diff --git a/examples/extract_image_feature.py b/examples/extract_image_feature.py
new file mode 100644
index 00000000000..2198956ebfd
--- /dev/null
+++ b/examples/extract_image_feature.py
@@ -0,0 +1,51 @@
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+"""Basic example which iterates through the tasks specified and load/extract the 
+image features. 
+
+For example, to extract the image feature of COCO images:
+`python examples/extract_image_feature.py -t vqa_v1 -im resnet152`.
+
+The CNN model and layer is specified at `--image-cnntype` and `--image-layernum` 
+in `parlai.core.image_featurizers`. 
+
+For more options, check `parlai.core.image_featurizers`
+"""
+
+from parlai.core.params import ParlaiParser
+from parlai.agents.repeat_label.repeat_label import RepeatLabelAgent
+from parlai.core.worlds import create_task
+from parlai.core.image_featurizers import ImageLoader
+
+import random
+
+def main():
+    random.seed(42)
+
+    # Get command line arguments
+    parser = ParlaiParser()
+    parser.add_argument('-n', '--num-examples', default=10)
+    parser.set_defaults(datatype='train:ordered')
+
+    ImageLoader.add_cmdline_args(parser)
+    opt = parser.parse_args()
+
+    opt['no_cuda'] = False
+    opt['gpu'] = 0
+    # create repeat label agent and assign it to the specified task
+    agent = RepeatLabelAgent(opt)
+    world = create_task(opt, agent)
+
+    # Show some example dialogs.
+    with world:
+        for k in range(int(opt['num_examples'])):
+            world.parley()
+            print(world.display() + '\n~~')
+            if world.epoch_done():
+                print('EPOCH DONE')
+                break
+
+if __name__ == '__main__':
+    main()
diff --git a/examples/memnn_luatorch_cpu/full_task_train.py b/examples/memnn_luatorch_cpu/full_task_train.py
index 59e68b480fe..8b01aa518d4 100644
--- a/examples/memnn_luatorch_cpu/full_task_train.py
+++ b/examples/memnn_luatorch_cpu/full_task_train.py
@@ -56,32 +56,31 @@ def main():
     if not opt.get('dict_file'):
         # build dictionary since we didn't load it
         ordered_opt = copy.deepcopy(opt)
-        for datatype in ['train:ordered', 'valid']:
-            # we use train and valid sets to build dictionary
-            ordered_opt['datatype'] = datatype
-            ordered_opt['numthreads'] = 1
-            world_dict = create_task(ordered_opt, dictionary)
-
-            print('Dictionary building on {} data.'.format(datatype))
-            cnt = 0
-            # pass examples to dictionary
-            for _ in world_dict:
-                cnt += 1
-                if cnt > opt['dict_max_exs'] and opt['dict_max_exs'] > 0:
-                    print('Processed {} exs, moving on.'.format(
-                          opt['dict_max_exs']))
-                    # don't wait too long...
-                    break
-
-                world_dict.parley()
+        ordered_opt['datatype'] = 'train:ordered'
+        ordered_opt['numthreads'] = 1
+        world_dict = create_task(ordered_opt, dictionary)
+
+        print('Dictionary building on training data.')
+        cnt = 0
+        # pass examples to dictionary
+        for _ in world_dict:
+            cnt += 1
+            if cnt > opt['dict_max_exs'] and opt['dict_max_exs'] > 0:
+                print('Processed {} exs, moving on.'.format(
+                      opt['dict_max_exs']))
+                # don't wait too long...
+                break
+
+            world_dict.parley()
 
         # we need to save the dictionary to load it in memnn (sort it by freq)
+        dictionary.sort()
         dictionary.save('/tmp/dict.txt', sort=True)
 
     print('Dictionary ready, moving on to training.')
 
     opt['datatype'] = 'train'
-    agent = ParsedRemoteAgent(opt, {'dictionary': dictionary})
+    agent = ParsedRemoteAgent(opt, {'dictionary_shared': dictionary.share()})
     world_train = create_task(opt, agent)
     opt['datatype'] = 'valid'
     world_valid = create_task(opt, agent)
diff --git a/examples/remote.py b/examples/remote.py
new file mode 100644
index 00000000000..807741e658e
--- /dev/null
+++ b/examples/remote.py
@@ -0,0 +1,66 @@
+# Copyright 2004-present Facebook. All Rights Reserved.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+"""Simple loop which sets up a remote connection. The paired agent can run this
+same loop but with the '--remote-host' flag set. For example...
+
+Agent 1:
+python remote.py
+
+Agent 2:
+python remote.py --remote-host
+
+Now humans connected to each agent can communicate over that thread.
+
+
+If you want to use this to feed a dataset to a remote agent, set the '--task':
+
+Agent 1:
+python remote.py -t "babi:task1k:1"
+
+
+If you would like to use a model instead, merely set the '--model' flag:
+
+Either Agent:
+python remote.py -m seq2seq
+"""
+
+from parlai.agents.remote_agent.remote_agent import RemoteAgentAgent
+from parlai.agents.local_human.local_human import LocalHumanAgent
+from parlai.core.params import ParlaiParser
+from parlai.core.agents import create_agent
+from parlai.core.worlds import DialogPartnerWorld, create_task
+
+import random
+
+def main():
+    random.seed(42)
+
+    # Get command line arguments
+    parser = ParlaiParser(True, True)
+    RemoteAgentAgent.add_cmdline_args(parser)
+    opt = parser.parse_args()
+
+    remote = RemoteAgentAgent(opt)
+    if opt.get('task'):
+        world = create_task(opt, [remote])
+    else:
+        if opt.get('model'):
+            local = create_agent(opt)
+        else:
+            local = LocalHumanAgent(opt)
+        # the remote-host goes **second**
+        agents = [local, remote] if not opt['remote_host'] else [remote, local]
+        world = DialogPartnerWorld(opt, agents)
+
+
+    # Talk to the remote agent
+    with world:
+        while True:
+            world.parley()
+            print(world.display())
+
+if __name__ == '__main__':
+    main()
diff --git a/examples/train_model.py b/examples/train_model.py
index 8a8aa320299..8223113d012 100644
--- a/examples/train_model.py
+++ b/examples/train_model.py
@@ -3,62 +3,75 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree. An additional grant
 # of patent rights can be found in the PATENTS file in the same directory.
-'''Train a model.
+"""Train a model.
 
 After training, computes validation and test error.
 
 Run with, e.g.:
 
-python examples/train_model.py -m ir_baseline -t dialog_babi:Task:1 -mf '/tmp/model'
+python examples/train_model.py -m ir_baseline -t dialog_babi:Task:1 -mf /tmp/model
 
 ..or..
 
-python examples/train_model.py -m rnn_baselines/seq2seq -t babi:Task10k:1 -mf '/tmp/model' -bs 32 -lr 0.5 -hs 128
+python examples/train_model.py -m seq2seq -t babi:Task10k:1 -mf '/tmp/model' -bs 32 -lr 0.5 -hs 128
 
 ..or..
 
-python examples/train_model.py -m drqa -t babi:Task10k:1 -mf '/tmp/model' -bs 10
+python examples/train_model.py -m drqa -t babi:Task10k:1 -mf /tmp/model -bs 10
 
 TODO List:
 - More logging (e.g. to files), make things prettier.
-'''
+"""
 
 from parlai.core.agents import create_agent
 from parlai.core.worlds import create_task
 from parlai.core.params import ParlaiParser
 from parlai.core.utils import Timer
 import build_dict
-import copy
-import importlib
 import math
-import os
 
-def run_eval(agent, opt, datatype, still_training=False):
-    ''' Eval on validation/test data. '''
+def run_eval(agent, opt, datatype, max_exs=-1, write_log=False, valid_world=None):
+    """Eval on validation/test data.
+    - Agent is the agent to use for the evaluation.
+    - opt is the options that specific the task, eval_task, etc
+    - datatype is the datatype to use, such as "valid" or "test"
+    - write_log specifies to write metrics to file if the model_file is set
+    - max_exs limits the number of examples if max_exs > 0
+    - valid_world can be an existing world which will be reset instead of reinitialized
+    """
     print('[ running eval: ' + datatype + ' ]')
     opt['datatype'] = datatype
     if opt.get('evaltask'):
+
         opt['task'] = opt['evaltask']
-    valid_world = create_task(opt, agent)
-    for i in range(len(valid_world)):
+
+    if valid_world is None:
+        valid_world = create_task(opt, agent)
+    else:
+        valid_world.reset()
+    cnt = 0
+    for _ in valid_world:
         valid_world.parley()
-        if i == 1 and opt['display_examples']:
+        if cnt == 0 and opt['display_examples']:
             print(valid_world.display() + '\n~~')
             print(valid_world.report())
-        if valid_world.epoch_done():
+        cnt += opt['batchsize']
+        if valid_world.epoch_done() or (max_exs > 0 and cnt > max_exs):
+            # note this max_exs is approximate--some batches won't always be
+            # full depending on the structure of the data
             break
-    valid_world.shutdown()
     valid_report = valid_world.report()
+
     metrics = datatype + ':' + str(valid_report)
     print(metrics)
-    if still_training:
-        return valid_report
-    else:
-        if opt['model_file']:
-            # Write out metrics
-            f = open(opt['model_file'] + '.' + datatype, 'a+')
-            f.write(metrics + '\n')
-            f.close()
+    if write_log and opt['model_file']:
+        # Write out metrics
+        f = open(opt['model_file'] + '.' + datatype, 'a+')
+        f.write(metrics + '\n')
+        f.close()
+
+    return valid_report, valid_world
+
 
 def main():
     # Get command line arguments
@@ -69,13 +82,17 @@ def main():
                               'one used for training if not set)'))
     train.add_argument('-d', '--display-examples',
                         type='bool', default=False)
-    train.add_argument('-e', '--num-epochs', type=int, default=1)
+    train.add_argument('-e', '--num-epochs', type=float, default=-1)
     train.add_argument('-ttim', '--max-train-time',
-                        type=float, default=float('inf'))
+                        type=float, default=-1)
     train.add_argument('-ltim', '--log-every-n-secs',
-                        type=float, default=1)
+                        type=float, default=2)
     train.add_argument('-vtim', '--validation-every-n-secs',
-                        type=float, default=0)
+                        type=float, default=-1)
+    train.add_argument('-vme', '--validation-max-exs',
+                        type=int, default=-1,
+                        help='max examples to use during validation (default ' +
+                             '-1 uses all)')
     train.add_argument('-vp', '--validation-patience',
                         type=int, default=5,
                         help=('number of iterations of validation where result '
@@ -88,6 +105,7 @@ def main():
     if opt['dict_build_first'] and 'dict_file' in opt:
         if opt['dict_file'] is None and opt.get('model_file'):
             opt['dict_file'] = opt['model_file'] + '.dict'
+        print("[ building dictionary first... ]")
         build_dict.build_dict(opt)
     # Create model and assign it to the specified task
     agent = create_agent(opt)
@@ -98,63 +116,93 @@ def main():
     log_time = Timer()
     print('[ training... ]')
     parleys = 0
-    num_parleys = opt['num_epochs'] * int(len(world) / opt['batchsize'])
+    total_exs = 0
+    max_exs = opt['num_epochs'] * len(world)
+    max_parleys = math.ceil(max_exs / opt['batchsize'])
     best_accuracy = 0
     impatience = 0
     saved = False
-    for i in range(num_parleys):
+    valid_world = None
+    while True:
         world.parley()
-        parleys = parleys + 1
-        if train_time.time() > opt['max_train_time']:
-            print('[ max_train_time elapsed: ' + str(train_time.time()) + ' ]')
+        parleys += 1
+
+        if opt['num_epochs'] > 0 and parleys >= max_parleys:
+            print('[ num_epochs completed: {} ]'.format(opt['num_epochs']))
             break
-        if log_time.time() > opt['log_every_n_secs']:
+        if opt['max_train_time'] > 0 and train_time.time() > opt['max_train_time']:
+            print('[ max_train_time elapsed: {} ]'.format(train_time.time()))
+            break
+        if opt['log_every_n_secs'] > 0 and log_time.time() > opt['log_every_n_secs']:
             if opt['display_examples']:
                 print(world.display() + '\n~~')
-            parleys_per_sec =  train_time.time() / parleys
-            time_left = (num_parleys - parleys) * parleys_per_sec
-            log = ('[ time:' + str(math.floor(train_time.time()))
-                  + 's parleys:' + str(parleys)
-                  + ' time_left:'
-                  + str(math.floor(time_left))  + 's ]')
+
+            logs = []
+            # time elapsed
+            logs.append('time:{}s'.format(math.floor(train_time.time())))
+            logs.append('parleys:{}'.format(parleys))
+
+            # get report and update total examples seen so far
             if hasattr(agent, 'report'):
-                log = log + str(agent.report())
+                train_report = agent.report()
+                agent.reset_metrics()
             else:
-                log = log + str(world.report())
-                # TODO: world.reset_metrics()
+                train_report = world.report()
+                world.reset_metrics()
+
+            if hasattr(train_report, 'get') and train_report.get('total'):
+                total_exs += train_report['total']
+                logs.append('total_exs:{}'.format(total_exs))
+
+            # check if we should log amount of time remaining
+            time_left = None
+            if opt['num_epochs'] > 0:
+                exs_per_sec = train_time.time() / total_exs
+                time_left = (max_exs - total_exs) * exs_per_sec
+            if opt['max_train_time'] > 0:
+                other_time_left = opt['max_train_time'] - train_time.time()
+                if time_left is not None:
+                    time_left = min(time_left, other_time_left)
+                else:
+                    time_left = other_time_left
+            if time_left is not None:
+                logs.append('time_left:{}s'.format(math.floor(time_left)))
+
+            # join log string and add full metrics report to end of log
+            log = '[ {} ] {}'.format(' '.join(logs), train_report)
+
             print(log)
             log_time.reset()
-        if (opt['validation_every_n_secs'] and
-            validate_time.time() > opt['validation_every_n_secs']):
-            valid_report = run_eval(agent, opt, 'valid', True)
+
+        if (opt['validation_every_n_secs'] > 0 and
+                validate_time.time() > opt['validation_every_n_secs']):
+            valid_report, valid_world = run_eval(agent, opt, 'valid', opt['validation_max_exs'], valid_world=valid_world)
             if valid_report['accuracy'] > best_accuracy:
                 best_accuracy = valid_report['accuracy']
                 impatience = 0
-                print('[ new best accuracy: ' + str(best_accuracy) +  ' ]')
-                if opt['model_file']:
-                    agent.save(opt['model_file'])
-                    saved = True
+                print('[ new best accuracy: ' + str(best_accuracy) + ' ]')
+                world.save_agents()
+                saved = True
                 if best_accuracy == 1:
                     print('[ task solved! stopping. ]')
                     break
             else:
                 impatience += 1
-                print('[ did not beat best accuracy: ' + str(best_accuracy) +
-                      ' impatience: ' + str(impatience)  + ' ]')
+                print('[ did not beat best accuracy: {} impatience: {} ]'.format(
+                        round(best_accuracy, 4), impatience))
             validate_time.reset()
-            if impatience >= opt['validation_patience']:
-                print('[ ran out of patience! stopping. ]')
+            if opt['validation_patience'] > 0 and impatience >= opt['validation_patience']:
+                print('[ ran out of patience! stopping training. ]')
                 break
     world.shutdown()
     if not saved:
-        if opt['model_file']:
-            agent.save(opt['model_file'])
+        world.save_agents()
     else:
         # reload best validation model
         agent = create_agent(opt)
 
-    run_eval(agent, opt, 'valid')
-    run_eval(agent, opt, 'test')
+    run_eval(agent, opt, 'valid', write_log=True)
+    run_eval(agent, opt, 'test', write_log=True)
 
 
 if __name__ == '__main__':
diff --git a/parlai/agents/drqa/config.py b/parlai/agents/drqa/config.py
index be44513cd29..67608667942 100644
--- a/parlai/agents/drqa/config.py
+++ b/parlai/agents/drqa/config.py
@@ -85,7 +85,7 @@ def set_defaults(opt):
     # Embeddings options
     if opt.get('embedding_file'):
         if not os.path.isfile(opt['embedding_file']):
-            raise IOError('No such file: %s' % args.embedding_file)
+            raise IOError('No such file: %s' % opt['embedding_file'])
         with open(opt['embedding_file']) as f:
             dim = len(f.readline().strip().split(' ')) - 1
         opt['embedding_dim'] = dim
diff --git a/parlai/agents/drqa/drqa.py b/parlai/agents/drqa/drqa.py
index ba33732e234..2de2f1bebb5 100644
--- a/parlai/agents/drqa/drqa.py
+++ b/parlai/agents/drqa/drqa.py
@@ -190,7 +190,7 @@ def act(self):
         if ex is None:
             return reply
         batch = batchify(
-            [ex], null=self.word_dict['<NULL>'], cuda=self.opt['cuda']
+            [ex], null=self.word_dict[self.word_dict.null_token], cuda=self.opt['cuda']
         )
 
         # Either train or predict
@@ -223,7 +223,7 @@ def batch_act(self, observations):
 
         # Else, use what we have (hopefully everything).
         batch = batchify(
-            examples, null=self.word_dict['<NULL>'], cuda=self.opt['cuda']
+            examples, null=self.word_dict[self.word_dict.null_token], cuda=self.opt['cuda']
         )
 
         # Either train or predict
@@ -237,10 +237,12 @@ def batch_act(self, observations):
 
         return batch_reply
 
-    def save(self, filename):
+    def save(self, fname=None):
         """Save the parameters of the agent to a file."""
-        print("[ saving model: " + self.opt['model_file'] + " ]")
-        self.model.save(self.opt['model_file'])
+        fname = self.opt.get('model_file', None) if fname is None else fname
+        if fname:
+            print("[ saving model: " + fname + " ]")
+            self.model.save(fname)
 
     # --------------------------------------------------------------------------
     # Helper functions.
diff --git a/parlai/agents/drqa/rnn_reader.py b/parlai/agents/drqa/rnn_reader.py
index 2f7cdf20af7..eed1cfd34b8 100644
--- a/parlai/agents/drqa/rnn_reader.py
+++ b/parlai/agents/drqa/rnn_reader.py
@@ -77,7 +77,7 @@ def __init__(self, opt, padding_idx=0):
 
         # Question merging
         if opt['question_merge'] not in ['avg', 'self_attn']:
-            raise NotImplementedError('merge_mode = %s' % opt['merge_mode'])
+            raise NotImplementedError('question_merge = %s' % opt['question_merge'])
         if opt['question_merge'] == 'self_attn':
             self.self_attn = layers.LinearSeqAttn(question_hidden_size)
 
diff --git a/parlai/agents/drqa/utils.py b/parlai/agents/drqa/utils.py
index 7978338c6c1..170a4a00c71 100644
--- a/parlai/agents/drqa/utils.py
+++ b/parlai/agents/drqa/utils.py
@@ -36,7 +36,7 @@ def load_embeddings(opt, word_dict):
                 embeddings[word_dict[w]].copy_(vec)
 
     # Zero NULL token
-    embeddings[word_dict['<NULL>']].fill_(0)
+    embeddings[word_dict['__NULL__']].fill_(0)
 
     return embeddings
 
diff --git a/parlai/agents/hred/README.md b/parlai/agents/hred/README.md
new file mode 100755
index 00000000000..3bebff886db
--- /dev/null
+++ b/parlai/agents/hred/README.md
@@ -0,0 +1,133 @@
+### Description
+This repository hosts the Latent Variable Hierarchical Recurrent Encoder-Decoder RNN model with Gaussian and piecewise constant latent variables for generative dialog modeling, as well as the HRED baseline model. These models were proposed in the paper "Piecewise Latent Variables for Neural Variational Text Processing" by Serban et al.
+
+
+### Truncated BPTT
+All models are implemented using Truncated Backpropagation Through Time (Truncated BPTT).
+The truncated computation is carried out by splitting each document (dialogue) into shorter sequences (e.g. 80 tokens) and computing gradients for each sequence separately, such that the hidden state of the RNNs on each subsequence is initialized from the preceding sequence (i.e. the hidden states have been forward propagated through the previous states).
+
+
+### Creating Datasets
+The script convert-text2dict.py can be used to generate model datasets based on text files with dialogues.
+It only requires that the document contains end-of-utterance tokens &lt;/s&gt; which are used to construct the model graph, since the utterance encoder is only connected to the dialogue encoder at the end of each utterance.
+
+Prepare your dataset as a text file for with one document per line (e.g. one dialogue per line). The documents are assumed to be tokenized. If you have validation and test sets, they must satisfy the same requirements.
+
+Once you're ready, you can create the model dataset files by running:
+
+python convert-text2dict.py &lt;training_file&gt; --cutoff &lt;vocabulary_size&gt; Training
+python convert-text2dict.py &lt;validation_file&gt; --dict=Training.dict.pkl Validation
+python convert-text2dict.py &lt;test_file&gt; --dict=Training.dict.pkl &lt;vocabulary_size&gt; Test
+
+where &lt;training_file&gt;, &lt;validation_file&gt; and &lt;test_file&gt; are the training, validation and test files, and &lt;vocabulary_size&gt; is the number of tokens that you want to train on (all other tokens, but the most frequent &lt;vocabulary_size&gt; tokens, will be converted to &lt;unk&gt; symbols).
+
+NOTE: The script automatically adds the following special tokens specific to movie script dialogues:
+- end-of-utterance: &lt;/s&gt;
+- end-of-dialogue: &lt;/d&gt;
+- first speaker: &lt;first_speaker&gt;
+- second speaker: &lt;second_speaker&gt;
+- third speaker: &lt;third_speaker&gt;
+- minor speaker: &lt;minor_speaker&gt;
+- voice over: &lt;voice_over&gt;
+- off screen: &lt;off_screen&gt;
+- pause: &lt;pause&gt;
+
+If these do not exist in your dataset, you can safely ignore these. The model will learn to assign approximately zero probability mass to them.
+
+
+### Model Training
+If you have Theano with GPU installed (bleeding edge version), you can train the model as follows:
+1) Clone the Github repository
+2) Unpack your dataset files into "Data" directory.
+3) Create a new prototype inside state.py (look at prototype_test_variational for an example)
+4) From the terminal, cd into the code directory and run:
+
+    THEANO_FLAGS=mode=FAST_RUN,device=cuda,floatX=float32 python train.py --prototype <prototype_name> > Model_Output.txt
+
+where &lt;prototype_name&gt; is a state (model configuration/architecture) defined inside state.py.
+Training a model to convergence on a modern GPU on the Ubuntu Dialogue Corpus with 46 million tokens takes about 2 weeks. If your GPU runs out of memory, you can adjust the batch size (bs) parameter in the model state, but training will be slower. You can also play around with the other parameters inside state.py.
+
+
+### Model Sampling & Testing
+To generate model responses using beam search run:
+
+    THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=cuda python sample.py <model_name> <contexts> <model_outputs> --beam_search --n-samples=<beams> --ignore-unk --verbose
+
+where &lt;model_name&gt; is the name automatically generated during training, &lt;contexts&gt; is a file containing the dialogue contexts with one dialogue per line, and &lt;beams&gt; is the size of the beam search. The results are saved in the file &lt;model_outputs&gt;.
+
+
+### Citation
+If you build on this work, we'd really appreciate it if you could cite our papers:
+
+    Piecewise Latent Variables for Neural Variational Text Processing. Iulian V. Serban, Alexander G. Ororbia II, Joelle Pineau, Aaron Courville, Yoshua Bengio. 2017. https://arxiv.org/abs/1612.00377
+
+    A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. Iulian V. Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, Yoshua Bengio. 2016. http://arxiv.org/abs/1605.06069
+
+    Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, Joelle Pineau. 2016. AAAI. http://arxiv.org/abs/1507.04808.
+
+
+### Reproducing Results in "Piecewise Latent Variables for Neural Variational Text Processing" 
+The results reported in the paper "Piecewise Latent Variables for Neural Variational Text Processing" by Serban et al. are based on the following model states found inside state.py:
+
+    prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Baseline_Exp1 (HRED baseline)
+    prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp5 (P-VHRED)
+    prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp7 (G-VHRED)
+    prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp9 (H-VHRED)
+
+To reproduce these results from scratch, you must follow these steps:
+
+1) Download and unpack the preprocessed Ubuntu dataset available from http://www.iulianserban.com/Files/UbuntuDialogueCorpus.zip.
+
+2) a) Clone this Github repository locally on a machine. Use a machine with a fast GPU with large memory (preferably 12GB).
+
+   b) Reconfigure the model states above in state.py appropriately:
+      1) Change 'train\_dialogues', 'valid\_dialogues', 'test\_dialogues' to the path for the Ubuntu dataset files.
+      2) Change 'dictionary' to the path for the dictionary.
+
+   c) Train up the model. This takes about 2 weeks time!
+      For example, for "prototype\_ubuntu\_GaussPiecewise\_NormOp\_VHRED\_Exp9" run:
+
+        THEANO_FLAGS=mode=FAST_RUN,device=cuda,floatX=float32 python train.py --prototype prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp9 &> Model_Output.txt
+
+      The model will be saved inside the directory Output/.
+      If the machine runs out of GPU memory, reduce the batch size (bs) and maximum number of gradient steps (max_grad_steps) in the model state.
+
+   d) Generate outputs using beam search with size 5 on the Ubuntu test set.
+      To do this, run:
+
+        THEANO_FLAGS=mode=FAST_RUN,device=cuda,floatX=float32 python sample.py <model_path_prefix> <text_set_contexts> <output_file> --beam_search --n-samples=5 --n-turns=1 --verbose
+
+      where &lt;model_path_prefix&gt; is the path to the saved model parameters excluding the postfix (e.g. Output/1482712210.89_UbuntuModel),
+      &lt;text_set_contexts&gt; is the path to the Ubuntu test set contexts and  &lt;output_file&gt; is where the beam outputs will be stored.
+
+   e) Compute performance using activity- and entity-based metrics.
+      Follow the instructions given here: https://github.com/julianser/Ubuntu-Multiresolution-Tools.
+
+
+Following all steps to reproduce the results requires a few weeks time and, depending on your setup, may also require changing your Theano configuraiton and the state file. Therefore, we have also made available the trained models and the generated model responses on the test set.
+
+You can find the trained models here: https://drive.google.com/open?id=0B06gib_77EnxaDg2VkV1N1huUjg.
+
+You can find the model responses generated using beam search in this repository inside "TestSet_BeamSearch_Outputs/".
+
+
+### Datasets
+The pre-processed Ubuntu Dialogue Corpus and model responses used are available at: http://www.iulianserban.com/Files/UbuntuDialogueCorpus.zip.
+
+The original Ubuntu Dialogue Corpus as released by Lowe et al. (2015) can be found here: http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/
+
+Unfortunately due to Twitter's terms of service we are not allowed to distribute Twitter content. Therefore we can only make available the tweet IDs, which can then be used with the Twitter API to build a similar dataset. The tweet IDs and model test responses can be found here: http://www.iulianserban.com/Files/TwitterDialogueCorpus.zip.
+
+### References
+
+    Piecewise Latent Variables for Neural Variational Text Processing. Iulian V. Serban, Alexander G. Ororbia II, Joelle Pineau, Aaron Courville, Yoshua Bengio. 2017. https://arxiv.org/abs/1612.00377
+
+    A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, Yoshua Bengio. 2016a. http://arxiv.org/abs/1605.06069
+
+    Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation. Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua Bengio, Aaron Courville. 2016b. http://arxiv.org/abs/1606.00776.
+
+    Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, Joelle Pineau. 2016c. AAAI. http://arxiv.org/abs/1507.04808.
+
+    Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus. Ryan Lowe, Nissan Pow, Iulian V. Serban, Laurent Charlin, Chia-Wei Liu, Joelle Pineau. 2017. Dialogue & Discourse Journal. http://www.cs.mcgill.ca/~jpineau/files/lowe-dialoguediscourse-2017.pdf
+
+    The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. Ryan Lowe, Nissan Pow, Iulian Serban, Joelle Pineau. 2015. SIGDIAL. http://arxiv.org/abs/1506.08909.
diff --git a/parlai/agents/hred/SS_dataset.py b/parlai/agents/hred/SS_dataset.py
new file mode 100755
index 00000000000..c4253aafb0c
--- /dev/null
+++ b/parlai/agents/hred/SS_dataset.py
@@ -0,0 +1,182 @@
+import numpy 
+import os, gc
+import cPickle
+import copy
+import logging
+
+import threading
+import Queue
+
+import collections
+
+logger = logging.getLogger(__name__)
+
+class SSFetcher(threading.Thread):
+    def __init__(self, parent, init_offset=0, init_reshuffle_count=1, eos_sym=-1,
+                 skip_utterance=False, skip_utterance_predict_both=False):
+        threading.Thread.__init__(self)
+        self.parent = parent
+        self.rng = numpy.random.RandomState(self.parent.seed)
+        self.indexes = numpy.arange(parent.data_len)
+
+        self.init_offset = init_offset
+        self.init_reshuffle_count = init_reshuffle_count
+        self.offset = 0
+        self.reshuffle_count = 0
+
+        self.eos_sym = eos_sym
+        self.skip_utterance = skip_utterance
+        self.skip_utterance_predict_both = skip_utterance_predict_both
+
+    def apply_reshuffle(self):
+        self.rng.shuffle(self.indexes)
+        self.offset = 0
+        self.reshuffle_count += 1
+
+    def run(self):
+        diter = self.parent
+        # Initialize to previously set reshuffles and offset position
+        while (self.reshuffle_count < self.init_reshuffle_count):
+            self.apply_reshuffle()
+
+        self.offset = self.init_offset
+
+        while not diter.exit_flag:
+            last_batch = False
+            dialogues = []
+
+            while len(dialogues) < diter.batch_size:
+                if self.offset == diter.data_len:
+                    if not diter.use_infinite_loop:
+                        last_batch = True
+                        break
+                    else:
+                        # Infinite loop here, we reshuffle the indexes
+                        # and reset the self.offset
+                        self.apply_reshuffle()
+
+                index = self.indexes[self.offset]
+                s = diter.data[index]
+
+                # Flatten if this is a list of lists
+                if len(s) > 0:
+                    if isinstance(s[0], list):
+                        s = [item for sublist in s for item in sublist]
+
+                # Standard dialogue preprocessing
+                if not self.skip_utterance:
+                    # Append only if it is shorter than max_len
+                    if diter.max_len == -1 or len(s) <= diter.max_len:
+                        dialogues.append([s, self.offset, self.reshuffle_count])
+
+                # Skip-utterance preprocessing
+                else:
+                    s = copy.deepcopy(s)
+                    eos_indices = numpy.where(numpy.asarray(s) == self.eos_sym)[0]
+
+                    if not s[0] == self.eos_sym:
+                        eos_indices = numpy.insert(eos_indices, 0, [self.eos_sym])
+                    if not s[-1] == self.eos_sym:
+                        eos_indices = numpy.append(eos_indices, [self.eos_sym])
+                    if len(eos_indices) > 2:
+                        # Compute forward and backward targets
+                        first_utterance_index = self.rng.randint(0, len(eos_indices)-2)
+                        s_forward = s[eos_indices[first_utterance_index]:eos_indices[first_utterance_index+2]+1]
+
+                        s_backward_a = s[eos_indices[first_utterance_index+1]:eos_indices[first_utterance_index+2]]
+                        s_backward_b = s[eos_indices[first_utterance_index]:eos_indices[first_utterance_index+1]+1]
+
+                        # Sometimes an end-of-utterance token is missing at the end.
+                        # Therefore, we need to insert it here.
+                        if s_backward_a[-1] == self.eos_sym or s_backward_b[0] == self.eos_sym:
+                            s_backward = s_backward_a + s_backward_b
+                        else:
+                            s_backward = s_backward_a + [self.eos_sym] + s_backward_b
+
+                    else:
+                        s_forward = [self.eos_sym]
+                        s_backward = [self.eos_sym]
+
+                    if self.skip_utterance_predict_both:
+                        # Append only if it is shorter than max_len
+                        if diter.max_len == -1 or len(s_forward) <= diter.max_len:
+                            dialogues.append([s_forward, self.offset, self.reshuffle_count])
+                        if diter.max_len == -1 or len(s_backward) <= diter.max_len:
+                            dialogues.append([s_backward, self.offset, self.reshuffle_count])
+                    else:
+                        # Append only if it is shorter than max_len
+                        if self.rng.randint(0, 2) == 0:
+                            if diter.max_len == -1 or len(s_forward) <= diter.max_len:
+                                dialogues.append([s_forward, self.offset, self.reshuffle_count])
+                        else:
+                            if diter.max_len == -1 or len(s_backward) <= diter.max_len:
+                                dialogues.append([s_backward, self.offset, self.reshuffle_count])
+
+                self.offset += 1
+
+
+            if len(dialogues):
+                diter.queue.put(dialogues)
+
+            if last_batch:
+                diter.queue.put(None)
+                return
+
+class SSIterator(object):
+    def __init__(self,
+                 dialogue_file,
+                 batch_size,
+                 seed,
+                 max_len=-1,
+                 use_infinite_loop=True,
+                 init_offset=0,
+                 init_reshuffle_count=1,
+                 eos_sym=-1,
+                 skip_utterance=False,
+                 skip_utterance_predict_both=False):
+
+        self.dialogue_file = dialogue_file
+        self.batch_size = batch_size
+        self.init_offset = init_offset
+        self.init_reshuffle_count = init_reshuffle_count
+        self.eos_sym = eos_sym
+        self.skip_utterance = skip_utterance
+        self.skip_utterance_predict_both = skip_utterance_predict_both
+
+        args = locals()
+        args.pop("self")
+        self.__dict__.update(args)
+        self.load_files()
+        self.exit_flag = False
+
+    def load_files(self):
+        self.data = cPickle.load(open(self.dialogue_file, 'r'))
+        self.data_len = len(self.data)
+        logger.debug('Data len is %d' % self.data_len)
+
+    def start(self):
+        self.exit_flag = False
+        self.queue = Queue.Queue(maxsize=1000)
+        self.gather = SSFetcher(self, self.init_offset, self.init_reshuffle_count,
+                                self.eos_sym, self.skip_utterance, self.skip_utterance_predict_both)
+        self.gather.daemon = True
+        self.gather.start()
+
+    def __del__(self):
+        if hasattr(self, 'gather'):
+            self.gather.exitFlag = True
+            self.gather.join()
+
+    def __iter__(self):
+        return self
+
+    def next(self):
+        if self.exit_flag:
+            return None
+        
+        batch = self.queue.get()
+        if not batch:
+            self.exit_flag = True
+        return batch
+
+
diff --git a/parlai/agents/hred/__init__.py b/parlai/agents/hred/__init__.py
new file mode 100755
index 00000000000..de7579ee4a2
--- /dev/null
+++ b/parlai/agents/hred/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
diff --git a/parlai/agents/hred/adam.py b/parlai/agents/hred/adam.py
new file mode 100755
index 00000000000..24289f5dd9b
--- /dev/null
+++ b/parlai/agents/hred/adam.py
@@ -0,0 +1,59 @@
+"""
+The MIT License (MIT)
+
+Copyright (c) 2015 Alec Radford
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+"""    
+
+import theano
+import theano.tensor as T
+
+def sharedX(value, name=None, borrow=False, dtype=None):
+    if dtype is None:
+        dtype = theano.config.floatX
+    return theano.shared(theano._asarray(value, dtype=dtype),
+                         name=name,
+                         borrow=borrow)
+
+def Adam(grads, lr=0.0002, b1=0.1, b2=0.001, e=1e-8):
+    updates = []
+    varlist = []
+    i = sharedX(0.)
+    i_t = i + 1.
+    fix1 = 1. - (1. - b1)**i_t
+    fix2 = 1. - (1. - b2)**i_t
+    lr_t = lr * (T.sqrt(fix2) / fix1)
+    for p, g in grads.items():
+        m = sharedX(p.get_value() * 0., name=p.name + '_adam_optimizer_m')
+        v = sharedX(p.get_value() * 0., name=p.name + '_adam_optimizer_v')
+        m_t = (b1 * g) + ((1. - b1) * m)
+        v_t = (b2 * T.sqr(g)) + ((1. - b2) * v)
+        g_t = m_t / (T.sqrt(v_t) + e)
+        p_t = p - (lr_t * g_t)
+
+        updates.append((m, m_t))
+        updates.append((v, v_t))
+        updates.append((p, p_t))
+
+        varlist.append(m)
+        varlist.append(v)
+
+    updates.append((i, i_t))
+    return updates, varlist
diff --git a/parlai/agents/hred/convert-text2dict.py b/parlai/agents/hred/convert-text2dict.py
new file mode 100755
index 00000000000..cf70f2fe2fa
--- /dev/null
+++ b/parlai/agents/hred/convert-text2dict.py
@@ -0,0 +1,146 @@
+"""
+Takes as input a dialogue file and creates a processed version of it.
+If given an external dictionary, the input dialogue file will be converted
+using that input dictionary.
+
+@author Alessandro Sordoni, Iulian Vlad Serban
+"""
+
+import collections
+import numpy
+import operator
+import os
+import sys
+import logging
+import cPickle
+
+from collections import Counter
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger('text2dict')
+
+def safe_pickle(obj, filename):
+    if os.path.isfile(filename):
+        logger.info("Overwriting %s." % filename)
+    else:
+        logger.info("Saving to %s." % filename)
+    
+    with open(filename, 'wb') as f:
+        cPickle.dump(obj, f, protocol=cPickle.HIGHEST_PROTOCOL)
+
+import argparse
+parser = argparse.ArgumentParser()
+parser.add_argument("input", type=str, help="Dialogue file; assumed shuffled with one document (e.g. one movie dialogue, or one Twitter conversation or one Ubuntu conversation) per line")
+parser.add_argument("--cutoff", type=int, default=-1, help="Vocabulary cutoff (optional)")
+parser.add_argument("--dict", type=str, default="", help="External dictionary (pkl file)")
+parser.add_argument("output", type=str, help="Prefix of the pickle binarized dialogue corpus")
+args = parser.parse_args()
+
+if not os.path.isfile(args.input):
+    raise Exception("Input file not found!")
+
+unk = "<unk>"
+
+###############################
+# Part I: Create the dictionary
+###############################
+if args.dict != "":
+    # Load external dictionary
+    assert os.path.isfile(args.dict)
+    vocab = dict([(x[0], x[1]) for x in cPickle.load(open(args.dict, "r"))])
+    
+    # Check consistency
+    assert '<unk>' in vocab
+    assert '</s>' in vocab
+    assert '</d>' in vocab
+
+    # Also check special tags, which must exist in the Movie-Scriptolog dataset
+    assert '<first_speaker>' in vocab
+    assert '<second_speaker>' in vocab
+    assert '<third_speaker>' in vocab
+    assert '<minor_speaker>' in vocab
+    assert '<voice_over>' in vocab
+    assert '<off_screen>' in vocab
+    assert '<pause>' in vocab
+
+else:
+    word_counter = Counter()
+
+
+    for line in open(args.input, 'r'):
+        line_words = line.strip().split()
+        if line_words[len(line_words)-1] != '</s>':
+            line_words.append('</s>')
+
+        s = [x for x in line_words]
+        word_counter.update(s) 
+
+    total_freq = sum(word_counter.values())
+    logger.info("Total word frequency in dictionary %d " % total_freq) 
+
+    if args.cutoff != -1:
+        logger.info("Cutoff %d" % args.cutoff)
+        vocab_count = word_counter.most_common(args.cutoff)
+    else:
+        vocab_count = word_counter.most_common()
+
+    # Add special tokens to the vocabulary
+    vocab = {'<unk>': 0, '</s>': 1, '</d>': 2, '<first_speaker>': 3, \
+            '<second_speaker>': 4, '<third_speaker>': 5, '<minor_speaker>': 6, \
+            '<voice_over>': 7, '<off_screen>': 8, '<pause>': 9}
+
+    # Add other tokens to vocabulary in the order of their frequency
+    i = 10
+    for (word, count) in vocab_count:
+        if not word in vocab:
+            vocab[word] = i
+            i += 1
+
+logger.info("Vocab size %d" % len(vocab))
+
+#################################
+# Part II: Binarize the dialogues
+#################################
+
+# Everything is loaded into memory for the moment
+binarized_corpus = []
+# Some statistics
+unknowns = 0.
+num_terms = 0.
+freqs = collections.defaultdict(lambda: 0)
+
+# counts the number of dialogues each unique word exists in; also known as document frequency
+df = collections.defaultdict(lambda: 0)
+
+for line, dialogue in enumerate(open(args.input, 'r')):
+    dialogue_words = dialogue.strip().split()
+    if dialogue_words[len(dialogue_words)-1] != '</s>':
+        dialogue_words.append('</s>')
+
+    # Convert words to token ids and compute some statistics
+    dialogue_word_ids = []
+    for word in dialogue_words:
+        word_id = vocab.get(word, 0)
+        dialogue_word_ids.append(word_id)
+        unknowns += 1 * (word_id == 0)
+        freqs[word_id] += 1
+
+    num_terms += len(dialogue_words)
+
+    # Compute document frequency statistics
+    unique_word_indices = set(dialogue_word_ids)
+    for word_id in unique_word_indices:
+        df[word_id] += 1
+
+    # Add dialogue to corpus
+    binarized_corpus.append(dialogue_word_ids)
+
+safe_pickle(binarized_corpus, args.output + ".dialogues.pkl")
+
+if args.dict == "":
+     safe_pickle([(word, word_id, freqs[word_id], df[word_id]) for word, word_id in vocab.items()], args.output + ".dict.pkl")
+
+logger.info("Number of unknowns %d" % unknowns)
+logger.info("Number of terms %d" % num_terms)
+logger.info("Mean document length %f" % float(sum(map(len, binarized_corpus))/len(binarized_corpus)))
+logger.info("Writing training %d dialogues (%d left out)" % (len(binarized_corpus), line + 1 - len(binarized_corpus)))
diff --git a/parlai/agents/hred/data_iterator.py b/parlai/agents/hred/data_iterator.py
new file mode 100755
index 00000000000..2b8cb49354f
--- /dev/null
+++ b/parlai/agents/hred/data_iterator.py
@@ -0,0 +1,429 @@
+import numpy as np
+import theano
+import theano.tensor as T
+
+import sys, getopt
+import logging
+
+from state import *
+from utils import *
+from SS_dataset import *
+
+import itertools
+import sys
+import pickle
+import random
+import datetime
+import math
+import copy
+
+logger = logging.getLogger(__name__)
+
+
+def add_random_variables_to_batch(state, rng, batch, prev_batch, evaluate_mode):
+    """
+    This is a helper function, which adds random variables to a batch.
+    We do it this way, because we want to avoid Theano's random sampling both to speed up and to avoid
+    known Theano issues with sampling inside scan loops.
+
+    The random variable 'ran_var_gaussian_constutterance' is sampled from a standard Gaussian distribution, 
+    which remains constant during each utterance (i.e. between a pair of end-of-utterance tokens).
+
+    The random variable 'ran_var_uniform_constutterance' is sampled from a uniform distribution [0, 1], 
+    which remains constant during each utterance (i.e. between a pair of end-of-utterance tokens).
+
+    When not in evaluate mode, the random vector 'ran_decoder_drop_mask' is also sampled. 
+    This variable represents the input tokens which are replaced by unk when given to 
+    the decoder RNN. It is required for the noise addition trick used by Bowman et al. (2015).
+    """
+
+    # If none return none
+    if not batch:
+        return batch
+
+    # Variables to store random vector sampled at the beginning of each utterance
+    Ran_Var_Gaussian_ConstUtterance = numpy.zeros((batch['x'].shape[0], batch['x'].shape[1], state['latent_gaussian_per_utterance_dim']), dtype='float32')
+    Ran_Var_Uniform_ConstUtterance = numpy.zeros((batch['x'].shape[0], batch['x'].shape[1], state['latent_piecewise_per_utterance_dim']), dtype='float32')
+
+
+    # Go through each sample, find end-of-utterance indices and sample random variables
+    for idx in xrange(batch['x'].shape[1]):
+        # Find end-of-utterance indices
+        eos_indices = numpy.where(batch['x'][:, idx] == state['eos_sym'])[0].tolist()
+
+        # Make sure we also sample at the beginning of the utterance, and that we stop appropriately at the end
+        if len(eos_indices) > 0:
+            if not eos_indices[0] == 0:
+                eos_indices = [0] + eos_indices
+            if not eos_indices[-1] == batch['x'].shape[0]:
+                eos_indices = eos_indices + [batch['x'].shape[0]]
+        else:
+            eos_indices = [0] + [batch['x'].shape[0]]
+
+        # Sample random variables using NumPy
+        ran_gaussian_vectors = rng.normal(loc=0, scale=1, size=(len(eos_indices), state['latent_gaussian_per_utterance_dim']))
+        ran_uniform_vectors = rng.uniform(low=0.0, high=1.0, size=(len(eos_indices), state['latent_piecewise_per_utterance_dim']))
+
+        for i in range(len(eos_indices)-1):
+            for j in range(eos_indices[i], eos_indices[i+1]):
+                Ran_Var_Gaussian_ConstUtterance[j, idx, :] = ran_gaussian_vectors[i, :]
+                Ran_Var_Uniform_ConstUtterance[j, idx, :] = ran_uniform_vectors[i, :]
+
+        # If a previous batch is given, and the last utterance in the previous batch
+        # overlaps with the first utterance in the current batch, then we need to copy over 
+        # the random variables from the last utterance in the last batch to remain consistent.
+        if prev_batch:
+            if ('x_reset' in prev_batch) and (not numpy.sum(numpy.abs(prev_batch['x_reset'])) < 1) \
+              and (('ran_var_gaussian_constutterance' in prev_batch) or ('ran_var_uniform_constutterance' in prev_batch)):
+                prev_ran_gaussian_vector = prev_batch['ran_var_gaussian_constutterance'][-1,idx,:]
+                prev_ran_uniform_vector = prev_batch['ran_var_uniform_constutterance'][-1,idx,:]
+                if len(eos_indices) > 1:
+                    for j in range(0, eos_indices[1]):
+                        Ran_Var_Gaussian_ConstUtterance[j, idx, :] = prev_ran_gaussian_vector
+                        Ran_Var_Uniform_ConstUtterance[j, idx, :] = prev_ran_uniform_vector
+                else:
+                    for j in range(0, batch['x'].shape[0]):
+                        Ran_Var_Gaussian_ConstUtterance[j, idx, :] = prev_ran_gaussian_vector
+                        Ran_Var_Uniform_ConstUtterance[j, idx, :] = prev_ran_uniform_vector
+
+    # Add new random Gaussian variable to batch
+    batch['ran_var_gaussian_constutterance'] = Ran_Var_Gaussian_ConstUtterance
+    batch['ran_var_uniform_constutterance'] = Ran_Var_Uniform_ConstUtterance
+
+    # Create word drop mask based on 'decoder_drop_previous_input_tokens_rate' option:
+    if evaluate_mode:
+        batch['ran_decoder_drop_mask'] = numpy.ones((batch['x'].shape[0], batch['x'].shape[1]), dtype='float32')
+    else:
+        if state.get('decoder_drop_previous_input_tokens', False):
+            ran_drop = rng.uniform(size=(batch['x'].shape[0], batch['x'].shape[1]))
+            batch['ran_decoder_drop_mask'] = (ran_drop <= state['decoder_drop_previous_input_tokens_rate']).astype('float32')
+        else:
+            batch['ran_decoder_drop_mask'] = numpy.ones((batch['x'].shape[0], batch['x'].shape[1]), dtype='float32')
+
+
+    return batch
+
+
+def create_padded_batch(state, rng, x, force_end_of_utterance_token = False):
+    # If flag 'do_generate_first_utterance' is off, then zero out the mask for the first utterance.
+    do_generate_first_utterance = True  
+    if 'do_generate_first_utterance' in state:
+        if state['do_generate_first_utterance'] == False:
+            do_generate_first_utterance = False
+
+    # Skip utterance model
+    if state.get('skip_utterance', False):
+        do_generate_first_utterance = False
+
+    #    x = copy.deepcopy(x)
+    #    for idx in xrange(len(x[0])):
+    #        eos_indices = numpy.where(numpy.asarray(x[0][idx]) == state['eos_sym'])[0]
+    #        if not x[0][idx][0] == state['eos_sym']:
+    #            eos_indices = numpy.insert(eos_indices, 0, state['eos_sym'])
+    #        if not x[0][idx][-1] == state['eos_sym']:
+    #            eos_indices = numpy.append(eos_indices, state['eos_sym'])
+    #
+    #        if len(eos_indices) > 2:
+    #            first_utterance_index = rng.randint(0, len(eos_indices)-2)
+    #
+    #            # Predict next or previous utterance
+    #            if state.get('skip_utterance_predict_both', False):
+    #                if rng.randint(0, 2) == 0:
+    #                    x[0][idx] = x[0][idx][eos_indices[first_utterance_index]:eos_indices[first_utterance_index+2]+1]
+    #                else:
+    #                    x[0][idx] = x[0][idx][eos_indices[first_utterance_index+1]:eos_indices[first_utterance_index+2]] + x[0][idx][eos_indices[first_utterance_index]:eos_indices[first_utterance_index+1]+1]
+    #            else:
+    #                
+    #        else:
+    #            x[0][idx] = [state['eos_sym']]
+
+
+    # Find max length in batch
+    mx = 0
+    for idx in xrange(len(x[0])):
+        mx = max(mx, len(x[0][idx]))
+
+    # Take into account that sometimes we need to add the end-of-utterance symbol at the start
+    mx += 1
+
+    n = state['bs']
+    
+    X = numpy.zeros((mx, n), dtype='int32')
+    Xmask = numpy.zeros((mx, n), dtype='float32') 
+
+    # Variable to store each utterance in reverse form (for bidirectional RNNs)
+    X_reversed = numpy.zeros((mx, n), dtype='int32')
+
+    # Fill X and Xmask.
+    # Keep track of number of predictions and maximum dialogue length.
+    num_preds = 0
+    max_length = 0
+    for idx in xrange(len(x[0])):
+        # Insert sequence idx in a column of matrix X
+        dialogue_length = len(x[0][idx])
+
+        # Fiddle-it if it is too long ..
+        if mx < dialogue_length: 
+            continue
+
+        # Make sure end-of-utterance symbol is at beginning of dialogue.
+        # This will force model to generate first utterance too
+        if not x[0][idx][0] == state['eos_sym']:
+            X[:dialogue_length+1, idx] = [state['eos_sym']] + x[0][idx][:dialogue_length]
+            dialogue_length = dialogue_length + 1
+        else:
+            X[:dialogue_length, idx] = x[0][idx][:dialogue_length]
+
+        # Keep track of longest dialogue
+        max_length = max(max_length, dialogue_length)
+
+        # Set the number of predictions == sum(Xmask), for cost purposes, minus one (to exclude first eos symbol)
+        num_preds += dialogue_length - 1
+        
+        # Mark the end of phrase
+        if len(x[0][idx]) < mx:
+            if force_end_of_utterance_token:
+                X[dialogue_length:, idx] = state['eos_sym']
+
+        # Initialize Xmask column with ones in all positions that
+        # were just set in X (except for first eos symbol, because we are not evaluating this). 
+        # Note: if we need mask to depend on tokens inside X, then we need to 
+        # create a corresponding mask for X_reversed and send it further in the model
+        Xmask[0:dialogue_length, idx] = 1.
+
+        # Reverse all utterances
+        # TODO: For backward compatibility. This should be removed in future versions
+        # i.e. move all the x_reversed computations to the model itself.
+        eos_indices = numpy.where(X[:, idx] == state['eos_sym'])[0]
+        X_reversed[:, idx] = X[:, idx]
+        prev_eos_index = -1
+        for eos_index in eos_indices:
+            X_reversed[(prev_eos_index+1):eos_index, idx] = (X_reversed[(prev_eos_index+1):eos_index, idx])[::-1]
+            prev_eos_index = eos_index
+            if prev_eos_index > dialogue_length:
+                break
+
+
+
+        if not do_generate_first_utterance:
+            eos_index_to_start_cost_from = eos_indices[0]
+            if (eos_index_to_start_cost_from == 0) and (len(eos_indices) > 1):
+                eos_index_to_start_cost_from = eos_indices[1]
+                Xmask[0:eos_index_to_start_cost_from+1, idx] = 0.
+
+            if np.sum(Xmask[:, idx]) < 2.0:
+                Xmask[:, idx] = 0.
+        
+    if do_generate_first_utterance:
+        assert num_preds == numpy.sum(Xmask) - numpy.sum(Xmask[0, :])
+
+    batch = {'x': X,                                                 \
+             'x_reversed': X_reversed,                               \
+             'x_mask': Xmask,                                        \
+             'num_preds': num_preds,                                 \
+             'num_dialogues': len(x[0]),                             \
+             'max_length': max_length                                \
+            }
+
+    return batch
+
+class Iterator(SSIterator):
+    def __init__(self, dialogue_file, batch_size, **kwargs):
+        self.state = kwargs.pop('state', None)
+        self.k_batches = kwargs.pop('sort_k_batches', 20)
+
+        if ('skip_utterance' in self.state) and ('do_generate_first_utterance' in self.state):
+            if self.state['skip_utterance']:
+                assert not self.state.get('do_generate_first_utterance', False)
+
+        # Store whether the iterator operates in evaluate mode or not
+        self.evaluate_mode = kwargs.pop('evaluate_mode', False)
+        print 'Data Iterator Evaluate Mode: ', self.evaluate_mode
+
+        if self.evaluate_mode:
+            SSIterator.__init__(self, dialogue_file, batch_size,                          \
+                                seed=kwargs.pop('seed', 1234),                            \
+                                max_len=kwargs.pop('max_len', -1),                        \
+                                use_infinite_loop=kwargs.pop('use_infinite_loop', False), \
+                                eos_sym=self.state['eos_sym'],                            \
+                                skip_utterance=self.state.get('skip_utterance', False),   \
+                                skip_utterance_predict_both=self.state.get('skip_utterance_predict_both', False))
+        else:
+            SSIterator.__init__(self, dialogue_file, batch_size,                          \
+                                seed=kwargs.pop('seed', 1234),                            \
+                                max_len=kwargs.pop('max_len', -1),                        \
+                                use_infinite_loop=kwargs.pop('use_infinite_loop', False), \
+                                init_offset=self.state['train_iterator_offset'],          \
+                                init_reshuffle_count=self.state['train_iterator_reshuffle_count'],       \
+                                eos_sym=self.state['eos_sym'],                                           \
+                                skip_utterance=self.state.get('skip_utterance', False),                  \
+                                skip_utterance_predict_both=self.state.get('skip_utterance_predict_both', False))
+
+
+        self.batch_iter = None
+        self.rng = numpy.random.RandomState(self.state['seed'])
+
+        # Keep track of previous batch, because this is needed to specify random variables
+        self.prev_batch = None
+
+
+
+        self.last_returned_offset = 0
+
+    def get_homogenous_batch_iter(self, batch_size = -1):
+        while True:
+            batch_size = self.batch_size if (batch_size == -1) else batch_size 
+           
+            data = []
+            for k in range(self.k_batches):
+                batch = SSIterator.next(self)
+                if batch:
+                    data.append(batch)
+            
+            if not len(data):
+                return
+            
+            number_of_batches = len(data)
+            data = list(itertools.chain.from_iterable(data))
+
+            # Split list of words from the offset index and reshuffle count
+            data_x = []
+            data_offset = []
+            data_reshuffle_count = []
+            for i in range(len(data)):
+                data_x.append(data[i][0])
+                data_offset.append(data[i][1])
+                data_reshuffle_count.append(data[i][2])
+
+            if len(data_offset)  > 0:
+                self.last_returned_offset = data_offset[-1]
+                self.last_returned_reshuffle_count = data_reshuffle_count[-1]
+
+            x = numpy.asarray(list(itertools.chain(data_x)))
+
+            lens = numpy.asarray([map(len, x)])
+            order = numpy.argsort(lens.max(axis=0))
+                 
+            for k in range(number_of_batches):
+                indices = order[k * batch_size:(k + 1) * batch_size]
+                full_batch = create_padded_batch(self.state, self.rng, [x[indices]])
+
+                if full_batch['num_dialogues'] < batch_size:
+                    print 'Skipping incomplete batch!'
+                    continue
+
+                if full_batch['max_length'] < 3:
+                    print 'Skipping small batch!'
+                    continue
+
+
+                # Then split batches to have size 'max_grad_steps'
+                splits = int(math.ceil(float(full_batch['max_length']) / float(self.state['max_grad_steps'])))
+                batches = []
+                for i in range(0, splits):
+                    batch = copy.deepcopy(full_batch)
+
+                    # Retrieve start and end position (index) of current mini-batch
+                    start_pos = self.state['max_grad_steps'] * i
+                    if start_pos > 0:
+                        start_pos = start_pos - 1
+
+                    # We need to copy over the last token from each batch onto the next, 
+                    # because this is what the model expects.
+                    end_pos = min(full_batch['max_length'], self.state['max_grad_steps'] * (i + 1))
+
+                    batch['x'] = full_batch['x'][start_pos:end_pos, :]
+                    batch['x_reversed'] = full_batch['x_reversed'][start_pos:end_pos, :]
+                    batch['x_mask'] = full_batch['x_mask'][start_pos:end_pos, :]
+                    batch['max_length'] = end_pos - start_pos
+                    batch['num_preds'] = numpy.sum(batch['x_mask']) - numpy.sum(batch['x_mask'][0,:])
+
+                    # For each batch we compute the number of dialogues as a fraction of the full batch,
+                    # that way, when we add them together, we get the total number of dialogues.
+                    batch['num_dialogues'] = float(full_batch['num_dialogues']) / float(splits)
+                    batch['x_reset'] = numpy.ones(self.state['bs'], dtype='float32')
+
+                    batches.append(batch)
+
+                if len(batches) > 0:
+                    batches[-1]['x_reset'] = numpy.zeros(self.state['bs'], dtype='float32')
+
+                    # Trim the last very short batch
+                    if batches[-1]['max_length'] < 3:
+                        del batches[-1]
+                        batches[-1]['x_reset'] = numpy.zeros(self.state['bs'], dtype='float32')
+                        logger.debug("Truncating last mini-batch...")
+
+                for batch in batches:
+                    if batch:
+                        yield batch
+
+
+    def start(self):
+        SSIterator.start(self)
+        self.batch_iter = None
+
+    def next(self, batch_size = -1):
+        """ 
+        We can specify a batch size,
+        independent of the object initialization. 
+        """
+        # If there are no more batches in list, try to generate new batches
+        if not self.batch_iter:
+            self.batch_iter = self.get_homogenous_batch_iter(batch_size)
+
+        try:
+            # Retrieve next batch
+            batch = next(self.batch_iter)
+
+            # Add Gaussian random variables to batch. 
+            # We add them separetly for each batch to save memory.
+            # If we instead had added them to the full batch before splitting into mini-batches,
+            # the random variables would take up several GBs for big batches and long documents.
+            batch = add_random_variables_to_batch(self.state, self.rng, batch, self.prev_batch, self.evaluate_mode)
+            # Keep track of last batch
+            self.prev_batch = batch
+        except StopIteration:
+            return None
+        return batch
+
+
+    def get_offset(self):
+        return self.last_returned_offset
+
+    def get_reshuffle_count(self):
+        return self.last_returned_reshuffle_count
+
+
+def get_train_iterator(state):
+    train_data = Iterator(
+        state['train_dialogues'],
+        int(state['bs']),
+        state=state,
+        seed=state['seed'],
+        use_infinite_loop=True,
+        max_len=state.get('max_len', -1),
+        evaluate_mode=False)
+     
+    valid_data = Iterator(
+        state['valid_dialogues'],
+        int(state['bs']),
+        state=state,
+        seed=state['seed'],
+        use_infinite_loop=False,
+        max_len=state.get('max_len', -1),
+        evaluate_mode=True)
+    return train_data, valid_data 
+
+def get_test_iterator(state):
+    assert 'test_dialogues' in state
+
+    test_data = Iterator(
+        state.get('test_dialogues'),
+        int(state['bs']),
+        state=state,
+        seed=state['seed'],
+        use_infinite_loop=False,
+        max_len=state.get('max_len', -1),
+        evaluate_mode=True)
+    return test_data
diff --git a/parlai/agents/hred/dialog_encdec.py b/parlai/agents/hred/dialog_encdec.py
new file mode 100755
index 00000000000..4c5189beeda
--- /dev/null
+++ b/parlai/agents/hred/dialog_encdec.py
@@ -0,0 +1,3350 @@
+"""
+Dialog hierarchical encoder-decoder code.
+The code is inspired from nmt encdec code in groundhog
+but we do not rely on groundhog infrastructure.
+"""
+__docformat__ = 'restructedtext en'
+__authors__ = ("Iulian Vlad Serban")
+
+import theano
+import theano.tensor as T
+import numpy as np
+import cPickle
+import logging
+logger = logging.getLogger(__name__)
+
+from theano import scan
+from theano.sandbox.rng_mrg import MRG_RandomStreams
+# Deprecated
+#from theano.tensor.nnet.conv3d2d import *
+
+from collections import OrderedDict
+
+from model import *
+from utils import *
+
+import operator
+
+def add_to_params(params, new_param):
+    params.append(new_param)
+    return new_param
+
+
+class EncoderDecoderBase():
+    def __init__(self, state, rng, parent):
+        self.rng = rng
+        self.parent = parent
+        
+        self.state = state
+        self.__dict__.update(state)
+        
+        self.dialogue_rec_activation = eval(self.dialogue_rec_activation)
+        self.sent_rec_activation = eval(self.sent_rec_activation)
+         
+        self.params = []
+
+class LinearCombination(EncoderDecoderBase):
+    """
+    This module computes a per-dimension weighted sum of two vectors x and y.
+    The module can be extended, so that the weights of x and y depends on a conditioning vector (cond).
+    """
+
+    def init_params(self, cond_size, output_size, force_min_max_intervals, min_val, max_val):
+        self.W = add_to_params(self.params, theano.shared(value=np.ones((output_size,), dtype='float32'), name='W_x'+self.name))
+
+        self.force_min_max_intervals = force_min_max_intervals
+        self.min_val = min_val
+        self.max_val = max_val
+        
+    def build_output(self, cond, x, y):
+        res = self.W*x + (np.float32(1.0) - self.W)*y
+
+        if self.force_min_max_intervals:
+            return T.clip(res, self.min_val, self.max_val)
+        else:
+            return res
+
+    def __init__(self, state, cond_size, output_size, force_min_max_intervals, min_val, max_val, rng, parent, name):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.name = name
+        self.init_params(cond_size, output_size, force_min_max_intervals, min_val, max_val)
+
+
+class OneLayerMLP(EncoderDecoderBase):
+    def init_params(self, inp_size, hidden_size, output_size):
+        # First layer
+        self.W1_in_act = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, inp_size, hidden_size), name='W1_in_'+self.name))
+        self.b1_in_act = add_to_params(self.params, theano.shared(value=np.zeros((hidden_size,), dtype='float32'), name='b1_in_'+self.name))
+
+        # First layer batch norm / layer norm parameters
+        self.normop_in_act_h1_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((hidden_size,), dtype='float32'), name='normop_in_act_h1_gamma_'+self.name))
+        self.normop_in_act_h1_mean = add_to_params(self.params, theano.shared(value=np.zeros((hidden_size,), dtype='float32'), name='normop_in_act_h1_mean_'+self.name))
+        self.normop_in_act_h1_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((hidden_size,), dtype='float32'), name='normop_in_act_h1_var_'+self.name))
+
+        # Output layer
+        self.W2_in_act = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, hidden_size, output_size), name='W2_in_'+self.name))
+        self.b2_in_act = add_to_params(self.params, theano.shared(value=np.zeros((output_size,), dtype='float32'), name='b2_in_'+self.name))
+
+        # Output layer batch norm / layer norm parameters
+        self.normop_in_act_h2_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((output_size,), dtype='float32'), name='normop_in_act_h2_gamma_'+self.name))
+        self.normop_in_act_h2_mean = add_to_params(self.params, theano.shared(value=np.zeros((output_size,), dtype='float32'), name='normop_in_act_h2_mean_'+self.name))
+        self.normop_in_act_h2_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((output_size,), dtype='float32'), name='normop_in_act_h2_var_'+self.name))
+
+    def build_output(self, inp, bnmask):
+        # Make sure bnmask is of type float32:
+        if bnmask:
+            bnmask = T.cast(bnmask, 'float32')
+
+        # Execute normalization operator on inputs
+        h_nonlinear_inp, h_nonlinear_inp_mean, h_nonlinear_inp_var = NormalizationOperator(self.normop_type, T.dot(inp, self.W1_in_act) + self.b1_in_act, self.normop_in_act_h1_gamma, bnmask, self.normop_in_act_h1_mean, self.normop_in_act_h1_var)
+
+        # Compute hidden layer
+        h = T.nnet.relu(h_nonlinear_inp)
+
+        # Execute normalization operator on hidden layer
+        output, output_mean, output_var = NormalizationOperator(self.normop_type, T.dot(h, self.W2_in_act) + self.b2_in_act, self.normop_in_act_h2_gamma, bnmask, self.normop_in_act_h2_mean, self.normop_in_act_h2_var)
+
+        # Create batch norm updates
+        updates = []
+        if self.normop_type == 'BN':
+            print(' Creating batch norm updates for OneLayerMLP (' + self.name + '):')
+            vars_to_update = [self.normop_in_act_h1_mean, self.normop_in_act_h1_var]
+            vars_estimates = [h_nonlinear_inp_mean, h_nonlinear_inp_var, output_mean, output_var]
+
+            assert len(vars_estimates) == len(vars_to_update)
+
+            for i in range(len(vars_estimates)):
+                print('     ', vars_to_update[i])
+                new_value = self.normop_moving_average_const*vars_to_update[i] \
+                            + (1.0 - self.normop_moving_average_const)*vars_estimates[i]
+                updates.append((vars_to_update[i], new_value))
+
+        return output, updates
+
+
+    def __init__(self, state, rng, inp_size, hidden_size, output_size, parent, name):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.name = name
+        self.init_params(inp_size, hidden_size, output_size)
+
+
+class TwoLayerMLP(EncoderDecoderBase):
+    def init_params(self, inp_size, hidden_size, output_size):
+        # First layer
+        self.W1_in_tanh = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, inp_size, hidden_size), name='W1_in_'+self.name))
+        self.b1_in_tanh = add_to_params(self.params, theano.shared(value=np.zeros((hidden_size,), dtype='float32'), name='b1_in_'+self.name))
+        self.W1_in_skip = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, inp_size, hidden_size), name='W1_in_skip_'+self.name))
+        self.b1_in_skip = add_to_params(self.params, theano.shared(value=np.zeros((hidden_size,), dtype='float32'), name='b1_in_skip_'+self.name))
+
+        # First layer batch norm / layer norm parameters
+        self.normop_in_tanh_h1_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((hidden_size,), dtype='float32'), name='normop_in_tanh_h1_gamma_'+self.name))
+        self.normop_in_tanh_h1_mean = add_to_params(self.params, theano.shared(value=np.zeros((hidden_size,), dtype='float32'), name='normop_in_tanh_h1_mean_'+self.name))
+        self.normop_in_tanh_h1_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((hidden_size,), dtype='float32'), name='normop_in_tanh_h1_var_'+self.name))
+
+        self.normop_in_skip_h1_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((hidden_size,), dtype='float32'), name='normop_in_skip_h1_gamma_'+self.name))
+        self.normop_in_skip_h1_mean = add_to_params(self.params, theano.shared(value=np.zeros((hidden_size,), dtype='float32'), name='normop_in_skip_h1_mean_'+self.name))
+        self.normop_in_skip_h1_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((hidden_size,), dtype='float32'), name='normop_in_skip_h1_var_'+self.name))
+
+
+        # Second layer
+        self.W2_in_tanh = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, hidden_size, output_size), name='W2_in_'+self.name))
+        self.b2_in_tanh = add_to_params(self.params, theano.shared(value=np.zeros((output_size,), dtype='float32'), name='b2_in_'+self.name))
+
+        self.W2_in_skip = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, hidden_size, output_size), name='W2_in_skip_'+self.name))
+        self.b2_in_skip = add_to_params(self.params, theano.shared(value=np.zeros((output_size,), dtype='float32'), name='b2_in_skip_'+self.name))
+
+        # Second layer batch norm / layer norm parameters
+        self.normop_in_tanh_h2_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((output_size,), dtype='float32'), name='normop_in_tanh_h2_gamma_'+self.name))
+        self.normop_in_tanh_h2_mean = add_to_params(self.params, theano.shared(value=np.zeros((output_size,), dtype='float32'), name='normop_in_tanh_h2_mean_'+self.name))
+        self.normop_in_tanh_h2_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((output_size,), dtype='float32'), name='normop_in_tanh_h2_var_'+self.name))
+
+        self.normop_in_skip_h2_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((output_size,), dtype='float32'), name='normop_in_skip_h2_gamma_'+self.name))
+        self.normop_in_skip_h2_mean = add_to_params(self.params, theano.shared(value=np.zeros((output_size,), dtype='float32'), name='normop_in_skip_h2_mean_'+self.name))
+        self.normop_in_skip_h2_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((output_size,), dtype='float32'), name='normop_in_skip_h2_var_'+self.name))
+
+    def build_output(self, inp, bnmask):
+        # Make sure bnmask is of type float32:
+        if bnmask:
+            bnmask = T.cast(bnmask, 'float32')
+
+        # Execute normalization operator on inputs
+        h_linear_inp, h_linear_inp_mean, h_linear_inp_var = NormalizationOperator(self.normop_type, T.dot(inp, self.W1_in_skip), self.normop_in_tanh_h1_gamma, bnmask, self.normop_in_tanh_h1_mean, self.normop_in_tanh_h1_var)
+
+        h_nonlinear_inp, h_nonlinear_inp_mean, h_nonlinear_inp_var = NormalizationOperator(self.normop_type, T.dot(inp, self.W1_in_tanh) + self.b1_in_tanh, self.normop_in_skip_h1_gamma, bnmask, self.normop_in_skip_h1_mean, self.normop_in_skip_h1_var)
+
+        # Compute first hidden layer
+        h = T.tanh(h_nonlinear_inp) + h_linear_inp + self.b1_in_skip
+
+        # Execute normalization operator on inputs to second hidden layer
+        h2_linear_inp, h2_linear_inp_mean, h2_linear_inp_var = NormalizationOperator(self.normop_type, T.dot(h, self.W2_in_skip), self.normop_in_skip_h2_gamma, bnmask, self.normop_in_skip_h2_mean, self.normop_in_skip_h2_var)
+        h2_nonlinear_inp, h2_nonlinear_inp_mean, h2_nonlinear_inp_var = NormalizationOperator(self.normop_type, T.dot(h, self.W2_in_tanh) + self.b2_in_tanh, self.normop_in_tanh_h2_gamma, bnmask, self.normop_in_tanh_h2_mean, self.normop_in_tanh_h2_var)
+
+        output = T.tanh(h2_nonlinear_inp) + h2_linear_inp + self.b2_in_skip
+
+        # Create batch norm updates
+        updates = []
+        if self.normop_type == 'BN':
+            print(' Creating batch norm updates for TwoLayerMLP (' + self.name + '):')
+            vars_to_update = [self.normop_in_tanh_h1_mean, self.normop_in_tanh_h1_var, self.normop_in_skip_h1_mean, self.normop_in_skip_h1_var, self.normop_in_skip_h2_mean, self.normop_in_skip_h2_var, self.normop_in_tanh_h2_mean, self.normop_in_tanh_h2_var]
+            vars_estimates = [h_linear_inp_mean, h_linear_inp_var, h_nonlinear_inp_mean, h_nonlinear_inp_var, h2_linear_inp_mean, h2_linear_inp_var, h2_nonlinear_inp_mean, h2_nonlinear_inp_var]
+
+            assert len(vars_estimates) == len(vars_to_update)
+
+            for i in range(len(vars_estimates)):
+                print('     ', vars_to_update[i])
+                new_value = self.normop_moving_average_const*vars_to_update[i] \
+                            + (1.0 - self.normop_moving_average_const)*vars_estimates[i]
+                updates.append((vars_to_update[i], new_value))
+
+        return output, updates
+
+
+    def __init__(self, state, rng, inp_size, hidden_size, output_size, parent, name):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.name = name
+        self.init_params(inp_size, hidden_size, output_size)
+
+
+
+
+class UtteranceEncoder(EncoderDecoderBase):
+    """
+    This is the GRU-gated RNN encoder class, which operates on hidden states at the word level
+    (intra-utterance level). It encodes utterances into real-valued fixed-sized vectors.
+    """
+
+    def init_params(self, word_embedding_param):
+        # Initialzie W_emb to given word embeddings
+        assert(word_embedding_param != None)
+        self.W_emb = word_embedding_param
+
+        """ sent weights """
+        self.W_in = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, self.qdim_encoder), name='W_in_'+self.name))
+        self.W_hh = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.qdim_encoder, self.qdim_encoder), name='W_hh_'+self.name))
+        self.b_hh = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_encoder,), dtype='float32'), name='b_hh_'+self.name))
+
+        # Initialize batch norm / layer norm parameters
+        self.normop_in_h_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.qdim_encoder,), dtype='float32'), name='normop_in_h_gamma_'+self.name))
+        self.normop_in_h_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_in_h_mean_'+self.name))
+        self.normop_in_h_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_in_h_var_'+self.name))
+
+
+        self.normop_in_x_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.qdim_encoder,), dtype='float32'), name='normop_in_x_gamma_'+self.name))
+        self.normop_in_x_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_in_x_mean_'+self.name))
+        self.normop_in_x_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_in_x_var_'+self.name))
+
+
+        
+        if self.utterance_encoder_gating == "GRU":
+            self.W_in_r = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, self.qdim_encoder), name='W_in_r_'+self.name))
+            self.W_in_z = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, self.qdim_encoder), name='W_in_z_'+self.name))
+            self.W_hh_r = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.qdim_encoder, self.qdim_encoder), name='W_hh_r_'+self.name))
+            self.W_hh_z = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.qdim_encoder, self.qdim_encoder), name='W_hh_z_'+self.name))
+            self.b_z = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_encoder,), dtype='float32'), name='b_z_'+self.name))
+            self.b_r = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_encoder,), dtype='float32'), name='b_r_'+self.name))
+
+
+            # Initialize batch norm / layer norm parameters
+            self.normop_r_h_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.qdim_encoder,), dtype='float32'), name='normop_r_h_gamma_'+self.name))
+            self.normop_r_h_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_r_h_mean_'+self.name))
+            self.normop_r_h_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_r_h_var_'+self.name))
+
+            self.normop_r_x_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.qdim_encoder,), dtype='float32'), name='normop_r_x_gamma_'+self.name))
+            self.normop_r_x_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_r_x_mean_'+self.name))
+            self.normop_r_x_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_r_x_var_'+self.name))
+
+            self.normop_z_h_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.qdim_encoder,), dtype='float32'), name='normop_z_h_gamma_'+self.name))
+            self.normop_z_h_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_z_h_mean_'+self.name))
+            self.normop_z_h_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_z_h_var_'+self.name))
+
+            self.normop_z_x_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.qdim_encoder,), dtype='float32'), name='normop_z_x_gamma_'+self.name))
+            self.normop_z_x_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_z_x_mean_'+self.name))
+            self.normop_z_x_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.normop_max_enc_seq, self.qdim_encoder), dtype='float32'), name='normop_z_x_var_'+self.name))
+
+
+    # This function takes as input word indices and extracts their corresponding word embeddings
+    def approx_embedder(self, x):
+        return self.W_emb[x]
+
+    def plain_step(self, x_t, m_t, bnmask_t, *args):
+        args = iter(args)
+        h_tm1 = next(args)
+
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+        
+        # If 'reset_utterance_encoder_at_end_of_utterance' flag is on,
+        # then reset the hidden state if this is an end-of-utterance token
+        # as given by m_t
+        if self.reset_utterance_encoder_at_end_of_utterance:
+            hr_tm1 = m_t * h_tm1
+        else:
+            hr_tm1 = h_tm1
+
+        h_t = self.sent_rec_activation(T.dot(x_t, self.W_in) + T.dot(hr_tm1, self.W_hh) + self.b_hh)
+
+        # Return hidden state only
+        return [h_t]
+
+    def GRU_step(self, x_t, m_t, bnmask_t, *args):
+        args = iter(args)
+        h_tm1 = next(args)
+        n_t = next(args)
+
+        if self.reset_utterance_encoder_at_end_of_utterance:
+            new_n_t = T.gt(m_t, 0.5)*(n_t + 1) # n_t + T.gt(m_t, 0.5)
+        else:
+            new_n_t = n_t + 1
+
+        new_n_t = T.cast(new_n_t, 'int8')
+
+        if n_t.ndim == 2:
+            n_t_truncated = T.maximum(0, T.minimum(n_t[0,:], self.normop_max_enc_seq - 1))
+        else:
+            n_t_truncated = T.maximum(0, T.minimum(n_t, self.normop_max_enc_seq - 1))
+
+
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x') 
+
+        # If 'reset_utterance_encoder_at_end_of_utterance' flag is on,
+        # then reset the hidden state if this is an end-of-utterance token
+        # as given by m_t
+        if self.reset_utterance_encoder_at_end_of_utterance:
+            hr_tm1 = m_t * h_tm1
+        else:
+            hr_tm1 = h_tm1
+
+        # Compute reset gate
+        r_t_normop_x_inp, r_t_normop_x_mean, r_t_normop_x_var = NormalizationOperator(self.normop_type, T.dot(x_t, self.W_in_r), self.normop_r_x_gamma, bnmask_t, self.normop_r_x_mean[n_t_truncated, :], self.normop_r_x_var[n_t_truncated, :])
+        r_t_normop_h_inp, r_t_normop_h_mean, r_t_normop_h_var = NormalizationOperator(self.normop_type, T.dot(hr_tm1, self.W_hh_r), self.normop_r_h_gamma, bnmask_t, self.normop_r_h_mean[n_t_truncated, :], self.normop_r_h_var[n_t_truncated, :])
+        r_t = T.nnet.sigmoid(r_t_normop_x_inp + r_t_normop_h_inp + self.b_r)
+
+
+
+        # Compute update gate
+        z_t_normop_x_inp, z_t_normop_x_mean, z_t_normop_x_var = NormalizationOperator(self.normop_type, T.dot(x_t, self.W_in_z), self.normop_z_x_gamma, bnmask_t, self.normop_z_x_mean[n_t_truncated, :], self.normop_z_x_var[n_t_truncated, :])
+        z_t_normop_h_inp, z_t_normop_h_mean, z_t_normop_h_var = NormalizationOperator(self.normop_type, T.dot(hr_tm1, self.W_hh_z), self.normop_z_h_gamma, bnmask_t, self.normop_z_h_mean[n_t_truncated, :], self.normop_z_h_var[n_t_truncated, :])
+        z_t = T.nnet.sigmoid(z_t_normop_x_inp + z_t_normop_h_inp + self.b_z)
+
+        # Compute h_tilde
+        h_tilde_normop_x_inp, h_tilde_normop_x_mean, h_tilde_normop_x_var = NormalizationOperator(self.normop_type, T.dot(x_t, self.W_in), self.normop_in_x_gamma, bnmask_t, self.normop_in_x_mean[n_t_truncated, :], self.normop_in_x_var[n_t_truncated, :])
+
+        h_tilde_normop_h_inp, h_tilde_normop_h_mean, h_tilde_normop_h_var = NormalizationOperator(self.normop_type, T.dot(r_t * hr_tm1, self.W_hh), self.normop_in_h_gamma, bnmask_t, self.normop_in_h_mean[n_t_truncated, :], self.normop_in_h_var[n_t_truncated, :])
+
+        h_tilde = self.sent_rec_activation(h_tilde_normop_x_inp + h_tilde_normop_h_inp + self.b_hh)
+
+        # Compute h
+        h_t = (np.float32(1.0) - z_t) * hr_tm1 + z_t * h_tilde
+
+        # return states, gates and batch norm parameters
+        return [h_t, T.cast(new_n_t, 'int8'), r_t, z_t, h_tilde, r_t_normop_x_mean, r_t_normop_x_var, r_t_normop_h_mean, r_t_normop_h_var, z_t_normop_x_mean, z_t_normop_x_var, z_t_normop_h_mean, z_t_normop_h_var, h_tilde_normop_x_mean, h_tilde_normop_x_var, h_tilde_normop_h_mean, h_tilde_normop_h_var]
+
+    def build_encoder(self, x, xmask=None, bnmask=None, prev_state=None, **kwargs):
+        one_step = False
+        if len(kwargs):
+            one_step = True
+         
+        # if x.ndim == 2 then 
+        # x = (n_steps, batch_size)
+        if x.ndim == 2:
+            batch_size = x.shape[1]
+        # else x = (word_1, word_2, word_3, ...)
+        # or x = (last_word_1, last_word_2, last_word_3, ..)
+        # in this case batch_size is 
+        else:
+            batch_size = 1
+
+        # if it is not one_step then we initialize everything to previous state or zero  
+        if not one_step:
+            if prev_state:
+                h_0, n_0 = prev_state
+            else:
+                h_0 = T.alloc(np.float32(0), batch_size, self.qdim_encoder)
+                n_0 = T.alloc(np.int8(0), batch_size)
+
+        # in sampling mode (i.e. one step) we require 
+        else:
+            # in this case x.ndim != 2
+            assert x.ndim != 2
+            assert 'prev_h' in kwargs 
+            h_0 = kwargs['prev_h']
+            n_0 = T.alloc(np.int8(0), batch_size)
+
+        # We extract the word embeddings from the word indices
+        xe = self.approx_embedder(x)
+        if xmask == None:
+            xmask = T.neq(x, self.eos_sym)
+
+        bnmask_given = True
+        if bnmask == None:
+            bnmask_given = False
+            bnmask = T.zeros(xmask.shape, dtype='float32')
+
+
+        # We add ones at the the beginning of the reset vector to align the resets with y_training:
+        # for example for 
+        # training_x =        </s> a b c </s> d
+        # xmask =               0  1 1 1  0   1
+        # rolled_xmask =        1  0 1 1  1   0 1
+        # Thus, we ensure that the no information in the encoder is carried from input "</s>" to "a",
+        # or from "</s>" to "d". 
+        # Now, the state at exactly </s> always reflects the previous utterance encoding.
+        # Since the dialogue encoder uses xmask, and inputs it when xmask=0, it will input the utterance encoding
+        # exactly on the </s> state.
+
+        if xmask.ndim == 2:
+            ones_vector = T.ones_like(xmask[0,:]).dimshuffle('x', 0)
+            rolled_xmask = T.concatenate([ones_vector, xmask], axis=0)
+        else:
+            ones_scalar = theano.shared(value=numpy.ones((1), dtype='float32'), name='ones_scalar')
+            rolled_xmask = T.concatenate([ones_scalar, xmask])
+
+        # GRU Encoder
+        if self.utterance_encoder_gating == "GRU":
+            f_enc = self.GRU_step
+            o_enc_info = [h_0, n_0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
+        else:
+            f_enc = self.plain_step
+            o_enc_info = [h_0]
+
+
+        # Run through all tokens (encode everything)
+        if not one_step: 
+            _res, _ = theano.scan(f_enc,
+                              sequences=[xe, rolled_xmask, bnmask],\
+                              outputs_info=o_enc_info)
+        else: # Make just one step further
+            _res = f_enc(xe, rolled_xmask, bnmask, [h_0, n_0])[0]
+
+        # Get the hidden state sequence
+        if self.utterance_encoder_gating == 'GRU':
+            h, n = _res[0], _res[1]
+            updates = []
+
+            # Create batch norm updates
+            if self.normop_type == 'BN':
+                if (not one_step) and (x.ndim == 2) and (bnmask_given):
+                    updates = []
+                    n_max = T.maximum(0, T.minimum(h.shape[0]-1, self.normop_max_enc_seq))
+                    vars_to_update = [self.normop_r_x_mean, self.normop_r_x_var, self.normop_r_h_mean, self.normop_r_h_var, self.normop_z_x_mean, self.normop_z_x_var, self.normop_z_h_mean, self.normop_z_h_var, self.normop_in_x_mean, self.normop_in_x_var, self.normop_in_h_mean, self.normop_in_h_var]
+
+                    assert len(_res) == len(vars_to_update)+5
+                    print(' Creating batch norm updates for GRU Utterance Encoder (' + self.name + '):')
+                    for varidx, var in enumerate(vars_to_update):
+                        sub_new_value = self.normop_moving_average_const*var[0:n_max] \
+                                                + (1.0-self.normop_moving_average_const)*_res[5+varidx][0:n_max]
+                        new_value = T.set_subtensor(var[0:n_max], sub_new_value)
+                        updates.append((var, new_value))
+                        print('     ' + str(var))
+
+        else:
+           h = _res
+           n = 0
+           updates = []
+
+
+        return h, n, updates
+
+    def __init__(self, state, rng, word_embedding_param, parent, name):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.name = name
+        self.init_params(word_embedding_param)
+
+
+
+class DCGMEncoder(EncoderDecoderBase):
+    """
+    This is the bag-of-words (DCGM) RNN encoder class, which operates on hidden states at the word level (intra-utterance level).
+    It encodes utterances into real-valued fixed-sized vectors.
+    """
+
+    def init_params(self, word_embedding_param):
+        # Initialzie W_emb to given word embeddings
+        assert(word_embedding_param != None)
+        self.W_emb = word_embedding_param
+        self.Wq_in = add_to_params(self.params, \
+                                   theano.shared(value=NormalInit(self.rng, self.rankdim, self.output_dim), name='dcgm_Wq_in'+self.name))
+        self.bq_in = add_to_params(self.params, \
+                                   theano.shared(value=np.zeros((self.output_dim,), dtype='float32'), name='dcgm_bq_in'+self.name))
+
+    def mean_step(self, x_t, m_t, *args):
+        args = iter(args)
+        
+        # already computed avg 
+        avg_past = next(args)
+        n_past = next(args)
+
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x') 
+        
+        # reset avg
+        avg_past_r = m_t * avg_past 
+        n_past_r = m_t.T * n_past
+
+
+        n = n_past_r + 1.0
+
+        resized_n = T.repeat(n.T, avg_past_r.shape[1], axis=1)
+        avg = (avg_past_r * (resized_n - 1) + x_t) / resized_n
+
+        # Old implementation:
+        #avg = (avg_past_r * (n[:, None] - 1) + x_t) / n[:, None]
+
+        # return state and pooled state
+        return avg, n
+
+    def approx_embedder(self, x):
+        return self.W_emb[x]
+
+    def build_encoder(self, x, xmask=None, prev_state=None, **kwargs):
+        one_step = False
+        if len(kwargs):
+            one_step = True
+
+        if x.ndim == 2:
+            batch_size = x.shape[1]
+        else:
+            batch_size = 1
+
+        # if it is not one_step then we initialize everything to previous state or zero  
+        if not one_step:
+            if prev_state:
+                avg_0, n_0 = prev_state
+            else:
+                avg_0 = T.alloc(np.float32(0), batch_size, self.rankdim)
+                n_0 = T.alloc(np.float32(0), batch_size)
+
+        # in sampling mode (i.e. one step) we require 
+        else:
+            # in this case x.ndim != 2
+            assert x.ndim != 2
+            assert 'prev_avg' in kwargs 
+            avg_0 = kwargs['prev_avg']
+
+        
+        # in sampling mode (i.e. one step) we require 
+        xe = self.approx_embedder(x)
+        if xmask == None:
+            xmask = T.neq(x, self.eos_sym)
+
+        if xmask.ndim == 2:
+            ones_vector = T.ones_like(xmask[0,:]).dimshuffle('x', 0)
+            rolled_xmask = T.concatenate([ones_vector, xmask], axis=0)
+        else:
+            ones_scalar = theano.shared(value=numpy.ones((1), dtype='float32'), name='ones_scalar')
+            rolled_xmask = T.concatenate([ones_scalar, xmask])
+
+        f_enc = self.mean_step
+        o_enc_info = [avg_0, n_0] 
+
+
+
+        # Run through all tokens (encode everything)
+        if not one_step: 
+            _res, _ = theano.scan(f_enc,
+                              sequences=[xe, rolled_xmask],\
+                              outputs_info=o_enc_info)
+        else: # Make just one step further
+            _res, _ = f_enc(xe, rolled_xmask, [avg_0, n_0])
+        
+        avg, n = _res[0], _res[1]
+
+        # Linear activation
+        avg_q = T.dot(avg, self.Wq_in) + self.bq_in
+        return avg_q, avg, n
+
+    def __init__(self, state, rng, word_embedding_param, output_dim, parent, name):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.name = name
+        self.output_dim = output_dim
+        self.init_params(word_embedding_param)
+
+
+class DialogEncoder(EncoderDecoderBase):
+    """
+    This is the context RNN encoder class, which operates on hidden states at the dialogue level
+    (inter-utterance level). At the end of each utterance, it updates its hidden state using the incoming
+    input from the utterance encoder(s).
+    """
+
+    def init_params(self):
+        """ Context weights """
+
+        # If the dialogue encoder is diabled, do not initialize any parameters
+        if self.disable_dialogue_encoder:
+            return
+
+        if self.bidirectional_utterance_encoder:
+            # With the bidirectional flag, the dialog encoder gets input 
+            # from both the forward and backward utterance encoders, hence it is double qdim_encoder
+            input_dim = self.qdim_encoder * 2
+        else:
+            # Without the bidirectional flag, the dialog encoder only gets input
+            # from the forward utterance encoder, which has dim self.qdim_encoder
+            input_dim = self.qdim_encoder
+
+
+        transformed_input_dim = input_dim
+        if self.deep_dialogue_encoder_input:
+            transformed_input_dim = self.sdim
+
+            self.input_mlp = TwoLayerMLP(self.state, self.rng, input_dim, self.sdim, self.sdim, self, '_input_mlp_'+self.name)
+            self.params += self.input_mlp.params
+
+        self.Ws_in = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, transformed_input_dim, self.sdim), name='Ws_in'+self.name))
+        self.Ws_hh = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.sdim, self.sdim), name='Ws_hh'+self.name))
+        self.bs_hh = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='bs_hh'+self.name))
+
+        if self.dialogue_encoder_gating == "GRU":
+            self.Ws_in_r = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, transformed_input_dim, self.sdim), name='Ws_in_r'+self.name))
+            self.Ws_in_z = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, transformed_input_dim, self.sdim), name='Ws_in_z'+self.name))
+            self.Ws_hh_r = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.sdim, self.sdim), name='Ws_hh_r'+self.name))
+            self.Ws_hh_z = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.sdim, self.sdim), name='Ws_hh_z'+self.name))
+            self.bs_z = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='bs_z'+self.name))
+            self.bs_r = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='bs_r'+self.name))
+
+            # Linear skip connections, which acts as an "overwrite" mechanism.
+            # It allows each GRU unit to replace its hidden state with the incoming input.
+            # This is potentially useful, for example, if the dialogue changes topic.
+            self.Ws_in_overwrite = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, transformed_input_dim, self.sdim), name='Ws_in_overwrite'+self.name))
+            self.bs_overwrite = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='bs_overwrite'+self.name))
+
+            # Gating mechanism defining whether to overwrite or not
+            self.Ws_in_o = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, transformed_input_dim, self.sdim), name='Ws_in_o'+self.name))    
+            self.Ws_hh_o = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.sdim, self.sdim), name='Ws_hh_o'+self.name))
+            self.bs_o = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='bs_o'+self.name))
+
+
+
+
+            # Batch norm parameters
+            self.normop_in_hs_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.sdim,), dtype='float32'), name='normop_in_hs_gamma'+self.name))
+            self.normop_in_hs_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='normop_in_hs_mean'+self.name))
+            self.normop_in_hs_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.sdim,), dtype='float32'), name='normop_in_hs_var'+self.name))
+
+            self.normop_in_h_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.sdim,), dtype='float32'), name='normop_in_h_gamma'+self.name))
+            self.normop_in_h_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='normop_in_h_mean'+self.name))
+            self.normop_in_h_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.sdim,), dtype='float32'), name='normop_in_h_var'+self.name))
+
+            self.normop_rs_hs_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.sdim,), dtype='float32'), name='normop_rs_hs_gamma'+self.name))
+            self.normop_rs_hs_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='normop_rs_hs_mean'+self.name))
+            self.normop_rs_hs_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.sdim,), dtype='float32'), name='normop_rs_hs_var'+self.name))
+
+            self.normop_rs_h_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.sdim,), dtype='float32'), name='normop_rs_h_gamma'+self.name))
+            self.normop_rs_h_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='normop_rs_h_mean'+self.name))
+            self.normop_rs_h_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.sdim,), dtype='float32'), name='normop_rs_h_var'+self.name))
+
+            self.normop_zs_hs_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.sdim,), dtype='float32'), name='normop_zs_hs_gamma'+self.name))
+            self.normop_zs_hs_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='normop_zs_hs_mean'+self.name))
+            self.normop_zs_hs_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.sdim,), dtype='float32'), name='normop_zs_hs_var'+self.name))
+
+            self.normop_zs_h_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.sdim,), dtype='float32'), name='normop_zs_h_gamma'+self.name))
+            self.normop_zs_h_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='normop_zs_h_mean'+self.name))
+            self.normop_zs_h_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.sdim,), dtype='float32'), name='normop_zs_h_var'+self.name))
+
+            self.normop_os_hs_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.sdim,), dtype='float32'), name='normop_os_hs_gamma'+self.name))
+            self.normop_os_hs_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='normop_os_hs_mean'+self.name))
+            self.normop_os_hs_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.sdim,), dtype='float32'), name='normop_os_hs_var'+self.name))
+
+            self.normop_os_h_gamma = add_to_params(self.params, theano.shared(value=self.normop_gamma_init*np.ones((self.sdim,), dtype='float32'), name='normop_os_h_gamma'+self.name))
+            self.normop_os_h_mean = add_to_params(self.params, theano.shared(value=np.zeros((self.sdim,), dtype='float32'), name='normop_os_h_mean'+self.name))
+            self.normop_os_h_var = add_to_params(self.params, theano.shared(value=(1e-7)*np.ones((self.sdim,), dtype='float32'), name='normop_os_h_var'+self.name))
+
+    def plain_dialogue_step(self, h_t, m_t, bnmask_t, hs_tm1, *args):
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+
+
+        hs_tilde = self.dialogue_rec_activation(T.dot(h_t, self.Ws_in) + T.dot(hs_tm1, self.Ws_hh) + self.bs_hh)
+
+        hs_t = (m_t) * hs_tm1 + (1 - m_t) * hs_tilde
+
+        return hs_t
+
+
+    def GRU_dialogue_step(self, h_t, m_t, bnmask_t, hs_tm1, *args):
+
+        #rs_t = T.nnet.sigmoid(T.dot(h_t, self.Ws_in_r) + T.dot(hs_tm1, self.Ws_hh_r) + self.bs_r)
+        rs_t_normop_h_inp, rs_t_normop_h_mean, rs_t_normop_h_var = NormalizationOperator(self.normop_type, T.dot(h_t, self.Ws_in_r), self.normop_rs_h_gamma, bnmask_t, self.normop_rs_h_mean, self.normop_rs_h_var)
+        rs_t_normop_hs_inp, rs_t_normop_hs_mean, rs_t_normop_hs_var = NormalizationOperator(self.normop_type, T.dot(hs_tm1, self.Ws_hh_r), self.normop_rs_hs_gamma, bnmask_t, self.normop_rs_hs_mean, self.normop_rs_hs_var)
+        rs_t = T.nnet.sigmoid(rs_t_normop_h_inp + rs_t_normop_hs_inp + self.bs_r)
+
+
+        #zs_t = T.nnet.sigmoid(T.dot(h_t, self.Ws_in_z) + T.dot(hs_tm1, self.Ws_hh_z) + self.bs_z)
+        zs_t_normop_h_inp, zs_t_normop_h_mean, zs_t_normop_h_var = NormalizationOperator(self.normop_type, T.dot(h_t, self.Ws_in_z), self.normop_zs_h_gamma, bnmask_t, self.normop_zs_h_mean, self.normop_zs_h_var)
+        zs_t_normop_hs_inp, zs_t_normop_hs_mean, zs_t_normop_hs_var = NormalizationOperator(self.normop_type, T.dot(hs_tm1, self.Ws_hh_z), self.normop_zs_hs_gamma, bnmask_t, self.normop_zs_hs_mean, self.normop_zs_hs_var)
+        zs_t = T.nnet.sigmoid(zs_t_normop_h_inp + zs_t_normop_hs_inp + self.bs_z)
+
+        #os_t = T.nnet.sigmoid(T.dot(h_t, self.Ws_in_o) + T.dot(hs_tm1, self.Ws_hh_o) + self.bs_o)
+        os_t_normop_h_inp, os_t_normop_h_mean, os_t_normop_h_var = NormalizationOperator(self.normop_type, T.dot(h_t, self.Ws_in_o), self.normop_os_h_gamma, bnmask_t, self.normop_os_h_mean, self.normop_os_h_var)
+        os_t_normop_hs_inp, os_t_normop_hs_mean, os_t_normop_hs_var = NormalizationOperator(self.normop_type, T.dot(hs_tm1, self.Ws_hh_o), self.normop_os_hs_gamma, bnmask_t, self.normop_os_hs_mean, self.normop_os_hs_var)
+        os_t = T.nnet.sigmoid(os_t_normop_h_inp + os_t_normop_hs_inp + self.bs_o)
+
+        hs_overwrite = T.dot(h_t, self.Ws_in_overwrite) + self.bs_overwrite
+
+
+        hs_tilde_normop_h_inp, hs_tilde_normop_h_mean, hs_tilde_normop_h_var = NormalizationOperator(self.normop_type, T.dot(h_t, self.Ws_in), self.normop_in_h_gamma, bnmask_t, self.normop_in_h_mean, self.normop_in_h_var)
+        hs_tilde_normop_hs_inp, hs_tilde_normop_hs_mean, hs_tilde_normop_hs_var = NormalizationOperator(self.normop_type, T.dot(rs_t * hs_tm1, self.Ws_hh), self.normop_in_hs_gamma, bnmask_t, self.normop_in_hs_mean, self.normop_in_hs_var)
+        hs_tilde = self.dialogue_rec_activation(hs_tilde_normop_h_inp + hs_tilde_normop_hs_inp + self.bs_hh)
+
+        hs_hat = (np.float32(1.) - os_t) * hs_tilde + os_t * hs_overwrite
+
+        hs_update = (np.float32(1.) - zs_t) * hs_tm1 + zs_t * hs_hat
+         
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+         
+        hs_t = (m_t) * hs_tm1 + (1 - m_t) * hs_update
+
+        return hs_t, hs_hat, rs_t, zs_t, rs_t_normop_h_mean, rs_t_normop_h_var, rs_t_normop_hs_mean, rs_t_normop_hs_var, zs_t_normop_h_mean, zs_t_normop_h_var, zs_t_normop_hs_mean, zs_t_normop_hs_var, os_t_normop_h_mean, os_t_normop_h_var, os_t_normop_hs_mean, os_t_normop_hs_var, hs_tilde_normop_h_mean, hs_tilde_normop_h_var, hs_tilde_normop_hs_mean, hs_tilde_normop_hs_var
+
+    def build_encoder(self, h, x, xmask=None, bnmask=None, prev_state=None, **kwargs):
+        one_step = False
+        if len(kwargs):
+            one_step = True
+         
+        # if x.ndim == 2 then 
+        # x = (n_steps, batch_size)
+        if x.ndim == 2:
+            batch_size = x.shape[1]
+        # else x = (word_1, word_2, word_3, ...)
+        # or x = (last_word_1, last_word_2, last_word_3, ..)
+        # in this case batch_size is 
+        else:
+            batch_size = 1
+        
+        # if it is not one_step then we initialize everything to 0  
+        if not one_step:
+            if prev_state:
+                hs_0 = prev_state
+            else:
+                hs_0 = T.alloc(np.float32(0), batch_size, self.sdim)
+
+        # in sampling mode (i.e. one step) we require 
+        else:
+            # in this case x.ndim != 2
+            assert x.ndim != 2
+            assert 'prev_hs' in kwargs
+            hs_0 = kwargs['prev_hs']
+
+        if xmask == None:
+            xmask = T.neq(x, self.eos_sym)       
+
+        bnmask_given = True
+        if bnmask == None:
+            bnmask_given = False
+            bnmask = T.zeros(xmask.shape, dtype='float32')
+
+
+        # If the dialogue encoder is disabled, return zeros
+        if self.disable_dialogue_encoder:
+            if x.ndim == 2:
+                zeros_out = T.alloc(np.float32(0), x.shape[0], x.shape[1], self.sdim)
+            else:
+                zeros_out = T.alloc(np.float32(0), x.shape[0], self.sdim)
+
+            return zeros_out, []
+
+
+        if self.dialogue_encoder_gating == "GRU":
+            f_hier = self.GRU_dialogue_step
+            o_hier_info = [hs_0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
+        else:
+            f_hier = self.plain_dialogue_step
+            o_hier_info = [hs_0]
+
+        if self.deep_dialogue_encoder_input:        
+            transformed_h, updates = self.input_mlp.build_output(h, xmask)
+        else:
+            transformed_h = h
+            updates = []
+
+        # The hs sequence is based on the original mask
+        if not one_step:
+            _res,  _ = theano.scan(f_hier,\
+                               sequences=[transformed_h, xmask, bnmask],\
+                               outputs_info=o_hier_info)
+        # Just one step further
+        else:
+            _res = f_hier(transformed_h, xmask, bnmask, hs_0)
+
+        if isinstance(_res, list) or isinstance(_res, tuple):
+            hs = _res[0]
+        else:
+            hs = _res
+
+
+        # Create batch norm updates
+        if self.normop_type == 'BN':
+            if self.dialogue_encoder_gating == "GRU":
+                if (not one_step) and (h.ndim == 3) and (bnmask_given):
+                    vars_to_update = [self.normop_rs_h_mean, self.normop_rs_h_var, self.normop_rs_hs_mean, self.normop_rs_hs_var, self.normop_zs_h_mean, self.normop_zs_h_var, self.normop_zs_hs_mean, self.normop_zs_hs_var, self.normop_os_h_mean, self.normop_os_h_var, self.normop_os_hs_mean, self.normop_os_hs_var, self.normop_in_h_mean, self.normop_in_h_var, self.normop_in_hs_mean, self.normop_in_hs_var]
+
+                    batch_examples_per_timestep = T.sum(bnmask, axis=1).dimshuffle(0, 'x')
+
+                    assert len(_res) == len(vars_to_update)+4
+                    print(' Creating batch norm updates for GRU Dialog Encoder (' + self.name + '):')
+                    for varidx, var in enumerate(vars_to_update):
+                        average_var = T.sum(_res[4+varidx]*batch_examples_per_timestep, axis=0) \
+                                        / T.sum(batch_examples_per_timestep, axis=0)
+
+                        new_value = self.normop_moving_average_const*var \
+                                                + (1.0-self.normop_moving_average_const)*average_var
+
+                        updates.append((var, new_value))
+                        print('     ' + str(var))
+
+        return hs, updates
+
+    def __init__(self, state, rng, parent, name):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.name = name
+        self.init_params()
+
+
+
+
+class DialogDummyEncoder(EncoderDecoderBase):
+    """
+    This class operates on hidden states at the dialogue level (inter-utterance level).
+    At the end of each utterance, the input from the utterance encoder(s) is transferred
+    to its hidden state, which can then be transfered to the decoder.
+    """
+
+    def init_params(self):
+        """ Context weights """
+        if self.deep_direct_connection:
+            self.Ws_dummy_deep_input = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.inp_dim, self.inp_dim), name='Ws_dummy_deep_input'+self.name))
+            self.bs_dummy_deep_input = add_to_params(self.params, theano.shared(value=np.zeros((self.inp_dim,), dtype='float32'), name='bs_dummy_deep_input'+self.name))
+
+
+    def plain_dialogue_step(self, h_t, m_t, hs_tm1):
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+
+        transformed_h_t = h_t
+        if self.deep_direct_connection:
+            transformed_h_t = self.dialogue_rec_activation(T.dot(h_t, self.Ws_dummy_deep_input) + self.bs_dummy_deep_input)
+
+        hs_t = (m_t) * hs_tm1 + (1 - m_t) * transformed_h_t 
+        return hs_t
+
+    def build_encoder(self, h, x, xmask=None, prev_state=None, **kwargs):
+        one_step = False
+        if len(kwargs):
+            one_step = True
+         
+        # if x.ndim == 2 then 
+        # x = (n_steps, batch_size)
+        if x.ndim == 2:
+            batch_size = x.shape[1]
+        # else x = (word_1, word_2, word_3, ...)
+        # or x = (last_word_1, last_word_2, last_word_3, ..)
+        # in this case batch_size is 
+        else:
+            batch_size = 1
+        
+        # if it is not one_step then we initialize everything to 0  
+        if not one_step:
+            if prev_state:
+                hs_0 = prev_state
+            else:
+                hs_0 = T.alloc(np.float32(0), batch_size, self.inp_dim) 
+
+        # in sampling mode (i.e. one step) we require 
+        else:
+            # in this case x.ndim != 2
+            assert x.ndim != 2
+            assert 'prev_hs' in kwargs
+            hs_0 = kwargs['prev_hs']
+
+        if xmask == None:
+            xmask = T.neq(x, self.eos_sym)
+
+        f_hier = self.plain_dialogue_step
+        o_hier_info = [hs_0]
+        
+        # The hs sequence is based on the original mask
+        if not one_step:
+            _res,  _ = theano.scan(f_hier,\
+                               sequences=[h, xmask],\
+                               outputs_info=o_hier_info)
+        # Just one step further
+        else:
+            _res = f_hier(h, xmask, hs_0)
+
+        if isinstance(_res, list) or isinstance(_res, tuple):
+            hs = _res[0]
+        else:
+            hs = _res
+
+        return hs 
+
+    def __init__(self, state, rng, parent, inp_dim, name=''):
+        self.inp_dim = inp_dim
+        self.name = name
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.init_params()
+
+
+
+class UtteranceDecoder(EncoderDecoderBase):
+    """
+    This is the decoder RNN class, which operates at the word level (intra-utterance level).
+    It is an RNNLM conditioned on additional information (e.g. context level hidden state, latent variables)
+    """
+
+    NCE = 0
+    EVALUATION = 1
+    SAMPLING = 2
+    BEAM_SEARCH = 3
+
+    def __init__(self, state, rng, parent, dialog_encoder, word_embedding_param):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        # Take as input the encoder instance for the embeddings..
+        # To modify in the future
+        assert(word_embedding_param != None)
+        self.word_embedding_param = word_embedding_param
+        self.dialog_encoder = dialog_encoder
+        self.trng = MRG_RandomStreams(self.seed)
+        self.init_params()
+
+    def init_params(self):
+
+        assert self.utterance_decoder_gating == self.utterance_decoder_gating.upper()
+
+        # Compute input dimensionality
+        if self.direct_connection_between_encoders_and_decoder:
+            # When there is a direct connection between encoder and decoder, 
+            # the input has dimensionality sdim + qdim_decoder if forward encoder, and
+            # sdim + 2 x qdim_decoder for bidirectional encoder
+            if self.bidirectional_utterance_encoder:
+                self.input_dim = self.sdim + self.qdim_encoder*2
+            else:
+                self.input_dim = self.sdim + self.qdim_encoder
+        else:
+            # When there is no connection between encoder and decoder, 
+            # the input has dimensionality sdim
+            self.input_dim = self.sdim
+
+        if self.add_latent_gaussian_per_utterance and self.add_latent_piecewise_per_utterance:
+            if self.condition_decoder_only_on_latent_variable:
+                self.input_dim = self.latent_gaussian_per_utterance_dim + self.latent_piecewise_per_utterance_dim
+            else:
+                self.input_dim += self.latent_gaussian_per_utterance_dim + self.latent_piecewise_per_utterance_dim
+        elif self.add_latent_gaussian_per_utterance:
+            if self.condition_decoder_only_on_latent_variable:
+                self.input_dim = self.latent_gaussian_per_utterance_dim
+            else:
+                self.input_dim += self.latent_gaussian_per_utterance_dim
+        elif self.add_latent_piecewise_per_utterance:
+            if self.condition_decoder_only_on_latent_variable:
+                self.input_dim = self.latent_piecewise_per_utterance_dim
+            else:
+                self.input_dim += self.latent_piecewise_per_utterance_dim
+
+        # Compute hidden state dimensionality
+        if self.utterance_decoder_gating == "LSTM":
+            # For LSTM decoder, the state hd is the concatenation of the cell state and hidden state
+            self.complete_hidden_state_size = self.qdim_decoder*2
+        else:
+            self.complete_hidden_state_size = self.qdim_decoder
+
+        # Compute deep input
+        if self.deep_utterance_decoder_input:
+            self.input_mlp = OneLayerMLP(self.state, self.rng, self.input_dim,
+                                         self.input_dim, self.input_dim, self, '_input_mlp_utterance_decoder')
+            self.params += self.input_mlp.params
+
+
+        self.bd_out = add_to_params(self.params, theano.shared(value=np.zeros((self.idim,), dtype='float32'), name='bd_out'))
+        self.Wd_emb = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.idim, self.rankdim), name='Wd_emb'))
+
+        """ RNN decoder weights """
+        if self.utterance_decoder_gating == "" or self.utterance_decoder_gating == "NONE" \
+            or self.utterance_decoder_gating == "GRU" or self.utterance_decoder_gating == "LSTM":
+
+            self.Wd_hh = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.qdim_decoder, self.qdim_decoder), name='Wd_hh'))
+            self.bd_hh = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_decoder,), dtype='float32'), name='bd_hh'))
+            self.Wd_in = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, self.qdim_decoder), name='Wd_in')) 
+
+            # We only include the initial hidden state if the utterance decoder is NOT reset 
+            # and if its NOT a collapsed model (i.e. collapsed to standard RNN). 
+            # In the collapsed model, we always initialize hidden state to zero.
+            if (not self.collaps_to_standard_rnn) and (self.reset_utterance_decoder_at_end_of_utterance):
+                self.Wd_s_0 = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.complete_hidden_state_size), name='Wd_s_0'))
+                self.bd_s_0 = add_to_params(self.params, theano.shared(value=np.zeros((self.complete_hidden_state_size,), dtype='float32'), name='bd_s_0'))
+
+        if self.utterance_decoder_gating == "GRU":
+            self.Wd_in_r = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, self.qdim_decoder), name='Wd_in_r'))
+            self.Wd_in_z = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, self.qdim_decoder), name='Wd_in_z'))
+            self.Wd_hh_r = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.qdim_decoder, self.qdim_decoder), name='Wd_hh_r'))
+            self.Wd_hh_z = add_to_params(self.params, theano.shared(value=OrthogonalInit(self.rng, self.qdim_decoder, self.qdim_decoder), name='Wd_hh_z'))
+            self.bd_r = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_decoder,), dtype='float32'), name='bd_r'))
+            self.bd_z = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_decoder,), dtype='float32'), name='bd_z'))
+        
+            if self.decoder_bias_type == 'all':
+                self.Wd_s_q = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.qdim_decoder), name='Wd_s_q'))
+                self.Wd_s_z = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.qdim_decoder), name='Wd_s_z'))
+                self.Wd_s_r = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.qdim_decoder), name='Wd_s_r'))
+
+        elif self.utterance_decoder_gating == "LSTM":
+            # Input gate
+            self.Wd_in_i = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, self.qdim_decoder), name='Wd_in_i'))
+            self.Wd_hh_i = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.qdim_decoder, self.qdim_decoder), name='Wd_hh_i'))
+            self.Wd_c_i = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.qdim_decoder, self.qdim_decoder), name='Wd_c_i'))
+            self.bd_i = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_decoder,), dtype='float32'), name='bd_i'))
+
+            # Forget gate
+            self.Wd_in_f = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, self.qdim_decoder), name='Wd_in_f'))
+            self.Wd_hh_f = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.qdim_decoder, self.qdim_decoder), name='Wd_hh_f'))
+            self.Wd_c_f = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.qdim_decoder, self.qdim_decoder), name='Wd_c_f'))
+            self.bd_f = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_decoder,), dtype='float32'), name='bd_f'))
+
+            # Output gate
+            self.Wd_in_o = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, self.qdim_decoder), name='Wd_in_o'))
+            self.Wd_hh_o = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.qdim_decoder, self.qdim_decoder), name='Wd_hh_o'))
+            self.Wd_c_o = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.qdim_decoder, self.qdim_decoder), name='Wd_c_o'))
+            self.bd_o = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_decoder,), dtype='float32'), name='bd_o'))
+
+            if self.decoder_bias_type == 'all' or self.decoder_bias_type == 'selective':
+                # Input gate
+                self.Wd_s_i = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.qdim_decoder), name='Wd_s_i'))
+                # Forget gate
+                self.Wd_s_f = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.qdim_decoder), name='Wd_s_f'))
+                # Cell input
+                self.Wd_s = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.qdim_decoder), name='Wd_s'))
+                # Output gate
+                self.Wd_s_o = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.qdim_decoder), name='Wd_s_o'))
+        elif self.utterance_decoder_gating == "BOW":
+            self.Wd_bow_W_in = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.qdim_decoder), name='Wd_bow_W_in'))
+            self.Wd_bow_b_in = add_to_params(self.params, theano.shared(value=np.zeros((self.qdim_decoder,), dtype='float32'), name='Wd_bow_b_in'))
+
+
+        # Selective gating mechanism
+        if self.decoder_bias_type == 'selective':
+            # Selective gating mechanism is not compatible with bag-of-words decoder
+            assert not self.utterance_decoder_gating == "BOW"
+
+            # Selective gating mechanism for LSTM
+            if self.utterance_decoder_gating == "LSTM":
+                self.bd_sel = add_to_params(self.params, theano.shared(value=np.zeros((self.input_dim,), dtype='float32'), name='bd_sel'))
+
+                self.Wd_sel_s = add_to_params(self.params, \
+                                          theano.shared(value=NormalInit(self.rng, self.input_dim, self.input_dim), \
+                                                        name='Wd_sel_s'))
+                # x_{n-1} -> g_r
+                self.Wd_sel_e = add_to_params(self.params, \
+                                          theano.shared(value=NormalInit(self.rng, self.rankdim, self.input_dim), \
+                                                        name='Wd_sel_e'))
+                # h_{n-1} -> g_r
+                self.Wd_sel_h = add_to_params(self.params, \
+                                          theano.shared(value=NormalInit(self.rng, self.qdim_decoder, self.input_dim), \
+                                                        name='Wd_sel_h'))
+                # c_{n-1} -> g_r
+                self.Wd_sel_c = add_to_params(self.params, \
+                                          theano.shared(value=NormalInit(self.rng, self.qdim_decoder, self.input_dim), \
+                                                        name='Wd_sel_c'))
+            else: # Selective gating mechanism for GRU and plain decoder
+                self.bd_sel = add_to_params(self.params, theano.shared(value=np.zeros((self.input_dim,), dtype='float32'), name='bd_sel'))
+                self.Wd_s_q = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, self.qdim_decoder), name='Wd_s_q'))
+                # s -> g_r
+                self.Wd_sel_s = add_to_params(self.params, \
+                                          theano.shared(value=NormalInit(self.rng, self.input_dim, self.input_dim), \
+                                                        name='Wd_sel_s'))
+                # x_{n-1} -> g_r
+                self.Wd_sel_e = add_to_params(self.params, \
+                                          theano.shared(value=NormalInit(self.rng, self.rankdim, self.input_dim), \
+                                                        name='Wd_sel_e'))
+                # h_{n-1} -> g_r
+                self.Wd_sel_h = add_to_params(self.params, \
+                                          theano.shared(value=NormalInit(self.rng, self.qdim_decoder, self.input_dim), \
+                                                        name='Wd_sel_h'))
+
+
+
+
+        ######################   
+        # Output layer weights
+        ######################
+        if self.maxout_out:
+            if int(self.qdim_decoder) != 2*int(self.rankdim):
+                raise ValueError('Error with maxout configuration in UtteranceDecoder!'
+                                 + 'For maxout to work we need qdim_decoder = 2x rankdim')
+
+        out_target_dim = self.qdim_decoder
+        if not self.maxout_out:
+            out_target_dim = self.rankdim
+
+        self.Wd_out = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.qdim_decoder, out_target_dim), name='Wd_out'))
+         
+        # Set up deep output
+        if self.deep_utterance_decoder_out:
+
+            if self.utterance_decoder_gating == "" or self.utterance_decoder_gating == "NONE" \
+                or self.utterance_decoder_gating == "GRU" or self.utterance_decoder_gating == "LSTM":
+
+                self.Wd_e_out = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.rankdim, out_target_dim), name='Wd_e_out'))
+                self.bd_e_out = add_to_params(self.params, theano.shared(value=np.zeros((out_target_dim,), dtype='float32'), name='bd_e_out'))
+             
+            if self.decoder_bias_type != 'first': 
+                self.Wd_s_out = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.input_dim, out_target_dim), name='Wd_s_out'))
+
+
+    def build_output_layer(self, hs, xd, hd):
+        if self.utterance_decoder_gating == "LSTM":
+            if hd.ndim != 2:
+                pre_activ = T.dot(hd[:, :, 0:self.qdim_decoder], self.Wd_out)
+            else:
+                pre_activ = T.dot(hd[:, 0:self.qdim_decoder], self.Wd_out)
+        else:
+            pre_activ = T.dot(hd, self.Wd_out)
+        
+        if self.deep_utterance_decoder_out:
+
+            if self.utterance_decoder_gating == "" or self.utterance_decoder_gating == "NONE" \
+                or self.utterance_decoder_gating == "GRU" or self.utterance_decoder_gating == "LSTM":
+
+                pre_activ += T.dot(xd, self.Wd_e_out) + self.bd_e_out
+
+            if self.decoder_bias_type != 'first':
+                pre_activ += T.dot(hs, self.Wd_s_out)
+                # ^ if bias all, bias the deep output
+
+        if self.maxout_out:
+            pre_activ = Maxout(2)(pre_activ)
+
+        return pre_activ
+
+    def build_next_probs_predictor(self, inp, x, prev_state):
+        """ 
+        Return output probabilities given prev_words x, hierarchical pass hs, and previous hd
+        hs should always be the same (and should not be updated).
+        """
+        return self.build_decoder(inp, x, mode=UtteranceDecoder.BEAM_SEARCH, prev_state=prev_state)
+
+    def approx_embedder(self, x):
+        # Here we use the same embeddings learnt in the encoder.. !!!
+        return self.word_embedding_param[x]
+     
+    def output_softmax(self, pre_activ):
+        # returns a (timestep, bs, idim) matrix (huge)
+        return SoftMax(T.dot(pre_activ, self.Wd_emb.T) + self.bd_out)
+    
+    def output_nce(self, pre_activ, y, y_hat):
+        # returns a (timestep, bs, pos + neg) matrix (very small)
+        target_embedding = self.Wd_emb[y]
+        # ^ target embedding is (timestep x bs, rankdim)
+        noise_embedding = self.Wd_emb[y_hat]
+        # ^ noise embedding is (10, timestep x bs, rankdim)
+        
+        # pre_activ is (timestep x bs x rankdim)
+        pos_scores = (target_embedding * pre_activ).sum(2)
+        neg_scores = (noise_embedding * pre_activ).sum(3)
+ 
+        pos_scores += self.bd_out[y]
+        neg_scores += self.bd_out[y_hat]
+         
+        pos_noise = self.parent.t_noise_probs[y] * 10
+        neg_noise = self.parent.t_noise_probs[y_hat] * 10
+        
+        pos_scores = - T.log(T.nnet.sigmoid(pos_scores - T.log(pos_noise)))
+        neg_scores = - T.log(1 - T.nnet.sigmoid(neg_scores - T.log(neg_noise))).sum(0)
+        return pos_scores + neg_scores
+
+    def build_decoder(self, decoder_inp, x, xmask=None, xdropmask=None, y=None, y_neg=None, mode=EVALUATION, prev_state=None, step_num=None):
+
+        # If model collapses to standard RNN reset all input to decoder
+        if self.collaps_to_standard_rnn:
+            decoder_inp = decoder_inp * 0
+
+        # Compute deep input
+        if self.deep_utterance_decoder_input:
+            decoder_inp, updates = self.input_mlp.build_output(decoder_inp, xmask)
+        else:
+            updates = []
+
+
+        # Check parameter consistency
+        if mode == UtteranceDecoder.EVALUATION or mode == UtteranceDecoder.NCE:
+            assert y
+        else:
+            assert not y
+            assert prev_state
+         
+        # if mode == EVALUATION
+        #   xd = (timesteps, batch_size, qdim_decoder)
+        #
+        # if mode != EVALUATION
+        #   xd = (n_samples, dim)
+
+        # If a drop mask is given, replace 'dropped' tokens with 'unk' token as input
+        # to the decoder RNN.
+        if self.decoder_drop_previous_input_tokens and xdropmask:
+            xdropmask = xdropmask.dimshuffle(0, 1, 'x')
+            xd = xdropmask*self.approx_embedder(x) + (1-xdropmask)*self.word_embedding_param[self.unk_sym].dimshuffle('x', 'x', 0)
+        else:
+            xd = self.approx_embedder(x)
+
+
+        if not xmask:
+            xmask = T.neq(x, self.eos_sym)
+        
+        # we must zero out the </s> embedding
+        # i.e. the embedding x_{-1} is the 0 vector
+        # as well as hd_{-1} which will be reseted in the scan functions
+        if xd.ndim != 3:
+            assert mode != UtteranceDecoder.EVALUATION
+            xd = (xd.dimshuffle((1, 0)) * xmask).dimshuffle((1, 0))
+        else:
+            assert mode == UtteranceDecoder.EVALUATION or mode == UtteranceDecoder.NCE
+            xd = (xd.dimshuffle((2,0,1)) * xmask).dimshuffle((1,2,0))
+
+        # Run RNN decoder
+        if self.utterance_decoder_gating == "" or self.utterance_decoder_gating == "NONE" \
+            or self.utterance_decoder_gating == "GRU" or self.utterance_decoder_gating == "LSTM":
+
+            if prev_state:
+                hd_init = prev_state
+            else:
+                hd_init = T.alloc(np.float32(0), x.shape[1], self.complete_hidden_state_size)
+
+            if self.utterance_decoder_gating == "LSTM":
+                f_dec = self.LSTM_step
+                o_dec_info = [hd_init]
+                if self.decoder_bias_type == "selective":
+                    o_dec_info += [None, None]
+            elif self.utterance_decoder_gating == "GRU":
+                f_dec = self.GRU_step
+                o_dec_info = [hd_init, None, None, None]
+                if self.decoder_bias_type == "selective":
+                    o_dec_info += [None, None]
+            else: # No gating
+                f_dec = self.plain_step
+                o_dec_info = [hd_init]
+                if self.decoder_bias_type == "selective":
+                    o_dec_info += [None, None] 
+
+            # If the mode of the decoder is EVALUATION
+            # then we evaluate by default all the utterances
+            # xd - i.e. xd.ndim == 3, xd = (timesteps, batch_size, qdim_decoder)
+            if mode == UtteranceDecoder.EVALUATION or mode == UtteranceDecoder.NCE: 
+                _res, _ = theano.scan(f_dec,
+                                  sequences=[xd, xmask, decoder_inp],\
+                                  outputs_info=o_dec_info)
+            # else we evaluate only one step of the recurrence using the
+            # previous hidden states and the previous computed hierarchical 
+            # states.
+            else:
+                _res = f_dec(xd, xmask, decoder_inp, prev_state)
+
+            if isinstance(_res, list) or isinstance(_res, tuple):
+                hd = _res[0]
+            else:
+                hd = _res
+
+            # OBSOLETE:
+            #   if we are using selective bias, we should update our decoder_inp
+            #   to the step-selective decoder_inp
+            #   if self.decoder_bias_type == "selective":
+            #       decoder_inp = _res[1]
+
+        elif self.utterance_decoder_gating == "BOW": # BOW (bag of words) decoder
+            hd = T.dot(decoder_inp, self.Wd_bow_W_in) + self.Wd_bow_b_in
+
+        pre_activ = self.build_output_layer(decoder_inp, xd, hd)
+
+        # EVALUATION  : Return target_probs + all the predicted ranks
+        # target_probs.ndim == 3
+        if mode == UtteranceDecoder.EVALUATION:
+            outputs = self.output_softmax(pre_activ)
+            target_probs = GrabProbs(outputs, y)
+            return target_probs, hd, outputs, updates
+
+        elif mode == UtteranceDecoder.NCE:
+            return self.output_nce(pre_activ, y, y_neg), hd, updates
+
+        # BEAM_SEARCH : Return output (the softmax layer) + the new hidden states
+        elif mode == UtteranceDecoder.BEAM_SEARCH:
+            return self.output_softmax(pre_activ), hd
+
+        # SAMPLING    : Return a vector of n_sample from the output layer 
+        #                 + log probabilities + the new hidden states
+        elif mode == UtteranceDecoder.SAMPLING:
+            outputs = self.output_softmax(pre_activ)
+            if outputs.ndim == 1:
+                outputs = outputs.dimshuffle('x', 0) 
+            sample = self.trng.multinomial(pvals=outputs, dtype='int64').argmax(axis=-1)
+            if outputs.ndim == 1:
+                sample = sample[0] 
+            log_prob = -T.log(T.diag(outputs.T[sample])) 
+            return sample, log_prob, hd
+
+    def LSTM_step(self, xd_t, m_t, decoder_inp_t, hd_tm1):
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+
+        # If model collapses to standard RNN, or the 'reset_utterance_decoder_at_end_of_utterance' flag is off,
+        # then never reset decoder. Otherwise, reset the decoder at every utterance turn.
+        if (not self.collaps_to_standard_rnn) and (self.reset_utterance_decoder_at_end_of_utterance):
+            hd_tm1 = (m_t) * hd_tm1 + (1 - m_t) * T.tanh(T.dot(decoder_inp_t, self.Wd_s_0) + self.bd_s_0)
+
+        # Unlike the GRU gating function, the LSTM gating function needs to keep track of two vectors:
+        # the output state and the cell state. To align the implementation with the GRU, we store 
+        # both of these two states in a single vector for every time step, split them up for computation and
+        # then concatenate them back together at the end.
+
+        # Given the previous concatenated hidden states, split them up into output state and cell state.
+        # By convention, we assume that the output state is always first, and the cell state second.
+        hd_tm1_tilde = hd_tm1[:, 0:self.qdim_decoder]
+        cd_tm1_tilde = hd_tm1[:, self.qdim_decoder:self.qdim_decoder*2]
+  
+        # In the 'selective' decoder bias type each hidden state of the decoder
+        # RNN receives the decoder_inp_t modified by the selective bias -> decoder_inpr_t 
+        if self.decoder_bias_type == 'selective':
+            rd_sel_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_sel_e) + T.dot(hd_tm1_tilde, self.Wd_sel_h) + T.dot(cd_tm1_tilde, self.Wd_sel_c) + T.dot(decoder_inp_t, self.Wd_sel_s) + self.bd_sel)
+            decoder_inpr_t = rd_sel_t * decoder_inp_t
+
+            id_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_i) + T.dot(hd_tm1_tilde, self.Wd_hh_i) \
+                                  + T.dot(decoder_inpr_t, self.Wd_s_i) \
+                                  + T.dot(cd_tm1_tilde, self.Wd_c_i) + self.bd_i)
+            fd_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_f) + T.dot(hd_tm1_tilde, self.Wd_hh_f) \
+                                  + T.dot(decoder_inpr_t, self.Wd_s_f) \
+                                  + T.dot(cd_tm1_tilde, self.Wd_c_f) + self.bd_f)
+            cd_t = fd_t*cd_tm1_tilde + id_t*self.sent_rec_activation(T.dot(xd_t, self.Wd_in)  \
+                                  + T.dot(decoder_inpr_t, self.Wd_s) \
+                                  + T.dot(hd_tm1_tilde, self.Wd_hh) + self.bd_hh)
+            od_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_o) + T.dot(hd_tm1_tilde, self.Wd_hh_o) \
+                                  + T.dot(decoder_inpr_t, self.Wd_s_o) \
+                                  + T.dot(cd_t, self.Wd_c_o) + self.bd_o)
+
+            # Concatenate output state and cell state into one vector
+            hd_t = T.concatenate([od_t*self.sent_rec_activation(cd_t), cd_t], axis=1)
+            output = (hd_t, decoder_inpr_t, rd_sel_t)
+        
+        # In the 'all' decoder bias type each hidden state of the decoder
+        # RNN receives the decoder_inp_t vector as bias without modification
+        elif self.decoder_bias_type == 'all':
+            id_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_i) + T.dot(hd_tm1_tilde, self.Wd_hh_i) \
+                                  + T.dot(decoder_inp_t, self.Wd_s_i) \
+                                  + T.dot(cd_tm1_tilde, self.Wd_c_i) + self.bd_i)
+            fd_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_f) + T.dot(hd_tm1_tilde, self.Wd_hh_f) \
+                                  + T.dot(decoder_inp_t, self.Wd_s_f) \
+                                  + T.dot(cd_tm1_tilde, self.Wd_c_f) + self.bd_f)
+            cd_t = fd_t*cd_tm1_tilde + id_t*self.sent_rec_activation(T.dot(xd_t, self.Wd_in)  \
+                                  + T.dot(decoder_inp_t, self.Wd_s) \
+                                  + T.dot(hd_tm1_tilde, self.Wd_hh) + self.bd_hh)
+            od_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_o) + T.dot(hd_tm1_tilde, self.Wd_hh_o) \
+                                  + T.dot(decoder_inp_t, self.Wd_s_o) \
+                                  + T.dot(cd_t, self.Wd_c_o) + self.bd_o)
+
+            # Concatenate output state and cell state into one vector
+            hd_t = T.concatenate([od_t*self.sent_rec_activation(cd_t), cd_t], axis=1)
+            output = (hd_t,)
+        else:
+            # Do not bias the decoder at every time, instead,
+            # force it to store very useful information in the first state.
+            id_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_i) + T.dot(hd_tm1_tilde, self.Wd_hh_i) \
+                                  + T.dot(cd_tm1_tilde, self.Wd_c_i) + self.bd_i)
+            fd_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_f) + T.dot(hd_tm1_tilde, self.Wd_hh_f) \
+                                  + T.dot(cd_tm1_tilde, self.Wd_c_f) + self.bd_f)
+            cd_t = fd_t*cd_tm1_tilde + id_t*self.sent_rec_activation(T.dot(xd_t, self.Wd_in_c)  \
+                                  + T.dot(hd_tm1_tilde, self.Wd_hh) + self.bd_hh)
+            od_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_o) + T.dot(hd_tm1_tilde, self.Wd_hh_o) \
+                                  + T.dot(cd_t, self.Wd_c_o) + self.bd_o)
+
+            # Concatenate output state and cell state into one vector
+            hd_t = T.concatenate([od_t*self.sent_rec_activation(cd_t), cd_t], axis=1)
+            output = (hd_t,)
+
+        return output
+
+    def GRU_step(self, xd_t, m_t, decoder_inp_t, hd_tm1): 
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+
+        # If model collapses to standard RNN, or the 'reset_utterance_decoder_at_end_of_utterance' flag is off,
+        # then never reset decoder. Otherwise, reset the decoder at every utterance turn.
+        if (not self.collaps_to_standard_rnn) and (self.reset_utterance_decoder_at_end_of_utterance):
+            hd_tm1 = (m_t) * hd_tm1 + (1 - m_t) * T.tanh(T.dot(decoder_inp_t, self.Wd_s_0) + self.bd_s_0)
+  
+        # In the 'selective' decoder bias type each hidden state of the decoder
+        # RNN receives the decoder_inp_t modified by the selective bias -> decoder_inpr_t 
+        if self.decoder_bias_type == 'selective':
+            rd_sel_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_sel_e) + T.dot(hd_tm1, self.Wd_sel_h) + T.dot(decoder_inp_t, self.Wd_sel_s) + self.bd_sel)
+            decoder_inpr_t = rd_sel_t * decoder_inp_t
+             
+            rd_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_r) + T.dot(hd_tm1, self.Wd_hh_r) + self.bd_r)
+            zd_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_z) + T.dot(hd_tm1, self.Wd_hh_z) + self.bd_z)
+            hd_tilde = self.sent_rec_activation(T.dot(xd_t, self.Wd_in) \
+                                        + T.dot(rd_t * hd_tm1, self.Wd_hh) \
+                                        + T.dot(decoder_inpr_t, self.Wd_s_q) \
+                                        + self.bd_hh)
+
+
+            hd_t = (np.float32(1.) - zd_t) * hd_tm1 + zd_t * hd_tilde 
+            output = (hd_t, decoder_inpr_t, rd_sel_t, rd_t, zd_t, hd_tilde)
+        
+        # In the 'all' decoder bias type each hidden state of the decoder
+        # RNN receives the decoder_inp_t vector as bias without modification
+        elif self.decoder_bias_type == 'all':
+        
+            rd_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_r) + T.dot(hd_tm1, self.Wd_hh_r) + T.dot(decoder_inp_t, self.Wd_s_r) + self.bd_r)
+            zd_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_z) + T.dot(hd_tm1, self.Wd_hh_z) + T.dot(decoder_inp_t, self.Wd_s_z) + self.bd_z)
+            hd_tilde = self.sent_rec_activation(T.dot(xd_t, self.Wd_in) \
+                                        + T.dot(rd_t * hd_tm1, self.Wd_hh) \
+                                        + T.dot(decoder_inp_t, self.Wd_s_q) \
+                                        + self.bd_hh)
+            hd_t = (np.float32(1.) - zd_t) * hd_tm1 + zd_t * hd_tilde 
+            output = (hd_t, rd_t, zd_t, hd_tilde)
+
+        else:
+            # Do not bias the decoder at every time, instead,
+            # force it to store very useful information in the first state.
+            rd_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_r) + T.dot(hd_tm1, self.Wd_hh_r) + self.bd_r)
+            zd_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_in_z) + T.dot(hd_tm1, self.Wd_hh_z) + self.bd_z)
+            hd_tilde = self.sent_rec_activation(T.dot(xd_t, self.Wd_in) \
+                                        + T.dot(rd_t * hd_tm1, self.Wd_hh) \
+                                        + self.bd_hh) 
+            hd_t = (np.float32(1.) - zd_t) * hd_tm1 + zd_t * hd_tilde
+            output = (hd_t, rd_t, zd_t, hd_tilde)
+        return output
+    
+    def plain_step(self, xd_t, m_t, decoder_inp_t, hd_tm1):
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+        
+        # If model collapses to standard RNN, or the 'reset_utterance_decoder_at_end_of_utterance' flag is off,
+        # then never reset decoder. Otherwise, reset the decoder at every utterance turn.
+        if (not self.collaps_to_standard_rnn) and (self.reset_utterance_decoder_at_end_of_utterance):
+            # We already assume that xd are zeroed out
+            hd_tm1 = (m_t) * hd_tm1 + (1-m_t) * T.tanh(T.dot(decoder_inp_t, self.Wd_s_0) + self.bd_s_0)
+
+        if self.decoder_bias_type == 'first':
+            # Do not bias the decoder at every time, instead,
+            # force it to store very useful information in the first state.
+            hd_t = self.sent_rec_activation( T.dot(xd_t, self.Wd_in) \
+                                             + T.dot(hd_tm1, self.Wd_hh) \
+                                             + self.bd_hh )
+            output = (hd_t,)
+        elif self.decoder_bias_type == 'all':
+            hd_t = self.sent_rec_activation( T.dot(xd_t, self.Wd_in) \
+                                             + T.dot(hd_tm1, self.Wd_hh) \
+                                             + T.dot(decoder_inp_t, self.Wd_s_q) \
+                                             + self.bd_hh )
+            output = (hd_t,)
+        elif self.decoder_bias_type == 'selective':
+            rd_sel_t = T.nnet.sigmoid(T.dot(xd_t, self.Wd_sel_e) + T.dot(hd_tm1, self.Wd_sel_h) + T.dot(decoder_inp_t, self.Wd_sel_s) + self.bd_sel)
+            decoder_inpr_t = rd_sel_t * decoder_inp_t
+             
+            hd_t = self.sent_rec_activation( T.dot(xd_t, self.Wd_in) \
+                                        + T.dot(hd_tm1, self.Wd_hh) \
+                                        + T.dot(decoder_inpr_t, self.Wd_s_q) \
+                                        + self.bd_hh )
+            output = (hd_t, decoder_inpr_t, rd_sel_t)
+
+        return output
+
+
+class DialogLevelLatentGaussianEncoder(EncoderDecoderBase):
+    """
+    This class operates on hidden states at the dialogue level (inter-utterance level).
+    At the end of each utterance, the input from the utterance encoder(s) is transferred
+    to its hidden state. This hidden state is then transformed to output a mean and a (diagonal) 
+    covariance matrix, which parametrizes a latent Gaussian variable.
+    """
+
+    def init_params(self):
+        """ Encoder weights """
+
+        # Initialize input MLP
+        self.input_mlp = TwoLayerMLP(self.state, self.rng, self.input_dim, self.latent_dim*2, self.latent_dim, self, '_input_mlp_'+self.name)
+        self.params += self.input_mlp.params
+
+        # Initialize mean and diagonal covariance matrix
+        self.Wl_mean_out = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.latent_dim, self.latent_dim), name='Wl_mean_out'+self.name))
+        self.bl_mean_out = add_to_params(self.params, theano.shared(value=np.zeros((self.latent_dim,), dtype='float32'), name='bl_mean_out'+self.name))
+
+        self.Wl_std_out = add_to_params(self.params, theano.shared(value=NormalInit(self.rng, self.latent_dim, self.latent_dim), name='Wl_std_out'+self.name))
+        self.bl_std_out = add_to_params(self.params, theano.shared(value=np.zeros((self.latent_dim,), dtype='float32'), name='bl_std_out'+self.name))
+
+    def plain_dialogue_step(self, h_t, m_t, hs_tm1):
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+
+        hs_t = (m_t) * hs_tm1 + (1 - m_t) * h_t
+
+        return hs_t
+
+    def build_encoder(self, h, x, xmask=None, latent_variable_mask=None, prev_state=None, **kwargs):
+        one_step = False
+        if len(kwargs):
+            one_step = True
+
+        # if x.ndim == 2 then 
+        # x = (n_steps, batch_size)
+        if x.ndim == 2:
+            batch_size = x.shape[1]
+        # else x = (word_1, word_2, word_3, ...)
+        # or x = (last_word_1, last_word_2, last_word_3, ..)
+        else:
+            batch_size = 1
+        
+        # if it is not one_step then we initialize everything to 0  
+        if not one_step:
+            if prev_state:
+                hs_0 = prev_state
+            else:
+                hs_0 = T.alloc(np.float32(0), batch_size, self.latent_dim)
+
+        # sampling mode (i.e. one step)
+        else:
+            # in this case x.ndim != 2
+            assert x.ndim != 2
+            assert 'prev_hs' in kwargs
+            hs_0 = kwargs['prev_hs']
+
+        if xmask == None:
+            xmask = T.neq(x, self.eos_sym)
+
+        if xmask.ndim == 1:
+            xmask = xmask.dimshuffle(0, 'x')
+
+        if latent_variable_mask == None:
+            latent_variable_mask = T.eq(x, self.eos_sym)
+
+        if latent_variable_mask.ndim == 1:
+            latent_variable_mask = latent_variable_mask.dimshuffle(0, 'x')
+
+
+        f_hier = self.plain_dialogue_step
+        o_hier_info = [hs_0]
+
+        transformed_h, updates = self.input_mlp.build_output(h, latent_variable_mask)
+       
+
+        if not one_step:
+            _res,  _ = theano.scan(f_hier,\
+                               sequences=[transformed_h, xmask],\
+                               outputs_info=o_hier_info)
+
+        # Just one step further
+        else:
+            _res = f_hier(transformed_h, xmask, hs_0)
+
+        if isinstance(_res, list) or isinstance(_res, tuple):
+            hs = _res[0]
+        else:
+            hs = _res
+
+        hs_mean = T.dot(hs, self.Wl_mean_out) + self.bl_mean_out
+        hs_var = T.nnet.softplus((T.dot(hs, self.Wl_std_out) + self.bl_std_out)) * self.scale_latent_gaussian_variable_variances
+
+        hs_var = T.clip(hs_var, self.min_latent_gaussian_variable_variances, self.max_latent_gaussian_variable_variances)
+
+        return [hs, hs_mean, hs_var], updates
+
+    def __init__(self, state, input_dim, latent_dim, rng, parent, name):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.input_dim = input_dim
+        self.latent_dim = latent_dim
+        self.name = name
+        self.init_params()
+
+
+class DialogLevelLatentPiecewiseEncoder(EncoderDecoderBase):
+    """
+    This class operates on hidden states at the dialogue level (inter-utterance level).
+    At the end of each utterance, the input from the utterance encoder(s) is transferred
+    to its hidden state. This hidden state is then transformed to output alpha vectors, which parametrize the vector of latent piecewise variables.
+    """
+
+    def init_params(self):
+        """ Encoder weights """
+        # Initialize input MLP
+        self.input_mlp = TwoLayerMLP(self.state, self.rng, self.input_dim, self.latent_dim*2, self.latent_dim, self, '_input_mlp_'+self.name)
+        self.params += self.input_mlp.params
+
+        # Alpha output parameters
+        self.Wl_alpha_out = add_to_params(self.params, theano.shared(value=NormalInit3D(self.rng, self.latent_dim, self.latent_dim, self.pieces_alpha), name='Wl_alpha_out'+self.name))
+        self.bl_alpha_out = add_to_params(self.params, theano.shared(value=np.zeros((self.latent_dim, self.pieces_alpha), dtype='float32'), name='bl_alpha_out'+self.name))
+
+
+
+    def plain_dialogue_step(self, h_t, m_t, hs_tm1):
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+
+        hs_t = (m_t) * hs_tm1 + (1 - m_t) * h_t
+
+        return hs_t
+
+    def build_encoder(self, h, x, xmask=None, latent_variable_mask=None, prev_state=None, **kwargs):
+        one_step = False
+        if len(kwargs):
+            one_step = True
+
+        # if x.ndim == 2 then 
+        # x = (n_steps, batch_size)
+        if x.ndim == 2:
+            batch_size = x.shape[1]
+        # else x = (word_1, word_2, word_3, ...)
+        # or x = (last_word_1, last_word_2, last_word_3, ..)
+        else:
+            batch_size = 1
+        
+        # if it is not one_step then we initialize everything to 0  
+        if not one_step:
+            if prev_state:
+                hs_0 = prev_state
+            else:
+                hs_0 = T.alloc(np.float32(0), batch_size, self.latent_dim)
+
+        # sampling mode (i.e. one step)
+        else:
+            # in this case x.ndim != 2
+            assert x.ndim != 2
+            assert 'prev_hs' in kwargs
+            hs_0 = kwargs['prev_hs']
+
+        if xmask == None:
+            xmask = T.neq(x, self.eos_sym)
+
+        if xmask.ndim == 1:
+            xmask = xmask.dimshuffle(0, 'x')
+
+        if latent_variable_mask == None:
+            latent_variable_mask = T.eq(x, self.eos_sym)
+
+        if latent_variable_mask.ndim == 1:
+            latent_variable_mask = latent_variable_mask.dimshuffle(0, 'x')
+
+        f_hier = self.plain_dialogue_step
+        o_hier_info = [hs_0]
+
+        transformed_h, updates = self.input_mlp.build_output(h, latent_variable_mask)
+
+
+
+        if not one_step:
+            _res,  _ = theano.scan(f_hier,\
+                               sequences=[transformed_h, xmask],\
+                               outputs_info=o_hier_info)
+
+        # Just one step further
+        else:
+            _res = f_hier(transformed_h, xmask, hs_0)
+
+        if isinstance(_res, list) or isinstance(_res, tuple):
+            hs = _res[0]
+        else:
+            hs = _res
+
+        hs_reshaped = hs.reshape((1,hs.shape[0],hs.shape[1],hs.shape[2]))
+
+        hs_repeated = T.repeat(hs_reshaped, self.pieces_alpha, axis=0).reshape((self.pieces_alpha, hs.shape[0], hs.shape[1], hs.shape[2])).dimshuffle(1,2,3,0)
+
+        hs_alpha = BatchedDot(hs_repeated, self.Wl_alpha_out, True) + self.bl_alpha_out
+
+        # hs: time steps x batch size x hidden dim
+        # hs_reshaped: time steps x batch size x hidden dim x pieces
+        # Wl_alpha_out: hidden dim x latent dim x pieces
+        # hs_alpha: time steps x batch size x latent dim x pieces
+
+        if self.scale_latent_piecewise_variable_alpha_use_softplus:
+            hs_alpha = T.nnet.softplus(hs_alpha)*self.scale_alpha
+        else:
+            hs_alpha = T.exp(hs_alpha)*self.scale_alpha
+
+        return [hs, hs_alpha], updates
+
+    def __init__(self, state, input_dim, latent_dim, pieces_alpha, scale_alpha, rng, parent, name):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.input_dim = input_dim
+        self.latent_dim = latent_dim
+        self.pieces_alpha = pieces_alpha
+        self.scale_alpha = scale_alpha
+        self.name = name
+        self.init_params()
+
+
+
+class DialogLevelRollLeft(EncoderDecoderBase):
+    """
+    This class operates on hidden states at the dialogue level (inter-utterance level).
+    It rolls the hidden states at utterance t to be at position t-1.
+    It is used for the latent variable approximate posterior, which needs to use the future h variable.
+    """
+    def plain_dialogue_step(self, h_t, m_t, hs_tm1):
+        if m_t.ndim >= 1:
+            m_t = m_t.dimshuffle(0, 'x')
+
+        hs_t = (m_t) * hs_tm1 + (1 - m_t) * h_t
+        return hs_t
+
+    def build_encoder(self, h, x, xmask=None, **kwargs):
+        one_step = False
+        if len(kwargs):
+            one_step = True
+
+        assert not one_step
+
+        # if x.ndim == 2 then 
+        # x = (n_steps, batch_size)
+        if x.ndim == 2:
+            batch_size = x.shape[1]
+        # else x = (word_1, word_2, word_3, ...)
+        # or x = (last_word_1, last_word_2, last_word_3, ..)
+        else:
+            batch_size = 1
+        
+        # if it is not one_step then we initialize everything to 0  
+        if not one_step:
+            hs_0 = h[-1]
+
+        # in sampling mode (i.e. one step) we require 
+        else:
+            # in this case x.ndim != 2
+            assert x.ndim != 2
+            assert 'prev_hs' in kwargs
+            hs_0 = kwargs['prev_hs']
+
+        if xmask == None:
+            xmask = T.neq(x, self.eos_sym)
+
+        f_hier = self.plain_dialogue_step
+        o_hier_info = [hs_0]
+
+        h_reversed = h[::-1]
+        xmask_reversed = xmask[::-1]
+        if not one_step:
+            _res,  _ = theano.scan(f_hier,\
+                               sequences=[h_reversed, xmask_reversed],\
+                               outputs_info=o_hier_info)
+
+
+
+
+        # Just one step further
+        else:
+            _res = f_hier(h, xmask, hs_0)
+
+        if isinstance(_res, list) or isinstance(_res, tuple):
+            hs = _res[0][::-1]
+        else:
+            hs = _res[::-1]
+
+        final_hs = hs[1:(self.parent.x_max_length-1)]
+        final_hs = T.concatenate([final_hs, h[-1].dimshuffle('x', 0, 1)], axis=0)
+
+        return final_hs
+
+
+    def __init__(self, state, input_dim, rng, parent):
+        EncoderDecoderBase.__init__(self, state, rng, parent)
+        self.input_dim = input_dim
+
+class DialogEncoderDecoder(Model):
+    """
+    Main model class, which links together all other sub-components
+    and provides functions for training and sampling from the model.
+    """
+
+    def indices_to_words(self, seq, exclude_end_sym=True):
+        """
+        Converts a list of words to a list
+        of word ids. Use unk_sym if a word is not
+        known.
+        """
+        def convert():
+            for word_index in seq:
+                if word_index > len(self.idx_to_str):
+                    raise ValueError('Word index is too large for the model vocabulary!')
+                if not exclude_end_sym or (word_index != self.eos_sym):
+                    yield self.idx_to_str[word_index]
+        return list(convert())
+
+    def words_to_indices(self, seq):
+        """
+        Converts a list of words to a list
+        of word ids. Use unk_sym if a word is not
+        known.
+        """
+        return [self.str_to_idx.get(word, self.unk_sym) for word in seq]
+
+    def reverse_utterances(self, seq):
+        """
+        Reverses the words in each utterance inside a sequence of utterance (e.g. a dialogue)
+        This is used for the bidirectional encoder RNN.
+        """
+        reversed_seq = numpy.copy(seq)
+        for idx in range(seq.shape[1]):
+            eos_indices = numpy.where(seq[:, idx] == self.eos_sym)[0]
+            prev_eos_index = -1
+            for eos_index in eos_indices:
+                reversed_seq[(prev_eos_index+1):eos_index, idx] = (reversed_seq[(prev_eos_index+1):eos_index, idx])[::-1]
+                prev_eos_index = eos_index
+
+        return reversed_seq
+
+    def compute_updates(self, training_cost, params):
+        updates = []
+         
+        grads = T.grad(training_cost, params)
+        grads = OrderedDict(zip(params, grads))
+
+        # Gradient clipping
+        c = numpy.float32(self.cutoff)
+        clip_grads = []
+        
+        norm_gs = T.sqrt(sum(T.sum(g ** 2) for p, g in grads.items()))
+        normalization = T.switch(T.ge(norm_gs, c), c / norm_gs, np.float32(1.))
+        notfinite = T.or_(T.isnan(norm_gs), T.isinf(norm_gs))
+         
+        for p, g in grads.items():
+            clip_grads.append((p, T.switch(notfinite, numpy.float32(.1) * p, g * normalization)))
+        
+        grads = OrderedDict(clip_grads)
+
+        if self.W_emb in grads:
+            if self.initialize_from_pretrained_word_embeddings and self.fix_pretrained_word_embeddings:
+                assert not self.fix_encoder_parameters
+                # Keep pretrained word embeddings fixed
+                logger.debug("Will use mask to fix pretrained word embeddings")
+                grads[self.W_emb] = grads[self.W_emb] * self.W_emb_pretrained_mask
+            elif self.fix_encoder_parameters:
+                # If 'fix_encoder_parameters' is on, the word embeddings will be excluded from parameter training set
+                logger.debug("Will fix word embeddings to initial embeddings or embeddings from resumed model")
+            else:
+                logger.debug("Will train all word embeddings")
+
+        optimizer_variables = []
+        if self.updater == 'adagrad':
+            updates = Adagrad(grads, self.lr)
+        elif self.updater == 'sgd':
+            raise Exception("Sgd not implemented!")
+        elif self.updater == 'adadelta':
+            updates = Adadelta(grads)
+        elif self.updater == 'rmsprop':
+            updates = RMSProp(grads, self.lr)
+        elif self.updater == 'adam':
+            updates, optimizer_variables = Adam(grads, self.lr)
+        else:
+            raise Exception("Updater not understood!") 
+
+        return updates, optimizer_variables
+  
+    # Batch training function.
+    def build_train_function(self):
+        if not hasattr(self, 'train_fn'):
+            # Compile functions
+            logger.debug("Building train function")
+
+            if self.add_latent_gaussian_per_utterance and self.add_latent_piecewise_per_utterance:
+
+                self.train_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, 
+                                                             self.x_max_length,
+                                                             self.x_cost_mask,
+                                                             self.x_reset_mask, 
+                                                             self.ran_gaussian_cost_utterance,
+                                                             self.ran_uniform_cost_utterance,
+                                                             self.x_dropmask],
+                                                outputs=[self.training_cost, self.kl_divergence_cost_acc, self.latent_gaussian_utterance_variable_approx_posterior_mean_var, self.latent_piecewise_utterance_variable_approx_posterior_alpha[-1], self.latent_piecewise_utterance_variable_prior_alpha[-1], self.kl_divergences_between_piecewise_prior_and_posterior, self.kl_divergences_between_gaussian_prior_and_posterior, self.latent_piecewise_posterior_sample],
+                                                updates=self.updates + self.state_updates, 
+                                                on_unused_input='warn', 
+                                                name="train_fn")
+
+            elif self.add_latent_gaussian_per_utterance:
+                self.train_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, 
+                                                             self.x_max_length,
+                                                             self.x_cost_mask,
+                                                             self.x_reset_mask, 
+                                                             self.ran_gaussian_cost_utterance,
+                                                             self.ran_uniform_cost_utterance,
+                                                             self.x_dropmask],
+                                                outputs=[self.training_cost, self.kl_divergence_cost_acc, self.latent_gaussian_utterance_variable_approx_posterior_mean_var, self.kl_divergences_between_gaussian_prior_and_posterior],
+                                                updates=self.updates + self.state_updates, 
+                                                on_unused_input='warn', 
+                                                name="train_fn")
+
+            elif self.add_latent_piecewise_per_utterance:
+                self.train_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, 
+                                                             self.x_max_length,
+                                                             self.x_cost_mask,
+                                                             self.x_reset_mask, 
+                                                             self.ran_gaussian_cost_utterance,
+                                                             self.ran_uniform_cost_utterance,
+                                                             self.x_dropmask],
+                                                outputs=[self.training_cost, self.kl_divergence_cost_acc, self.kl_divergences_between_piecewise_prior_and_posterior],
+                                                updates=self.updates + self.state_updates, 
+                                                on_unused_input='warn', 
+                                                name="train_fn")
+
+            else:
+                self.train_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, 
+                                                             self.x_max_length,
+                                                             self.x_cost_mask,
+                                                             self.x_reset_mask, 
+                                                             self.ran_gaussian_cost_utterance,
+                                                             self.ran_uniform_cost_utterance,
+                                                             self.x_dropmask],
+                                                outputs=self.training_cost,
+                                                updates=self.updates + self.state_updates, 
+                                                on_unused_input='warn', 
+                                                name="train_fn")                
+
+        return self.train_fn
+
+    def build_gamma_bounding_function(self):
+        if not hasattr(self, 'gamma_bounding_fn'):
+            # Compile functions
+            logger.debug("Building gamma bounding function")
+                
+            self.gamma_bounding_fn = theano.function(inputs=[],
+                                            outputs=[],
+                                            updates=self.gamma_bounding_updates, 
+                                            on_unused_input='warn', 
+                                            name="gamma_bounding_fn")
+
+        return self.gamma_bounding_fn
+
+    # Helper function used for computing the initial decoder hidden states before sampling starts.
+    def build_decoder_encoding(self):
+        if not hasattr(self, 'decoder_encoding_fn'):
+            # Compile functions
+            logger.debug("Building decoder encoding function")
+                
+            self.decoder_encoding_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, 
+                                                         self.x_max_length, self.x_cost_mask,
+                                                         self.x_reset_mask, 
+                                                         self.ran_gaussian_cost_utterance,
+                                                         self.ran_uniform_cost_utterance,
+                                                         self.x_dropmask],
+                                            outputs=[self.hd],
+                                            on_unused_input='warn', 
+                                            name="decoder_encoding_fn")
+
+        return self.decoder_encoding_fn
+
+    # Helper function used for the training with noise contrastive estimation (NCE).
+    # This function is currently not supported.
+    def build_nce_function(self):
+        if not hasattr(self, 'train_fn'):
+            # Compile functions
+            logger.debug("Building NCE train function")
+
+            self.nce_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, 
+                                                  self.y_neg, self.x_max_length, 
+                                                  self.x_cost_mask,
+                                                  self.x_reset_mask,
+                                                  self.ran_gaussian_cost_utterance,
+                                                  self.ran_uniform_cost_utterance,
+                                                  self.x_dropmask],
+                                            outputs=[self.training_cost, self.kl_divergence_cost_acc, self.latent_gaussian_utterance_variable_approx_posterior_mean_var],
+                                            updates=self.updates + self.state_updates, 
+                                            on_unused_input='warn', 
+                                            name="train_fn")
+
+        return self.nce_fn
+
+    # Batch evaluation function.
+    def build_eval_function(self):
+        if not hasattr(self, 'eval_fn'):
+            # Compile functions
+            logger.debug("Building evaluation function")
+
+            self.eval_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, self.x_max_length, self.x_cost_mask, self.x_reset_mask, self.ran_gaussian_cost_utterance, self.ran_uniform_cost_utterance, self.x_dropmask], 
+                                            outputs=[self.evaluation_cost, self.softmax_cost, self.kl_divergence_cost_acc], 
+                                            updates=self.state_updates,
+                                            on_unused_input='warn', name="eval_fn")
+
+
+
+        return self.eval_fn
+
+    # Batch mean field update function.
+    def build_mf_update_function(self):
+        if not hasattr(self, 'mf_update_fn'):
+            # Compile functions
+            logger.debug("Building mean field update function")
+
+            mf_params = []
+
+            if self.add_latent_gaussian_per_utterance:
+                mf_params.append(self.latent_gaussian_utterance_variable_approx_posterior_mean_mfbias)
+                mf_params.append(self.latent_gaussian_utterance_variable_approx_posterior_var_mfbias)
+
+            if self.add_latent_piecewise_per_utterance:
+                mf_params.append(self.latent_piecewise_utterance_variable_approx_posterior_alpha_mfbias)
+
+            mf_updates, _ = self.compute_updates(self.training_cost, mf_params)
+
+            if self.add_latent_gaussian_per_utterance and self.add_latent_piecewise_per_utterance:
+
+                self.mf_update_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, 
+                                                             self.x_max_length,
+                                                             self.x_cost_mask,
+                                                             self.x_reset_mask,
+                                                             self.ran_gaussian_cost_utterance,
+                                                             self.ran_uniform_cost_utterance,
+                                                             self.x_dropmask],
+                                                outputs=[self.training_cost, self.kl_divergence_cost_acc,
+                                                         self.kl_divergences_between_piecewise_prior_and_posterior,
+                                                         self.kl_divergences_between_gaussian_prior_and_posterior],
+                                                updates=mf_updates, 
+                                                on_unused_input='warn', 
+                                                name="mf_update_fn")
+
+            elif self.add_latent_gaussian_per_utterance:
+                self.mf_update_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, 
+                                                             self.x_max_length,
+                                                             self.x_cost_mask,
+                                                             self.x_reset_mask, 
+                                                             self.ran_gaussian_cost_utterance,
+                                                             self.ran_uniform_cost_utterance,
+                                                             self.x_dropmask],
+                                                outputs=[self.training_cost, self.kl_divergence_cost_acc,
+                                                         self.kl_divergences_between_gaussian_prior_and_posterior],
+                                                updates=mf_updates, 
+                                                on_unused_input='warn', 
+                                                name="mf_update_fn")
+
+            elif self.add_latent_piecewise_per_utterance:
+                self.mf_update_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, 
+                                                             self.x_max_length,
+                                                             self.x_cost_mask,
+                                                             self.x_reset_mask, 
+                                                             self.ran_gaussian_cost_utterance,
+                                                             self.ran_uniform_cost_utterance,
+                                                             self.x_dropmask],
+                                                outputs=[self.training_cost, self.kl_divergence_cost_acc,\
+                                                        self.kl_divergences_between_piecewise_prior_and_posterior],
+                                                updates=mf_updates, 
+                                                on_unused_input='warn', 
+                                                name="mf_update_fn")
+          
+
+        return self.mf_update_fn
+
+    def build_mf_reset_function(self):
+        if not hasattr(self, 'mf_reset_fn'):
+            # Compile functions
+            logger.debug("Building mean field reset function")
+
+            mf_reset_update = []
+
+            if self.add_latent_gaussian_per_utterance:
+                mf_reset_update.append((self.latent_gaussian_utterance_variable_approx_posterior_mean_mfbias, T.zeros_like(self.latent_gaussian_utterance_variable_approx_posterior_mean_mfbias)))
+                mf_reset_update.append((self.latent_gaussian_utterance_variable_approx_posterior_var_mfbias, T.zeros_like(self.latent_gaussian_utterance_variable_approx_posterior_var_mfbias)))
+
+            if self.add_latent_piecewise_per_utterance:
+                mf_reset_update.append((self.latent_piecewise_utterance_variable_approx_posterior_alpha_mfbias, T.zeros_like(self.latent_piecewise_utterance_variable_approx_posterior_alpha_mfbias)))
+            
+
+
+            self.mf_reset_fn = theano.function(inputs=[],
+                                                outputs=[],
+                                                updates=mf_reset_update, 
+                                                on_unused_input='warn', 
+                                                name="mf_reset_fn")
+
+        return self.mf_reset_fn
+
+    # Batch saliency evaluation function.
+    def build_saliency_eval_function(self):
+        if not hasattr(self, 'saliency_eval_fn'):
+            # Compile functions
+            logger.debug("Building saliency evaluation function")
+
+            training_x = self.x_data[:(self.x_max_length-1)]
+            training_x_cost_mask = self.x_cost_mask[1:self.x_max_length]
+            latent_variable_mask = T.eq(training_x, self.eos_sym) * training_x_cost_mask
+
+            # Compute Gaussian KL divergence saliency:
+            if self.add_latent_gaussian_per_utterance:
+                kl_saliency_gaussian = \
+                    T.grad(T.sum(self.kl_divergences_between_gaussian_prior_and_posterior*latent_variable_mask), self.W_emb)**2
+                kl_saliency_gaussian = T.sum(kl_saliency_gaussian, axis=-1)
+            else:
+                kl_saliency_gaussian = T.sum(T.zeros_like(self.W_emb), axis=-1)
+
+
+            # Compute Piecewise KL divergence saliency:
+            if self.add_latent_piecewise_per_utterance:
+                kl_saliency_piecewise = \
+                    T.grad(T.sum(self.kl_divergences_between_piecewise_prior_and_posterior*latent_variable_mask), self.W_emb)**2
+                kl_saliency_piecewise = T.sum(kl_saliency_piecewise, axis=-1)
+            else:
+                kl_saliency_piecewise = T.sum(T.zeros_like(self.W_emb), axis=-1)
+
+            self.saliency_eval_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, self.x_max_length, self.x_cost_mask, self.x_reset_mask, self.ran_gaussian_cost_utterance, self.ran_uniform_cost_utterance, self.x_dropmask], 
+                                            outputs=[kl_saliency_gaussian, kl_saliency_piecewise], 
+                                            updates=self.state_updates,
+                                            on_unused_input='warn', name="saliency_eval_fn")
+
+
+
+        return self.saliency_eval_fn
+
+    # Helper function used to compute decoder hidden states and token probabilities.
+    # Currently this function does not supported truncated computations.
+    def build_next_probs_function(self):
+        if not hasattr(self, 'next_probs_fn'):
+
+            if self.add_latent_gaussian_per_utterance or self.add_latent_piecewise_per_utterance:
+
+                if self.condition_latent_variable_on_dialogue_encoder:
+                    if self.direct_connection_between_encoders_and_decoder:
+                        hs_to_condition_latent_variable_on = self.beam_hs.dimshuffle((0, 'x', 1))
+                    else:
+                        hs_to_condition_latent_variable_on = self.beam_hs.dimshuffle((0, 'x', 1))[:, :, 0:self.sdim]
+                else:
+                    hs_to_condition_latent_variable_on = T.alloc(np.float32(0), self.beam_hs.shape[0], 1, self.beam_hs.shape[1])[:, :, 0:self.sdim]
+
+                if self.add_latent_gaussian_per_utterance:
+                    _gaussian_prior_out, _ = self.latent_gaussian_utterance_variable_prior_encoder.build_encoder(hs_to_condition_latent_variable_on, self.beam_x_data[-1])
+
+                    latent_gaussian_utterance_variable_prior_mean = _gaussian_prior_out[1][-1]
+                    latent_gaussian_utterance_variable_prior_var = _gaussian_prior_out[2][-1]
+
+                    prior_gaussian_sample = self.beam_ran_gaussian_cost_utterance * T.sqrt(latent_gaussian_utterance_variable_prior_var) + latent_gaussian_utterance_variable_prior_mean
+
+                if self.add_latent_piecewise_per_utterance:
+                    _piecewise_prior_out, _ = self.latent_piecewise_utterance_variable_prior_encoder.build_encoder(hs_to_condition_latent_variable_on, self.beam_x_data[-1])
+
+                    latent_piecewise_utterance_variable_prior_alpha_hat = _piecewise_prior_out[1][-1]
+
+                    # Apply alpha parameter trying / convolution
+                    if self.latent_piecewise_variable_alpha_parameter_tying:
+                        latent_piecewise_utterance_variable_prior_alpha = \
+                            T.zeros_like(latent_piecewise_utterance_variable_prior_alpha_hat)
+
+                        for i in range(1, self.latent_piecewise_alpha_variables+1):
+                            normalization_constant = 0.0
+                            for j in range(1, self.latent_piecewise_alpha_variables+1):
+                                # Compute current alpha_hat weight
+                                w = numpy.exp(-self.latent_piecewise_variable_alpha_parameter_tying_beta*(i-j)**2)
+
+                                # Add weight to normalization constant
+                                normalization_constant += w
+
+                            normalization_constant = normalization_constant.astype('float32')
+
+                            for j in range(1, self.latent_piecewise_alpha_variables+1):
+                                # Compute normalized alpha_hat weight
+                                wn = numpy.exp(-self.latent_piecewise_variable_alpha_parameter_tying_beta*(i-j)**2)\
+                                    /normalization_constant
+                                wn = wn.astype('float32')
+
+                                # Add weight to alpha prior
+                                latent_piecewise_utterance_variable_prior_alpha =                               \
+                                 T.inc_subtensor(latent_piecewise_utterance_variable_prior_alpha[:,:,i-1],\
+                                  wn*latent_piecewise_utterance_variable_prior_alpha_hat[:,:,j-1])
+
+                    else:
+                        latent_piecewise_utterance_variable_prior_alpha = \
+                            latent_piecewise_utterance_variable_prior_alpha_hat
+
+
+
+
+                    latent_piecewise_utterance_prior_ki = latent_piecewise_utterance_variable_prior_alpha / self.latent_piecewise_alpha_variables
+                    latent_piecewise_utterance_prior_k = T.sum(latent_piecewise_utterance_prior_ki, axis=2)
+
+                    # Sample from prior using inverse transform sampling:
+                    epsilon = self.beam_ran_uniform_cost_utterance
+                    prior_piecewise_sample = T.zeros_like(epsilon)
+                    for i in range(1, self.latent_piecewise_alpha_variables+1):
+                        lowerbound = T.zeros_like(epsilon)
+                        for j in range(1, i):
+                            lowerbound += (1.0/latent_piecewise_utterance_prior_k)*latent_piecewise_utterance_prior_ki[:, :,j-1]
+                        upperbound = lowerbound + (1.0/latent_piecewise_utterance_prior_k)*latent_piecewise_utterance_prior_ki[:, :,i-1]
+                        indicator = T.ge(epsilon, lowerbound)*T.lt(epsilon, upperbound)
+
+                        prior_piecewise_sample += \
+                              indicator*((i - 1.0)/(self.latent_piecewise_alpha_variables) \
+                              + (latent_piecewise_utterance_prior_k/latent_piecewise_utterance_variable_prior_alpha[:,:,i-1])*(epsilon - lowerbound))
+
+
+                    # Transform sample to be in the range [-1, 1] with initial mean at zero.
+                    prior_piecewise_sample = 2.0*prior_piecewise_sample - 1.0
+
+
+                if self.add_latent_gaussian_per_utterance and self.add_latent_piecewise_per_utterance:
+                    if self.condition_decoder_only_on_latent_variable:
+                        decoder_inp = T.concatenate([prior_gaussian_sample, prior_piecewise_sample], axis=1)
+                    else:
+                        decoder_inp = T.concatenate([self.beam_hs, prior_gaussian_sample, prior_piecewise_sample], axis=1)
+                elif self.add_latent_gaussian_per_utterance:
+                    if self.condition_decoder_only_on_latent_variable:
+                        decoder_inp = prior_gaussian_sample
+                    else:
+                        decoder_inp = T.concatenate([self.beam_hs, prior_gaussian_sample], axis=1)
+                else:
+                    if self.condition_decoder_only_on_latent_variable:
+                        decoder_inp = prior_piecewise_sample
+                    else:
+                        decoder_inp = T.concatenate([self.beam_hs, prior_piecewise_sample], axis=1)
+
+
+
+            else:
+                decoder_inp = self.beam_hs
+
+            outputs, hd = self.utterance_decoder.build_next_probs_predictor(decoder_inp, self.beam_source, prev_state=self.beam_hd)
+            self.next_probs_fn = theano.function(inputs=[self.beam_hs, self.beam_hd, self.beam_source, self.beam_x_data, self.beam_ran_gaussian_cost_utterance, self.beam_ran_uniform_cost_utterance],
+                outputs=[outputs, hd],
+                on_unused_input='warn',
+                name="next_probs_fn")
+        return self.next_probs_fn
+
+    # Currently this function does not support truncated computations.
+    # NOTE: If batch is given as input with padded endings,
+    # e.g. last 'n' tokens are all zero and not part of the real sequence, 
+    # then the encoding must be extracted at index of the last non-padded (non-zero) token.
+    def build_encoder_function(self):
+        if not hasattr(self, 'encoder_fn'):
+
+            if self.bidirectional_utterance_encoder:
+                res_forward, _, _ = self.utterance_encoder_forward.build_encoder(self.x_data)
+                res_backward, _, _ = self.utterance_encoder_backward.build_encoder(self.x_data_reversed)
+
+                # Each encoder gives a single output vector
+                h = T.concatenate([res_forward, res_backward], axis=2)
+            else:
+                h, _, _ = self.utterance_encoder.build_encoder(self.x_data)
+
+            hs, _ = self.dialog_encoder.build_encoder(h, self.x_data)
+
+            if self.direct_connection_between_encoders_and_decoder:
+                hs_dummy = self.dialog_dummy_encoder.build_encoder(h, self.x_data)
+                hs_complete = T.concatenate([hs, hs_dummy], axis=2)
+            else:
+                hs_complete = hs
+
+
+            if self.add_latent_gaussian_per_utterance:
+
+                # Initialize hidden states to zero
+                platent_gaussian_utterance_variable_approx_posterior = theano.shared(value=numpy.zeros((self.bs, self.latent_gaussian_per_utterance_dim), dtype='float32'), name='encoder_fn_platent_gaussian_utterance_variable_approx_posterior')
+
+                if self.condition_posterior_latent_variable_on_dcgm_encoder:
+                    platent_dcgm_avg = theano.shared(value=numpy.zeros((self.bs, self.rankdim), dtype='float32'), name='encoder_fn_platent_dcgm_avg')
+                    platent_dcgm_n = theano.shared(value=numpy.zeros((1, self.bs), dtype='float32'), name='encoder_fn_platent_dcgm_n')
+
+                # Create computational graph for latent variables
+                latent_variable_mask = T.eq(self.x_data, self.eos_sym)
+
+                if self.condition_latent_variable_on_dialogue_encoder:
+                    hs_to_condition_latent_variable_on = hs_complete
+                else:
+                    hs_to_condition_latent_variable_on = T.alloc(np.float32(0), hs.shape[0], hs.shape[1], hs.shape[2])
+
+                logger.debug("Initializing approximate posterior encoder for utterance-level latent variable")
+                if self.bidirectional_utterance_encoder and not self.condition_posterior_latent_variable_on_dcgm_encoder:
+                    posterior_latent_input_size = self.sdim + self.qdim_encoder*2
+                    if self.direct_connection_between_encoders_and_decoder:
+                        posterior_latent_input_size += self.qdim_encoder*2
+                else:
+                    posterior_latent_input_size = self.sdim + self.qdim_encoder
+                    if self.direct_connection_between_encoders_and_decoder:
+                        posterior_latent_input_size += self.qdim_encoder
+
+                if self.condition_posterior_latent_variable_on_dcgm_encoder:
+                    logger.debug("Build dcgm encoder")
+                    latent_dcgm_res, latent_dcgm_avg, latent_dcgm_n = self.dcgm_encoder.build_encoder(self.x_data, prev_state=[platent_dcgm_avg, platent_dcgm_n])
+                    h_future = self.utterance_encoder_rolledleft.build_encoder( \
+                                         latent_dcgm_res, \
+                                         self.x_data)
+
+                else:
+                    h_future = self.utterance_encoder_rolledleft.build_encoder( \
+                                         h, \
+                                         self.x_data)
+
+
+                # Compute prior
+                _prior_out, _ = self.latent_gaussian_utterance_variable_prior_encoder.build_encoder(hs_to_condition_latent_variable_on, self.x_data, latent_variable_mask=latent_variable_mask)
+
+                latent_utterance_variable_prior_mean = _prior_out[1]
+                latent_utterance_variable_prior_variance = _prior_out[2]
+
+                if self.direct_connection_between_encoders_and_decoder:
+                    if self.condition_decoder_only_on_latent_variable:
+                        hd_input = latent_utterance_variable_prior_mean
+                        hd_input_variance = latent_utterance_variable_prior_variance
+                    else:
+                        hd_input = T.concatenate([hs, hs_dummy, latent_utterance_variable_prior_mean], axis=2)
+                        hd_input_variance = T.concatenate([T.zeros_like(hs), T.zeros_like(hs_dummy), latent_utterance_variable_prior_variance], axis=2)                       
+                else:
+                    if self.condition_decoder_only_on_latent_variable:
+                        hd_input = latent_utterance_variable_prior_mean
+                        hd_input_variance = latent_utterance_variable_prior_variance
+                    else:
+                        hd_input = T.concatenate([hs, latent_utterance_variable_prior_mean], axis=2)
+                        hd_input_variance = T.concatenate([T.zeros_like(hs), latent_utterance_variable_prior_variance], axis=2)
+
+
+                ## Compute candidate posterior
+                #hs_and_h_future = T.concatenate([hs_to_condition_latent_variable_on, h_future], axis=2)
+
+                #logger.debug("Build approximate posterior encoder for utterance-level latent variable")
+                #_posterior_out, _ = self.latent_gaussian_utterance_variable_approx_posterior_encoder.build_encoder( \
+                                         #hs_and_h_future, \
+                                         #self.x_data, \
+                                         #latent_variable_mask=latent_variable_mask)
+
+
+                ## Use an MLP to interpolate between prior mean and candidate posterior mean and variance.
+                #latent_utterance_variable_approx_posterior_mean = self.gaussian_posterior_mean_combination.build_output(self.hs_and_h_future, _prior_out[1], _posterior_out[1])
+                #latent_utterance_variable_approx_posterior_var = self.posterior_variance_combination.build_output(self.hs_and_h_future, _prior_out[2], _posterior_out[2])
+
+            else:
+                hd_input = hs_complete
+                hd_input_variance = T.zeros_like(hs_complete)
+
+            #decoder_inp = hd_input
+            #if self.deep_utterance_decoder_input:
+            #    decoder_inp, _ = self.utterance_decoder.input_mlp.build_output(hd_input, T.neq(self.x_data[1:self.x_data.shape[0]], self.eos_sym))
+
+            # TODO: Implement posterior distribution encoding of piecewise latent variables here!
+
+            if self.add_latent_gaussian_per_utterance:
+                self.encoder_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, \
+                             self.x_max_length], \
+                             outputs=[h, hs_complete, hd_input, hd_input_variance], on_unused_input='warn', name="encoder_fn")
+                #self.encoder_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, \
+                #             self.x_max_length], \
+                #             outputs=[h, hs_complete, hs_and_h_future, latent_utterance_variable_approx_posterior_mean], on_unused_input='warn', name="encoder_fn")
+            else:
+                self.encoder_fn = theano.function(inputs=[self.x_data, self.x_data_reversed, \
+                             self.x_max_length], \
+                             outputs=[h, hs_complete, hd_input, hd_input_variance], on_unused_input='warn', name="encoder_fn")
+
+
+        return self.encoder_fn
+
+
+    def compute_utterance_embeddings(self, utterances):
+        # Build encoder function if it doesn't already exist
+        if not hasattr(self, 'encoder_fn'):
+            self.build_encoder_function()
+
+        maxlen = 1
+        for utterance_id in range(len(utterances)):
+            words = utterances[utterance_id].split()
+            words_count = len(words)
+            if len(words) > 0:
+                if not words[0] == self.end_sym_utterance:
+                    utterances[utterance_id] = (self.end_sym_utterance + ' ' + utterances[utterance_id]).replace('  ', ' ')
+                    words_count += 1
+                if not words[-1] == self.end_sym_utterance:
+                    utterances[utterance_id] = (utterances[utterance_id] + ' ' + self.end_sym_utterance).replace('  ', ' ')
+                    words_count += 1
+
+            maxlen = max(maxlen, words_count)
+
+        maxlen = min(maxlen, self.max_len)
+        dialogue = numpy.zeros((maxlen, len(utterances)), dtype='int32')
+        dialogue_eos_indices = []
+        for utterance_id in range(len(utterances)):
+            word_ids = self.words_to_indices(utterances[utterance_id].split())
+            if word_ids > maxlen:
+                word_ids = word_ids[-maxlen:]
+
+            dialogue[0:len(word_ids), utterance_id] = word_ids
+            dialogue_eos_indices.append(len(word_ids)-1)
+
+        dialogue_reversed = self.reverse_utterances(dialogue)
+
+        full_embeddings = self.encoder_fn(dialogue, dialogue_reversed, dialogue.shape[0])
+
+        # Use utterance encoder
+        full_embeddings = full_embeddings[0]
+
+        # Use transformed input to decoder
+        #full_embeddings = full_embeddings[2]       
+       
+        embeddings = numpy.zeros((full_embeddings.shape[1], full_embeddings.shape[2]), dtype='float32')
+        for utterance_id in range(len(utterances)):
+            embeddings[utterance_id, :] = full_embeddings[dialogue_eos_indices[utterance_id], utterance_id, :]
+        
+        normalized_embeddings = (embeddings.T / numpy.linalg.norm(embeddings, axis=1)).T
+
+        return normalized_embeddings
+
+    def compute_utterance_embeddings_from_list(self, utterances):
+        # Compute embedding size embeddings
+        # Use utterance encoder
+        if True:
+            if self.bidirectional_utterance_encoder:
+                embedding_dim = self.qdim_encoder*2
+            else:
+                embedding_dim = self.qdim_encoder
+                
+        # Use transformed input to decoder
+        if False:
+            embedding_dim = self.utterance_decoder.input_dim
+
+        # Compute utterance embeddings
+        utterance_embeddings = numpy.zeros((len(utterances), embedding_dim), dtype='float32')
+        last_utterance_id_computed = 0
+        utterances_to_compute = []
+        for utterance_id in range(len(utterances)):
+            utterances_to_compute.append(utterances[utterance_id])
+
+            if (len(utterances_to_compute) == self.bs) or (utterance_id+1 == len(utterances)):
+                print('utterance_id', utterance_id)
+
+                computed_emb = self.compute_utterance_embeddings(utterances_to_compute)
+                utterance_embeddings[last_utterance_id_computed:last_utterance_id_computed+computed_emb.shape[0], :] = computed_emb[:, :]
+                last_utterance_id_computed = utterance_id+1
+                utterances_to_compute = []
+
+        return utterance_embeddings
+
+    def compute_utterance_embeddings_with_variance(self, utterances):
+        # Build encoder function if it doesn't already exist
+        if not hasattr(self, 'encoder_fn'):
+            self.build_encoder_function()
+
+        maxlen = 1
+        for utterance_id in range(len(utterances)):
+            words = utterances[utterance_id].split()
+            words_count = len(words)
+            if len(words) > 0:
+                if not words[0] == self.end_sym_utterance:
+                    utterances[utterance_id] = (self.end_sym_utterance + ' ' + utterances[utterance_id]).replace('  ', ' ')
+                    words_count += 1
+                if not words[-1] == self.end_sym_utterance:
+                    utterances[utterance_id] = (utterances[utterance_id] + ' ' + self.end_sym_utterance).replace('  ', ' ')
+                    words_count += 1
+
+            maxlen = max(maxlen, words_count)
+
+        maxlen = min(maxlen, self.max_len)
+        dialogue = numpy.zeros((maxlen, len(utterances)), dtype='int32')
+        dialogue_eos_indices = []
+        for utterance_id in range(len(utterances)):
+            word_ids = self.words_to_indices(utterances[utterance_id].split())
+            if word_ids > maxlen:
+                word_ids = word_ids[-maxlen:]
+
+            dialogue[0:len(word_ids), utterance_id] = word_ids
+            dialogue_eos_indices.append(len(word_ids)-1)
+
+        dialogue_reversed = self.reverse_utterances(dialogue)
+
+        full_embeddings = self.encoder_fn(dialogue, dialogue_reversed, dialogue.shape[0])
+
+        # Use transformed input to decoder
+        full_embeddings_mean = full_embeddings[2]       
+        full_embeddings_var = full_embeddings[3]       
+
+        embeddings = numpy.zeros((full_embeddings_mean.shape[1], full_embeddings_mean.shape[2]), dtype='float32')
+        embeddings_var = numpy.zeros((full_embeddings_mean.shape[1], full_embeddings_mean.shape[2]), dtype='float32')
+        for utterance_id in range(len(utterances)):
+            embeddings[utterance_id, :] = full_embeddings_mean[dialogue_eos_indices[utterance_id], utterance_id, :]
+            embeddings_var[utterance_id, :] = full_embeddings_var[dialogue_eos_indices[utterance_id], utterance_id, :]
+
+        return embeddings, embeddings_var
+
+    def compute_utterance_embeddings_with_variance_from_list(self, utterances):
+        # Compute embedding size embeddings
+        # Use utterance encoder
+        if self.bidirectional_utterance_encoder:
+            embedding_dim = self.qdim_encoder*2 + self.sdim
+        else:
+            embedding_dim = self.qdim_encoder + self.sdim
+
+        if self.add_latent_gaussian_per_utterance:
+            embedding_dim += self.latent_gaussian_per_utterance_dim
+
+        # Compute utterance embeddings
+        utterance_embeddings = numpy.zeros((len(utterances), embedding_dim), dtype='float32')
+        utterance_variance_embeddings = numpy.zeros((len(utterances), embedding_dim), dtype='float32')
+        last_utterance_id_computed = 0
+        utterances_to_compute = []
+        for utterance_id in range(len(utterances)):
+            utterances_to_compute.append(utterances[utterance_id])
+
+            if (len(utterances_to_compute) == self.bs) or (utterance_id+1 == len(utterances)):
+                print('utterance_id', utterance_id)
+
+                computed_emb, computed_emb_variance = self.compute_utterance_embeddings_with_variance(utterances_to_compute)
+                utterance_embeddings[last_utterance_id_computed:last_utterance_id_computed+computed_emb.shape[0], :] = computed_emb[:, :]
+                utterance_variance_embeddings[last_utterance_id_computed:last_utterance_id_computed+computed_emb_variance.shape[0], :] = computed_emb_variance[:, :]
+
+                last_utterance_id_computed = utterance_id+1
+                utterances_to_compute = []
+
+        # Remove useless sdim values
+        utterance_embeddings = utterance_embeddings[:, self.sdim:]
+        utterance_variance_embeddings = utterance_variance_embeddings[:, self.sdim:]
+
+        return utterance_embeddings, utterance_variance_embeddings
+
+    def __init__(self, state):
+        Model.__init__(self)
+
+        # Make sure eos_sym is never zero, otherwise generate_encodings script would fail
+        assert state['eos_sym'] > 0
+
+        if not 'bidirectional_utterance_encoder' in state:
+            state['bidirectional_utterance_encoder'] = False
+
+        if 'encode_with_l2_pooling' in state:
+            assert state['encode_with_l2_pooling'] == False # We don't support L2 pooling right now...
+
+        if not 'direct_connection_between_encoders_and_decoder' in state:
+            state['direct_connection_between_encoders_and_decoder'] = False
+
+        if not 'deep_direct_connection' in state:
+            state['deep_direct_connection'] = False
+
+        if not 'disable_dialogue_encoder' in state:
+            state['disable_dialogue_encoder'] = False
+
+        if state['disable_dialogue_encoder']:
+            # We can only disable the dialoge encoder, if the utterance encoder hidden state
+            # is given as input to the decoder directly.
+            assert state['direct_connection_between_encoders_and_decoder']
+
+        if not state['direct_connection_between_encoders_and_decoder']:
+            assert(state['deep_direct_connection'] == False)
+
+        if not 'collaps_to_standard_rnn' in state:
+            state['collaps_to_standard_rnn'] = False
+
+        if not 'reset_utterance_decoder_at_end_of_utterance' in state:
+            state['reset_utterance_decoder_at_end_of_utterance'] = True
+
+        if not 'reset_utterance_encoder_at_end_of_utterance' in state:
+            state['reset_utterance_encoder_at_end_of_utterance'] = False
+        else:
+            assert state['reset_utterance_encoder_at_end_of_utterance'] == False
+
+        if not 'deep_dialogue_encoder_input' in state:
+            state['deep_dialogue_encoder_input'] = True
+
+        if not 'deep_utterance_decoder_input' in state:
+            state['deep_utterance_decoder_input'] = False
+
+        if not 'reset_hidden_states_between_subsequences' in state:
+            state['reset_hidden_states_between_subsequences'] = False
+
+        if not 'fix_encoder_parameters' in state:
+            state['fix_encoder_parameters'] = False
+
+        if not 'decoder_drop_previous_input_tokens' in state:
+            state['decoder_drop_previous_input_tokens'] = False
+        else:
+            if state['decoder_drop_previous_input_tokens']:
+                assert state['decoder_drop_previous_input_tokens_rate']
+
+        if not 'add_latent_gaussian_per_utterance' in state:
+           state['add_latent_gaussian_per_utterance'] = False
+        if not 'latent_gaussian_per_utterance_dim' in state:
+           state['latent_gaussian_per_utterance_dim'] = 1
+        if not 'condition_latent_variable_on_dialogue_encoder' in state:
+           state['condition_latent_variable_on_dialogue_encoder'] = True
+        if not 'condition_posterior_latent_variable_on_dcgm_encoder' in state:
+           state['condition_posterior_latent_variable_on_dcgm_encoder'] = False
+        if not 'scale_latent_gaussian_variable_variances' in state:
+           state['scale_latent_gaussian_variable_variances'] = 0.01
+        if not 'condition_decoder_only_on_latent_variable' in state:
+           state['condition_decoder_only_on_latent_variable'] = False
+
+        if not 'train_latent_variables_with_kl_divergence_annealing' in state:
+           state['train_latent_variables_with_kl_divergence_annealing'] = False
+        if state['train_latent_variables_with_kl_divergence_annealing']:
+            assert 'kl_divergence_annealing_rate' in state
+
+        if not 'kl_divergence_max_weight' in state:
+            state['kl_divergence_max_weight'] = 1.0
+
+
+        if not 'add_latent_piecewise_per_utterance' in state:
+            state['add_latent_piecewise_per_utterance'] = False
+        if not 'gate_latent_piecewise_per_utterance' in state:
+            state['gate_latent_piecewise_per_utterance'] = True
+
+        if not 'constraint_latent_piecewise_variable_posterior' in state:
+            state['constraint_latent_piecewise_variable_posterior'] = True
+        if not 'scale_latent_piecewise_variable_prior_alpha' in state:
+            state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+        if not 'scale_latent_piecewise_variable_posterior_alpha' in state:
+            state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+        if not 'scale_latent_piecewise_variable_alpha_use_softplus' in state:
+            state['scale_latent_piecewise_variable_alpha_use_softplus'] = True
+        if not 'latent_piecewise_variable_alpha_parameter_tying' in state:
+            state['latent_piecewise_variable_alpha_parameter_tying'] = False
+
+        if not 'apply_meanfield_inference' in state:
+            state['apply_meanfield_inference'] = False
+
+        if state['collaps_to_standard_rnn']:
+            # If we collapse to standard RNN (e.g. LSTM language model) then we should not reset.
+            # If we did reset, we'd have a language model over individual utterances, which is what we want!
+            assert not state['reset_utterance_decoder_at_end_of_utterance']
+
+        if not 'compute_training_updates' in state:
+            state['compute_training_updates'] = True
+
+        self.state = state
+        self.global_params = []
+
+        self.__dict__.update(state)
+        self.rng = numpy.random.RandomState(state['seed']) 
+
+        # Load dictionary
+        raw_dict = cPickle.load(open(self.dictionary, 'r'))
+
+        # Probabilities for each term in the corpus used for noise contrastive estimation (NCE)
+        self.noise_probs = [x[2] for x in sorted(raw_dict, key=operator.itemgetter(1))]
+        self.noise_probs = numpy.array(self.noise_probs, dtype='float64')
+        self.noise_probs /= numpy.sum(self.noise_probs)
+        self.noise_probs = self.noise_probs ** 0.75
+        self.noise_probs /= numpy.sum(self.noise_probs)
+        
+        self.t_noise_probs = theano.shared(self.noise_probs.astype('float32'), 't_noise_probs')
+
+        # Dictionaries to convert str to idx and vice-versa
+        self.str_to_idx = dict([(tok, tok_id) for tok, tok_id, _, _ in raw_dict])
+        self.idx_to_str = dict([(tok_id, tok) for tok, tok_id, freq, _ in raw_dict])
+
+        # Extract document (dialogue) frequency for each word
+        self.word_freq = dict([(tok_id, freq) for _, tok_id, freq, _ in raw_dict])
+        self.document_freq = dict([(tok_id, df) for _, tok_id, _, df in raw_dict])
+
+        if self.end_sym_utterance not in self.str_to_idx:
+           raise Exception("Error, malformed dictionary!")
+         
+        # Number of words in the dictionary 
+        self.idim = len(self.str_to_idx)
+        self.state['idim'] = self.idim
+        logger.debug("idim: " + str(self.idim))
+
+        logger.debug("Initializing Theano variables")
+        self.y_neg = T.itensor3('y_neg')
+        self.x_data = T.imatrix('x_data')
+        self.x_data_reversed = T.imatrix('x_data_reversed')
+        self.x_cost_mask = T.matrix('cost_mask')
+        self.x_reset_mask = T.vector('reset_mask')
+        self.x_max_length = T.iscalar('x_max_length')
+        self.ran_gaussian_cost_utterance = T.tensor3('ran_gaussian_cost_utterance')
+        self.ran_uniform_cost_utterance = T.tensor3('ran_uniform_cost_utterance')
+        self.x_dropmask = T.matrix('x_dropmask')
+
+
+        
+        # The 'x' data (input) is defined as all symbols except the last, and
+        # the 'y' data (output) is defined as all symbols except the first.
+        training_x = self.x_data[:(self.x_max_length-1)]
+        training_x_reversed = self.x_data_reversed[:(self.x_max_length-1)]
+        training_y = self.x_data[1:self.x_max_length]
+        training_x_dropmask = self.x_dropmask[:(self.x_max_length-1)]
+
+        # Here we find the end-of-utterance tokens in the minibatch.
+        training_hs_mask = T.neq(training_x, self.eos_sym)
+        training_x_cost_mask = self.x_cost_mask[1:self.x_max_length]
+        training_x_cost_mask_flat = training_x_cost_mask.flatten()
+        
+        # Backward compatibility
+        if 'decoder_bias_type' in self.state:
+            logger.debug("Decoder bias type {}".format(self.decoder_bias_type))
+
+
+        # Build word embeddings, which are shared throughout the model
+        if self.initialize_from_pretrained_word_embeddings == True:
+            # Load pretrained word embeddings from pickled file
+            logger.debug("Loading pretrained word embeddings")
+            pretrained_embeddings = cPickle.load(open(self.pretrained_word_embeddings_file, 'r'))
+
+            # Check all dimensions match from the pretrained embeddings
+            assert(self.idim == pretrained_embeddings[0].shape[0])
+            assert(self.rankdim == pretrained_embeddings[0].shape[1])
+            assert(self.idim == pretrained_embeddings[1].shape[0])
+            assert(self.rankdim == pretrained_embeddings[1].shape[1])
+
+            self.W_emb_pretrained_mask = theano.shared(pretrained_embeddings[1].astype(numpy.float32), name='W_emb_mask')
+            self.W_emb = add_to_params(self.global_params, theano.shared(value=pretrained_embeddings[0].astype(numpy.float32), name='W_emb'))
+        else:
+            # Initialize word embeddings randomly
+            self.W_emb = add_to_params(self.global_params, theano.shared(value=NormalInit(self.rng, self.idim, self.rankdim), name='W_emb'))
+
+        # Variables to store encoder and decoder states
+        if self.bidirectional_utterance_encoder:
+            # Previous states variables
+            self.ph_fwd = theano.shared(value=numpy.zeros((self.bs, self.qdim_encoder), dtype='float32'), name='ph_fwd')
+            self.ph_fwd_n = theano.shared(value=numpy.zeros((1, self.bs), dtype='int8'), name='ph_fwd_n')
+
+            self.ph_bck = theano.shared(value=numpy.zeros((self.bs, self.qdim_encoder), dtype='float32'), name='ph_bck')
+            self.ph_bck_n = theano.shared(value=numpy.zeros((1, self.bs), dtype='int8'), name='ph_bck_n')
+
+            self.phs = theano.shared(value=numpy.zeros((self.bs, self.sdim), dtype='float32'), name='phs')
+
+            if self.direct_connection_between_encoders_and_decoder:
+                self.phs_dummy = theano.shared(value=numpy.zeros((self.bs, self.qdim_encoder*2), dtype='float32'), name='phs_dummy')
+
+        else:
+            # Previous states variables
+            self.ph = theano.shared(value=numpy.zeros((self.bs, self.qdim_encoder), dtype='float32'), name='ph')
+            self.ph_n = theano.shared(value=numpy.zeros((1, self.bs), dtype='int8'), name='ph_n')
+
+            self.phs = theano.shared(value=numpy.zeros((self.bs, self.sdim), dtype='float32'), name='phs')
+
+            if self.direct_connection_between_encoders_and_decoder:
+                self.phs_dummy = theano.shared(value=numpy.zeros((self.bs, self.qdim_encoder), dtype='float32'), name='phs_dummy')
+
+        if self.utterance_decoder_gating == 'LSTM':
+            self.phd = theano.shared(value=numpy.zeros((self.bs, self.qdim_decoder*2), dtype='float32'), name='phd')
+        else:
+            self.phd = theano.shared(value=numpy.zeros((self.bs, self.qdim_decoder), dtype='float32'), name='phd')
+
+        if self.add_latent_gaussian_per_utterance:
+            self.platent_gaussian_utterance_variable_prior = theano.shared(value=numpy.zeros((self.bs, self.latent_gaussian_per_utterance_dim), dtype='float32'), name='platent_gaussian_utterance_variable_prior')
+            self.platent_gaussian_utterance_variable_approx_posterior = theano.shared(value=numpy.zeros((self.bs, self.latent_gaussian_per_utterance_dim), dtype='float32'), name='platent_gaussian_utterance_variable_approx_posterior')
+
+        if self.add_latent_piecewise_per_utterance:
+            self.platent_piecewise_utterance_variable_prior = theano.shared(value=numpy.zeros((self.bs, self.latent_piecewise_per_utterance_dim), dtype='float32'), name='platent_piecewise_utterance_variable_prior')
+            self.platent_piecewise_utterance_variable_approx_posterior = theano.shared(value=numpy.zeros((self.bs, self.latent_piecewise_per_utterance_dim), dtype='float32'), name='platent_piecewise_utterance_variable_approx_posterior')
+
+        if self.add_latent_gaussian_per_utterance or self.add_latent_piecewise_per_utterance:
+            if self.condition_posterior_latent_variable_on_dcgm_encoder:
+                self.platent_dcgm_avg = theano.shared(value=numpy.zeros((self.bs, self.rankdim), dtype='float32'), name='platent_dcgm_avg')
+                self.platent_dcgm_n = theano.shared(value=numpy.zeros((1, self.bs), dtype='float32'), name='platent_dcgm_n')
+
+
+        # Build utterance encoders
+        if self.bidirectional_utterance_encoder:
+            logger.debug("Initializing forward utterance encoder")
+            self.utterance_encoder_forward = UtteranceEncoder(self.state, self.rng, self.W_emb, self, 'fwd')
+            logger.debug("Build forward utterance encoder")
+            res_forward, res_forward_n, res_forward_updates = self.utterance_encoder_forward.build_encoder(training_x, xmask=training_hs_mask, prev_state=[self.ph_fwd, self.ph_fwd_n])
+
+            logger.debug("Initializing backward utterance encoder")
+            self.utterance_encoder_backward = UtteranceEncoder(self.state, self.rng, self.W_emb, self, 'bck')
+            logger.debug("Build backward utterance encoder")
+            res_backward, res_backward_n, res_backward_updates = self.utterance_encoder_backward.build_encoder(training_x_reversed, xmask=training_hs_mask, prev_state=[self.ph_bck, self.ph_bck_n])
+
+            # The encoder h embedding is a concatenation of final states of the forward and backward encoder RNNs
+            self.h = T.concatenate([res_forward, res_backward], axis=2)
+
+        else:
+            logger.debug("Initializing utterance encoder")
+            self.utterance_encoder = UtteranceEncoder(self.state, self.rng, self.W_emb, self, 'fwd')
+
+            logger.debug("Build utterance encoder")
+
+            # The encoder h embedding is the final hidden state of the forward encoder RNN
+            res_forward, res_forward_n, res_forward_updates = self.utterance_encoder.build_encoder(training_x, xmask=training_hs_mask, prev_state=[self.ph, self.ph_n])
+
+            self.h = res_forward
+
+
+        logger.debug("Initializing dialog encoder")
+        self.dialog_encoder = DialogEncoder(self.state, self.rng, self, '_dialogue_encoder')
+
+        logger.debug("Build dialog encoder")
+        self.hs, self.dialogue_encoder_updates = self.dialog_encoder.build_encoder(self.h, training_x, xmask=training_hs_mask, prev_state=self.phs)
+
+        # Define input vector for decoder
+        if self.direct_connection_between_encoders_and_decoder:
+            logger.debug("Initializing dialog dummy encoder")
+            if self.bidirectional_utterance_encoder:
+                self.dialog_dummy_encoder = DialogDummyEncoder(self.state, self.rng, self, self.qdim_encoder*2)
+            else:
+                self.dialog_dummy_encoder = DialogDummyEncoder(self.state, self.rng, self, self.qdim_encoder)
+
+            logger.debug("Build dialog dummy encoder")
+            self.hs_dummy = self.dialog_dummy_encoder.build_encoder(self.h, training_x, xmask=training_hs_mask, prev_state=self.phs_dummy)
+
+
+
+        # Compute quantities necessary for handling latent variables
+        if self.add_latent_gaussian_per_utterance or self.add_latent_piecewise_per_utterance:
+            # Define list storing variable updates related to latent modules
+            self.latent_variable_updates = []
+
+            # Define KL divergence cost
+            self.kl_divergence_cost = training_x_cost_mask*0
+
+            # Compute mask over latent variables. 
+            # One means that a variable is part of the computational graph and zero that it's not.
+            latent_variable_mask = T.eq(training_x, self.eos_sym) * training_x_cost_mask
+
+            # We consider two kinds of prior: one case where the latent variables are 
+            # conditioned on the dialogue encoder, and one case where they are not conditioned on anything.
+            if self.condition_latent_variable_on_dialogue_encoder:
+                if self.direct_connection_between_encoders_and_decoder:
+                    self.hs_to_condition_latent_variable_on = T.concatenate([self.hs, self.hs_dummy], axis=2)
+                    if self.bidirectional_utterance_encoder:
+                        prior_latent_input_size = self.sdim + self.qdim_encoder*2
+                    else:
+                        prior_latent_input_size = self.sdim + self.qdim_encoder
+                else:
+                    self.hs_to_condition_latent_variable_on = self.hs
+                    prior_latent_input_size = self.sdim
+            else:
+                self.hs_to_condition_latent_variable_on = T.alloc(np.float32(0), self.hs.shape[0], self.hs.shape[1], self.hs.shape[2])
+                prior_latent_input_size = self.sdim
+
+
+            if self.bidirectional_utterance_encoder and not self.condition_posterior_latent_variable_on_dcgm_encoder:
+                posterior_latent_input_size = prior_latent_input_size + self.qdim_encoder*2
+            else:
+                posterior_latent_input_size = prior_latent_input_size + self.qdim_encoder
+
+            # Retrieve hidden state at the end of next utterance from the utterance encoders
+            # (or at the end of the batch, if there are no end-of-token symbols at the end of the batch)
+            if self.bidirectional_utterance_encoder:
+                self.utterance_encoder_rolledleft = DialogLevelRollLeft(self.state, self.qdim_encoder, self.rng, self)
+            else:
+                self.utterance_encoder_rolledleft = DialogLevelRollLeft(self.state, self.qdim_encoder*2, self.rng, self)
+
+            if self.condition_posterior_latent_variable_on_dcgm_encoder:
+                logger.debug("Initializing DCGM encoder for conditioning input to the utterance-level latent variable")
+
+                self.dcgm_encoder = DCGMEncoder(self.state, self.rng, self.W_emb, self.qdim_encoder, self, 'latent_dcgm_encoder')
+                logger.debug("Build DCGM encoder")
+                latent_dcgm_res, self.latent_dcgm_avg, self.latent_dcgm_n = self.dcgm_encoder.build_encoder(training_x, xmask=training_hs_mask, prev_state=[self.platent_dcgm_avg, self.platent_dcgm_n])
+
+                self.h_future = self.utterance_encoder_rolledleft.build_encoder( \
+                                     latent_dcgm_res, \
+                                     training_x, \
+                                     xmask=training_hs_mask)
+            else:
+                self.h_future = self.utterance_encoder_rolledleft.build_encoder( \
+                                     self.h, \
+                                     training_x, \
+                                     xmask=training_hs_mask)
+
+            self.hs_and_h_future = T.concatenate([self.hs_to_condition_latent_variable_on, self.h_future], axis=2)
+
+
+
+
+
+
+
+
+
+        # We initialize the multivariate Gaussian latent variables
+        if self.add_latent_gaussian_per_utterance:
+            logger.debug("Initializing prior encoder for utterance-level latent multivariate Gaussian variables")
+
+            self.latent_gaussian_utterance_variable_prior_encoder = DialogLevelLatentGaussianEncoder(self.state, prior_latent_input_size, self.latent_gaussian_per_utterance_dim, self.rng, self, 'latent_gaussian_utterance_prior')
+
+            logger.debug("Build prior encoder for utterance-level latent multivariate Gaussian variables")
+            _prior_gaussian_out, _prior_gaussian_updates = self.latent_gaussian_utterance_variable_prior_encoder.build_encoder(self.hs_to_condition_latent_variable_on, training_x, xmask=training_hs_mask, latent_variable_mask=latent_variable_mask, prev_state=self.platent_gaussian_utterance_variable_prior)
+            self.latent_variable_updates += _prior_gaussian_updates
+
+            self.latent_gaussian_utterance_variable_prior = _prior_gaussian_out[0]
+            self.latent_gaussian_utterance_variable_prior_mean = _prior_gaussian_out[1]
+            self.latent_gaussian_utterance_variable_prior_var = _prior_gaussian_out[2]
+
+            self.latent_gaussian_utterance_variable_approx_posterior_encoder = DialogLevelLatentGaussianEncoder(self.state, posterior_latent_input_size, self.latent_gaussian_per_utterance_dim, self.rng, self, 'latent_gaussian_utterance_approx_posterior')
+
+            logger.debug("Build approximate posterior encoder for utterance-level latent multivariate Gaussian variables")
+            _posterior_gaussian_out, _posterior_gaussian_updates =                                  \
+                    self.latent_gaussian_utterance_variable_approx_posterior_encoder.build_encoder( \
+                                     self.hs_and_h_future,                                          \
+                                     training_x,                                                    \
+                                     xmask=training_hs_mask,                                        \
+                                     latent_variable_mask=latent_variable_mask,                     \
+                                     prev_state=self.platent_gaussian_utterance_variable_approx_posterior)
+            self.latent_variable_updates += _posterior_gaussian_updates
+
+            self.latent_gaussian_utterance_variable_approx_posterior = _posterior_gaussian_out[0]
+
+            # Use an MLP to interpolate between prior mean and candidate posterior mean.
+            # This allows model to revert back to prior easily for dimensions, where it is uncertain.
+            self.gaussian_posterior_mean_combination = LinearCombination(self.state, posterior_latent_input_size, self.latent_gaussian_per_utterance_dim, False, 0.0, 0.0, self.rng, self, 'latent_gaussian_utterance_approx_posterior_mean_combination')
+            self.latent_gaussian_utterance_variable_approx_posterior_mean = self.gaussian_posterior_mean_combination.build_output(self.hs_and_h_future, self.latent_gaussian_utterance_variable_prior_mean, _posterior_gaussian_out[1])
+
+
+            # Use an MLP to interpolate between prior variance and candidate posterior variance.
+            # This allows model to revert back to prior easily for dimensions, where it is uncertain.
+            self.posterior_variance_combination = LinearCombination(self.state, posterior_latent_input_size, self.latent_gaussian_per_utterance_dim, True, self.min_latent_gaussian_variable_variances, self.max_latent_gaussian_variable_variances, self.rng, self, 'latent_gaussian_utterance_approx_posterior_variance_combination')
+            self.latent_gaussian_utterance_variable_approx_posterior_var = self.posterior_variance_combination.build_output(self.hs_and_h_future, self.latent_gaussian_utterance_variable_prior_var, _posterior_gaussian_out[2])
+
+
+            # Apply mean-field inference?
+            if self.apply_meanfield_inference:
+                self.latent_gaussian_utterance_variable_approx_posterior_mean_mfbias = \
+                    theano.shared(value=numpy.zeros((self.bs, self.latent_gaussian_per_utterance_dim), dtype='float32'), name='latent_gaussian_utterance_variable_approx_posterior_mean_mfbias')
+                self.latent_gaussian_utterance_variable_approx_posterior_var_mfbias = \
+                    theano.shared(value=numpy.zeros((self.bs, self.latent_gaussian_per_utterance_dim), dtype='float32'), name='latent_gaussian_utterance_variable_approx_posterior_var_mfbias')
+
+                self.latent_gaussian_utterance_variable_approx_posterior_mean += \
+                    self.latent_gaussian_utterance_variable_approx_posterior_mean_mfbias.dimshuffle('x', 0, 1)
+
+                self.latent_gaussian_utterance_variable_approx_posterior_var += \
+                    T.maximum(self.latent_gaussian_utterance_variable_approx_posterior_var_mfbias.dimshuffle('x', 0, 1), - self.latent_gaussian_utterance_variable_approx_posterior_var + 0.000001)
+
+
+
+
+            self.latent_gaussian_utterance_variable_approx_posterior_mean_var = T.sum(T.mean(self.latent_gaussian_utterance_variable_approx_posterior_var,axis=2)*latent_variable_mask) / (T.sum(latent_variable_mask) + 0.0000001)
+
+            # Sample utterance latent variable from posterior
+            self.latent_gaussian_posterior_sample = self.ran_gaussian_cost_utterance[:(self.x_max_length-1)] * T.sqrt(self.latent_gaussian_utterance_variable_approx_posterior_var) + self.latent_gaussian_utterance_variable_approx_posterior_mean
+
+            # Compute KL divergence cost
+            mean_diff_squared = (self.latent_gaussian_utterance_variable_prior_mean \
+                                 - self.latent_gaussian_utterance_variable_approx_posterior_mean)**2
+
+            logger.debug("Build KL divergence cost for latent multivariate Gaussian variables")
+            #self.kl_divergences_between_gaussian_prior_and_posterior                                      \
+            #    = T.maximum(0.0, (T.sum(self.latent_gaussian_utterance_variable_approx_posterior_var      \
+            #                            /self.latent_gaussian_utterance_variable_prior_var, axis=2)       \
+            #       + T.sum(mean_diff_squared/self.latent_gaussian_utterance_variable_prior_var, axis=2)   \
+            #       - state['latent_gaussian_per_utterance_dim']                                           \
+            #       + T.sum(T.log(self.latent_gaussian_utterance_variable_prior_var), axis=2)              \
+            #       - T.sum(T.log(self.latent_gaussian_utterance_variable_approx_posterior_var), axis=2)   \
+            #      ) / 2)
+
+            # Numerically stable without truncation at zero
+            self.kl_divergences_between_gaussian_prior_and_posterior                                      \
+                = (T.sum(self.latent_gaussian_utterance_variable_approx_posterior_var      \
+                                        /self.latent_gaussian_utterance_variable_prior_var, axis=2)       \
+                   + T.sum(mean_diff_squared/self.latent_gaussian_utterance_variable_prior_var, axis=2)   \
+                   - state['latent_gaussian_per_utterance_dim']                                           \
+                   + T.sum(T.log(self.latent_gaussian_utterance_variable_prior_var), axis=2)              \
+                   - T.sum(T.log(self.latent_gaussian_utterance_variable_approx_posterior_var), axis=2))/2
+
+
+
+            self.kl_divergence_cost += self.kl_divergences_between_gaussian_prior_and_posterior*latent_variable_mask
+
+        else:
+            self.latent_gaussian_utterance_variable_approx_posterior_mean_var = theano.shared(value=numpy.float(0))
+
+
+
+
+
+
+
+
+
+
+
+        # We initialize the stochastic latent variables
+        # platent_piecewise_utterance_variable_prior
+        if self.add_latent_piecewise_per_utterance:
+            # Compute prior
+            logger.debug("Initializing prior encoder for utterance-level latent piecewise variables")
+            self.latent_piecewise_utterance_variable_prior_encoder = DialogLevelLatentPiecewiseEncoder(self.state, prior_latent_input_size, self.latent_piecewise_per_utterance_dim, self.latent_piecewise_alpha_variables, self.scale_latent_piecewise_variable_prior_alpha, self.rng, self, 'latent_piecewise_utterance_prior')
+
+            logger.debug("Build prior encoder for utterance-level latent piecewise variables")
+            _prior_piecewise_out, _prior_piecewise_updates = self.latent_piecewise_utterance_variable_prior_encoder.build_encoder(self.hs_to_condition_latent_variable_on, training_x, xmask=training_hs_mask, latent_variable_mask=latent_variable_mask, prev_state=self.platent_piecewise_utterance_variable_prior)
+            self.latent_variable_updates += _prior_piecewise_updates
+
+            self.latent_piecewise_utterance_variable_prior = _prior_piecewise_out[0]
+            self.latent_piecewise_utterance_variable_prior_alpha_hat = _prior_piecewise_out[1]
+
+
+            # Compute posterior using prior
+            logger.debug("Initializing approximate posterior encoder for utterance-level latent piecewise variables")
+            self.latent_piecewise_utterance_variable_approx_posterior_encoder = DialogLevelLatentPiecewiseEncoder(self.state, posterior_latent_input_size, self.latent_piecewise_per_utterance_dim, self.latent_piecewise_alpha_variables, self.scale_latent_piecewise_variable_posterior_alpha, self.rng, self, 'latent_piecewise_utterance_approx_posterior')
+
+            logger.debug("Build approximate posterior encoder for utterance-level latent piecewise variables")
+            _posterior_piecewise_out, _posterior_piecewise_updates =                                    \
+                     self.latent_piecewise_utterance_variable_approx_posterior_encoder.build_encoder( \
+                                     self.hs_and_h_future, \
+                                     training_x, \
+                                     xmask=training_hs_mask, \
+                                     latent_variable_mask=latent_variable_mask, \
+                                     prev_state=self.platent_piecewise_utterance_variable_approx_posterior)
+            self.latent_variable_updates += _posterior_piecewise_updates
+
+            self.latent_piecewise_utterance_variable_approx_posterior = _posterior_piecewise_out[0]
+
+            # Apply gating mechanism for linear interpolation
+            if self.gate_latent_piecewise_per_utterance:
+                self.piecewise_posterior_mean_combination = LinearCombination(self.state, posterior_latent_input_size, self.latent_piecewise_per_utterance_dim, False, 0.0, 0.0, self.rng, self, 'latent_piecewise_utterance_approx_posterior_alpha_combination')
+                self.latent_piecewise_utterance_variable_approx_posterior_alpha_hat = self.piecewise_posterior_mean_combination.build_output(self.hs_and_h_future, self.latent_piecewise_utterance_variable_prior_alpha_hat.dimshuffle(0, 1, 3, 2), _posterior_piecewise_out[1].dimshuffle(0, 1, 3, 2)).dimshuffle(0, 1, 3, 2)
+            else:
+                self.latent_piecewise_utterance_variable_approx_posterior_alpha_hat = _posterior_piecewise_out[1]
+
+
+            # Apply alpha parameter trying / convolution
+            if self.latent_piecewise_variable_alpha_parameter_tying:
+                self.latent_piecewise_utterance_variable_prior_alpha = \
+                    T.zeros_like(self.latent_piecewise_utterance_variable_prior_alpha_hat)
+                self.latent_piecewise_utterance_variable_approx_posterior_alpha = \
+                    T.zeros_like(self.latent_piecewise_utterance_variable_approx_posterior_alpha_hat)
+
+                for i in range(1, self.latent_piecewise_alpha_variables+1):
+                    normalization_constant = 0.0
+                    for j in range(1, self.latent_piecewise_alpha_variables+1):
+                        # Compute current alpha_hat weight
+                        w = numpy.exp(-self.latent_piecewise_variable_alpha_parameter_tying_beta*(i-j)**2)
+
+                        # Add weight to normalization constant
+                        normalization_constant += w
+
+                    normalization_constant = normalization_constant.astype('float32')
+
+                    for j in range(1, self.latent_piecewise_alpha_variables+1):
+                        # Compute normalized alpha_hat weight
+                        wn = numpy.exp(-self.latent_piecewise_variable_alpha_parameter_tying_beta*(i-j)**2)\
+                            /normalization_constant
+                        wn = wn.astype('float32')
+
+                        # Add weight to alpha prior
+                        self.latent_piecewise_utterance_variable_prior_alpha =                               \
+                         T.inc_subtensor(self.latent_piecewise_utterance_variable_prior_alpha[:,:,:,i-1],\
+                          wn*self.latent_piecewise_utterance_variable_prior_alpha_hat[:,:, :,j-1])
+
+                        # Add weight to alpha posterior
+                        self.latent_piecewise_utterance_variable_approx_posterior_alpha =                   \
+                         T.inc_subtensor(self.latent_piecewise_utterance_variable_approx_posterior_alpha[:,:,:,i-1],\
+                         wn*self.latent_piecewise_utterance_variable_approx_posterior_alpha_hat[:,:, :,j-1])
+
+
+            else:
+                self.latent_piecewise_utterance_variable_prior_alpha = \
+                    self.latent_piecewise_utterance_variable_prior_alpha_hat
+                self.latent_piecewise_utterance_variable_approx_posterior_alpha = \
+                    self.latent_piecewise_utterance_variable_approx_posterior_alpha_hat
+
+
+            if self.apply_meanfield_inference:
+                self.latent_piecewise_utterance_variable_approx_posterior_alpha_mfbias = \
+                    theano.shared(value=numpy.zeros((self.bs, self.latent_piecewise_per_utterance_dim,\
+                                  self.latent_piecewise_alpha_variables), dtype='float32'),\
+                                  name='latent_piecewise_utterance_variable_approx_posterior_alpha_mfbias')
+
+                self.latent_piecewise_utterance_variable_approx_posterior_alpha += \
+                    T.exp(self.latent_piecewise_utterance_variable_approx_posterior_alpha_mfbias.dimshuffle('x', 0, 1, 2))
+
+
+            # Compute prior normalization constants:
+            latent_piecewise_utterance_prior_ki = self.latent_piecewise_utterance_variable_prior_alpha / self.latent_piecewise_alpha_variables
+            latent_piecewise_utterance_prior_k = T.sum(latent_piecewise_utterance_prior_ki, axis=3)
+
+
+
+            # epsilon: a standard uniform sample in range [0, 1];
+            #          shape: (time steps x batch size x number of piecewise latent variables)
+            # latent_piecewise_posterior_sample: initialized to zeros;
+            #          shape: (time steps x batch size x number of piecewise latent variables)
+            # latent_piecewise_alpha_variables: integer representing number of pieces (I set this to 3)
+            # latent_piecewise_utterance_variable_approx_posterior_alpha:
+            #      un-normalized a values, i.e. the height of each rectangle;
+            #      shape: (time steps x batch size x number of piecewise latent variables x latent_piecewise_alpha_variables)
+
+
+            # Compute posterior normalization constants:
+            # latent_piecewise_utterance_variable_prior_alpha: time steps x batch sizes x latent dim x pieces
+            latent_piecewise_utterance_posterior_ki = self.latent_piecewise_utterance_variable_approx_posterior_alpha / self.latent_piecewise_alpha_variables
+            latent_piecewise_utterance_posterior_k = T.sum(latent_piecewise_utterance_posterior_ki, axis=3)
+
+            epsilon = self.ran_uniform_cost_utterance[:(self.x_max_length-1)]
+
+            # Sample from posterior using inverse transform sampling:
+            self.latent_piecewise_posterior_sample = T.zeros_like(epsilon)
+            for i in range(1, self.latent_piecewise_alpha_variables+1):
+                lowerbound = T.zeros_like(epsilon)
+                for j in range(1, i):
+                    lowerbound += (1.0/latent_piecewise_utterance_posterior_k)*latent_piecewise_utterance_posterior_ki[:,:, :,j-1]
+                upperbound = lowerbound + (1.0/latent_piecewise_utterance_posterior_k)*latent_piecewise_utterance_posterior_ki[:,:, :,i-1]
+                indicator = T.ge(epsilon, lowerbound)*T.lt(epsilon, upperbound)
+
+                self.latent_piecewise_posterior_sample += \
+                      indicator*((i - 1.0)/(self.latent_piecewise_alpha_variables) \
+                      + (latent_piecewise_utterance_posterior_k/self.latent_piecewise_utterance_variable_approx_posterior_alpha[:,:,:,i-1])*(epsilon - lowerbound))
+
+            # Transform sample to be in the range [-1, 1] with initial mean at zero.
+            # This is considered as part of the decoder and does not affect KL divergence computations.
+            self.latent_piecewise_posterior_sample = 2.0*self.latent_piecewise_posterior_sample - 1.0
+
+            # Next, compute KL divergence cost
+            self.kl_divergences_between_piecewise_prior_and_posterior = T.zeros_like(latent_variable_mask)
+            for i in range(1, self.latent_piecewise_alpha_variables+1):
+                self.kl_divergences_between_piecewise_prior_and_posterior += T.sum((1.0/self.latent_piecewise_alpha_variables)*(self.latent_piecewise_utterance_variable_approx_posterior_alpha[:,:,:,i-1]/latent_piecewise_utterance_posterior_k)*(T.log(self.latent_piecewise_utterance_variable_approx_posterior_alpha[:,:,:,i-1]/latent_piecewise_utterance_posterior_k)-T.log(self.latent_piecewise_utterance_variable_prior_alpha[:,:,:,i-1]/latent_piecewise_utterance_prior_k)), axis=2)
+
+            self.kl_divergence_cost += self.kl_divergences_between_piecewise_prior_and_posterior*latent_variable_mask
+
+        else:
+            self.latent_piecewise_utterance_variable_approx_posterior_alpha = theano.shared(value=numpy.float(0))
+            self.latent_piecewise_utterance_variable_prior_alpha = theano.shared(value=numpy.float(0))
+
+
+        # We initialize the decoder, and fix its word embeddings to that of the encoder(s)
+        logger.debug("Initializing decoder")
+        self.utterance_decoder = UtteranceDecoder(self.state, self.rng, self, self.dialog_encoder, self.W_emb)
+
+        # Define input vector for decoder
+        if self.direct_connection_between_encoders_and_decoder:
+            logger.debug("Build decoder (NCE) with direct connection from encoder(s)")
+            if self.add_latent_gaussian_per_utterance and self.add_latent_piecewise_per_utterance:
+                if self.condition_decoder_only_on_latent_variable:
+                    self.hd_input = T.concatenate([self.latent_gaussian_posterior_sample, self.latent_piecewise_posterior_sample], axis=2)
+                else:
+                    self.hd_input = T.concatenate([self.hs, self.hs_dummy, self.latent_gaussian_posterior_sample, self.latent_piecewise_posterior_sample], axis=2)
+
+            elif self.add_latent_gaussian_per_utterance:
+                if self.condition_decoder_only_on_latent_variable:
+                    self.hd_input = self.latent_gaussian_posterior_sample
+                else:
+                    self.hd_input = T.concatenate([self.hs, self.hs_dummy, self.latent_gaussian_posterior_sample], axis=2)
+            elif self.add_latent_piecewise_per_utterance:
+                if self.condition_decoder_only_on_latent_variable:
+                    self.hd_input = self.latent_piecewise_posterior_sample
+                else:
+                    self.hd_input = T.concatenate([self.hs, self.hs_dummy, self.latent_piecewise_posterior_sample], axis=2)          
+            else:
+                self.hd_input = T.concatenate([self.hs, self.hs_dummy], axis=2)
+
+        else:
+            if self.add_latent_gaussian_per_utterance and self.add_latent_piecewise_per_utterance:
+                if self.condition_decoder_only_on_latent_variable:
+                    self.hd_input =  T.concatenate([self.latent_gaussian_posterior_sample, self.latent_piecewise_posterior_sample], axis=2)
+                else:
+                    self.hd_input = T.concatenate([self.hs, self.latent_gaussian_posterior_sample, self.latent_piecewise_posterior_sample], axis=2)
+            elif self.add_latent_gaussian_per_utterance:
+                if self.condition_decoder_only_on_latent_variable:
+                    self.hd_input = self.latent_gaussian_posterior_sample
+                else:
+                    self.hd_input = T.concatenate([self.hs, self.latent_gaussian_posterior_sample], axis=2)
+            elif self.add_latent_piecewise_per_utterance:
+                if self.condition_decoder_only_on_latent_variable:
+                    self.hd_input = self.latent_piecewise_posterior_sample
+                else:
+                    self.hd_input = T.concatenate([self.hs, self.latent_piecewise_posterior_sample], axis=2)
+            else:
+                self.hd_input = self.hs
+
+        # Build decoder
+        logger.debug("Build decoder (NCE)")
+        contrastive_cost, self.hd_nce, self.utterance_decoder_nce_updates = self.utterance_decoder.build_decoder(self.hd_input, training_x, y_neg=self.y_neg, y=training_y, xmask=training_hs_mask, xdropmask=training_x_dropmask, mode=UtteranceDecoder.NCE, prev_state=self.phd)
+
+        logger.debug("Build decoder (EVAL)")
+        target_probs, self.hd, target_probs_full_matrix, self.utterance_decoder_updates = self.utterance_decoder.build_decoder(self.hd_input, training_x, xmask=training_hs_mask, xdropmask=training_x_dropmask, y=training_y, mode=UtteranceDecoder.EVALUATION, prev_state=self.phd)
+
+        # Prediction cost and rank cost
+        self.contrastive_cost = T.sum(contrastive_cost.flatten() * training_x_cost_mask_flat)
+        self.softmax_cost = -T.log(target_probs) * training_x_cost_mask_flat
+        self.softmax_cost_acc = T.sum(self.softmax_cost)
+
+        # Prediction accuracy
+        self.training_misclassification = T.neq(T.argmax(target_probs_full_matrix, axis=2), training_y).flatten() * training_x_cost_mask_flat
+
+        self.training_misclassification_acc = T.sum(self.training_misclassification)
+
+        # Compute training cost, which equals standard cross-entropy error
+        self.training_cost = self.softmax_cost_acc
+        if self.use_nce:
+            self.training_cost = self.contrastive_cost
+
+        # Compute training cost as variational lower bound with possible annealing of KL-divergence term
+        if self.add_latent_gaussian_per_utterance or self.add_latent_piecewise_per_utterance:
+            self.kl_divergence_cost_acc = T.sum(self.kl_divergence_cost)
+            if self.train_latent_variables_with_kl_divergence_annealing:
+                assert hasattr(self, 'max_kl_percentage') == False
+
+                self.evaluation_cost = self.training_cost + T.minimum(self.kl_divergence_max_weight, 1.0)*self.kl_divergence_cost_acc
+
+                self.kl_divergence_cost_weight = add_to_params(self.global_params, theano.shared(value=numpy.float32(0), name='kl_divergence_cost_weight'))
+                self.training_cost = self.training_cost + T.minimum(self.kl_divergence_max_weight, self.kl_divergence_cost_weight)*self.kl_divergence_cost_acc
+
+            else:
+                if hasattr(self, 'max_kl_percentage'):
+                    self.evaluation_cost = self.training_cost + self.kl_divergence_cost_acc
+
+                    if self.add_latent_gaussian_per_utterance:
+                        self.training_cost += T.maximum(self.max_kl_percentage*self.training_cost, T.sum(self.kl_divergences_between_gaussian_prior_and_posterior*latent_variable_mask))
+
+                    if self.add_latent_piecewise_per_utterance:
+                        self.training_cost += T.maximum(self.max_kl_percentage*self.training_cost, T.sum(self.kl_divergences_between_piecewise_prior_and_posterior*latent_variable_mask))
+
+                else:
+                    self.evaluation_cost = self.training_cost + self.kl_divergence_cost_acc
+                    self.training_cost += self.kl_divergence_cost_acc
+
+        else:
+            self.evaluation_cost = self.training_cost
+            self.kl_divergence_cost_acc = theano.shared(value=numpy.float(0))
+
+
+
+        # Init params
+        if self.collaps_to_standard_rnn:
+                self.params = self.global_params + self.utterance_decoder.params
+                assert len(set(self.params)) == (len(self.global_params) + len(self.utterance_decoder.params))
+        else:
+            if self.bidirectional_utterance_encoder:
+                self.params = self.global_params + self.utterance_encoder_forward.params + self.utterance_encoder_backward.params + self.dialog_encoder.params + self.utterance_decoder.params
+                assert len(set(self.params)) == (len(self.global_params) + len(self.utterance_encoder_forward.params) + len(self.utterance_encoder_backward.params) + len(self.dialog_encoder.params) + len(self.utterance_decoder.params))
+            else:
+                self.params = self.global_params + self.utterance_encoder.params + self.dialog_encoder.params + self.utterance_decoder.params
+                assert len(set(self.params)) == (len(self.global_params) + len(self.utterance_encoder.params) + len(self.dialog_encoder.params) + len(self.utterance_decoder.params))
+
+        if self.add_latent_gaussian_per_utterance:
+            assert len(set(self.params)) + len(set(self.latent_gaussian_utterance_variable_prior_encoder.params)) \
+                == len(set(self.params+self.latent_gaussian_utterance_variable_prior_encoder.params))
+            self.params += self.latent_gaussian_utterance_variable_prior_encoder.params
+            assert len(set(self.params)) + len(set(self.latent_gaussian_utterance_variable_approx_posterior_encoder.params)) \
+                == len(set(self.params+self.latent_gaussian_utterance_variable_approx_posterior_encoder.params))
+            self.params += self.latent_gaussian_utterance_variable_approx_posterior_encoder.params
+
+            assert len(set(self.params)) + len(set(self.gaussian_posterior_mean_combination.params)) \
+                == len(set(self.params+self.gaussian_posterior_mean_combination.params))
+            self.params += self.gaussian_posterior_mean_combination.params
+
+            assert len(set(self.params)) + len(set(self.posterior_variance_combination.params)) \
+                == len(set(self.params+self.posterior_variance_combination.params))
+            self.params += self.posterior_variance_combination.params
+
+            if self.condition_posterior_latent_variable_on_dcgm_encoder:
+                assert len(set(self.params)) + len(set(self.dcgm_encoder.params)) \
+                    == len(set(self.params+self.dcgm_encoder.params))
+                self.params += self.dcgm_encoder.params
+
+        if self.add_latent_piecewise_per_utterance:
+            assert len(set(self.params)) + len(set(self.latent_piecewise_utterance_variable_prior_encoder.params)) \
+                == len(set(self.params+self.latent_piecewise_utterance_variable_prior_encoder.params))
+            self.params += self.latent_piecewise_utterance_variable_prior_encoder.params
+            assert len(set(self.params)) + len(set(self.latent_piecewise_utterance_variable_approx_posterior_encoder.params)) \
+                == len(set(self.params+self.latent_piecewise_utterance_variable_approx_posterior_encoder.params))
+            self.params += self.latent_piecewise_utterance_variable_approx_posterior_encoder.params
+
+            if self.gate_latent_piecewise_per_utterance:
+                assert len(set(self.params)) + len(set(self.piecewise_posterior_mean_combination.params)) \
+                    == len(set(self.params+self.piecewise_posterior_mean_combination.params))
+                self.params += self.piecewise_posterior_mean_combination.params
+
+
+        # Create set of parameters to train
+        self.params_to_train = []
+        self.params_to_exclude = []
+        if self.fix_encoder_parameters:
+            # If the option fix_encoder_parameters is on, then we exclude all parameters 
+            # related to the utterance encoder(s) and dialogue encoder, including the word embeddings,
+            # from the parameter training set.
+            if self.bidirectional_utterance_encoder:
+                self.params_to_exclude = self.global_params + self.utterance_encoder_forward.params + self.utterance_encoder_backward.params + self.dialog_encoder.params
+            else:
+                self.params_to_exclude = self.global_params + self.utterance_encoder.params + self.dialog_encoder.params
+
+        if self.add_latent_gaussian_per_utterance or self.add_latent_piecewise_per_utterance:
+            # We always need to exclude the KL-divergence term weight from training,
+            # since this value is being annealed (and should therefore not be optimized with SGD).
+            if self.train_latent_variables_with_kl_divergence_annealing:
+                self.params_to_exclude += [self.kl_divergence_cost_weight]
+
+        # Add appropriate normalization operator parameters to list of parameters to exclude from training.
+        # These parameters will be updated elsewhere.
+        for param in self.params:
+            if len(param.name) > 3:
+                if param.name[0:7] == 'normop_':
+                    if ('_mean_' in param.name) or ('_var_' in param.name):
+                       self.params_to_exclude += [param]
+
+
+        for param in self.params:
+            if not param in self.params_to_exclude:
+                self.params_to_train += [param]
+
+        if self.compute_training_updates:
+            self.updates, self.optimizer_variables = self.compute_updates(self.training_cost / training_x.shape[1], self.params_to_train)
+
+            # Add additional updates, i.e. updates not related to SGD (e.g. batch norm updates)
+            self.updates += res_forward_updates
+            if self.bidirectional_utterance_encoder:
+                self.updates += res_backward_updates
+
+            self.updates += self.dialogue_encoder_updates
+            self.updates += self.utterance_decoder_updates
+
+            if self.add_latent_gaussian_per_utterance or self.add_latent_piecewise_per_utterance:
+                self.updates += self.latent_variable_updates
+
+            # Add optimizer parameters to parameter set. This will ensure that they are saved and loaded correctly.
+            assert len(set(self.params)) + len(set(self.optimizer_variables)) \
+                == len(set(self.params+self.optimizer_variables))
+            self.params += self.optimizer_variables
+
+            # Truncate gradients properly by bringing forward previous states
+            # First, create reset mask
+            x_reset = self.x_reset_mask.dimshuffle(0, 'x')
+            # if flag 'reset_hidden_states_between_subsequences' is on, then always reset
+            if self.reset_hidden_states_between_subsequences:
+                x_reset = 0
+
+            # Next, compute updates using reset mask (this depends on the number of RNNs in the model)
+            self.state_updates = []
+            if self.bidirectional_utterance_encoder:
+                self.state_updates.append((self.ph_fwd, x_reset * res_forward[-1]))
+                self.state_updates.append((self.ph_fwd_n, T.gt(x_reset.T, 0.0) * res_forward_n[-1]))
+
+                self.state_updates.append((self.ph_bck, x_reset * res_backward[-1]))
+                self.state_updates.append((self.ph_bck_n, T.gt(x_reset.T, 0.0) * res_backward_n[-1]))
+
+                self.state_updates.append((self.phs, x_reset * self.hs[-1]))
+
+                self.state_updates.append((self.phd, x_reset * self.hd[-1]))
+            else:
+                self.state_updates.append((self.ph, x_reset * res_forward[-1]))
+                self.state_updates.append((self.ph_n, T.gt(x_reset.T, 0.0) * res_forward_n[-1]))
+
+                self.state_updates.append((self.phs, x_reset * self.hs[-1]))
+
+                self.state_updates.append((self.phd, x_reset * self.hd[-1]))
+
+            if self.direct_connection_between_encoders_and_decoder:
+                self.state_updates.append((self.phs_dummy, x_reset * self.hs_dummy[-1]))
+
+            if self.add_latent_gaussian_per_utterance:
+                self.state_updates.append((self.platent_gaussian_utterance_variable_prior, x_reset * self.latent_gaussian_utterance_variable_prior[-1]))
+                self.state_updates.append((self.platent_gaussian_utterance_variable_approx_posterior, x_reset * self.latent_gaussian_utterance_variable_approx_posterior[-1]))
+
+            if self.add_latent_piecewise_per_utterance:
+                self.state_updates.append((self.platent_piecewise_utterance_variable_prior, x_reset * self.latent_piecewise_utterance_variable_prior[-1]))
+                self.state_updates.append((self.platent_piecewise_utterance_variable_approx_posterior, x_reset * self.latent_piecewise_utterance_variable_approx_posterior[-1]))
+
+            if self.add_latent_gaussian_per_utterance or self.add_latent_piecewise_per_utterance:
+                if self.condition_posterior_latent_variable_on_dcgm_encoder:
+                    self.state_updates.append((self.platent_dcgm_avg, x_reset * self.latent_dcgm_avg[-1]))
+                    self.state_updates.append((self.platent_dcgm_n, x_reset.T * self.latent_dcgm_n[-1]))
+
+                if self.train_latent_variables_with_kl_divergence_annealing:
+                    self.state_updates.append((self.kl_divergence_cost_weight, T.minimum(1.0, self.kl_divergence_cost_weight + self.kl_divergence_annealing_rate)))
+
+
+
+            # Add normalization operator updates,
+            # which projects gamma parameters back to their constrained intervals:
+            self.normop_gamma_params = []
+            if not self.normop_type.upper() == 'NONE':
+                print(' Searching for gamma parameters which must have bounded interval:')
+                for param in self.params:
+                    if len(param.name) > 9:
+                        if param.name[0:3] == 'normop_':
+                            if '_gamma_' in param.name:
+                                if not '_optimizer_' in param.name:
+                                    self.normop_gamma_params += [param]
+                                    print('     ', param.name)
+
+            self.gamma_bounding_updates = []
+            for param in self.normop_gamma_params:
+                new_gamma = T.minimum(T.maximum(param, self.normop_gamma_min), self.normop_gamma_max)
+                self.gamma_bounding_updates.append((param, new_gamma))
+
+        else:
+            self.state_updates = []
+
+
+        # Beam-search variables
+        self.beam_x_data = T.imatrix('beam_x_data')
+        self.beam_source = T.lvector("beam_source")
+        self.beam_hs = T.matrix("beam_hs")
+        self.beam_step_num = T.lscalar("beam_step_num")
+        self.beam_hd = T.matrix("beam_hd")
+        self.beam_ran_gaussian_cost_utterance = T.matrix('beam_ran_gaussian_cost_utterance')
+        self.beam_ran_uniform_cost_utterance = T.matrix('beam_ran_uniform_cost_utterance')
diff --git a/parlai/agents/hred/hred.py b/parlai/agents/hred/hred.py
new file mode 100644
index 00000000000..b3e888496d7
--- /dev/null
+++ b/parlai/agents/hred/hred.py
@@ -0,0 +1,12 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+
+from parlai.core.agents import Agent
+
+class HredAgent(Agent):
+
+    def __init__(self, opt, shared=None):
+        raise RuntimeError("Work on this model is currently in progress")
diff --git a/parlai/agents/hred/model.py b/parlai/agents/hred/model.py
new file mode 100755
index 00000000000..cf35c5a77a1
--- /dev/null
+++ b/parlai/agents/hred/model.py
@@ -0,0 +1,49 @@
+import logging
+import numpy
+import theano
+logger = logging.getLogger(__name__)
+
+# This is the list of strings required to ignore, if we're going to take a pretrained HRED model 
+# and fine-tune it as a variational model.
+# parameter_strings_to_ignore = ["latent_utterance_prior", "latent_utterance_approx_posterior", "Wd_", "bd_"]
+
+
+class Model(object):
+    def __init__(self):
+        self.floatX = theano.config.floatX
+        # Parameters of the model
+        self.params = []
+    
+    def save(self, filename):
+        """
+        Save the model to file `filename`
+        """
+        vals = dict([(x.name, x.get_value()) for x in self.params])
+        numpy.savez(filename, **vals)
+
+    def load(self, filename, parameter_strings_to_ignore=[]):
+        """
+        Load the model.
+
+        Any parameter which has one of the strings inside parameter_strings_to_ignore as a substring,
+        will not be loaded from the file (but instead initialized as a new model, which usually means random).
+        """
+        vals = numpy.load(filename)
+        for p in self.params:
+            load_parameter = True
+            for string_to_ignore in parameter_strings_to_ignore:
+                if string_to_ignore in p.name:
+                     logger.debug('Initializing parameter {} as in new model'.format(p.name))
+                     load_parameter = False
+
+            if load_parameter:
+                if p.name in vals:
+                    logger.debug('Loading {} of {}'.format(p.name, p.get_value(borrow=True).shape))
+                    if p.get_value().shape != vals[p.name].shape:
+                        raise Exception('Shape mismatch: {} != {} for {}'.format(p.get_value().shape, vals[p.name].shape, p.name))
+                    p.set_value(vals[p.name])
+                else:
+                    logger.error('No parameter {} given: default initialization used'.format(p.name))
+                    unknown = set(vals.keys()) - {p.name for p in self.params}
+                    if len(unknown):
+                        logger.error('Unknown parameters {} given'.format(unknown))
diff --git a/parlai/agents/hred/numpy_compat.py b/parlai/agents/hred/numpy_compat.py
new file mode 100755
index 00000000000..7ffcb954de9
--- /dev/null
+++ b/parlai/agents/hred/numpy_compat.py
@@ -0,0 +1,35 @@
+'''
+Compatibility with older numpy's providing argpartition replacement.
+
+'''
+
+
+'''
+Created on Sep 12, 2014
+
+@author: chorows
+'''
+
+__all__ = ['argpartition']
+
+import numpy
+import warnings
+
+if hasattr(numpy, 'argpartition'):
+    argpartition = numpy.argpartition
+else:
+    try:
+        import bottleneck
+        #warnings.warn('Your numpy is too old (You have %s, we need 1.7.1), but we have found argpartsort in bottleneck' % (numpy.__version__,))
+        def argpartition(a, kth, axis=-1):
+            return bottleneck.argpartsort(a, kth, axis)
+    except ImportError:
+        warnings.warn('''Beam search will be slow!
+
+Your numpy is old (you have v. %s) and doesn't provide an argpartition function.
+Either upgrade numpy, or install bottleneck (https://pypi.python.org/pypi/Bottleneck).
+
+If you run this from within LISA lab you probably want to run: pip install bottleneck --user
+''' % (numpy.__version__,))
+        def argpartition(a, kth, axis=-1, order=None):
+            return numpy.argsort(a, axis=axis, order=order)
diff --git a/parlai/agents/hred/sample.py b/parlai/agents/hred/sample.py
new file mode 100755
index 00000000000..89f58f8b7d1
--- /dev/null
+++ b/parlai/agents/hred/sample.py
@@ -0,0 +1,117 @@
+#!/usr/bin/env python
+
+import argparse
+import cPickle
+import traceback
+import logging
+import time
+import sys
+
+import os
+import numpy
+import codecs
+import search
+import utils
+
+from dialog_encdec import DialogEncoderDecoder
+from numpy_compat import argpartition
+from state import prototype_state
+
+logger = logging.getLogger(__name__)
+
+class Timer(object):
+    def __init__(self):
+        self.total = 0
+
+    def start(self):
+        self.start_time = time.time()
+
+    def finish(self):
+        self.total += time.time() - self.start_time
+
+def parse_args():
+    parser = argparse.ArgumentParser("Sample (with beam-search) from the session model")
+
+    parser.add_argument("--ignore-unk",
+            action="store_false",
+            help="Disables the generation of unknown words (<unk> tokens)")
+
+    parser.add_argument("model_prefix",
+            help="Path to the model prefix (without _model.npz or _state.pkl)")
+
+    parser.add_argument("context",
+            help="File of input contexts")
+
+    parser.add_argument("output",
+            help="Output file")
+    
+    parser.add_argument("--beam_search",
+                        action="store_true",
+                        help="Use beam search instead of random search")
+
+    parser.add_argument("--n-samples",
+            default="1", type=int,
+            help="Number of samples")
+
+    parser.add_argument("--n-turns",
+                        default=1, type=int,
+                        help="Number of dialog turns to generate")
+
+    parser.add_argument("--verbose",
+            action="store_true", default=False,
+            help="Be verbose")
+
+    parser.add_argument("changes", nargs="?", default="", help="Changes to state")
+    return parser.parse_args()
+
+def main():
+    args = parse_args()
+    state = prototype_state()
+
+    state_path = args.model_prefix + "_state.pkl"
+    model_path = args.model_prefix + "_model.npz"
+
+    with open(state_path) as src:
+        state.update(cPickle.load(src))
+
+    logging.basicConfig(level=getattr(logging, state['level']), format="%(asctime)s: %(name)s: %(levelname)s: %(message)s")
+
+    state['compute_training_updates'] = False
+
+    model = DialogEncoderDecoder(state) 
+    
+    sampler = search.RandomSampler(model)
+    if args.beam_search:
+        sampler = search.BeamSampler(model)
+
+    if os.path.isfile(model_path):
+        logger.debug("Loading previous model")
+        model.load(model_path)
+    else:
+        raise Exception("Must specify a valid model path")
+    
+    contexts = [[]]
+    lines = open(args.context, "r").readlines()
+    if len(lines):
+        contexts = [x.strip() for x in lines]
+    
+    print('Sampling started...')
+    context_samples, context_costs = sampler.sample(contexts,
+                                            n_samples=args.n_samples,
+                                            n_turns=args.n_turns,
+                                            ignore_unk=args.ignore_unk,
+                                            verbose=args.verbose)
+    print('Sampling finished.')
+    print('Saving to file...')
+     
+    # Write to output file
+    output_handle = open(args.output, "w")
+    for context_sample in context_samples:
+        print >> output_handle, '\t'.join(context_sample)
+    output_handle.close()
+    print('Saving to file finished.')
+    print('All done!')
+
+if __name__ == "__main__":
+    main()
+
diff --git a/parlai/agents/hred/search.py b/parlai/agents/hred/search.py
new file mode 100755
index 00000000000..e73483f2bc2
--- /dev/null
+++ b/parlai/agents/hred/search.py
@@ -0,0 +1,352 @@
+#!/usr/bin/env python
+
+import argparse
+import cPickle
+import traceback
+import logging
+import time
+import sys
+
+import os
+import numpy
+import codecs
+
+from dialog_encdec import DialogEncoderDecoder
+from numpy_compat import argpartition
+from state import prototype_state
+logger = logging.getLogger(__name__)
+
+def sample_wrapper(sample_logic):
+    def sample_apply(*args, **kwargs):
+        sampler = args[0]
+        contexts = args[1]
+
+        verbose = kwargs.get('verbose', False)
+
+        if verbose:
+            logger.info("Starting {} : {} start sequences in total".format(sampler.name, len(contexts)))
+         
+        context_samples = []
+        context_costs = []
+
+        # Start loop for each utterance
+        for context_id, context_utterances in enumerate(contexts):
+            if verbose:
+                logger.info("Searching for {}".format(context_utterances))
+
+            # Convert contextes into list of ids
+            joined_context = []
+            if len(context_utterances) == 0:
+                joined_context = [sampler.model.eos_sym]
+            else:
+                utterance_ids = sampler.model.words_to_indices(context_utterances.split())
+                # Add eos tokens
+                if len(utterance_ids) > 0:
+                    if not utterance_ids[0] == sampler.model.eos_sym:
+                        utterance_ids = [sampler.model.eos_sym] + utterance_ids
+                    if not utterance_ids[-1] == sampler.model.eos_sym:
+                        utterance_ids += [sampler.model.eos_sym]
+                
+                else:
+                    utterance_ids = [sampler.model.eos_sym]
+
+                joined_context += utterance_ids
+
+            samples, costs = sample_logic(sampler, joined_context, **kwargs) 
+             
+            # Convert back indices to list of words
+            converted_samples = map(lambda sample : sampler.model.indices_to_words(sample, exclude_end_sym=kwargs.get('n_turns', 1) == 1), samples)
+            # Join the list of words
+            converted_samples = map(' '.join, converted_samples)
+
+            if verbose:
+                for i in range(len(converted_samples)):
+                    #print "Samples {}: {}".format(costs[i], converted_samples[i].encode('utf-8'))
+                    logger.info("Samples {}: {}".format(costs[i], converted_samples[i].encode('utf-8')))
+
+            context_samples.append(converted_samples)
+            context_costs.append(costs)
+
+        return context_samples, context_costs
+    return sample_apply
+
+class Sampler(object):
+    """
+    An abstract sampler class 
+    """
+    def __init__(self, model):
+        # Compile beam search
+        self.name = 'Sampler'
+        self.model = model
+        self.compiled = False
+        self.max_len = 160
+
+    def compile(self):
+        self.next_probs_predictor = self.model.build_next_probs_function()
+        self.compute_encoding = self.model.build_encoder_function()
+
+        if not self.model.reset_utterance_decoder_at_end_of_utterance:
+            self.compute_decoder_encoding = self.model.build_decoder_encoding()
+
+        self.compiled = True
+    
+    def select_next_words(self, next_probs, step_num, how_many):
+        pass
+
+    def count_n_turns(self, utterance):
+        return len([w for w in utterance \
+                    if w == self.model.eos_sym])
+
+    @sample_wrapper
+    def sample(self, *args, **kwargs):
+        context = args[0]
+
+        max_context_length = kwargs.get('max_context_length', 400)
+        if len(context) > max_context_length:
+            context = context[-max_context_length:]
+
+        n_samples = kwargs.get('n_samples', 1)
+        ignore_unk = kwargs.get('ignore_unk', True)
+        min_length = kwargs.get('min_length', 1)
+        max_length = kwargs.get('max_length', 30)
+        beam_diversity = kwargs.get('beam_diversity', 1)
+        normalize_by_length = kwargs.get('normalize_by_length', True)
+        verbose = kwargs.get('verbose', False)
+        n_turns = kwargs.get('n_turns', 1)
+
+        if not self.compiled:
+            self.compile()
+        
+        # Convert to matrix, each column is a context 
+        # [[1,1,1],[4,4,4],[2,2,2]]
+        context = numpy.repeat(numpy.array(context, dtype='int32')[:,None], 
+                               n_samples, axis=1)
+
+        if context[-1, 0] != self.model.eos_sym:
+            raise Exception('Last token of context, when present,'
+                            'should be the end of utterance: %d' % self.model.eos_sym)
+
+        # Generate the reversed context
+        reversed_context = self.model.reverse_utterances(context)
+
+        if self.model.direct_connection_between_encoders_and_decoder:
+            if self.model.bidirectional_utterance_encoder:
+                dialog_enc_size = self.model.sdim+self.model.qdim_encoder*2
+            else:
+                dialog_enc_size = self.model.sdim+self.model.qdim_encoder
+        else:
+            dialog_enc_size = self.model.sdim
+
+        prev_hs = numpy.zeros((n_samples, dialog_enc_size), dtype='float32')
+
+        prev_hd = numpy.zeros((n_samples, self.model.utterance_decoder.complete_hidden_state_size), dtype='float32')
+
+        if not self.model.reset_utterance_decoder_at_end_of_utterance:
+            assert self.model.bs >= context.shape[1]
+            enlarged_context = numpy.zeros((context.shape[0], self.model.bs), dtype='int32')
+            enlarged_context[:, 0:context.shape[1]] = context[:]
+            enlarged_reversed_context = numpy.zeros((context.shape[0], self.model.bs), dtype='int32')
+            enlarged_reversed_context[:, 0:context.shape[1]] = reversed_context[:]
+
+            ran_gaussian_vector = self.model.rng.normal(size=(context.shape[0],n_samples,self.model.latent_gaussian_per_utterance_dim)).astype('float32')
+            ran_uniform_vector = self.model.rng.uniform(low=0.0, high=1.0, size=(context.shape[0],n_samples,self.model.latent_piecewise_per_utterance_dim)).astype('float32')
+
+            zero_mask = numpy.zeros((context.shape[0], self.model.bs), dtype='float32')
+            zero_vector = numpy.zeros((self.model.bs), dtype='float32')
+            ones_mask = numpy.zeros((context.shape[0], self.model.bs), dtype='float32')
+
+            # Computes new utterance decoder hidden states (including intermediate utterance encoder and dialogue encoder hidden states)
+            new_hd = self.compute_decoder_encoding(enlarged_context, enlarged_reversed_context, self.max_len, zero_mask, zero_vector, ran_gaussian_vector, ran_uniform_vector, ones_mask)
+
+
+            prev_hd[:] = new_hd[0][-1][0:context.shape[1], :]
+
+        fin_gen = []
+        fin_costs = []
+         
+        gen = [[] for i in range(n_samples)]
+        costs = [0. for i in range(n_samples)]
+        beam_empty = False
+
+        # Compute random vector as additional input
+        ran_gaussian_vectors = self.model.rng.normal(size=(n_samples,self.model.latent_gaussian_per_utterance_dim)).astype('float32')
+        ran_uniform_vectors = self.model.rng.uniform(low=0.0, high=1.0, size=(n_samples,self.model.latent_piecewise_per_utterance_dim)).astype('float32')
+
+        # HACK
+        #ran_uniform_vectors = numpy.greater(ran_uniform_vectors, 0.5).astype('float32')
+
+
+        for k in range(max_length):
+            if len(fin_gen) >= n_samples or beam_empty:
+                break
+             
+            if verbose:
+                logger.info("{} : sampling step {}, beams alive {}".format(self.name, k, len(gen)))
+             
+            # Here we aggregate the context and recompute the hidden state
+            # at both session level and query level.
+            # Stack only when we sampled something
+            if k > 0:
+                context = numpy.vstack([context, \
+                                        numpy.array(map(lambda g: g[-1], gen))]).astype('int32')
+                reversed_context = numpy.copy(context)
+                for idx in range(context.shape[1]):
+                    eos_indices = numpy.where(context[:, idx] == self.model.eos_sym)[0]
+                    prev_eos_index = -1
+                    for eos_index in eos_indices:
+                        reversed_context[(prev_eos_index+2):eos_index, idx] = (reversed_context[(prev_eos_index+2):eos_index, idx])[::-1]
+                        prev_eos_index = eos_index
+
+            prev_words = context[-1, :]
+           
+            # Recompute encoder states, hs and random variables 
+            # only for those particular utterances that meet the end-of-utterance token
+            indx_update_hs = [num for num, prev_word in enumerate(prev_words)
+                                if prev_word == self.model.eos_sym]
+
+            if len(indx_update_hs):
+                encoder_states = self.compute_encoding(context[:, indx_update_hs], reversed_context[:, indx_update_hs], self.max_len)
+                prev_hs[indx_update_hs] = encoder_states[1][-1]
+                ran_gaussian_vectors[indx_update_hs,:] = self.model.rng.normal(size=(len(indx_update_hs),self.model.latent_gaussian_per_utterance_dim)).astype('float32')
+                ran_uniform_vectors[indx_update_hs,:] = self.model.rng.uniform(low=0.0, high=1.0, size=(len(indx_update_hs),self.model.latent_piecewise_per_utterance_dim)).astype('float32')
+
+
+                # HACK
+                #ran_uniform_vectors = numpy.greater(ran_uniform_vectors, 0.5).astype('float32')
+
+            # ... done
+            next_probs, new_hd = self.next_probs_predictor(prev_hs, prev_hd, prev_words, context, ran_gaussian_vectors, ran_uniform_vectors)
+
+
+
+            assert next_probs.shape[1] == self.model.idim
+            
+            # Adjust log probs according to search restrictions
+            if ignore_unk:
+                next_probs[:, self.model.unk_sym] = 0
+            if k <= min_length:
+                next_probs[:, self.model.eos_sym] = 0
+                next_probs[:, self.model.eod_sym] = 0
+             
+            # Update costs 
+            next_costs = numpy.array(costs)[:, None] - numpy.log(next_probs)
+
+            # Select next words here
+            (beam_indx, word_indx), costs = self.select_next_words(next_costs, next_probs, k, n_samples)
+            
+            # Update the stacks
+            new_gen = [] 
+            new_costs = []
+            new_sources = []
+
+            for num, (beam_ind, word_ind, cost) in enumerate(zip(beam_indx, word_indx, costs)):
+                if len(new_gen) > n_samples:
+                    break
+
+                hypothesis = gen[beam_ind] + [word_ind]
+                 
+                # End of utterance has been detected
+                n_turns_hypothesis = self.count_n_turns(hypothesis)
+                if n_turns_hypothesis == n_turns:
+                    if verbose:
+                        logger.debug("adding utterance {} from beam {}".format(hypothesis, beam_ind))
+
+                    # We finished sampling
+                    fin_gen.append(hypothesis)
+                    fin_costs.append(cost)
+                elif self.model.eod_sym in hypothesis: # End of dialogue detected
+                    new_hypothesis = []
+                    for wrd in hypothesis:
+                        new_hypothesis += [wrd]
+                        if wrd == self.model.eod_sym:
+                            break
+                    hypothesis = new_hypothesis
+
+                    if verbose:
+                        logger.debug("adding utterance {} from beam {}".format(hypothesis, beam_ind))
+
+                    # We finished sampling
+                    fin_gen.append(hypothesis)
+                    fin_costs.append(cost)
+                else:
+                    # Hypothesis recombination
+                    # TODO: pick the one with lowest cost 
+                    has_similar = False
+                    if self.hyp_rec > 0:
+                        has_similar = len([g for g in new_gen if \
+                            g[-self.hyp_rec:] == hypothesis[-self.hyp_rec:]]) != 0
+                    
+                    if not has_similar:
+                        new_sources.append(beam_ind)
+                        new_gen.append(hypothesis)
+                        new_costs.append(cost)
+            
+            if verbose:
+                for gen in new_gen:
+                    logger.debug("partial -> {}".format(' '.join(self.model.indices_to_words(gen))))
+
+            prev_hd = new_hd[new_sources]
+            prev_hs = prev_hs[new_sources]
+            ran_gaussian_vectors = ran_gaussian_vectors[new_sources,:]
+            ran_uniform_vectors = ran_uniform_vectors[new_sources,:]
+            context = context[:, new_sources]
+            reversed_context = reversed_context[:, new_sources]
+            gen = new_gen
+            costs = new_costs
+            beam_empty = len(gen) == 0
+
+        # If we have not sampled anything
+        # then force include stuff
+        if len(fin_gen) == 0:
+            fin_gen = gen 
+            fin_costs = costs 
+         
+        # Normalize costs
+        if normalize_by_length:
+            fin_costs = [(fin_costs[num]/len(fin_gen[num])) \
+                         for num in range(len(fin_gen))]
+
+        fin_gen = numpy.array(fin_gen)[numpy.argsort(fin_costs)]
+        fin_costs = numpy.array(sorted(fin_costs))
+        return fin_gen[:n_samples], fin_costs[:n_samples] 
+
+class RandomSampler(Sampler):
+    def __init__(self, model):
+        Sampler.__init__(self, model)
+        self.name = 'RandomSampler'
+        self.hyp_rec = 0
+
+    def select_next_words(self, next_costs, next_probs, step_num, how_many):
+        # Choice is complaining
+        next_probs = next_probs.astype("float64") 
+        word_indx = numpy.array([self.model.rng.choice(self.model.idim, p = x/numpy.sum(x))
+                                    for x in next_probs], dtype='int32')
+        beam_indx = range(next_probs.shape[0])
+
+        args = numpy.ravel_multi_index(numpy.array([beam_indx, word_indx]), next_costs.shape)
+        return (beam_indx, word_indx), next_costs.flatten()[args]
+
+class BeamSampler(Sampler):
+    def __init__(self, model):
+        Sampler.__init__(self, model)
+        self.name = 'BeamSampler'
+        self.hyp_rec = 3
+
+    def select_next_words(self, next_costs, next_probs, step_num, how_many):
+        # Pick only on the first line (for the beginning of sampling)
+        # This will avoid duplicate <q> token.
+        if step_num == 0:
+            flat_next_costs = next_costs[:1, :].flatten()
+        else:
+            # Set the next cost to infinite for finished utterances (they will be replaced)
+            # by other utterances in the beam
+            flat_next_costs = next_costs.flatten()
+         
+        voc_size = next_costs.shape[1]
+         
+        args = numpy.argpartition(flat_next_costs, how_many)[:how_many]
+        args = args[numpy.argsort(flat_next_costs[args])]
+        
+        return numpy.unravel_index(args, next_costs.shape), flat_next_costs[args]
+        
+
diff --git a/parlai/agents/hred/state.py b/parlai/agents/hred/state.py
new file mode 100755
index 00000000000..1b76567a713
--- /dev/null
+++ b/parlai/agents/hred/state.py
@@ -0,0 +1,2729 @@
+from collections import OrderedDict
+import cPickle
+import os
+
+def prototype_state():
+    state = {}
+
+    # ----- CONSTANTS -----
+    # Random seed
+    state['seed'] = 1234
+    
+    # Logging level
+    state['level'] = 'DEBUG'
+
+    # Out-of-vocabulary token string
+    state['oov'] = '<unk>'
+    
+    # These are end-of-sequence marks
+    state['end_sym_utterance'] = '</s>'
+
+    # Special tokens need to be defined here, because model architecture may adapt depending on these
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = 2 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = 3 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = 4 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = 5 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = 6 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = 7 # voice over symbol <voice_over>
+    state['off_screen_sym'] = 8 # off screen symbol <off_screen>
+    state['pause_sym'] = 9 # pause symbol <pause>
+
+
+    # ----- MODEL ARCHITECTURE -----
+    # If this flag is on, the hidden state between RNNs in subsequences is always initialized to zero.
+    # Set this to reset all RNN hidden states between 'max_grad_steps' time steps
+    state['reset_hidden_states_between_subsequences'] = False
+
+    # If this flag is on, the maxout activation function will be applied to the utterance decoders output unit.
+    # This requires qdim_decoder = 2x rankdim
+    state['maxout_out'] = False
+
+    # If this flag is on, a one-layer MLP with linear activation function will applied
+    # on the utterance decoder hidden state before outputting the distribution over words.
+    state['deep_utterance_decoder_out'] = True
+
+    # If this flag is on, there will be an extra MLP between utterance and dialogue encoder
+    state['deep_dialogue_encoder_input'] = False
+
+    # Default and recommended setting is: tanh.
+    # The utterance encoder and utterance decoder activation function
+    state['sent_rec_activation'] = 'lambda x: T.tanh(x)'
+    # The dialogue encoder activation function
+    state['dialogue_rec_activation'] = 'lambda x: T.tanh(x)'
+    
+    # Determines how to input the utterance encoder and dialogue encoder into the utterance decoder RNN hidden state:
+    #  - 'first': initializes first hidden state of decoder using encoders
+    #  - 'all': initializes first hidden state of decoder using encoders, 
+    #            and inputs all hidden states of decoder using encoders
+    #  - 'selective': initializes first hidden state of decoder using encoders, 
+    #                 and inputs all hidden states of decoder using encoders.
+    #                 Furthermore, a gating function is applied to the encoder input 
+    #                 to turn off certain dimensions if necessary.
+    #
+    # Experiments show that 'all' is most effective.
+    state['decoder_bias_type'] = 'all' 
+
+    # Define the gating function for the three RNNs.
+    state['utterance_encoder_gating'] = 'GRU' # Supports 'None' and 'GRU'
+    state['dialogue_encoder_gating'] = 'GRU' # Supports 'None' and 'GRU'
+    state['utterance_decoder_gating'] = 'GRU' # Supports 'None', 'BOW' (Bag of Words), 'GRU' and 'LSTM'
+
+    # If this flag is on, two utterances encoders (one forward and one backward) will be used,
+    # otherwise only a forward utterance encoder is used.
+    state['bidirectional_utterance_encoder'] = False
+
+    # If this flag is on, there will be a direct connection between utterance encoder and utterance decoder RNNs.
+    state['direct_connection_between_encoders_and_decoder'] = False
+
+    # If this flag is on, there will be an extra MLP between utterance encoder and utterance decoder.
+    state['deep_direct_connection'] = False
+
+    # If the 'direct_connection_between_encoders_and_decoder' is on, then enabling this flag will
+    # change the model so that it does not use the dialogue encoder (context encoder)
+    state['disable_dialogue_encoder'] = False
+
+
+    # If this flag is on, the model will collaps to a standard RNN:
+    # 1) The utterance+dialogue encoder input to the utterance decoder will be zero
+    # 2) The utterance decoder will never be reset
+    # Note this model will always be initialized with a hidden state equal to zero.
+    state['collaps_to_standard_rnn'] = False
+
+    # If this flag is on, the utterance decoder will be reset after each end-of-utterance token.
+    state['reset_utterance_decoder_at_end_of_utterance'] = True
+
+    # If this flag is on, the utterance encoder will be reset after each end-of-utterance token.
+    state['reset_utterance_encoder_at_end_of_utterance'] = False
+
+
+    # ----- HIDDEN LAYER DIMENSIONS -----
+    # Dimensionality of (word-level) utterance encoder hidden state
+    state['qdim_encoder'] = 512
+    # Dimensionality of (word-level) utterance decoder (RNN which generates output) hidden state
+    state['qdim_decoder'] = 512
+    # Dimensionality of (utterance-level) context encoder hidden layer 
+    state['sdim'] = 1000
+    # Dimensionality of low-rank word embedding approximation
+    state['rankdim'] = 256
+
+
+    # ----- LATENT VARIABLES WITH VARIATIONAL LEARNING -----
+    # If this flag is on, a Gaussian latent variable is added at the beginning of each utterance.
+    # The utterance decoder will be conditioned on this latent variable,
+    # and the model will be trained using the variational lower bound. 
+    # See, for example, the variational auto-encoder by Kingma et al. (2013).
+    state['add_latent_gaussian_per_utterance'] = False
+
+    # This flag will condition the latent variables on the dialogue encoder
+    state['condition_latent_variable_on_dialogue_encoder'] = False
+    # This flag will condition the latent variable on the DCGM (mean pooling over words) encoder.
+    # This will replace the conditioning on the utterance encoder.
+    # If the flag is false, the latent variable will be conditioned on the utterance encoder RNN.
+    state['condition_posterior_latent_variable_on_dcgm_encoder'] = False
+    # Dimensionality of Gaussian latent variable, which has diagonal covariance matrix.
+    state['latent_gaussian_per_utterance_dim'] = 10
+
+    # This is a constant by which the diagonal covariance matrix is scaled.
+    # By setting it to a high number (e.g. 1 or 10),
+    # the KL divergence will be relatively low at the beginning of training.
+    state['scale_latent_gaussian_variable_variances'] = 10
+    state['min_latent_gaussian_variable_variances'] = 0.01
+    state['max_latent_gaussian_variable_variances'] = 10.0
+
+    # If on, will make apply a one-layer MLP to transform the input before computing the prior
+    # and posterior of the Gaussian latent variable.
+    state['deep_latent_gaussian_variable_conditioning'] = True
+
+
+    # If this flag is on, the utterance decoder will ONLY be conditioned on the Gaussian latent variable.
+    state['condition_decoder_only_on_latent_variable'] = False
+
+
+    # If this flag is on, a piecewise latent variable is added at the beginning of each utterance.
+    # The utterance decoder will be conditioned on this latent variable,
+    # and the model will be trained using the variational lower bound. 
+    # See, for example, the variational auto-encoder by Kingma et al. (2013).
+    state['add_latent_piecewise_per_utterance'] = False
+
+    # If this flag is on, the posterior piecewise distribution will be interpolated
+    # with the prior distribution using a linear gating mechanism.
+    state['gate_latent_piecewise_per_utterance'] = True
+
+
+    state['latent_piecewise_alpha_variables'] = 5
+
+    # This is a constant by which the prior piecewise alpha parameters are scaled.
+    # By setting it to a number in the range (2.0, 10) the piecewise posterior distributions will
+    # be free to change appropriately to accomodate the real posterior,
+    # while still leaving some probability mass around 0.5 for the variable to change.
+    # With scale_latent_piecewise_variable_alpha=10, KL divergence cost is about 10% of overall cost initially.
+    # With scale_latent_piecewise_variable_alpha=1, KL divergence cost is about 1% of overall cost initially.
+
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = True
+
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+
+    state['latent_piecewise_per_utterance_dim'] = 10
+
+    # If parameter tying is enabled, a Gaussian convolution is applied to all the the alpha values.
+    # This makes the alpha values dependent upon each other, and guarantees that a single sample
+    # will update the weight of all the alpha values with higher gradients to nearby values.
+    # Julian: This only helped slightly in my intial experiments.
+    state['latent_piecewise_variable_alpha_parameter_tying'] = False
+    state['latent_piecewise_variable_alpha_parameter_tying_beta'] = 1.0
+
+    # If on, will make apply a one-layer MLP to transform the input before computing the prior
+    # and posterior of the piecewise latent variable.
+    state['deep_latent_piecewise_variable_conditioning'] = True
+
+
+    # If this flag is on, the input to the utterance decoder will be passed through
+    # a one-layer MLP with rectified linear units.
+    # If batch normalization or layer normalization is on,
+    # this will also ensure that the inputs to the decoder RNN are normalized.
+    state['deep_utterance_decoder_input'] = True
+
+
+    # If this flag is on, the KL-divergence term weight for the latent variables
+    # will be slowly increased from zero to one.
+    state['train_latent_variables_with_kl_divergence_annealing'] = False
+
+    # The KL-divergence term weight is increased by this parameter for every training batch.
+    # It is truncated to one. For example, 1.0/60000.0 means that at iteration 60000 the model
+    # will assign weight one to the KL-divergence term (assuming kl_divergence_max_weight=1.0)
+    # and thus only be maximizing the true variational bound from iteration 60000 and onward.
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+
+    # The maximum KL-divergence term weight allowed. Only applies to models with annealed KL-divergence.
+    state['kl_divergence_max_weight'] = 1.0
+
+    # If this flag is enabled, previous token input to the decoder RNN is replaced with 'unk' tokens at random.
+    state['decoder_drop_previous_input_tokens'] = False
+    # The rate at which the previous tokesn input to the decoder is kept (not set to 'unk').
+    # Setting this to zero effectively disables teacher-forcing in the model.
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+
+    # If this flag is enabled, mean field inference with stochastic gradient descent is applied during test time.
+    # Julian: This didn't really make a big difference...
+    state['apply_meanfield_inference'] = False
+
+    # Word embedding initialization
+    state['initialize_from_pretrained_word_embeddings'] = False
+    state['pretrained_word_embeddings_file'] = ''
+    state['fix_pretrained_word_embeddings'] = False
+
+    # If this flag is on, the model will fix the parameters of the utterance encoder and dialogue encoder RNNs,
+    # as well as the word embeddings. NOTE: NOT APPLICABLE when the flag 'collaps_to_standard_rnn' is on.
+    state['fix_encoder_parameters'] = False
+
+    # If this flag is disabled, the model will not generate the first utterance in a dialogue.
+    # This is used for the debate dataset as well as the skip_utterance configuration.
+    state['do_generate_first_utterance'] = True
+
+    # If this flag is enabled, the data iterator is changed so that the model is conditioned 
+    # on exactly one utterance and predicts only one utterance; the utterance to predict is
+    # either the next utterance or the previous utterance in the dialogue.
+    # When this flag is on, it forces the 'do_generate_first_utterance' to be off.
+    state['skip_utterance'] = False
+
+    # If 'skip_utterance' flag is enabled together with this flag, the data iterator is changed so
+    # that the model always predicts both the previous and next utterances.
+    # Note, this will double the batch size!
+    state['skip_utterance_predict_both'] = False
+
+
+    # ----- TRAINING PROCEDURE -----
+    # Choose optimization algorithm (adam works well most of the time)
+    state['updater'] = 'adam'
+    # If this flag is on, NCE (Noise-Contrastive Estimation) will be used to train model.
+    # This is significantly faster for large vocabularies (e.g. more than 20K words), 
+    # but experiments show that this degrades performance.
+    state['use_nce'] = False
+    # Threshold to clip the gradient
+    state['cutoff'] = 0.01
+    # Learning rate. The rate 0.0002 seems to work well across many tasks with adam.
+    # Alternatively, the learning rate can be adjusted down (e.g. 0.00004) 
+    # to at the end of training to help the model converge well.
+    state['lr'] = 0.0002
+    # Early stopping configuration
+    state['patience'] = 20
+    state['cost_threshold'] = 1.003
+    # Batch size. If out of memory, modify this!
+    state['bs'] = 80
+    # Sort by length groups of  
+    state['sort_k_batches'] = 20
+    # Training examples will be split into subsequences.
+    # This parameter controls the maximum size of each subsequence.
+    # Gradients will be computed on the subsequence, and the last hidden state of all RNNs will
+    # be used to initialize the hidden state of the RNNs in the next subsequence.
+    state['max_grad_steps'] = 80
+    # Modify this in the prototype
+    state['save_dir'] = './'
+    # Frequency of training error reports (in number of batches)
+    state['train_freq'] = 10
+    # Validation frequency
+    state['valid_freq'] = 5000
+    # Number of batches to process
+    state['loop_iters'] = 3000000
+    # Maximum number of minutes to run
+    state['time_stop'] = 24*60*31
+    # Error level to stop at
+    state['minerr'] = -1
+    # Maximum dialogue length
+    state['max_len'] = -1
+
+    # The model can apply several normalization operators to the encoder hidden states:
+    # 'NONE': No normalization is applied.
+    # 'BN': Batch normalization is applied.
+    # 'LN': Layer normalization is applied.
+    #
+    # Note the normalization operators can only be applied to GRU encoders and feed-forward neural networks.
+    state['normop_type'] = 'LN'
+
+    if state['normop_type'] == 'BN':
+        state['normop_gamma_init'] = 0.1
+        state['normop_gamma_min'] = 0.05
+        state['normop_gamma_max'] = 10.0
+        state['normop_moving_average_const'] = 0.99
+        state['normop_max_enc_seq'] = 50
+    else:
+        state['normop_gamma_init'] = 1.0
+        state['normop_gamma_min'] = 0.05
+        state['normop_gamma_max'] = 10.0
+        state['normop_moving_average_const'] = 0.99
+        state['normop_max_enc_seq'] = 1
+
+    # Parameters for initializing the training data iterator.
+    # The first is the first offset position in the list examples.
+    # The second is the number of reshuffles to perform at the beginning.
+    state['train_iterator_offset'] = 0
+    state['train_iterator_reshuffle_count'] = 1
+
+    return state
+
+
+
+def prototype_test():
+    state = prototype_state()
+    
+    # Fill paths here! 
+    state['train_dialogues'] = "./tests/data/ttrain.dialogues.pkl"
+    state['test_dialogues'] = "./tests/data/ttest.dialogues.pkl"
+    state['valid_dialogues'] = "./tests/data/tvalid.dialogues.pkl"
+    state['dictionary'] = "./tests/data/ttrain.dict.pkl"
+    state['save_dir'] = "./tests/models/"
+
+    state['max_grad_steps'] = 20
+    
+    # Handle pretrained word embeddings. Using this requires rankdim=10
+    state['initialize_from_pretrained_word_embeddings'] = False
+    state['pretrained_word_embeddings_file'] = './tests/data/MT_WordEmb.pkl'
+    state['fix_pretrained_word_embeddings'] = False
+    
+    state['valid_freq'] = 50
+    
+    state['prefix'] = "testmodel_" 
+    state['updater'] = 'adam'
+    
+    state['maxout_out'] = False
+    state['deep_utterance_decoder_out'] = True
+    state['deep_dialogue_encoder_input'] = True
+
+    state['utterance_encoder_gating'] = 'GRU'
+    state['dialogue_encoder_gating'] = 'GRU'
+    state['utterance_decoder_gating'] = 'GRU'
+    state['bidirectional_utterance_encoder'] = True 
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['bs'] = 5
+    state['sort_k_batches'] = 1
+    state['use_nce'] = False
+    state['decoder_bias_type'] = 'all'
+    
+    state['qdim_encoder'] = 15
+    state['qdim_decoder'] = 5
+    state['sdim'] = 10
+    state['rankdim'] = 10
+
+
+
+    return state
+
+
+
+def prototype_test_variational():
+    state = prototype_state()
+    
+    # Fill paths here! 
+    state['train_dialogues'] = "./tests/data/ttrain.dialogues.pkl"
+    state['test_dialogues'] = "./tests/data/ttest.dialogues.pkl"
+    state['valid_dialogues'] = "./tests/data/tvalid.dialogues.pkl"
+    state['dictionary'] = "./tests/data/ttrain.dict.pkl"
+    state['save_dir'] = "./tests/models/"
+
+    state['max_grad_steps'] = 20
+
+    # Handle pretrained word embeddings. Using this requires rankdim=10
+    state['initialize_from_pretrained_word_embeddings'] = True
+    state['pretrained_word_embeddings_file'] = './tests/data/MT_WordEmb.pkl' 
+    
+    state['valid_freq'] = 5
+   
+    state['prefix'] = "testmodel_"
+    state['updater'] = 'adam'
+    
+    state['maxout_out'] = False
+    state['deep_utterance_decoder_out'] = True
+    state['deep_dialogue_encoder_input'] = True
+    state['direct_connection_between_encoders_and_decoder'] = False
+    state['deep_direct_connection'] = False
+
+    state['utterance_encoder_gating'] = 'GRU'
+    state['dialogue_encoder_gating'] = 'GRU'
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['bidirectional_utterance_encoder'] = False
+
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 5
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['condition_posterior_latent_variable_on_dcgm_encoder'] = False
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 10
+
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+
+    state['bs'] = 5
+    state['sort_k_batches'] = 1
+    state['use_nce'] = False
+    state['decoder_bias_type'] = 'all'
+    
+    state['qdim_encoder'] = 15
+    state['qdim_decoder'] = 5
+    state['sdim'] = 10
+    state['rankdim'] = 10
+
+    state['gate_latent_piecewise_per_utterance'] = False
+
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_max_weight'] = 0.5
+
+    # KL max-trick
+    #state['train_latent_variables_with_kl_divergence_annealing'] = False
+    #state['max_kl_percentage'] = 0.01
+
+    return state
+
+
+
+###
+### Twitter - Hyperparameter search for HRED:
+###
+# sdim = {500, 1000}
+# qdim_encoder = {1000}
+# qdim_decoder = {1000, 2000, 4000}
+# rankdim = 400
+# bidirectional_utterance_encoder = True
+# reset_utterance_encoder_at_end_of_utterance = False
+# reset_utterance_decoder_at_end_of_utterance = True
+# lr = 0.0002
+# bs = 80
+# normop_type = 'LN'
+
+def prototype_twitter_HRED_NormOp_ClusterExp1():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 1000
+    state['sdim'] = 500
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = False
+    state['train_latent_variables_with_kl_divergence_annealing'] = False
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = False
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_HRED_NormOp_ClusterExp2():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 1000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = False
+    state['train_latent_variables_with_kl_divergence_annealing'] = False
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = False
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_HRED_NormOp_ClusterExp3():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = False
+    state['train_latent_variables_with_kl_divergence_annealing'] = False
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = False
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_HRED_NormOp_ClusterExp4():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 4000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = False
+    state['train_latent_variables_with_kl_divergence_annealing'] = False
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = False
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_HRED_NormOp_ClusterExp5():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 2000
+    state['qdim_decoder'] = 4000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = False
+    state['train_latent_variables_with_kl_divergence_annealing'] = False
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = False
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['patience'] = 20
+
+    return state
+
+
+
+###
+### Twitter - Hyperparameter search for Gaussian VHRED:
+###
+# sdim = {500, 1000}
+# qdim_encoder = {1000}
+# qdim_decoder = {1000, 2000, 4000}
+# rankdim = 400
+# latent_gaussian_per_utterance_dim = {100, 300}
+# bidirectional_utterance_encoder = True
+# reset_utterance_encoder_at_end_of_utterance = False
+# reset_utterance_decoder_at_end_of_utterance = True
+# lr = 0.0002
+# bs = 80
+# normop_type = 'LN'
+
+def prototype_twitter_GaussOnly_VHRED_NormOp_ClusterExp1():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 1000
+    state['sdim'] = 500
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_GaussOnly_VHRED_NormOp_ClusterExp2():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 1000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_GaussOnly_VHRED_NormOp_ClusterExp3():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_GaussOnly_VHRED_NormOp_ClusterExp4():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 4000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_GaussOnly_VHRED_NormOp_ClusterExp5():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 4000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 300
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+###
+### Twitter - Hyperparameter search for Piecewise-Gaussian VHRED:
+###
+# sdim = {500, 1000}
+# qdim_encoder = {1000}
+# qdim_decoder = {1000, 2000, 4000}
+# rankdim = 400
+# latent_gaussian_per_utterance_dim = {100, 300}
+# latent_piecewise_per_utterance_dim = {100, 300}
+# gate_latent_piecewise_per_utterance = {False, True}
+# bidirectional_utterance_encoder = True
+# reset_utterance_encoder_at_end_of_utterance = False
+# reset_utterance_decoder_at_end_of_utterance = True
+# lr = 0.0002
+# bs = 80
+# normop_type = 'LN'
+
+def prototype_twitter_GaussPiecewise_VHRED_NormOp_ClusterExp1():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 1000
+    state['sdim'] = 500
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_GaussPiecewise_VHRED_NormOp_ClusterExp2():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 1000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_GaussPiecewise_VHRED_NormOp_ClusterExp3():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_GaussPiecewise_VHRED_NormOp_ClusterExp4():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 4000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_GaussPiecewise_VHRED_NormOp_ClusterExp5():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 4000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 300
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 300
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_twitter_GaussPiecewise_VHRED_NormOp_ClusterExp6():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    state['gate_latent_piecewise_per_utterance'] = False
+
+    return state
+
+
+
+def prototype_twitter_GaussPiecewise_VHRED_NormOp_ClusterExp7():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 4000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    state['gate_latent_piecewise_per_utterance'] = False
+
+    return state
+
+
+
+def prototype_twitter_GaussPiecewise_VHRED_NormOp_ClusterExp8():
+    state = prototype_state()
+
+    # Fill your paths here!
+    state['train_dialogues'] = "../TwitterDataBPE/Train.dialogues.pkl"
+    state['test_dialogues'] = "../TwitterDataBPE/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../TwitterDataBPE/Valid.dialogues.pkl"
+    state['dictionary'] = "../TwitterDataBPE/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 2500
+
+    state['prefix'] = "TwitterModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+    state['decoder_bias_type'] = 'all' # Choose between 'first', 'all' and 'selective'
+
+    state['direct_connection_between_encoders_and_decoder'] = True
+    state['deep_direct_connection'] = False
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 4000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    state['utterance_decoder_gating'] = 'LSTM'
+
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 300
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 300
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/60000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+    state['patience'] = 20
+
+    state['gate_latent_piecewise_per_utterance'] = False
+
+    return state
+
+
+
+###
+### Ubuntu - Hyperparameter search for (Gaussian/Piecewise) VHRED on Ubuntu:
+###
+### sdim = 1000
+### qdim_encoder = 1000
+### qdim_decoder = 2000
+### rankdim = 400
+### deep_utterance_decoder_input={False,True}
+###
+###
+### bidirectional_utterance_encoder = True
+### reset_utterance_encoder_at_end_of_utterance = False
+### reset_utterance_decoder_at_end_of_utterance = True
+### lr = 0.0002
+### bs = 80
+### normop_type = 'LN'
+###
+### For latent models, we also experiment with kl_divergence_max_weight={0.25, 0.50, 0.75}
+### NOTE: In this case, we early stop according to the reweighted lower bound!
+###
+###
+
+# This is the Ubuntu HRED baseline used in "Piecewise Latent Variables for Neural Variational Text Processing" by Serban et al.
+# It achieved best performance w.r.t. F1 activity performance on the validation set among all HRED baseline models
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Baseline_Exp1():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = False
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Baseline_Exp2():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp1():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = False
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp2():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = False
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp3():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = False
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp4():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    return state
+
+
+
+# This is the Ubuntu P-VHRED model used in "Piecewise Latent Variables for Neural Variational Text Processing" by Serban et al.
+# It achieved best performance w.r.t. F1 activity performance on the validation set among all P-VHRED models
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp5():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp6():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    return state
+
+
+
+# This is the Ubuntu G-VHRED model used in "Piecewise Latent Variables for Neural Variational Text Processing" by Serban et al.
+# It achieved best performance w.r.t. F1 activity performance on the validation set among all G-VHRED models
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp7():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    state['kl_divergence_max_weight'] = 0.25
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp8():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    state['kl_divergence_max_weight'] = 0.25
+
+    return state
+
+
+
+# This is the Ubuntu H-VHRED model used in "Piecewise Latent Variables for Neural Variational Text Processing" by Serban et al.
+# It achieved best performance w.r.t. F1 activity performance on the validation set among all H-VHRED models
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp9():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    state['kl_divergence_max_weight'] = 0.25
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp10():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    state['kl_divergence_max_weight'] = 0.5
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp11():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    state['kl_divergence_max_weight'] = 0.5
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp12():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    state['kl_divergence_max_weight'] = 0.5
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp13():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = False
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    state['kl_divergence_max_weight'] = 0.75
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp14():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = False
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    state['kl_divergence_max_weight'] = 0.75
+
+    return state
+
+
+
+def prototype_ubuntu_GaussPiecewise_NormOp_VHRED_Exp15():
+    state = prototype_state()
+
+    state['end_sym_utterance'] = '__eot__'
+
+    state['unk_sym'] = 0 # Unknown word token <unk>
+    state['eos_sym'] = 1 # end-of-utterance symbol </s>
+    state['eod_sym'] = -1 # end-of-dialogue symbol </d>
+    state['first_speaker_sym'] = -1 # first speaker symbol <first_speaker>
+    state['second_speaker_sym'] = -1 # second speaker symbol <second_speaker>
+    state['third_speaker_sym'] = -1 # third speaker symbol <third_speaker>
+    state['minor_speaker_sym'] = -1 # minor speaker symbol <minor_speaker>
+    state['voice_over_sym'] = -1 # voice over symbol <voice_over>
+    state['off_screen_sym'] = -1 # off screen symbol <off_screen>
+    state['pause_sym'] = -1 # pause symbol <pause>
+
+    state['train_dialogues'] = "../UbuntuData/Training.dialogues.pkl"
+    state['test_dialogues'] = "../UbuntuData/Test.dialogues.pkl"
+    state['valid_dialogues'] = "../UbuntuData/Validation.dialogues.pkl"
+    state['dictionary'] = "../UbuntuData/Dataset.dict.pkl"
+    state['save_dir'] = "Output"
+
+    state['max_grad_steps'] = 80
+
+    state['valid_freq'] = 5000
+
+    state['prefix'] = "UbuntuModel_"
+    state['updater'] = 'adam'
+
+    state['bidirectional_utterance_encoder'] = True
+    state['deep_dialogue_encoder_input'] = False
+    state['deep_utterance_decoder_out'] = True
+
+    state['bs'] = 80
+
+    state['utterance_decoder_gating'] = 'LSTM'
+    state['direct_connection_between_encoders_and_decoder'] = True
+
+    state['qdim_encoder'] = 1000
+    state['qdim_decoder'] = 2000
+    state['sdim'] = 1000
+    state['rankdim'] = 400
+
+    # Latent variable configuration
+    state['add_latent_gaussian_per_utterance'] = True
+    state['latent_gaussian_per_utterance_dim'] = 100
+    state['scale_latent_gaussian_variable_variances'] = 0.1
+
+    state['add_latent_piecewise_per_utterance'] = True
+    state['latent_piecewise_per_utterance_dim'] = 100
+    state['latent_piecewise_alpha_variables'] = 3
+    state['scale_latent_piecewise_variable_alpha_use_softplus'] = False
+    state['scale_latent_piecewise_variable_prior_alpha'] = 1.0
+    state['scale_latent_piecewise_variable_posterior_alpha'] = 1.0
+
+    state['condition_latent_variable_on_dialogue_encoder'] = True
+    state['train_latent_variables_with_kl_divergence_annealing'] = True
+    state['kl_divergence_annealing_rate'] = 1.0/75000.0
+    state['decoder_drop_previous_input_tokens'] = True
+    state['decoder_drop_previous_input_tokens_rate'] = 0.75
+
+    state['deep_utterance_decoder_input'] = True
+
+    state['patience'] = 20
+
+    state['kl_divergence_max_weight'] = 0.75
+
+    return state
\ No newline at end of file
diff --git a/parlai/agents/hred/train.py b/parlai/agents/hred/train.py
new file mode 100755
index 00000000000..d78d8c3e689
--- /dev/null
+++ b/parlai/agents/hred/train.py
@@ -0,0 +1,613 @@
+# -*- coding: utf-8 -*-
+#!/usr/bin/env python
+
+from data_iterator import *
+from state import *
+from dialog_encdec import *
+from utils import *
+
+import time
+import traceback
+import sys
+import argparse
+import cPickle
+import logging
+import search
+import pprint
+import numpy
+import collections
+import signal
+import math
+import gc
+
+import os
+import os.path
+
+# For certain clusters (e.g. Guillumin) we use flag 'DUMP_EXPERIMENT_LOGS_TO_DISC'
+# to force dumping log outputs to file.
+if 'DUMP_EXPERIMENT_LOGS_TO_DISC' in os.environ:
+    if os.environ['DUMP_EXPERIMENT_LOGS_TO_DISC'] == '1':
+        sys.stdout = open('Exp_Out.txt', 'a')
+        sys.stderr = open('Exp_Err.txt', 'a')
+
+from os import listdir
+from os.path import isfile, join
+
+import matplotlib
+matplotlib.use('Agg')
+import pylab
+
+
+class Unbuffered:
+    def __init__(self, stream):
+        self.stream = stream
+
+    def write(self, data):
+        self.stream.write(data)
+        self.stream.flush()
+
+    def __getattr__(self, attr):
+        return getattr(self.stream, attr)
+
+sys.stdout = Unbuffered(sys.stdout)
+logger = logging.getLogger(__name__)
+
+### Unique RUN_ID for this execution
+RUN_ID = str(time.time())
+
+### Additional measures can be set here
+measures = ["train_cost", "train_misclass", "train_kl_divergence_cost", "train_posterior_gaussian_mean_variance", "valid_cost", "valid_misclass", "valid_posterior_gaussian_mean_variance", "valid_kl_divergence_cost", "valid_emi"]
+
+
+def init_timings():
+    timings = {}
+    for m in measures:
+        timings[m] = []
+    return timings
+
+def save(model, timings, train_iterator, post_fix = ''):
+    print("Saving the model...")
+
+    # ignore keyboard interrupt while saving
+    start = time.time()
+    s = signal.signal(signal.SIGINT, signal.SIG_IGN)
+
+    model.state['train_iterator_offset'] = train_iterator.get_offset() + 1
+    model.state['train_iterator_reshuffle_count'] = train_iterator.get_reshuffle_count()
+
+    model.save(model.state['save_dir'] + '/' + model.state['run_id'] + "_" + model.state['prefix'] + post_fix + 'model.npz')
+    cPickle.dump(model.state, open(model.state['save_dir'] + '/' +  model.state['run_id'] + "_" + model.state['prefix'] + post_fix + 'state.pkl', 'w'))
+    numpy.savez(model.state['save_dir'] + '/' + model.state['run_id'] + "_" + model.state['prefix'] + post_fix + 'timing.npz', **timings)
+    signal.signal(signal.SIGINT, s)
+    
+    print("Model saved, took {}".format(time.time() - start))
+
+def load(model, filename, parameter_strings_to_ignore):
+    print("Loading the model...")
+
+    # ignore keyboard interrupt while saving
+    start = time.time()
+    s = signal.signal(signal.SIGINT, signal.SIG_IGN)
+    model.load(filename, parameter_strings_to_ignore)
+    signal.signal(signal.SIGINT, s)
+
+    print("Model loaded, took {}".format(time.time() - start))
+
+def main(args):     
+    logging.basicConfig(level = logging.DEBUG,
+                        format = "%(asctime)s: %(name)s: %(levelname)s: %(message)s")
+   
+    state = eval(args.prototype)() 
+    timings = init_timings() 
+
+    auto_restarting = False
+    if args.auto_restart:
+        assert not args.save_every_valid_iteration
+        assert len(args.resume) == 0
+
+        directory = state['save_dir']
+        if not directory[-1] == '/':
+            directory = directory + '/' 
+
+        auto_resume_postfix = state['prefix'] + '_auto_model.npz'
+
+        if os.path.exists(directory):
+            directory_files = [f for f in listdir(directory) if isfile(join(directory, f))]
+            resume_filename = ''
+            for f in directory_files:
+                if len(f) > len(auto_resume_postfix):
+                    if f[len(f) - len(auto_resume_postfix):len(f)] == auto_resume_postfix:
+                        if len(resume_filename) > 0:
+                            print('ERROR: FOUND MULTIPLE MODELS IN DIRECTORY:', directory)
+                            assert False
+                        else:
+                            resume_filename = directory + f[0:len(f)-len('__auto_model.npz')]
+
+            if len(resume_filename) > 0:
+                logger.debug("Found model to automatically resume: %s" % resume_filename)
+                auto_restarting = True
+                # Setup training to automatically resume training with the model found
+                args.resume = resume_filename + '__auto'
+                # Disable training from reinitialization any parameters
+                args.reinitialize_decoder_parameters = False
+                args.reinitialize_latent_variable_parameters = False
+            else:
+                logger.debug("Could not find any model to automatically resume...")
+
+
+
+    if args.resume != "":
+        logger.debug("Resuming %s" % args.resume)
+        
+        state_file = args.resume + '_state.pkl'
+        timings_file = args.resume + '_timing.npz'
+        
+        if os.path.isfile(state_file) and os.path.isfile(timings_file):
+            logger.debug("Loading previous state")
+            
+            state = cPickle.load(open(state_file, 'r'))
+            timings = dict(numpy.load(open(timings_file, 'r')))
+            for x, y in timings.items():
+                timings[x] = list(y)
+
+            # Increment seed to make sure we get newly shuffled batches when training on large datasets
+            state['seed'] = state['seed']
+
+        else:
+            raise Exception("Cannot resume, cannot find files!")
+
+
+
+    logger.debug("State:\n{}".format(pprint.pformat(state)))
+    logger.debug("Timings:\n{}".format(pprint.pformat(timings)))
+ 
+    if args.force_train_all_wordemb == True:
+        state['fix_pretrained_word_embeddings'] = False
+
+    model = DialogEncoderDecoder(state)
+    rng = model.rng 
+
+    valid_rounds = 0
+    save_model_on_first_valid = False
+
+    if args.resume != "":
+        filename = args.resume + '_model.npz'
+        if os.path.isfile(filename):
+            logger.debug("Loading previous model")
+
+            parameter_strings_to_ignore = []
+            if args.reinitialize_decoder_parameters:
+                parameter_strings_to_ignore += ['Wd_']
+                parameter_strings_to_ignore += ['bd_']
+
+                save_model_on_first_valid = True
+            if args.reinitialize_latent_variable_parameters:
+                parameter_strings_to_ignore += ['latent_utterance_prior']
+                parameter_strings_to_ignore += ['latent_utterance_approx_posterior']
+                parameter_strings_to_ignore += ['kl_divergence_cost_weight']
+                parameter_strings_to_ignore += ['latent_dcgm_encoder']
+
+                save_model_on_first_valid = True
+
+            load(model, filename, parameter_strings_to_ignore)
+        else:
+            raise Exception("Cannot resume, cannot find model file!")
+        
+        if 'run_id' not in model.state:
+            raise Exception('Backward compatibility not ensured! (need run_id in state)')           
+
+    else:
+        # assign new run_id key
+        model.state['run_id'] = RUN_ID
+
+    logger.debug("Compile trainer")
+    if not state["use_nce"]:
+        if ('add_latent_gaussian_per_utterance' in state) and (state["add_latent_gaussian_per_utterance"]):
+            logger.debug("Training using variational lower bound on log-likelihood")
+        else:
+            logger.debug("Training using exact log-likelihood")
+
+        train_batch = model.build_train_function()
+    else:
+        logger.debug("Training with noise contrastive estimation")
+        train_batch = model.build_nce_function()
+
+    eval_batch = model.build_eval_function()
+
+    gamma_bounding = model.build_gamma_bounding_function()
+
+    random_sampler = search.RandomSampler(model)
+    beam_sampler = search.BeamSampler(model) 
+
+    logger.debug("Load data")
+    train_data, \
+    valid_data, = get_train_iterator(state)
+    train_data.start()
+
+    # Start looping through the dataset
+    step = 0
+    patience = state['patience'] 
+    start_time = time.time()
+     
+    train_cost = 0
+    train_kl_divergence_cost = 0
+    train_posterior_gaussian_mean_variance = 0
+    train_misclass = 0
+    train_done = 0
+    train_dialogues_done = 0.0
+
+    prev_train_cost = 0
+    prev_train_done = 0
+
+    ex_done = 0
+    is_end_of_batch = True
+    start_validation = False
+
+    batch = None
+
+    while (step < state['loop_iters'] and
+            (time.time() - start_time)/60. < state['time_stop'] and
+            patience >= 0):
+
+        # Flush to log files
+        sys.stderr.flush()
+        sys.stdout.flush()
+
+        ### Sampling phase
+        if step % 200 == 0:
+            # First generate stochastic samples
+            for param in model.params:
+                print("%s = %.4f" % (param.name, numpy.sum(param.get_value() ** 2) ** 0.5))
+
+            samples, costs = random_sampler.sample([[]], n_samples=1, n_turns=3)
+            print("Sampled : {}".format(samples[0]))
+
+
+        ### Training phase
+        batch = train_data.next()
+
+        # Train finished
+        if not batch:
+            # Restart training
+            logger.debug("Got None...")
+            break
+
+        logger.debug("[TRAIN] - Got batch %d,%d" % (batch['x'].shape[1], batch['max_length']))
+        
+        x_data = batch['x']
+        x_data_reversed = batch['x_reversed']
+        max_length = batch['max_length']
+        x_cost_mask = batch['x_mask']
+        x_reset = batch['x_reset']
+        ran_gaussian_const_utterance = batch['ran_var_gaussian_constutterance']
+        ran_uniform_const_utterance = batch['ran_var_uniform_constutterance']
+
+        ran_decoder_drop_mask = batch['ran_decoder_drop_mask']
+
+        is_end_of_batch = False
+        if numpy.sum(numpy.abs(x_reset)) < 1:
+            # Print when we reach the end of an example (e.g. the end of a dialogue or a document)
+            # Knowing when the training procedure reaches the end is useful for diagnosing training problems
+            # print('END-OF-BATCH EXAMPLE!')
+            is_end_of_batch = True
+
+        if state['use_nce']:
+            y_neg = rng.choice(size=(10, max_length, x_data.shape[1]), a=model.idim, p=model.noise_probs).astype('int32')
+            c, kl_divergence_cost, posterior_gaussian_mean_variance = train_batch(x_data, x_data_reversed, y_neg, max_length, x_cost_mask, x_reset, ran_gaussian_const_utterance, ran_uniform_const_utterance, ran_decoder_drop_mask)
+        else:
+
+            latent_piecewise_utterance_variable_approx_posterior_alpha = 0.0
+            latent_piecewise_utterance_variable_prior_alpha = 0.0
+            kl_divergences_between_piecewise_prior_and_posterior = 0.0
+            kl_divergences_between_gaussian_prior_and_posterior = 0.0
+            latent_piecewise_posterior_sample = 0.0
+            posterior_gaussian_mean_variance = 0.0
+
+            if model.add_latent_piecewise_per_utterance and model.add_latent_gaussian_per_utterance:
+                c, kl_divergence_cost, posterior_gaussian_mean_variance, latent_piecewise_utterance_variable_approx_posterior_alpha, latent_piecewise_utterance_variable_prior_alpha, kl_divergences_between_piecewise_prior_and_posterior, kl_divergences_between_gaussian_prior_and_posterior, latent_piecewise_posterior_sample = train_batch(x_data, x_data_reversed, max_length, x_cost_mask, x_reset, ran_gaussian_const_utterance, ran_uniform_const_utterance, ran_decoder_drop_mask)
+            elif model.add_latent_gaussian_per_utterance:
+                c, kl_divergence_cost, posterior_gaussian_mean_variance, kl_divergences_between_gaussian_prior_and_posterior = train_batch(x_data, x_data_reversed, max_length, x_cost_mask, x_reset, ran_gaussian_const_utterance, ran_uniform_const_utterance, ran_decoder_drop_mask)
+            elif model.add_latent_piecewise_per_utterance:
+                c, kl_divergence_cost, kl_divergences_between_piecewise_prior_and_posterior = train_batch(x_data, x_data_reversed, max_length, x_cost_mask, x_reset, ran_gaussian_const_utterance, ran_uniform_const_utterance, ran_decoder_drop_mask)
+            else:
+                c = train_batch(x_data, x_data_reversed, max_length, x_cost_mask, x_reset, ran_gaussian_const_utterance, ran_uniform_const_utterance, ran_decoder_drop_mask)
+                kl_divergence_cost = 0.0
+
+
+
+
+
+        gamma_bounding()
+
+        # Print batch statistics
+        print('cost_sum', c)
+        print('cost_mean', c / float(numpy.sum(x_cost_mask)))
+
+        if model.add_latent_piecewise_per_utterance or model.add_latent_gaussian_per_utterance:
+            print('kl_divergence_cost_sum', kl_divergence_cost)
+            print('kl_divergence_cost_mean', kl_divergence_cost / float(len(numpy.where(x_data == model.eos_sym)[0])))
+
+        if model.add_latent_gaussian_per_utterance:
+            print('posterior_gaussian_mean_variance', posterior_gaussian_mean_variance)
+            print('kl_divergences_between_gaussian_prior_and_posterior', numpy.sum(kl_divergences_between_gaussian_prior_and_posterior), numpy.min(kl_divergences_between_gaussian_prior_and_posterior), numpy.max(kl_divergences_between_gaussian_prior_and_posterior))
+
+        if model.add_latent_piecewise_per_utterance:
+            print('kl_divergences_between_piecewise_prior_and_posterior', numpy.sum(kl_divergences_between_piecewise_prior_and_posterior), numpy.min(kl_divergences_between_piecewise_prior_and_posterior), numpy.max(kl_divergences_between_piecewise_prior_and_posterior))
+
+
+        if numpy.isinf(c) or numpy.isnan(c):
+            logger.warn("Got NaN cost .. skipping")
+            gc.collect()
+            continue
+
+        train_cost += c
+        train_kl_divergence_cost += kl_divergence_cost
+        train_posterior_gaussian_mean_variance += posterior_gaussian_mean_variance
+
+        train_done += batch['num_preds']
+        train_dialogues_done += batch['num_dialogues']
+
+        this_time = time.time()
+        if step % state['train_freq'] == 0:
+            elapsed = this_time - start_time
+
+            # Keep track of training cost for the last 'train_freq' batches.
+            current_train_cost = train_cost/train_done
+            if prev_train_done >= 1 and abs(train_done - prev_train_done) > 0:
+                current_train_cost = float(train_cost - prev_train_cost)/float(train_done - prev_train_done)
+
+            if numpy.isinf(c) or numpy.isnan(c):
+                current_train_cost = 0
+
+            prev_train_cost = train_cost
+            prev_train_done = train_done
+
+            h, m, s = ConvertTimedelta(this_time - start_time)
+
+            # We need to catch exceptions due to high numbers in exp
+            try:
+                print(".. %.2d:%.2d:%.2d %4d mb # %d bs %d maxl %d acc_cost = %.4f acc_word_perplexity = %.4f cur_cost = %.4f cur_word_perplexity = %.4f acc_mean_word_error = %.4f acc_mean_kl_divergence_cost = %.8f acc_mean_posterior_variance = %.8f" % (h, m, s,\
+                                 state['time_stop'] - (time.time() - start_time)/60.,\
+                                 step, \
+                                 batch['x'].shape[1], \
+                                 batch['max_length'], \
+                                 float(train_cost/train_done), \
+                                 math.exp(float(train_cost/train_done)), \
+                                 current_train_cost, \
+                                 math.exp(current_train_cost), \
+                                 float(train_misclass)/float(train_done), \
+                                 float(train_kl_divergence_cost/train_done), \
+                                 float(train_posterior_gaussian_mean_variance/train_dialogues_done)))
+            except:
+                pass
+
+
+        ### Inspection phase
+        if (step % 20 == 0):
+            if model.add_latent_gaussian_per_utterance and model.add_latent_piecewise_per_utterance:
+                try:
+                    print('posterior_gaussian_mean_combination', model.posterior_mean_combination.W.get_value())
+
+                except:
+                    pass
+
+                print('latent_piecewise_utterance_variable_approx_posterior_alpha', numpy.mean(latent_piecewise_utterance_variable_approx_posterior_alpha), latent_piecewise_utterance_variable_approx_posterior_alpha)
+
+                print('latent_piecewise_utterance_variable_prior_alpha', numpy.mean(latent_piecewise_utterance_variable_prior_alpha), latent_piecewise_utterance_variable_prior_alpha)
+
+                print('latent_piecewise_utterance_variable_alpha_diff', (latent_piecewise_utterance_variable_approx_posterior_alpha-latent_piecewise_utterance_variable_prior_alpha))
+
+                print('latent_piecewise_posterior_sample', numpy.min(latent_piecewise_posterior_sample), numpy.max(latent_piecewise_posterior_sample), latent_piecewise_posterior_sample[0, 0, :])
+                print('ran_uniform_const_utterance', numpy.min(ran_uniform_const_utterance), numpy.max(ran_uniform_const_utterance), ran_uniform_const_utterance[0, 0, :])
+
+            if model.utterance_decoder_gating.upper() == 'GRU' and model.decoder_bias_type.upper() == 'ALL':
+                Wd_s_q = model.utterance_decoder.Wd_s_q.get_value()
+                Wd_s_q_len = Wd_s_q.shape[0]
+                print('model.utterance_decoder Wd_s_q full', numpy.mean(numpy.abs(Wd_s_q)), numpy.mean(Wd_s_q**2))
+
+                if model.add_latent_gaussian_per_utterance and model.add_latent_piecewise_per_utterance:
+                    Wd_s_q_gaussian = Wd_s_q[Wd_s_q_len-2*model.latent_piecewise_per_utterance_dim:Wd_s_q_len-model.latent_piecewise_per_utterance_dim, :]
+                    Wd_s_q_piecewise = Wd_s_q[Wd_s_q_len-model.latent_piecewise_per_utterance_dim:Wd_s_q_len, :]
+
+                    print('model.utterance_decoder Wd_s_q gaussian', numpy.mean(numpy.abs(Wd_s_q_gaussian)), numpy.mean(Wd_s_q_gaussian**2))
+                    print('model.utterance_decoder Wd_s_q piecewise', numpy.mean(numpy.abs(Wd_s_q_piecewise)), numpy.mean(Wd_s_q_piecewise**2))
+
+                    print('model.utterance_decoder Wd_s_q piecewise/gaussian', numpy.mean(numpy.abs(Wd_s_q_piecewise))/numpy.mean(numpy.abs(Wd_s_q_gaussian)), numpy.mean(Wd_s_q_piecewise**2)/numpy.mean(Wd_s_q_gaussian**2))
+
+                elif model.add_latent_gaussian_per_utterance:
+                    Wd_s_q_piecewise = Wd_s_q[Wd_s_q_len-model.latent_piecewise_per_utterance_dim:Wd_s_q_len, :]
+
+                    print('model.utterance_decoder Wd_s_q piecewise', numpy.mean(numpy.abs(Wd_s_q_piecewise)), numpy.mean(Wd_s_q_piecewise**2))
+
+
+                elif model.add_latent_piecewise_per_utterance:
+                    Wd_s_q_gaussian = Wd_s_q[Wd_s_q_len-model.latent_piecewise_per_utterance_dim:Wd_s_q_len, :]
+
+                    print('model.utterance_decoder Wd_s_q gaussian', numpy.mean(numpy.abs(Wd_s_q_gaussian)), numpy.mean(Wd_s_q_gaussian**2))
+
+
+
+            if model.utterance_decoder_gating.upper() == 'BOW' and model.decoder_bias_type.upper() == 'ALL':
+                Wd_bow_W_in = model.utterance_decoder.Wd_bow_W_in.get_value()
+                Wd_bow_W_in_len = Wd_bow_W_in.shape[0]
+                print('model.utterance_decoder Wd_bow_W_in full', numpy.mean(numpy.abs(Wd_bow_W_in)), numpy.mean(Wd_bow_W_in**2))
+
+                if model.add_latent_gaussian_per_utterance and model.add_latent_piecewise_per_utterance:
+                    Wd_bow_W_in_gaussian = Wd_bow_W_in[Wd_bow_W_in_len-2*model.latent_piecewise_per_utterance_dim:Wd_bow_W_in_len-model.latent_piecewise_per_utterance_dim, :]
+                    Wd_bow_W_in_piecewise = Wd_bow_W_in[Wd_bow_W_in_len-model.latent_piecewise_per_utterance_dim:Wd_bow_W_in_len, :]
+
+                    print('model.utterance_decoder Wd_bow_W_in gaussian', numpy.mean(numpy.abs(Wd_bow_W_in_gaussian)), numpy.mean(Wd_bow_W_in_gaussian**2))
+                    print('model.utterance_decoder Wd_bow_W_in piecewise', numpy.mean(numpy.abs(Wd_bow_W_in_piecewise)), numpy.mean(Wd_bow_W_in_piecewise**2))
+
+                    print('model.utterance_decoder Wd_bow_W_in piecewise/gaussian', numpy.mean(numpy.abs(Wd_bow_W_in_piecewise))/numpy.mean(numpy.abs(Wd_bow_W_in_gaussian)), numpy.mean(Wd_bow_W_in_piecewise**2)/numpy.mean(Wd_bow_W_in_gaussian**2))
+
+                elif model.add_latent_gaussian_per_utterance:
+                    Wd_bow_W_in_piecewise = Wd_bow_W_in[Wd_bow_W_in_len-model.latent_piecewise_per_utterance_dim:Wd_bow_W_in_len, :]
+
+                    print('model.utterance_decoder Wd_bow_W_in piecewise', numpy.mean(numpy.abs(Wd_bow_W_in_piecewise)), numpy.mean(Wd_bow_W_in_piecewise**2))
+
+
+                elif model.add_latent_piecewise_per_utterance:
+                    Wd_bow_W_in_gaussian = Wd_bow_W_in[Wd_bow_W_in_len-model.latent_piecewise_per_utterance_dim:Wd_bow_W_in_len, :]
+
+                    print('model.utterance_decoder Wd_bow_W_in gaussian', numpy.mean(numpy.abs(Wd_bow_W_in_gaussian)), numpy.mean(Wd_bow_W_in_gaussian**2))
+
+
+
+
+
+
+
+
+
+        ### Evaluation phase
+        if valid_data is not None and\
+            step % state['valid_freq'] == 0 and step > 1:
+                start_validation = True
+
+        # Only start validation loop once it's time to validate and once all previous batches have been reset
+        if start_validation and is_end_of_batch:
+                start_validation = False
+                valid_data.start()
+                valid_cost = 0
+                valid_kl_divergence_cost = 0
+                valid_posterior_gaussian_mean_variance = 0
+
+                valid_wordpreds_done = 0
+                valid_dialogues_done = 0
+
+
+                logger.debug("[VALIDATION START]")
+
+                while True:
+                    batch = valid_data.next()
+
+                    # Validation finished
+                    if not batch:
+                        break
+
+
+                    logger.debug("[VALID] - Got batch %d,%d" % (batch['x'].shape[1], batch['max_length']))
+        
+                    x_data = batch['x']
+                    x_data_reversed = batch['x_reversed']
+                    max_length = batch['max_length']
+                    x_cost_mask = batch['x_mask']
+
+                    x_reset = batch['x_reset']
+                    ran_gaussian_const_utterance = batch['ran_var_gaussian_constutterance']
+                    ran_uniform_const_utterance = batch['ran_var_uniform_constutterance']
+
+                    ran_decoder_drop_mask = batch['ran_decoder_drop_mask']
+
+                    posterior_gaussian_mean_variance = 0.0
+
+                    c, c_list, kl_divergence_cost = eval_batch(x_data, x_data_reversed, max_length, x_cost_mask, x_reset, ran_gaussian_const_utterance, ran_uniform_const_utterance, ran_decoder_drop_mask)
+
+
+                    # Rehape into matrix, where rows are validation samples and columns are tokens
+                    # Note that we use max_length-1 because we don't get a cost for the first token
+                    # (the first token is always assumed to be eos)
+                    c_list = c_list.reshape((batch['x'].shape[1],max_length-1), order=(1,0))
+                    c_list = numpy.sum(c_list, axis=1)
+                    
+                    words_in_dialogues = numpy.sum(x_cost_mask, axis=0)
+                    c_list = c_list / words_in_dialogues
+                    
+
+                    if numpy.isinf(c) or numpy.isnan(c):
+                        continue
+                    
+                    valid_cost += c
+                    valid_kl_divergence_cost += kl_divergence_cost
+                    valid_posterior_gaussian_mean_variance += posterior_gaussian_mean_variance
+
+                    # Print batch statistics
+                    print('valid_cost', valid_cost)
+                    print('valid_kl_divergence_cost sample', kl_divergence_cost)
+                    print('posterior_gaussian_mean_variance', posterior_gaussian_mean_variance)
+
+
+                    valid_wordpreds_done += batch['num_preds']
+                    valid_dialogues_done += batch['num_dialogues']
+
+                logger.debug("[VALIDATION END]") 
+                 
+                valid_cost /= max(1.0, valid_wordpreds_done)
+                valid_kl_divergence_cost /= max(1.0, valid_wordpreds_done)
+                valid_posterior_gaussian_mean_variance /= max(1.0, valid_dialogues_done)
+
+                if (len(timings["valid_cost"]) == 0) \
+                    or (valid_cost < numpy.min(timings["valid_cost"])) \
+                    or (save_model_on_first_valid and valid_rounds == 0):
+                    patience = state['patience']
+
+                    # Save model if there is decrease in validation cost
+                    save(model, timings, train_data)
+                    print('best valid_cost', valid_cost)
+                elif valid_cost >= timings["valid_cost"][-1] * state['cost_threshold']:
+                    patience -= 1
+
+                if args.save_every_valid_iteration:
+                    save(model, timings, train_data, '_' + str(step) + '_')
+                if args.auto_restart:
+                    save(model, timings, train_data, '_auto_')
+
+
+                # We need to catch exceptions due to high numbers in exp
+                try:
+                    print("** valid cost (NLL) = %.4f, valid word-perplexity = %.4f, valid kldiv cost (per word) = %.8f, valid mean posterior variance (per word) = %.8f, patience = %d" % (float(valid_cost), float(math.exp(valid_cost)), float(valid_kl_divergence_cost), float(valid_posterior_gaussian_mean_variance), patience))
+                except:
+                    try:
+                        print("** valid cost (NLL) = %.4f, patience = %d" % (float(valid_cost), patience))
+                    except:
+                        pass
+
+
+                timings["train_cost"].append(train_cost/train_done)
+                timings["train_kl_divergence_cost"].append(train_kl_divergence_cost/train_done)
+                timings["train_posterior_gaussian_mean_variance"].append(train_posterior_gaussian_mean_variance/train_dialogues_done)
+                timings["valid_cost"].append(valid_cost)
+                timings["valid_kl_divergence_cost"].append(valid_kl_divergence_cost)
+                timings["valid_posterior_gaussian_mean_variance"].append(valid_posterior_gaussian_mean_variance)
+
+                # Reset train cost, train misclass and train done metrics
+                train_cost = 0
+                train_done = 0
+                prev_train_cost = 0
+                prev_train_done = 0
+
+                # Count number of validation rounds done so far
+                valid_rounds += 1
+
+        step += 1
+
+    logger.debug("All done, exiting...")
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--resume", type=str, default="", help="Resume training from that state")
+
+    parser.add_argument("--force_train_all_wordemb", action='store_true', help="If true, will force the model to train all word embeddings in the encoder. This switch can be used to fine-tune a model which was trained with fixed (pretrained)  encoder word embeddings.")
+
+    parser.add_argument("--save_every_valid_iteration", action='store_true', help="If true, will save a unique copy of the model at every validation round.")
+
+    parser.add_argument("--auto_restart", action='store_true', help="If true, will maintain a copy of the current model parameters updated at every validation round. Upon initialization, the script will automatically scan the output directory and and resume training of a previous model (if such exists). This option is meant to be used for training models on clusters with hard wall-times. This option is incompatible with the \"resume\" and \"save_every_valid_iteration\" options.")
+
+    parser.add_argument("--prototype", type=str, help="Prototype to use (must be specified)", default='prototype_state')
+
+    parser.add_argument("--reinitialize-latent-variable-parameters", action='store_true', help="Can be used when resuming a model. If true, will initialize all latent variable parameters randomly instead of loading them from previous model.")
+
+    parser.add_argument("--reinitialize-decoder-parameters", action='store_true', help="Can be used when resuming a model. If true, will initialize all parameters of the utterance decoder randomly instead of loading them from previous model.")
+
+    args = parser.parse_args()
+    return args
+
+if __name__ == "__main__":
+    # Models only run with float32
+    assert(theano.config.floatX == 'float32')
+
+    args = parse_args()
+    main(args)
+
+    # grep 'valid cost' LSTM_Baseline_exp1/LOGS/python_train.py_prototype_twitter_LSTM_NormOp_ClusterExp1_2016-09-23_22-48-31.523628/dbi_146c0c3c23d.out-* | grep -o -P '(?<=word-perplexity = ).*(?=, valid kldiv)'
diff --git a/parlai/agents/hred/utils.py b/parlai/agents/hred/utils.py
new file mode 100755
index 00000000000..8156113ce88
--- /dev/null
+++ b/parlai/agents/hred/utils.py
@@ -0,0 +1,357 @@
+import numpy
+import adam
+import theano
+import theano.tensor as T
+from collections import OrderedDict
+
+PRINT_VARS = True
+
+def DPrint(name, var):
+    if PRINT_VARS is False:
+        return var
+
+    return theano.printing.Print(name)(var)
+
+def sharedX(value, name=None, borrow=False, dtype=None):
+    if dtype is None:
+        dtype = theano.config.floatX
+    return theano.shared(theano._asarray(value, dtype=dtype),
+                         name=name,
+                         borrow=borrow)
+
+def Adam(grads, lr=0.0002, b1=0.1, b2=0.001, e=1e-8):
+    return adam.Adam(grads, lr, b1, b2, e)
+
+def Adagrad(grads, lr):
+    updates = OrderedDict()
+    for param in grads.keys():
+        # sum_square_grad := \sum g^2
+        sum_square_grad = sharedX(param.get_value() * 0.)
+        if param.name is not None:
+            sum_square_grad.name = 'sum_square_grad_' + param.name
+
+        # Accumulate gradient
+        new_sum_squared_grad = sum_square_grad + T.sqr(grads[param])
+
+        # Compute update
+        delta_x_t = (- lr / T.sqrt(numpy.float32(1e-5) + new_sum_squared_grad)) * grads[param]
+
+        # Apply update
+        updates[sum_square_grad] = new_sum_squared_grad
+        updates[param] = param + delta_x_t
+    return updates
+
+def Adadelta(grads, decay=0.95, epsilon=1e-6):
+    updates = OrderedDict()
+    for param in grads.keys():
+        # mean_squared_grad := E[g^2]_{t-1}
+        mean_square_grad = sharedX(param.get_value() * 0.)
+        # mean_square_dx := E[(\Delta x)^2]_{t-1}
+        mean_square_dx = sharedX(param.get_value() * 0.)
+
+        if param.name is not None:
+            mean_square_grad.name = 'mean_square_grad_' + param.name
+            mean_square_dx.name = 'mean_square_dx_' + param.name
+
+        # Accumulate gradient
+        new_mean_squared_grad = (
+            decay * mean_square_grad +
+            (1 - decay) * T.sqr(grads[param])
+        )
+
+        # Compute update
+        rms_dx_tm1 = T.sqrt(mean_square_dx + epsilon)
+        rms_grad_t = T.sqrt(new_mean_squared_grad + epsilon)
+        delta_x_t = - rms_dx_tm1 / rms_grad_t * grads[param]
+
+        # Accumulate updates
+        new_mean_square_dx = (
+            decay * mean_square_dx +
+            (1 - decay) * T.sqr(delta_x_t)
+        )
+
+        # Apply update
+        updates[mean_square_grad] = new_mean_squared_grad
+        updates[mean_square_dx] = new_mean_square_dx
+        updates[param] = param + delta_x_t
+
+    return updates
+
+def RMSProp(grads, lr, decay=0.95, eta=0.9, epsilon=1e-6):
+    """
+    RMSProp gradient method
+    """
+    updates = OrderedDict()
+    for param in grads.keys():
+        # mean_squared_grad := E[g^2]_{t-1}
+        mean_square_grad = sharedX(param.get_value() * 0.)
+        mean_grad = sharedX(param.get_value() * 0.)
+        delta_grad = sharedX(param.get_value() * 0.)
+
+        if param.name is None:
+            raise ValueError("Model parameters must be named.")
+
+        mean_square_grad.name = 'mean_square_grad_' + param.name
+
+        # Accumulate gradient
+
+        new_mean_grad = (decay * mean_grad + (1 - decay) * grads[param])
+        new_mean_squared_grad = (decay * mean_square_grad + (1 - decay) * T.sqr(grads[param]))
+
+        # Compute update
+        scaled_grad = grads[param] / T.sqrt(new_mean_squared_grad - new_mean_grad ** 2 + epsilon)
+        new_delta_grad = eta * delta_grad - lr * scaled_grad
+
+        # Apply update
+        updates[delta_grad] = new_delta_grad
+        updates[mean_grad] = new_mean_grad
+        updates[mean_square_grad] = new_mean_squared_grad
+        updates[param] = param + new_delta_grad
+
+    return updates
+
+class Maxout(object):
+    def __init__(self, maxout_part):
+        self.maxout_part = maxout_part
+
+    def __call__(self, x):
+        shape = x.shape
+        if x.ndim == 2:
+            shape1 = T.cast(shape[1] / self.maxout_part, 'int64')
+            shape2 = T.cast(self.maxout_part, 'int64')
+            x = x.reshape([shape[0], shape1, shape2])
+            x = x.max(2)
+        else:
+            shape1 = T.cast(shape[2] / self.maxout_part, 'int64')
+            shape2 = T.cast(self.maxout_part, 'int64')
+            x = x.reshape([shape[0], shape[1], shape1, shape2])
+            x = x.max(3)
+        return x
+
+def UniformInit(rng, sizeX, sizeY, lb=-0.01, ub=0.01):
+    """ Uniform Init """
+    return rng.uniform(size=(sizeX, sizeY), low=lb, high=ub).astype(theano.config.floatX)
+
+def OrthogonalInit(rng, sizeX, sizeY, sparsity=-1, scale=1):
+    """
+    Orthogonal Initialization
+    """
+
+    sizeX = int(sizeX)
+    sizeY = int(sizeY)
+
+    assert sizeX == sizeY, 'for orthogonal init, sizeX == sizeY'
+
+    if sparsity < 0:
+        sparsity = sizeY
+    else:
+        sparsity = numpy.minimum(sizeY, sparsity)
+
+    values = numpy.zeros((sizeX, sizeY), dtype=theano.config.floatX)
+    for dx in xrange(sizeX):
+        perm = rng.permutation(sizeY)
+        new_vals = rng.normal(loc=0, scale=scale, size=(sparsity,))
+        values[dx, perm[:sparsity]] = new_vals
+
+    # Use SciPy:
+    if sizeX*sizeY > 5000000:
+        import scipy
+        u,s,v = scipy.linalg.svd(values)
+    else:
+        u,s,v = numpy.linalg.svd(values)
+    values = u * scale
+    return values.astype(theano.config.floatX)
+
+def GrabProbs(classProbs, target, gRange=None):
+    if classProbs.ndim > 2:
+        classProbs = classProbs.reshape((classProbs.shape[0] * classProbs.shape[1], classProbs.shape[2]))
+    else:
+        classProbs = classProbs
+
+    if target.ndim > 1:
+        tflat = target.flatten()
+    else:
+        tflat = target
+    return T.diag(classProbs.T[tflat])
+
+def NormalInit(rng, sizeX, sizeY, scale=0.01, sparsity=-1):
+    """
+    Normal Initialization
+    """
+
+    sizeX = int(sizeX)
+    sizeY = int(sizeY)
+
+    if sparsity < 0:
+        sparsity = sizeY
+
+    sparsity = numpy.minimum(sizeY, sparsity)
+    values = numpy.zeros((sizeX, sizeY), dtype=theano.config.floatX)
+    for dx in xrange(sizeX):
+        perm = rng.permutation(sizeY)
+        new_vals = rng.normal(loc=0, scale=scale, size=(sparsity,))
+        values[dx, perm[:sparsity]] = new_vals
+
+    return values.astype(theano.config.floatX)
+
+def NormalInit3D(rng, sizeX, sizeY, sizeZ, scale=0.01, sparsity=-1):
+    """ 
+    Normal Initialization for 3D tensor
+    """
+
+    sizeX = int(sizeX)
+    sizeY = int(sizeY)
+    sizeZ = int(sizeZ)
+    values = numpy.zeros((sizeX, sizeY, sizeZ), dtype=theano.config.floatX)
+    for i in range(sizeZ):
+        values[:,:,i] = NormalInit(rng, sizeX, sizeY, scale, sparsity)
+
+    return values.astype(theano.config.floatX)
+
+def ConvertTimedelta(seconds_diff): 
+    hours = seconds_diff // 3600
+    minutes = (seconds_diff % 3600) // 60
+    seconds = (seconds_diff % 60)
+    return hours, minutes, seconds
+
+def SoftMax(x):
+    x = T.exp(x - T.max(x, axis=x.ndim-1, keepdims=True))
+    return x / T.sum(x, axis=x.ndim-1, keepdims=True)
+
+def stable_log(x):
+    return T.log(T.maximum(x, 0.0000000001))
+
+
+
+# Performs either batch normalization or layer normalization
+def NormalizationOperator(normop_type, x, gamma, mask, estimated_mean=0.0, estimated_var=1.0):
+    if normop_type.upper() == 'BN':
+        if x.ndim == 3:
+            return FeedforwardBatchNormalization(x, gamma, mask, estimated_mean=0.0, estimated_var=1.0)
+        elif x.ndim == 2:
+            return RecurrentBatchNormalization(x, gamma, mask, estimated_mean=0.0, estimated_var=1.0)
+    elif normop_type.upper() == 'LN':
+        return LayerNormalization(x, gamma, mask, estimated_mean=0.0, estimated_var=1.0)
+    elif normop_type.upper() == 'NONE' or normop_type.upper() == '':
+        assert x.ndim == 3 or x.ndim == 2
+
+        output = x + 0.0*gamma
+        if x.ndim == 3:
+            x_mean = T.mean(x, axis=1).dimshuffle(0, 1, 'x')
+            x_var = T.var(x, axis=1).dimshuffle(0, 1, 'x')
+        else:
+            x_mean = T.mean(x, axis=1).dimshuffle(0, 'x')
+            x_var = T.var(x, axis=1).dimshuffle(0, 'x')
+
+        return output, x_mean[0], x_var[0]
+    else:
+        raise ValueError("Error! normop_type must take a value in set {\'BN\', \'LN\', \'NONE\'}!")
+
+
+# Batch normalization of input variable on first and second tensor indices (time x batch example x hidden units)
+# Elements where mask is zero, will not be used to compute the mean and variance estimates,
+# however these elements will still be batch normalized.
+def FeedforwardBatchNormalization(x, gamma, mask, estimated_mean=0.0, estimated_var=1.0):
+    assert x.ndim == 3
+    if mask:
+        assert mask.ndim == 2
+        mask = mask.dimshuffle(0, 1, 'x')
+
+        mask_nonzeros = T.sum(T.sum(mask, axis=0), axis=0)
+        mask_nonzeros_weight = T.cast(T.minimum(1.0, T.sum(mask, axis=0)) / mask.shape[1], 'float32')
+
+        x_masked = x*mask
+
+        x_mean = (T.sum(T.sum(x_masked, axis=0), axis=0)/mask_nonzeros).dimshuffle('x', 'x', 0)
+        x_mean_adjusted = mask_nonzeros_weight*x_mean + (1.0 - mask_nonzeros_weight)*estimated_mean
+        x_zero_mean = x - x_mean_adjusted
+
+        x_var = (T.sum(T.sum(x_zero_mean**2, axis=0), axis=0)/mask_nonzeros).dimshuffle('x', 'x', 0)
+        x_var_adjusted = mask_nonzeros_weight*x_var + (1.0 - mask_nonzeros_weight)*estimated_var
+
+    else:
+        x_mean = estimated_mean.dimshuffle('x', 'x', 0)
+        x_mean_adjusted = x_mean
+
+        x_zero_mean = x - x_mean
+
+        x_var = estimated_var.dimshuffle('x', 'x', 0)
+        x_var_adjusted = x_var
+
+
+    return gamma*(x_zero_mean / T.sqrt(x_var_adjusted+1e-7)), x_mean_adjusted[0, 0], x_var_adjusted[0, 0]
+
+# Batch normalization of input variable on first tensor index (time x batch example x hidden units)
+# Elements where mask is zero, will not be used to compute the mean and variance estimates,
+# however these elements will still be batch normalized.
+def RecurrentBatchNormalization(x, gamma, mask, estimated_mean=0.0, estimated_var=1.0):
+    assert x.ndim == 2
+    assert mask.ndim == 1
+
+
+    mask = mask.dimshuffle(0, 'x')
+
+    mask_nonzeros = T.sum(mask, axis=0)
+    mask_nonzeros_weight = mask_nonzeros / T.sum(T.ones_like(mask), axis=0)
+
+    x_masked = x*mask
+
+    x_mean = (T.sum(x_masked, axis=0)/mask_nonzeros).dimshuffle('x', 0)
+    x_mean_adjusted = mask_nonzeros_weight*x_mean + (1.0 - mask_nonzeros_weight)*estimated_mean
+    
+    x_zero_mean = x - x_mean_adjusted #x_zero_mean = x_masked - x_mean_adjusted
+
+    x_var = T.sum(x_zero_mean**2, axis=0)/mask_nonzeros.dimshuffle('x', 0)
+    x_var_adjusted = mask_nonzeros_weight*x_var + (1.0 - mask_nonzeros_weight)*estimated_var
+
+    return gamma*(x_zero_mean / T.sqrt(x_var_adjusted+1e-7)), x_mean_adjusted[0], x_var_adjusted[0]
+
+# Performs layer normalization of input variable on last tensor index,
+# where we assume variable has shape (time x batch example x hidden units) or (batch example x hidden units).
+# Similar to batch normalization, the function also returns the mean and variance across hidden units.
+def LayerNormalization(x, gamma, mask, estimated_mean=0.0, estimated_var=1.0):
+    assert x.ndim == 3 or x.ndim == 2
+    if x.ndim == 3:
+        x_mean = T.mean(x, axis=2).dimshuffle(0, 1, 'x')
+        x_var = T.var(x, axis=2).dimshuffle(0, 1, 'x')
+        return gamma*((x - x_mean) / T.sqrt(x_var+1e-7)), x_mean[0, 0], x_var[0, 0]
+
+    elif x.ndim == 2:
+        x_mean = T.mean(x, axis=1).dimshuffle(0, 'x')
+        x_var = T.var(x, axis=1).dimshuffle(0, 'x')
+        return gamma*((x - x_mean) / T.sqrt(x_var+1e-7)), x_mean[0], x_var[0]
+
+
+
+# Does theano.batched_dot. If last_axis is on it will loop over the last axis, otherwise it will loop over the first axis.
+def BatchedDot(x, y, last_axis=False):
+    if last_axis==False:
+        return T.batched_dot(x, y)
+    elif last_axis:
+        if x.ndim == 2:
+            shuffled_x = x.dimshuffle(1,0)
+        elif x.ndim == 3:
+            shuffled_x = x.dimshuffle(2,0,1)
+        elif x.ndim == 4:
+            shuffled_x = x.dimshuffle(3,0,1,2)
+        else:
+            raise ValueError('BatchedDot inputs must have between 2-4 dimensions, but x has ' + str(x.ndim) + ' dimensions')
+
+        if y.ndim == 2:
+            shuffled_y = y.dimshuffle(1,0)
+        elif y.ndim == 3:
+            shuffled_y = y.dimshuffle(2,0,1)
+        elif y.ndim == 4:
+            shuffled_y = y.dimshuffle(3,0,1,2)
+        else:
+            raise ValueError('BatchedDot inputs must have between 2-4 dimensions, but y has ' + str(y.ndim) + ' dimensions')
+
+        dot = T.batched_dot(shuffled_x, shuffled_y)
+        if dot.ndim == 2:
+            return dot.dimshuffle(1,0)
+        elif dot.ndim == 3:
+            return dot.dimshuffle(1,2,0)
+        elif dot.ndim == 4:
+            return dot.dimshuffle(1,2,3,0)
+
+
diff --git a/parlai/agents/ir_baseline/ir_baseline.py b/parlai/agents/ir_baseline/ir_baseline.py
index 760b3caba71..e2b31d14f3d 100644
--- a/parlai/agents/ir_baseline/ir_baseline.py
+++ b/parlai/agents/ir_baseline/ir_baseline.py
@@ -135,8 +135,10 @@ def act(self):
             reply['text'] = "I don't know."
         return reply
 
-    def save(self, fname):
-        self.dictionary.save(fname + '.dict')
+    def save(self, fname=None):
+        fname = self.opt.get('model_file', None) if fname is None else fname
+        if fname:
+            self.dictionary.save(fname + '.dict')
 
     def load(self, fname):
         self.dictionary.load(fname + '.dict')
diff --git a/parlai/agents/memnn/__init__.py b/parlai/agents/memnn/__init__.py
new file mode 100644
index 00000000000..de7579ee4a2
--- /dev/null
+++ b/parlai/agents/memnn/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
diff --git a/parlai/agents/memnn/memnn.py b/parlai/agents/memnn/memnn.py
new file mode 100644
index 00000000000..b1e7dab382b
--- /dev/null
+++ b/parlai/agents/memnn/memnn.py
@@ -0,0 +1,291 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+
+from parlai.core.agents import Agent
+from parlai.core.dict import DictionaryAgent
+
+import torch
+from torch import optim
+from torch.autograd import Variable
+from torch.nn import CrossEntropyLoss
+
+import os
+import copy
+import random
+
+from .modules import MemNN
+
+
+class MemnnAgent(Agent):
+    """ Memory Network agent.
+    """
+
+    @staticmethod
+    def add_cmdline_args(argparser):
+        DictionaryAgent.add_cmdline_args(argparser)
+        argparser.add_arg('-lr', '--learning-rate', type=float, default=0.01,
+            help='learning rate')
+        argparser.add_arg('--embedding-size', type=int, default=128,
+            help='size of token embeddings')
+        argparser.add_arg('--hops', type=int, default=3,
+            help='number of memory hops')
+        argparser.add_arg('--mem-size', type=int, default=100,
+            help='size of memory')
+        argparser.add_arg('--time-features', type='bool', default=True,
+            help='use time features for memory embeddings')
+        argparser.add_arg('--position-encoding', type='bool', default=False,
+            help='use position encoding instead of bag of words embedding')
+        argparser.add_arg('--optimizer', default='adam',
+            help='optimizer type (sgd|adam)')
+        argparser.add_argument('--no-cuda', action='store_true', default=False,
+            help='disable GPUs even if available')
+        argparser.add_arg('--gpu', type=int, default=-1,
+            help='which GPU device to use')
+
+    def __init__(self, opt, shared=None):
+        opt['cuda'] = not opt['no_cuda'] and torch.cuda.is_available()
+        if opt['cuda']:
+            print('[ Using CUDA ]')
+            torch.cuda.device(opt['gpu'])
+
+        if not shared:
+            self.opt = opt
+            self.id = 'MemNN'
+            self.dict = DictionaryAgent(opt)
+            freqs = torch.LongTensor(list(self.dict.freqs().values()))
+
+            self.model = MemNN(opt, freqs)
+            self.mem_size = opt['mem_size']
+            self.loss_fn = CrossEntropyLoss()
+            self.answers = [None] * opt['batchsize']
+
+            optim_params = [p for p in self.model.parameters() if p.requires_grad]
+            if opt['optimizer'] == 'sgd':
+                self.optimizer = optim.SGD(optim_params, lr=opt['learning_rate'])
+            elif opt['optimizer'] == 'adam':
+                self.optimizer = optim.Adam(optim_params, lr=opt['learning_rate'])
+            else:
+                raise NotImplementedError('Optimizer not supported.')
+
+            if opt['cuda']:
+                self.model.share_memory()
+
+            if opt.get('model_file') and os.path.isfile(opt['model_file']):
+                print('Loading existing model parameters from ' + opt['model_file'])
+                self.load(opt['model_file'])
+        else:
+            self.answers = shared['answers']
+
+        self.episode_done = True
+        self.last_cands, self.last_cands_list = None, None
+        super().__init__(opt, shared)
+
+    def share(self):
+        shared = super().share()
+        shared['answers'] = self.answers
+        return shared
+
+    def observe(self, observation):
+        observation = copy.copy(observation)
+        if not self.episode_done:
+            # if the last example wasn't the end of an episode, then we need to
+            # recall what was said in that example
+            prev_dialogue = self.observation['text']
+            batch_idx = self.opt.get('batchindex', 0)
+            if self.answers[batch_idx] is not None:
+                prev_dialogue += '\n' + self.answers[batch_idx]
+                self.answers[batch_idx] = None
+            observation['text'] = prev_dialogue + '\n' + observation['text']
+        self.observation = observation
+        self.episode_done = observation['episode_done']
+        return observation
+
+    def update(self, xs, ys, cands):
+        self.model.train()
+        self.optimizer.zero_grad()
+
+        # Organize inputs for network (see contents of xs and ys in batchify method)
+        inputs = [xs[0], xs[1], ys[0], xs[2], xs[3], ys[1]]
+        inputs = [Variable(x) for x in inputs]
+        output_embeddings, answer_embeddings = self.model(*inputs)
+        scores = self.score(cands, output_embeddings, answer_embeddings)
+
+        label_inds = [cand_list.index(self.labels[i]) for i, cand_list in enumerate(cands)]
+        label_inds = Variable(torch.LongTensor(label_inds))
+        if self.opt['cuda']:
+            label_inds = label_inds.cuda(async=True)
+
+        loss = self.loss_fn(scores, label_inds)
+        loss.backward()
+        self.optimizer.step()
+        return self.ranked_predictions(cands, scores)
+
+    def predict(self, xs, cands):
+        self.model.eval()
+
+        # Organize inputs for network (see contents of xs in batchify method)
+        inputs = [xs[0], xs[1], None, xs[2], xs[3], None]
+        inputs = [Variable(x, volatile=True) for x in inputs]
+        output_embeddings, _ = self.model(*inputs)
+
+        scores = self.score(cands, output_embeddings)
+        return self.ranked_predictions(cands, scores)
+
+    def score(self, cands, output_embeddings, answer_embeddings=None):
+        last_cand = None
+        max_len = max([len(c) for c in cands])
+        scores = Variable(torch.Tensor(len(cands), max_len).fill_(-float('inf')))
+        if self.opt['cuda']:
+            scores = scores.cuda(async=True)
+        for i, cand_list in enumerate(cands):
+            if last_cand != cand_list:
+                candidate_lengths, candidate_indices = to_tensors(cand_list, self.dict)
+                candidate_lengths, candidate_indices = Variable(candidate_lengths), Variable(candidate_indices)
+                candidate_embeddings = self.model.answer_embedder(candidate_lengths, candidate_indices)
+                if self.opt['cuda']:
+                    candidate_embeddings = candidate_embeddings.cuda(async=True)
+                last_cand = cand_list
+            scores[i, :len(cand_list)] = self.model.score.one_to_many(output_embeddings[i].unsqueeze(0), candidate_embeddings)
+        return scores
+
+    def ranked_predictions(self, cands, scores):
+        _, inds = scores.data.sort(descending=True, dim=1)
+        return [[cands[i][j] for j in r if j < len(cands[i])]
+                    for i, r in enumerate(inds)]
+
+    def parse(self, text):
+        """Returns:
+            query = tensor (vector) of token indices for query
+            query_length = length of query
+            memory = tensor (matrix) where each row contains token indices for a memory
+            memory_lengths = tensor (vector) with lengths of each memory
+        """
+        sp = text.split('\n')
+        query_sentence = sp[-1]
+        query = self.dict.txt2vec(query_sentence)
+        query = torch.LongTensor(query)
+        query_length = torch.LongTensor([len(query)])
+
+        sp = sp[:-1]
+        sentences = []
+        for s in sp:
+            sentences.extend(s.split('\t'))
+        if len(sentences) == 0:
+            sentences.append(self.dict.null_token)
+
+        num_mems = min(self.mem_size, len(sentences))
+        memory_sentences = sentences[-num_mems:]
+        memory = [self.dict.txt2vec(s) for s in memory_sentences]
+        memory = [torch.LongTensor(m) for m in memory]
+        memory_lengths = torch.LongTensor([len(m) for m in memory])
+        memory = torch.cat(memory)
+
+        return (query, memory, query_length, memory_lengths)
+
+    def batchify(self, obs):
+        """Returns:
+            xs = [memories, queries, memory_lengths, query_lengths]
+            ys = [labels, label_lengths] (if available, else None)
+            cands = list of candidates for each example in batch
+            valid_inds = list of indices for examples with valid observations
+        """
+        exs = [ex for ex in obs if 'text' in ex]
+        valid_inds = [i for i, ex in enumerate(obs) if 'text' in ex]
+
+        parsed = [self.parse(ex['text']) for ex in exs]
+        queries = torch.cat([x[0] for x in parsed])
+        memories = torch.cat([x[1] for x in parsed])
+        query_lengths = torch.cat([x[2] for x in parsed])
+        memory_lengths = torch.LongTensor(len(exs), self.mem_size).zero_()
+        for i in range(len(exs)):
+            if len(parsed[i][3]) > 0:
+                memory_lengths[i, -len(parsed[i][3]):] = parsed[i][3]
+        xs = [memories, queries, memory_lengths, query_lengths]
+
+        ys = None
+        self.labels = [random.choice(ex['labels']) for ex in exs if 'labels' in ex]
+        if len(self.labels) == len(exs):
+            parsed = [self.dict.txt2vec(l) for l in self.labels]
+            parsed = [torch.LongTensor(p) for p in parsed]
+            label_lengths = torch.LongTensor([len(p) for p in parsed]).unsqueeze(1)
+            labels = torch.cat(parsed)
+            ys = [labels, label_lengths]
+
+        cands = [ex['label_candidates'] for ex in exs if 'label_candidates' in ex]
+        # Use words in dict as candidates if no candidates are provided
+        if len(cands) < len(exs):
+            cands = build_cands(exs, self.dict)
+        # Avoid rebuilding candidate list every batch if its the same
+        if self.last_cands != cands:
+            self.last_cands = cands
+            self.last_cands_list = [list(c) for c in cands]
+        cands = self.last_cands_list
+
+        return xs, ys, cands, valid_inds
+
+    def batch_act(self, observations):
+        batchsize = len(observations)
+        batch_reply = [{'id': self.getID()} for _ in range(batchsize)]
+
+        xs, ys, cands, valid_inds = self.batchify(observations)
+
+        if len(xs[1]) == 0:
+            return batch_reply
+
+        # Either train or predict
+        if ys is not None:
+            predictions = self.update(xs, ys, cands)
+        else:
+            predictions = self.predict(xs, cands)
+
+        for i in range(len(valid_inds)):
+            self.answers[valid_inds[i]] = predictions[i][0]
+            batch_reply[valid_inds[i]]['text'] = predictions[i][0]
+            batch_reply[valid_inds[i]]['text_candidates'] = predictions[i]
+        return batch_reply
+
+    def act(self):
+        return self.batch_act([self.observation])[0]
+
+    def save(self, path=None):
+        path = self.opt.get('model_file', None) if path is None else path
+
+        if path:
+            model_state = self.model.state_dict()
+            optim_state = self.optimizer.state_dict()
+            with open(path, 'wb') as write:
+                torch.save((model_state, optim_state), write)
+
+    def load(self, path):
+        with open(path, 'rb') as read:
+            (model, optim) = torch.load(read)
+        self.model.load_state_dict(model)
+        self.optimizer.load_state_dict(optim)
+
+
+def to_tensors(sentences, dictionary):
+    lengths = []
+    indices = []
+    for sentence in sentences:
+        tokens = dictionary.txt2vec(sentence)
+        lengths.append(len(tokens))
+        indices.extend(tokens)
+    lengths = torch.LongTensor(lengths)
+    indices = torch.LongTensor(indices)
+    return lengths, indices
+
+
+def build_cands(exs, dict):
+    dict_list = list(dict.tok2ind.keys())
+    cands = []
+    for ex in exs:
+        if 'label_candidates' in ex:
+            cands.append(ex['label_candidates'])
+        else:
+            cands.append(dict_list)
+            if 'labels' in ex:
+                cands[-1] += [l for l in ex['labels'] if l not in dict.tok2ind]
+    return cands
diff --git a/parlai/agents/memnn/modules.py b/parlai/agents/memnn/modules.py
new file mode 100644
index 00000000000..4559ffa96e9
--- /dev/null
+++ b/parlai/agents/memnn/modules.py
@@ -0,0 +1,176 @@
+import torch
+import torch.nn as nn
+from torch.autograd import Variable
+from torch.nn.functional import softmax
+
+from functools import lru_cache
+
+
+class MemNN(nn.Module):
+    def __init__(self, opt, freqs):
+        super(MemNN, self).__init__()
+        self.opt = opt
+
+        # Prepare features
+        self.num_time_features = opt['mem_size']
+        num_features = freqs.numel()
+        self.extra_features_slots = 0
+        if opt['time_features']:
+            self.time_features = torch.LongTensor(range(num_features,
+                num_features + self.num_time_features))
+            num_features += self.num_time_features
+            self.extra_features_slots += 1
+
+        def embedding():
+            return Embed(num_features, opt['embedding_size'],
+                position_encoding=opt['position_encoding'], padding_idx=0)
+
+        self.query_embedder = embedding()
+        self.answer_embedder = embedding()
+        self.in_memory_embedder = embedding()
+        self.out_memory_embedder = embedding()
+        self.memory_hop = Hop(opt['embedding_size'])
+
+        self.score = DotScore()
+
+        if opt['cuda']:
+            self.score.cuda()
+            if hasattr(self, 'memory_hop'):
+                self.memory_hop.cuda()
+
+        self.original_cuda_params = [(p, p.data) for p in self.parameters() if p.data.is_cuda]
+
+    def time_feature(self, t):
+        return self.time_features[min(t, self.num_time_features - 1)]
+
+    def update_memories_with_extra_features_(self, memory_lengths, memories):
+        memory_lengths = memory_lengths.data
+        memories = memories.data
+        if self.extra_features_slots > 0:
+            num_nonempty_memories = memory_lengths.ne(0).sum()
+            updated_memories = memories.new(memories.numel() + num_nonempty_memories * self.extra_features_slots)
+            src_offset = 0
+            dst_offset = 0
+            for i in range(memory_lengths.size(0)):
+                for j in range(self.opt['mem_size']):
+                    length = memory_lengths[i, j]
+                    if length > 0:
+                        if self.opt['time_features']:
+                            updated_memories[dst_offset] = self.time_feature(j)
+                            dst_offset += 1
+                        updated_memories[dst_offset:dst_offset + length] = memories[src_offset:src_offset + length]
+                        src_offset += length
+                        dst_offset += length
+            memory_lengths += memory_lengths.ne(0).long() * self.extra_features_slots
+            memories.set_(updated_memories)
+
+    def forward(self, memories, queries, answers,
+                memory_lengths, query_lengths, answer_lengths):
+        self.update_memories_with_extra_features_(memory_lengths, memories)
+
+        in_memory_embeddings = self.in_memory_embedder(memory_lengths, memories)
+        out_memory_embeddings = self.out_memory_embedder(memory_lengths, memories)
+        query_embeddings = self.query_embedder(query_lengths, queries)
+        answer_embeddings = None
+        if answer_lengths.numel() > 0:
+            answer_embeddings = self.answer_embedder(answer_lengths, answers)
+        attention_mask = Variable(memory_lengths.data.ne(0), requires_grad=False)
+
+        if self.opt['cuda']:
+            in_memory_embeddings = in_memory_embeddings.cuda(async=True)
+            out_memory_embeddings = out_memory_embeddings.cuda(async=True)
+            query_embeddings = query_embeddings.cuda(async=True)
+            if answer_lengths.numel() > 0:
+                answer_embeddings = answer_embeddings.cuda(async=True)
+            attention_mask = attention_mask.cuda(async=True)
+
+        for _ in range(self.opt['hops']):
+            query_embeddings = self.memory_hop(query_embeddings,
+                    in_memory_embeddings, out_memory_embeddings, attention_mask)
+
+        return query_embeddings, answer_embeddings
+
+
+class Embed(nn.Embedding):
+    def __init__(self, *args, position_encoding=False, **kwargs):
+        self.position_encoding = position_encoding
+        super().__init__(*args, **kwargs)
+
+    def forward(self, lengths, indices):
+        lengths_mat = lengths.data
+        indices = indices.data
+        if lengths.dim() == 1 or lengths.size(1) == 1:
+            lengths_mat = lengths_mat.squeeze().unsqueeze(0)
+
+        input = torch.LongTensor(lengths_mat.size(0), lengths_mat.size(1), torch.max(lengths_mat))
+        pad = self.padding_idx if self.padding_idx is not None else 0
+        input.fill_(pad)
+        emb_list = []
+        offset = 0
+        for i, row in enumerate(lengths_mat):
+            for j, length in enumerate(row):
+                if length > 0:
+                    input[i, j, :length] = indices[offset:offset+length]
+                offset += length
+        input = Variable(input)
+
+        for i, row in enumerate(lengths_mat):
+            emb = super().forward(input[i, :, :])
+            if self.position_encoding:
+                emb = emb * Variable(self.position_tensor(row, emb))
+            emb = torch.sum(emb, dim=1).squeeze(1)
+            for j, length in enumerate(row):
+                if length > 0:
+                    emb[j] /= length
+            emb_list.append(emb)
+        embs = torch.stack(emb_list)
+
+        if lengths.dim() == 1:
+            embs = embs.squeeze(0)
+        elif lengths.size(1) == 1:
+            embs = embs.squeeze().unsqueeze(1)
+        return embs
+
+    @staticmethod
+    @lru_cache(maxsize=32)
+    def position_matrix(J, d):
+        m = torch.Tensor(J, d)
+        for k in range(1, d+1):
+            for j in range(1, J+1):
+                m[j-1, k-1] = (1 - j/J) - (k/d) * (1 - 2 * j/J)
+        return m
+
+    @staticmethod
+    def position_tensor(sentence_lengths, embeddings):
+        t = torch.zeros(embeddings.size())
+        embedding_dim = embeddings.size()[-1]
+        for i, length in enumerate(sentence_lengths):
+            if length > 0:
+                t[i, :length, :] = Embed.position_matrix(length, embedding_dim)
+        return t
+
+
+class Hop(nn.Module):
+    def __init__(self, embedding_size):
+        super(Hop, self).__init__()
+        self.embedding_size = embedding_size
+        self.linear = nn.Linear(embedding_size, embedding_size, bias=False)
+
+    def forward(self, query_embeddings, in_memory_embeddings, out_memory_embeddings, attention_mask=None):
+        attention = torch.bmm(in_memory_embeddings, query_embeddings.unsqueeze(2)).squeeze(2)
+        if attention_mask is not None:
+            # exclude masked elements from the softmax
+            attention = attention_mask.float() * attention + (1 - attention_mask.float()) * -1e20
+        probs = softmax(attention).unsqueeze(1)
+        memory_output = torch.bmm(probs, out_memory_embeddings).squeeze(1)
+        query_embeddings = self.linear(query_embeddings)
+        output = memory_output + query_embeddings
+        return output
+
+
+class DotScore(nn.Module):
+    def one_to_one(self, query_embeddings, answer_embeddings, reply_embeddings=None):
+        return (query_embeddings * answer_embeddings).sum(dim=1).squeeze(1)
+
+    def one_to_many(self, query_embeddings, answer_embeddings, reply_embeddings=None):
+        return query_embeddings.mm(answer_embeddings.t())
diff --git a/parlai/agents/remote_agent/remote_agent.py b/parlai/agents/remote_agent/remote_agent.py
index e8ef08ed5ae..e471bcddf93 100644
--- a/parlai/agents/remote_agent/remote_agent.py
+++ b/parlai/agents/remote_agent/remote_agent.py
@@ -5,6 +5,7 @@
 # of patent rights can be found in the PATENTS file in the same directory.
 from parlai.core.agents import Agent, create_agent_from_shared
 from parlai.core.dict import DictionaryAgent
+import argparse
 import copy
 import numpy as np
 import json
@@ -12,19 +13,36 @@
 import zmq
 
 
-class RemoteAgent(Agent):
+def sanitize(obs):
+    if 'image' in obs and type(obs['image']) != str:
+        # can't json serialize images, unless they're in ascii format
+        obs.pop('image', None)
+    for k, v in obs.items():
+        if type(v) == set:
+            obs[k] = list(v)
+    return obs
+
+class RemoteAgentAgent(Agent):
     """Agent which connects over ZMQ to a paired agent. The other agent is
     launched using the command line options set via `add_cmdline_args`."""
 
     @staticmethod
     def add_cmdline_args(argparser):
-        argparser.add_arg(
+        remote = argparser.add_argument_group('Remote Agent Args')
+        remote.add_argument(
             '--port', default=5555,
             help='first port to connect to for remote agents')
-        argparser.add_arg(
-            '--remote-cmd', required=True,
-            help='command to launch paired agent')
-        argparser.add_arg(
+        remote.add_argument(
+            '--remote-address', default='localhost',
+            help='address to connect to, defaults to localhost for ' +
+                 'connections, overriden with `*` if remote-host is set')
+        remote.add_argument(
+            '--remote-host', action='store_true',
+            help='whether or not this connection is the host or the client')
+        remote.add_argument(
+            '--remote-cmd',
+            help='command to launch paired agent, if applicable')
+        remote.add_argument(
             '--remote-args',
             help='optional arguments to pass to paired agent')
 
@@ -35,9 +53,14 @@ def __init__(self, opt, shared=None):
         the multithreading effectively in their environment. (We don't run
         subprocess.Popen for each thread.)
         """
+        self.opt = copy.deepcopy(opt)
+        self.address = opt['remote_address']
+        if opt.get('remote_host') and self.address == 'localhost':
+            self.address = '*'
+        self.socket_type = zmq.REP if opt['remote_host'] else zmq.REQ
         if shared and 'port' in shared:
+            # for multithreading, use specified port
             self.port = shared['port']
-            self.opt = copy.deepcopy(shared['opt'])
         else:
             if 'port' in opt:
                 self.port = opt['port']
@@ -45,32 +68,38 @@ def __init__(self, opt, shared=None):
                 raise RuntimeError('You need to run RemoteAgent.' +
                                    'add_cmdline_args(argparser) before ' +
                                    'calling this class to set up options.')
-            self.process = subprocess.Popen(
-                '{cmd} {port} {numthreads} {args}'.format(
-                    cmd=opt['remote_cmd'], port=opt['port'],
-                    numthreads=opt['numthreads'],
-                    args=opt.get('remote_args', '')
-                ).split()
-            )
-            self.opt = copy.deepcopy(opt)
+            if opt.get('remote_cmd'):
+                # if available, command to launch partner instance, passing on
+                # some shared parameters from ParlAI
+                # useful especially if "remote" agent is running locally, e.g.
+                # in a different language than python
+                self.process = subprocess.Popen(
+                    '{cmd} {port} {numthreads} {args}'.format(
+                        cmd=opt['remote_cmd'], port=opt['port'],
+                        numthreads=opt['numthreads'],
+                        args=opt.get('remote_args', '')
+                    ).split()
+                )
         self.connect()
+        super().__init__(opt, shared)
 
     def connect(self):
-        """Connect to ZMQ socket as client. Requires package zmq."""
+        """Bind or connect to ZMQ socket. Requires package zmq."""
         context = zmq.Context()
-        self.socket = context.socket(zmq.REQ)
+        self.socket = context.socket(self.socket_type)
         self.socket.setsockopt(zmq.LINGER, 1)
-        self.socket.connect('tcp://localhost:{0}'.format(self.port))
-        print('python thread connected to ' +
-              'tcp://localhost:{0}'.format(self.port))
+        host = 'tcp://{}:{}'.format(self.address, self.port)
+        if self.socket_type == zmq.REP:
+            self.socket.bind(host)
+        else:
+            self.socket.connect(host)
+        print('python thread connected to ' + host)
 
     def act(self):
         """Send message to paired agent listening over zmq."""
-        if 'image' in self.observation:
-            # can't json serialize images
-            self.observation.pop('image', None)
-        text = json.dumps(self.observation)
-        self.socket.send_unicode(text)
+        if self.observation is not None:
+            text = json.dumps(sanitize(self.observation))
+            self.socket.send_unicode(text)
         reply = self.socket.recv_unicode()
         return json.loads(reply)
 
@@ -106,15 +135,19 @@ def shutdown(self):
                 self.process.kill()
 
 
-class ParsedRemoteAgent(RemoteAgent):
+class ParsedRemoteAgent(RemoteAgentAgent):
     """Same as the regular remote agent, except that this agent converts all
     text into vectors using its dictionary before sending them.
     """
 
     @staticmethod
     def add_cmdline_args(argparser):
-        super().add_cmdline_args(argparser)
-        ParsedRemoteAgent.dictionary_class().add_cmdline_args(argparser)
+        RemoteAgentAgent.add_cmdline_args(argparser)
+        try:
+            ParsedRemoteAgent.dictionary_class().add_cmdline_args(argparser)
+        except argparse.ArgumentError:
+            # don't freak out if the dictionary has already been added
+            pass
 
     @staticmethod
     def dictionary_class():
diff --git a/parlai/agents/repeat_label/repeat_label.py b/parlai/agents/repeat_label/repeat_label.py
index 7becab203fe..1559c168eee 100644
--- a/parlai/agents/repeat_label/repeat_label.py
+++ b/parlai/agents/repeat_label/repeat_label.py
@@ -33,6 +33,8 @@ def __init__(self, opt, shared=None):
 
     def act(self):
         obs = self.observation
+        if obs is None:
+            return { 'text': "Nothing to repeat yet." }
         reply = {}
         reply['id'] = self.getID()
         if ('labels' in obs and obs['labels'] is not None
diff --git a/parlai/agents/seq2seq/seq2seq.py b/parlai/agents/seq2seq/seq2seq.py
index fb6585e0ee3..bb8f86ab7f8 100644
--- a/parlai/agents/seq2seq/seq2seq.py
+++ b/parlai/agents/seq2seq/seq2seq.py
@@ -17,13 +17,20 @@
 
 
 class Seq2seqAgent(Agent):
-    """Simple agent which uses an LSTM to process incoming text observations."""
+    """Simple agent which uses an RNN to process incoming text observations.
+    The RNN generates a vector which is used to represent the input text,
+    conditioning on the context to generate an output token-by-token.
+
+    For more information, see Sequence to Sequence Learning with Neural Networks
+    `(Sutskever et al. 2014) <https://arxiv.org/abs/1409.3215>`_.
+    """
 
     @staticmethod
     def add_cmdline_args(argparser):
+        """Add command-line arguments specifically for this agent."""
         DictionaryAgent.add_cmdline_args(argparser)
         agent = argparser.add_argument_group('Seq2Seq Arguments')
-        agent.add_argument('-hs', '--hiddensize', type=int, default=64,
+        agent.add_argument('-hs', '--hiddensize', type=int, default=128,
             help='size of the hidden layers and embeddings')
         agent.add_argument('-nl', '--numlayers', type=int, default=2,
             help='number of hidden layers')
@@ -31,74 +38,123 @@ def add_cmdline_args(argparser):
             help='learning rate')
         agent.add_argument('-dr', '--dropout', type=float, default=0.1,
             help='dropout rate')
+        # agent.add_argument('-bi', '--bidirectional', type='bool', default=False,
+        #     help='whether to encode the context with a bidirectional RNN')
         agent.add_argument('--no-cuda', action='store_true', default=False,
             help='disable GPUs even if available')
         agent.add_argument('--gpu', type=int, default=-1,
             help='which GPU device to use')
+        agent.add_argument('-r', '--rank-candidates', type='bool', default=False,
+            help='rank candidates if available. this is done by computing the' +
+                 ' mean score per token for each candidate and selecting the ' +
+                 'highest scoring one.')
 
     def __init__(self, opt, shared=None):
+        # initialize defaults first
         super().__init__(opt, shared)
-        opt['cuda'] = not opt['no_cuda'] and torch.cuda.is_available()
-        if opt['cuda']:
-            print('[ Using CUDA ]')
-            torch.cuda.set_device(opt['gpu'])
         if not shared:
+            # this is not a shared instance of this class, so do full
+            # initialization. if shared is set, only set up shared members.
+            
+            # check for cuda
+            self.use_cuda = not opt.get('no_cuda') and torch.cuda.is_available()
+            if self.use_cuda:
+                print('[ Using CUDA ]')
+                torch.cuda.set_device(opt['gpu'])
+
+            if opt.get('model_file') and os.path.isfile(opt['model_file']):
+                # load model parameters if available
+                print('Loading existing model params from ' + opt['model_file'])
+                new_opt, self.states = self.load(opt['model_file'])
+                # override options with stored ones
+                opt = self.override_opt(new_opt)
+
             self.dict = DictionaryAgent(opt)
             self.id = 'Seq2Seq'
+            # we use END markers to break input and output and end our output
+            self.END = self.dict.end_token
+            self.observation = {'text': self.END, 'episode_done': True}
+            self.END_TENSOR = torch.LongTensor(self.dict.parse(self.END))
+            # get index of null token from dictionary (probably 0)
+            self.NULL_IDX = self.dict.txt2vec(self.dict.null_token)[0]
+
+            # store important params directly
             hsz = opt['hiddensize']
-            self.EOS = self.dict.eos_token
-            self.EOS_TENSOR = torch.LongTensor(self.dict.parse(self.EOS))
             self.hidden_size = hsz
             self.num_layers = opt['numlayers']
             self.learning_rate = opt['learningrate']
-            self.use_cuda = opt.get('cuda', False)
+            self.rank = opt['rank_candidates']
             self.longest_label = 1
 
+            # set up modules
             self.criterion = nn.NLLLoss()
-            self.lt = nn.Embedding(len(self.dict), hsz, padding_idx=0,
+            # lookup table stores word embeddings
+            self.lt = nn.Embedding(len(self.dict), hsz,
+                                   padding_idx=self.NULL_IDX,
                                    scale_grad_by_freq=True)
+            # encoder captures the input text
             self.encoder = nn.GRU(hsz, hsz, opt['numlayers'])
+            # decoder produces our output states
             self.decoder = nn.GRU(hsz, hsz, opt['numlayers'])
-            self.d2o = nn.Linear(hsz, len(self.dict))
+            # linear layer helps us produce outputs from final decoder state
+            self.h2o = nn.Linear(hsz, len(self.dict))
+            # droput on the linear layer helps us generalize
             self.dropout = nn.Dropout(opt['dropout'])
+            # softmax maps output scores to probabilities
             self.softmax = nn.LogSoftmax()
 
+            # set up optims for each module
             lr = opt['learningrate']
             self.optims = {
                 'lt': optim.SGD(self.lt.parameters(), lr=lr),
                 'encoder': optim.SGD(self.encoder.parameters(), lr=lr),
                 'decoder': optim.SGD(self.decoder.parameters(), lr=lr),
-                'd2o': optim.SGD(self.d2o.parameters(), lr=lr),
+                'h2o': optim.SGD(self.h2o.parameters(), lr=lr),
             }
+
+            if hasattr(self, 'states'):
+                # set loaded states if applicable
+                self.set_states(self.states)
+
             if self.use_cuda:
                 self.cuda()
-            if opt.get('model_file') and os.path.isfile(opt['model_file']):
-                print('Loading existing model parameters from ' + opt['model_file'])
-                self.load(opt['model_file'])
 
         self.episode_done = True
 
+    def override_opt(self, new_opt):
+        """Print out each added key and each overriden key."""
+        for k, v in new_opt.items():
+            if k not in self.opt:
+                print('Adding new option [ {k}: {v} ]'.format(k=k, v=v))
+            elif self.opt[k] != v:
+                print('Overriding option [ {k}: {old} => {v}]'.format(
+                      k=k, old=self.opt[k], v=v))
+            self.opt[k] = v
+        return self.opt
+
     def parse(self, text):
-        return torch.LongTensor(self.dict.txt2vec(text))
+        return self.dict.txt2vec(text)
 
     def v2t(self, vec):
         return self.dict.vec2txt(vec)
 
     def cuda(self):
+        self.END_TENSOR = self.END_TENSOR.cuda(async=True)
         self.criterion.cuda()
         self.lt.cuda()
         self.encoder.cuda()
         self.decoder.cuda()
-        self.d2o.cuda()
+        self.h2o.cuda()
         self.dropout.cuda()
         self.softmax.cuda()
 
-    def hidden_to_idx(self, hidden, drop=False):
+    def hidden_to_idx(self, hidden, dropout=False):
+        """Converts hidden state vectors into indices into the dictionary."""
         if hidden.size(0) > 1:
             raise RuntimeError('bad dimensions of tensor:', hidden)
         hidden = hidden.squeeze(0)
-        scores = self.d2o(hidden)
-        if drop:
+        scores = self.h2o(hidden)
+        if dropout:
             scores = self.dropout(scores)
         scores = self.softmax(scores)
         _max_score, idx = scores.max(1)
@@ -136,106 +192,156 @@ def observe(self, observation):
         self.episode_done = observation['episode_done']
         return observation
 
-    def update(self, xs, ys):
+    def predict(self, xs, ys=None, cands=None):
+        """Produce a prediction from our model. Update the model using the
+        targets if available.
+        """
         batchsize = len(xs)
+        text_cand_inds = None
 
         # first encode context
         xes = self.lt(xs).t()
         h0 = self.init_zeros(batchsize)
         _output, hn = self.encoder(xes, h0)
 
-        # start with EOS tensor for all
-        x = self.EOS_TENSOR
-        if self.use_cuda:
-            x = x.cuda(async=True)
-        x = Variable(x)
+        # next we use END as an input to kick off our decoder
+        x = Variable(self.END_TENSOR)
         xe = self.lt(x).unsqueeze(1)
         xes = xe.expand(xe.size(0), batchsize, xe.size(2))
 
+        # list of output tokens for each example in the batch
         output_lines = [[] for _ in range(batchsize)]
 
-        self.zero_grad()
-        # update model
-        loss = 0
-        self.longest_label = max(self.longest_label, ys.size(1))
-        for i in range(ys.size(1)):
-            output, hn = self.decoder(xes, hn)
-            preds, scores = self.hidden_to_idx(output, drop=True)
-            y = ys.select(1, i)
-            loss += self.criterion(scores, y)
-            # use the true token as the next input
-            xes = self.lt(y).unsqueeze(0)
-            # hn = self.dropout(hn)
-            for j in range(preds.size(0)):
-                token = self.v2t([preds.data[j][0]])
-                output_lines[j].append(token)
-
-        loss.backward()
-        self.update_params()
-
-        if random.random() < 0.1:
-            true = self.v2t(ys.data[0])
-            #print('loss:', round(loss.data[0], 2),
-            #      ' '.join(output_lines[0]), '(true: {})'.format(true))
-        return output_lines
-
-    def predict(self, xs):
-        batchsize = len(xs)
-
-        # first encode context
-        xes = self.lt(xs).t()
-        h0 = self.init_zeros(batchsize)
-        _output, hn = self.encoder(xes, h0)
-
-        # start with EOS tensor for all
-        x = self.EOS_TENSOR
-        if self.use_cuda:
-            x = x.cuda(async=True)
-        x = Variable(x)
-        xe = self.lt(x).unsqueeze(1)
-        xes = xe.expand(xe.size(0), batchsize, xe.size(2))
-
-        done = [False for _ in range(batchsize)]
-        total_done = 0
-        max_len = 0
-        output_lines = [[] for _ in range(batchsize)]
-
-        while(total_done < batchsize) and max_len < self.longest_label:
-            output, hn = self.decoder(xes, hn)
-            preds, scores = self.hidden_to_idx(output, drop=False)
-            xes = self.lt(preds.t())
-            max_len += 1
-            for i in range(preds.size(0)):
-                if not done[i]:
-                    token = self.v2t(preds.data[i])
-                    if token == self.EOS:
-                        done[i] = True
-                        total_done += 1
-                    else:
-                        output_lines[i].append(token)
-        if random.random() < 0.1:
-            print('prediction:', ' '.join(output_lines[0]))
-        return output_lines
-
-    def batchify(self, obs):
-        exs = [ex for ex in obs if 'text' in ex]
-        valid_inds = [i for i, ex in enumerate(obs) if 'text' in ex]
-
+        if ys is not None:
+            # update the model based on the labels
+            self.zero_grad()
+            loss = 0
+            # keep track of longest label we've ever seen
+            self.longest_label = max(self.longest_label, ys.size(1))
+            for i in range(ys.size(1)):
+                output, hn = self.decoder(xes, hn)
+                preds, scores = self.hidden_to_idx(output, dropout=True)
+                y = ys.select(1, i)
+                loss += self.criterion(scores, y)
+                # use the true token as the next input instead of predicted
+                # this produces a biased prediction but better training
+                xes = self.lt(y).unsqueeze(0)
+                for b in range(batchsize):
+                    # convert the output scores to tokens
+                    token = self.v2t([preds.data[b][0]])
+                    output_lines[b].append(token)
+
+            loss.backward()
+            self.update_params()
+
+            if random.random() < 0.1:
+                # sometimes output a prediction for debugging
+                print('prediction:', ' '.join(output_lines[0]),
+                      '\nlabel:', self.dict.vec2txt(ys.data[0]))
+        else:
+            # just produce a prediction without training the model
+            done = [False for _ in range(batchsize)]
+            total_done = 0
+            max_len = 0
+
+            if cands:
+                # score each candidate separately
+
+                # cands are exs_with_cands x cands_per_ex x words_per_cand
+                # cview is total_cands x words_per_cand
+                cview = cands.view(-1, cands.size(2))
+                cands_xes = xe.expand(xe.size(0), cview.size(0), xe.size(2))
+                sz = hn.size()
+                cands_hn = (
+                    hn.view(sz[0], sz[1], 1, sz[2])
+                    .expand(sz[0], sz[1], cands.size(1), sz[2])
+                    .contiguous()
+                    .view(sz[0], -1, sz[2])
+                )
+
+                cand_scores = torch.zeros(cview.size(0))
+                cand_lengths = torch.LongTensor(cview.size(0)).fill_(0)
+                if self.use_cuda:
+                    cand_scores = cand_scores.cuda(async=True)
+                    cand_lengths = cand_lengths.cuda(async=True)
+                cand_scores = Variable(cand_scores)
+                cand_lengths = Variable(cand_lengths)
+
+                for i in range(cview.size(1)):
+                    output, cands_hn = self.decoder(cands_xes, cands_hn)
+                    preds, scores = self.hidden_to_idx(output, dropout=False)
+                    cs = cview.select(1, i)
+                    non_nulls = cs.ne(self.NULL_IDX)
+                    cand_lengths += non_nulls.long()
+                    score_per_cand = torch.gather(scores, 1, cs.unsqueeze(1))
+                    cand_scores += score_per_cand.squeeze() * non_nulls.float()
+                    cands_xes = self.lt(cs).unsqueeze(0)
+
+                # set empty scores to -1, so when divided by 0 they become -inf
+                cand_scores -= cand_lengths.eq(0).float()
+                # average the scores per token
+                cand_scores /= cand_lengths.float()
+
+                cand_scores = cand_scores.view(cands.size(0), cands.size(1))
+                srtd_scores, text_cand_inds = cand_scores.sort(1, True)
+                text_cand_inds = text_cand_inds.data
+
+            # now, generate a response from scratch
+            while(total_done < batchsize) and max_len < self.longest_label:
+                # keep producing tokens until we hit END or max length for each
+                # example in the batch
+                output, hn = self.decoder(xes, hn)
+                preds, scores = self.hidden_to_idx(output, dropout=False)
+
+                xes = self.lt(preds.t())
+                max_len += 1
+                for b in range(batchsize):
+                    if not done[b]:
+                        # only add more tokens for examples that aren't done yet
+                        token = self.v2t(preds.data[b])
+                        if token == self.END:
+                            # if we produced END, we're done
+                            done[b] = True
+                            total_done += 1
+                        else:
+                            output_lines[b].append(token)
+
+            if random.random() < 0.1:
+                # sometimes output a prediction for debugging
+                print('prediction:', ' '.join(output_lines[0]))
+
+        return output_lines, text_cand_inds
+
+    def batchify(self, observations):
+        """Convert a list of observations into input & target tensors."""
+        # valid examples
+        exs = [ex for ex in observations if 'text' in ex]
+        # the indices of the valid (non-empty) tensors
+        valid_inds = [i for i, ex in enumerate(observations) if 'text' in ex]
+
+        # set up the input tensors
         batchsize = len(exs)
-        parsed = [self.parse(ex['text']) for ex in exs]
-        max_x_len = max([len(x) for x in parsed])
-        xs = torch.LongTensor(batchsize, max_x_len).fill_(0)
-        for i, x in enumerate(parsed):
-            offset = max_x_len - len(x)
-            for j, idx in enumerate(x):
-                xs[i][j + offset] = idx
-        if self.use_cuda:
-            xs = xs.cuda(async=True)
-        xs = Variable(xs)
+        # tokenize the text
+        xs = None
+        if batchsize > 0:
+            parsed = [self.parse(ex['text']) for ex in exs]
+            max_x_len = max([len(x) for x in parsed])
+            xs = torch.LongTensor(batchsize, max_x_len).fill_(0)
+            # pack the data to the right side of the tensor for this model
+            for i, x in enumerate(parsed):
+                offset = max_x_len - len(x)
+                for j, idx in enumerate(x):
+                    xs[i][j + offset] = idx
+            if self.use_cuda:
+                xs = xs.cuda(async=True)
+            xs = Variable(xs)
 
+        # set up the target tensors
         ys = None
-        if 'labels' in exs[0]:
-            labels = [random.choice(ex['labels']) + ' ' + self.EOS for ex in exs]
+        if batchsize > 0 and any(['labels' in ex for ex in exs]):
+            # randomly select one of the labels to update on, if multiple
+            # append END to each label
+            labels = [random.choice(ex.get('labels', [''])) + ' ' + self.END for ex in exs]
             parsed = [self.parse(y) for y in labels]
             max_y_len = max(len(y) for y in parsed)
             ys = torch.LongTensor(batchsize, max_y_len).fill_(0)
@@ -245,49 +351,103 @@ def batchify(self, obs):
             if self.use_cuda:
                 ys = ys.cuda(async=True)
             ys = Variable(ys)
-        return xs, ys, valid_inds
+
+        # set up candidates
+        cands = None
+        valid_cands = None
+        if ys is None and self.rank:
+            # only do ranking when no targets available and ranking flag set
+            parsed = []
+            valid_cands = []
+            for i in valid_inds:
+                if 'label_candidates' in observations[i]:
+                    # each candidate tuple is a pair of the parsed version and
+                    # the original full string
+                    cs = list(observations[i]['label_candidates'])
+                    parsed.append([self.parse(c) for c in cs])
+                    valid_cands.append((i, cs))
+            if len(parsed) > 0:
+                # TODO: store lengths of cands separately, so don't have zero
+                # padding for varying number of cands per example
+                # found cands, pack them into tensor
+                max_c_len = max(max(len(c) for c in cs) for cs in parsed)
+                max_c_cnt = max(len(cs) for cs in parsed)
+                cands = torch.LongTensor(len(parsed), max_c_cnt, max_c_len).fill_(0)
+                for i, cs in enumerate(parsed):
+                    for j, c in enumerate(cs):
+                        for k, idx in enumerate(c):
+                            cands[i][j][k] = idx
+                if self.use_cuda:
+                    cands = cands.cuda(async=True)
+                cands = Variable(cands)
+
+        return xs, ys, valid_inds, cands, valid_cands
 
     def batch_act(self, observations):
         batchsize = len(observations)
+        # initialize a table of replies with this agent's id
         batch_reply = [{'id': self.getID()} for _ in range(batchsize)]
 
-        xs, ys, valid_inds = self.batchify(observations)
+        # convert the observations into batches of inputs and targets
+        # valid_inds tells us the indices of all valid examples
+        # e.g. for input [{}, {'text': 'hello'}, {}, {}], valid_inds is [1]
+        # since the other three elements had no 'text' field
+        xs, ys, valid_inds, cands, valid_cands = self.batchify(observations)
 
-        if len(xs) == 0:
+        if xs is None:
+            # no valid examples, just return the empty responses we set up
             return batch_reply
 
-        # Either train or predict
-        if ys is not None:
-            predictions = self.update(xs, ys)
-        else:
-            predictions = self.predict(xs)
+        # produce predictions either way, but use the targets if available
+        predictions, text_cand_inds = self.predict(xs, ys, cands)
 
         for i in range(len(predictions)):
-            batch_reply[valid_inds[i]]['text'] = ' '.join(
-                c for c in predictions[i] if c != self.EOS)
+            # map the predictions back to non-empty examples in the batch
+            # we join with spaces since we produce tokens one at a time
+            curr = batch_reply[valid_inds[i]]
+            curr['text'] = ' '.join(c for c in predictions[i] if c != self.END
+                                    and c != self.dict.null_token)
+
+        if text_cand_inds is not None:
+            for i in range(len(valid_cands)):
+                order = text_cand_inds[i]
+                batch_idx, curr_cands = valid_cands[i]
+                curr = batch_reply[batch_idx]
+                curr['text_candidates'] = [curr_cands[idx] for idx in order
+                                           if idx < len(curr_cands)]
 
         return batch_reply
 
     def act(self):
+        # call batch_act with this batch of one
         return self.batch_act([self.observation])[0]
 
-    def save(self, path):
-        model = {}
-        model['lt'] = self.lt.state_dict()
-        model['encoder'] = self.encoder.state_dict()
-        model['decoder'] = self.decoder.state_dict()
-        model['d2o'] = self.d2o.state_dict()
-        model['longest_label'] = self.longest_label
+    def save(self, path=None):
+        path = self.opt.get('model_file', None) if path is None else path
 
-        with open(path, 'wb') as write:
-            torch.save(model, write)
+        if path:
+            model = {}
+            model['lt'] = self.lt.state_dict()
+            model['encoder'] = self.encoder.state_dict()
+            model['decoder'] = self.decoder.state_dict()
+            model['h2o'] = self.h2o.state_dict()
+            model['longest_label'] = self.longest_label
+            model['opt'] = self.opt
+
+            with open(path, 'wb') as write:
+                torch.save(model, write)
 
     def load(self, path):
+        """Return opt and model states."""
         with open(path, 'rb') as read:
             model = torch.load(read)
 
-        self.lt.load_state_dict(model['lt'])
-        self.encoder.load_state_dict(model['encoder'])
-        self.decoder.load_state_dict(model['decoder'])
-        self.d2o.load_state_dict(model['d2o'])
-        self.longest_label = model['longest_label']
+        return model['opt'], model
+
+    def set_states(self, states):
+        """Set the state dicts of the modules from saved states."""
+        self.lt.load_state_dict(states['lt'])
+        self.encoder.load_state_dict(states['encoder'])
+        self.decoder.load_state_dict(states['decoder'])
+        self.h2o.load_state_dict(states['h2o'])
+        self.longest_label = states['longest_label']
diff --git a/parlai/core/agents.py b/parlai/core/agents.py
index 4367d20acc2..dfd4a822872 100644
--- a/parlai/core/agents.py
+++ b/parlai/core/agents.py
@@ -76,6 +76,14 @@ def getID(self):
     def reset(self):
         self.observation = None
 
+    def reset_metrics(self):
+        pass
+
+    def save(self):
+        """If applicable, save any parameters needed to recreate this agent from
+        loaded parameters."""
+        pass
+
     def share(self):
         """If applicable, share any parameters needed to create a shared version
         of this agent.
@@ -89,13 +97,14 @@ def shutdown(self):
         """Perform any final cleanup if needed."""
         pass
 
+
 class Teacher(Agent):
     """Basic Teacher agent which keeps track of how many times it's received
     messages. Teachers provide the ``report()`` method to get back metrics."""
 
     def __init__(self, opt, shared=None):
         if not hasattr(self, 'opt'):
-            self.opt = opt
+             self.opt = copy.deepcopy(opt)
         if not hasattr(self, 'id'):
             self.id = opt.get('task', 'teacher')
         if not hasattr(self, 'metrics'):
@@ -133,17 +142,19 @@ def report(self):
 
     def reset(self):
         super().reset()
+        self.reset_metrics()
         self.epochDone = False
+
+    def reset_metrics(self):
         self.metrics.clear()
 
     def share(self):
-        """If applicable, share any parameters needed to create a shared version
-        of this agent.
-        """
+        """In addition to default Agent shared parameters, share metrics."""
         shared = super().share()
         shared['metrics'] = self.metrics
         return shared
 
+
 class MultiTaskTeacher(Teacher):
     """Creates a teacher that is actually a set of teachers each based on
     a task string--each of these teachers will get called in turn,
@@ -242,6 +253,14 @@ def reset(self):
         for t in self.tasks:
             t.reset()
 
+    def reset_metrics(self):
+        for t in self.tasks:
+            t.reset_metrics()
+
+    def save(self):
+        for t in self.tasks:
+            t.save()
+
     def share(self):
         shared = {}
         shared['class'] = type(self)
@@ -249,6 +268,12 @@ def share(self):
         shared['tasks'] = [t.share() for t in self.tasks]
         return shared
 
+    def shutdown(self):
+        """Shutdown each agent."""
+        for t in self.tasks:
+            t.shutdown()
+
+
 def name_to_agent_class(name):
     words = name.split('_')
     class_name = ''
@@ -279,8 +304,11 @@ def create_agent(opt):
     (i.e. the path followed by the class name) or else just ``ir_baseline`` which
     assumes the path above, and a class name suffixed with 'Agent'.
     """
-    model_class = get_agent_module(opt['model'])
-    return model_class(opt)
+    if opt.get('model'):
+        model_class = get_agent_module(opt['model'])
+        return model_class(opt)
+    else:
+        raise RuntimeError('Need to set `model` argument to use create_agent.')
 
 # Helper functions to create agent/agents given shared parameters
 # returned from agent.share(). Useful for parallelism, sharing params, etc.
@@ -296,6 +324,27 @@ def create_agents_from_shared(shared):
         shared_agents.append(agent)
     return shared_agents
 
+def get_task_module(taskname):
+    # get the module of the task agent
+    sp = taskname.strip().split(':')
+    if '.' in sp[0]:
+        module_name = sp[0]
+    else:
+        task = sp[0].lower()
+        module_name = "parlai.tasks.%s.agents" % (task)
+    if len(sp) > 1:
+        sp[1] = sp[1][0].upper() + sp[1][1:]
+        teacher = sp[1]
+        if '.' not in sp[0] and 'Teacher' not in teacher:
+            # Append "Teacher" to class name by default if
+            # a complete path is not given.
+            teacher += "Teacher"
+    else:
+        teacher = "DefaultTeacher"
+    my_module = importlib.import_module(module_name)
+    teacher_class = getattr(my_module, teacher)
+    return teacher_class
+
 def create_task_agent_from_taskname(opt):
     """Creates task agent(s) assuming the input ``task_dir:teacher_class``.
 
@@ -309,23 +358,7 @@ def create_task_agent_from_taskname(opt):
                            '--task {task_name}.')
     if ',' not in opt['task']:
         # Single task
-        sp = opt['task'].strip().split(':')
-        if '.' in sp[0]:
-            module_name = sp[0]
-        else:
-            task = sp[0].lower()
-            module_name = "parlai.tasks.%s.agents" % (task)
-        if len(sp) > 1:
-            sp[1] = sp[1][0].upper() + sp[1][1:]
-            teacher = sp[1]
-            if '.' not in sp[0] and 'Teacher' not in teacher:
-                # Append "Teacher" to class name by default if
-                # a complete path is not given.
-                teacher += "Teacher"
-        else:
-            teacher = "DefaultTeacher"
-        my_module = importlib.import_module(module_name)
-        teacher_class = getattr(my_module, teacher)
+        teacher_class = get_task_module(opt['task'])
         task_agents = teacher_class(opt)
         if type(task_agents) != list:
             task_agents = [task_agents]
diff --git a/parlai/core/build_data.py b/parlai/core/build_data.py
index d8859e30610..f7b5ec4a0fb 100644
--- a/parlai/core/build_data.py
+++ b/parlai/core/build_data.py
@@ -8,6 +8,7 @@
 These can be replaced if your particular file system does not support them.
 """
 
+import time
 import datetime
 import os
 import requests
@@ -26,7 +27,7 @@ def built(path, version_string=None):
         else:
             with open(fname, 'r') as read:
                 text = read.read().split('\n')
-            return (len(text) == 2 and text[1] == version_string)
+            return (len(text) > 1 and text[1] == version_string)
     else:
         return os.path.isfile(os.path.join(path, '.built'))
 
@@ -39,6 +40,7 @@ def mark_done(path, version_string=None):
         if version_string:
             write.write('\n' + version_string)
 
+
 def log_progress(curr, total, width=40):
     """Displays a bar showing the current progress."""
     done = min(curr * width // total, width)
@@ -52,32 +54,74 @@ def log_progress(curr, total, width=40):
     print(progress, end='\r')
 
 
-def download(url, path, fname, redownload=True):
+def download(url, path, fname, redownload=False):
     """Downloads file using `requests`. If ``redownload`` is set to false, then
     will not download tar file again if it is present (default ``True``)."""
     outfile = os.path.join(path, fname)
-    if redownload or not os.path.isfile(outfile):
+    download = not os.path.isfile(outfile) or redownload
+
+    retry = 5
+    exp_backoff = [2 ** r for r in reversed(range(retry))]
+    while download and retry >= 0:
+        resume_file = outfile + '.part'
+        resume = os.path.isfile(resume_file)
+        if resume:
+            resume_pos = os.path.getsize(resume_file)
+            mode = 'ab'
+        else:
+            resume_pos = 0
+            mode = 'wb'
+        response = None
+
         with requests.Session() as session:
-            response = session.get(url, stream=True)
-            CHUNK_SIZE = 32768
-            total_size = int(response.headers.get('Content-Length', -1))
-            done = 0
-            with open(outfile, 'wb') as f:
-                for chunk in response.iter_content(CHUNK_SIZE):
-                    if chunk:  # filter out keep-alive new chunks
-                        f.write(chunk)
-                    if total_size > 0:
-                        done += len(chunk)
-                        if total_size < done:
-                            # don't freak out if content-length was too small
-                            total_size = done
-                        log_progress(done, total_size)
-            if done < total_size:
-                raise RuntimeWarning('Received less data than specified in ' +
-                                     'Content-Length header for ' + url + '.' +
-                                     ' There may be a download problem.')
-            print()
-            response.close()
+            try:
+                header = {'Range': 'bytes=%d-' % resume_pos,
+                        'Accept-Encoding': 'identity'} if resume else {}
+                response = session.get(url, stream=True, timeout=5, headers=header)
+
+                # negative reply could be 'none' or just missing
+                if resume and response.headers.get('Accept-Ranges', 'none') == 'none':
+                    resume_pos = 0
+                    mode = 'wb'
+
+                CHUNK_SIZE = 32768
+                total_size = int(response.headers.get('Content-Length', -1))
+                # server returns remaining size if resuming, so adjust total
+                total_size += resume_pos
+                done = resume_pos
+
+                with open(resume_file, mode) as f:
+                    for chunk in response.iter_content(CHUNK_SIZE):
+                        if chunk:  # filter out keep-alive new chunks
+                            f.write(chunk)
+                        if total_size > 0:
+                            done += len(chunk)
+                            if total_size < done:
+                                # don't freak out if content-length was too small
+                                total_size = done
+                            log_progress(done, total_size)
+                    break
+            except requests.exceptions.ConnectionError:
+                retry -= 1
+                print(''.join([' '] * 60), end='\r')  # TODO Better way to clean progress bar?
+                if retry >= 0:
+                    print('Connection error, retrying. (%d retries left)' % retry)
+                    time.sleep(exp_backoff[retry])
+                else:
+                    print('Retried too many times, stopped retrying.')
+            finally:
+                if response:
+                    response.close()
+    if retry < 0:
+        raise RuntimeWarning('Connection broken too many times. Stopped retrying.')
+
+    if download and retry > 0:
+        print()
+        if done < total_size:
+            raise RuntimeWarning('Received less data than specified in ' +
+                                 'Content-Length header for ' + url + '.' +
+                                 ' There may be a download problem.')
+        move(resume_file, outfile)
 
 
 def make_dir(path):
diff --git a/parlai/core/dialog_teacher.py b/parlai/core/dialog_teacher.py
index a7ef472b66b..f0743230358 100644
--- a/parlai/core/dialog_teacher.py
+++ b/parlai/core/dialog_teacher.py
@@ -6,6 +6,7 @@
 
 from .agents import Teacher
 
+from .image_featurizers import ImageLoader
 from PIL import Image
 import random
 import os
@@ -45,10 +46,11 @@ def __init__(self, opt, shared=None):
         # first initialize any shared objects
         self.random = self.datatype == 'train'
         if shared and shared.get('data'):
-            self.data = shared['data']
+            self.data = DialogData(opt, shared=shared['data'])
         else:
-            self.data = DialogData(opt, self.setup_data(opt['datafile']),
-                                   cands=self.label_candidates())
+            self.data = DialogData(opt,
+                data_loader=self.setup_data(opt['datafile']),
+                cands=self.label_candidates())
 
         # for ordered data in batch mode (especially, for validation and
         # testing), each teacher in the batch gets a start index and a step
@@ -83,7 +85,7 @@ def __next__(self):
 
     def share(self):
         shared = super().share()
-        shared['data'] = self.data
+        shared['data'] = self.data.share()
         return shared
 
     def label_candidates(self):
@@ -176,15 +178,30 @@ class at request-time. should always point to the raw image file.
     or randomly when returning examples to the caller.
     """
 
-    def __init__(self, opt, data_loader, cands=None):
+    def __init__(self, opt, data_loader=None, cands=None, shared=None):
         # self.data is a list of episodes
         # each episode is a tuple of entries
         # each entry is a tuple of values for the action/observation table
         self.opt = opt
-        self.data = []
-        self._load(data_loader)
-        self.cands = None if cands == None else set(sys.intern(c) for c in cands)
+        if shared:
+            self.image_loader = shared.get('image_loader', None)
+            self.data = shared.get('data', [])
+            self.cands = shared.get('cands', None)
+        else:
+            self.image_loader = ImageLoader(opt)
+            self.data = []
+            self._load(data_loader)
+            self.cands = None if cands == None else set(sys.intern(c) for c in cands)
         self.addedCands = []
+        self.copied_cands = False
+
+    def share(self):
+        shared = {
+            'data': self.data,
+            'cands': self.cands,
+            'image_loader': self.image_loader
+        }
+        return shared
 
     def __len__(self):
         """Returns total number of entries available. Each episode has at least
@@ -214,10 +231,13 @@ def _load(self, data_loader):
                     new_entry.append(None)
                 if len(entry) > 1:
                     # process labels if available
-                    if entry[1] is not None:
+                    if entry[1] is None:
+                        new_entry.append(None)
+                    elif hasattr(entry[1], '__iter__') and type(entry[1]) is not str:
+                        # make sure iterable over labels, not single string
                         new_entry.append(tuple(sys.intern(e) for e in entry[1]))
                     else:
-                        new_entry.append(None)
+                        raise TypeError('Must provide iterable over labels, not a single string.')
                     if len(entry) > 2:
                         # process reward if available
                         if entry[2] is not None:
@@ -225,19 +245,21 @@ def _load(self, data_loader):
                         else:
                             new_entry.append(None)
                         if len(entry) > 3:
-                            if entry[3] is not None:
-                                # process label candidates if available
-                                if last_cands and entry[3] is last_cands:
-                                    # if cands are shared, say "same" so we
-                                    # don't store them again
-                                    new_entry.append(
-                                        sys.intern('same as last time'))
-                                else:
-                                    last_cands = entry[3]
-                                    new_entry.append(tuple(
-                                        sys.intern(e) for e in entry[3]))
-                            else:
+                            # process label candidates if available
+                            if entry[3] is None:
                                 new_entry.append(None)
+                            elif last_cands and entry[3] is last_cands:
+                                # if cands are shared, say "same" so we
+                                # don't store them again
+                                new_entry.append(
+                                    sys.intern('same as last time'))
+                            elif hasattr(entry[3], '__iter__') and type(entry[3]) is not str:
+                                # make sure iterable over candidates, not single string
+                                last_cands = entry[3]
+                                new_entry.append(tuple(
+                                    sys.intern(e) for e in entry[3]))
+                            else:
+                                raise TypeError('Must provide iterable over label candidates, not a single string.')
                             if len(entry) > 4 and entry[4] is not None:
                                 new_entry.append(sys.intern(entry[4]))
 
@@ -272,7 +294,7 @@ def get(self, episode_idx, entry_idx=0):
                     if entry[3] is not None:
                         table['label_candidates'] = entry[3]
                     if len(entry) > 4 and entry[4] is not None:
-                        img = load_image(self.opt, entry[4])
+                        img = self.image_loader.load(entry[4])
                         if img is not None:
                             table['image'] = img
 
@@ -285,6 +307,9 @@ def get(self, episode_idx, entry_idx=0):
             for label in table['labels']:
                 if label not in self.cands:
                     # add labels, queue them for removal next time
+                    if not self.copied_cands:
+                        self.cands = self.cands.copy()
+                        self.copied_cands = True
                     self.cands.add(label)
                     self.addedCands.append(label)
             table['label_candidates'] = self.cands
@@ -296,43 +321,3 @@ def get(self, episode_idx, entry_idx=0):
         # last entry in this episode
         table['episode_done'] = episode_done
         return table, end_of_data
-
-
-_greyscale = '  .,:;crsA23hHG#98&@'
-
-
-def img_to_ascii(path):
-    im = Image.open(path)
-    im.thumbnail((60, 40), Image.BICUBIC)
-    im = im.convert('L')
-    asc = []
-    for y in range(0, im.size[1]):
-        for x in range(0, im.size[0]):
-            lum = 255 - im.getpixel((x, y))
-            asc.append(_greyscale[lum * len(_greyscale) // 256])
-        asc.append('\n')
-    return ''.join(asc)
-
-
-def load_image(opt, path):
-    mode = opt.get('image_mode', 'raw')
-    if mode is None or mode == 'none':
-        # don't need to load images
-        return None
-    elif mode == 'raw':
-        # raw just returns RGB values
-        return Image.open(path).convert('RGB')
-    elif mode == 'ascii':
-        # convert images to ascii ¯\_(ツ)_/¯
-        return img_to_ascii(path)
-    else:
-        # otherwise, looks for preprocessed version under 'mode' directory
-        prepath, imagefn = os.path.split(path)
-        new_path = os.path.join(prepath, mode, imagefn)
-        if not os.path.isfile(new_path):
-            # currently only supports *downloaded* preprocessing
-            # TODO: generate preprocessed images if not available
-            raise NotImplementedError('image preprocessing mode' +
-                                      '{} not supported yet'.format(mode))
-        else:
-            return Image.open(path)
diff --git a/parlai/core/dict.py b/parlai/core/dict.py
index 699dbaab308..a287ab99677 100644
--- a/parlai/core/dict.py
+++ b/parlai/core/dict.py
@@ -14,6 +14,20 @@
 import re
 
 
+def escape(s):
+    """Replace potential special characters with escaped version.
+    For example, newline => \\n and tab => \\t
+    """
+    return s.replace('\n', '\\n').replace('\t', '\\t').replace('\r', '\\r')
+
+
+def unescape(s):
+    """Revert escaped characters back to their special version.
+    For example, \\n => newline and \\t => tab
+    """
+    return s.replace('\\n', '\n').replace('\\t', '\t').replace('\\r', '\r')
+
+
 def find_ngrams(token_dict, text, n):
     """Breaks text into ngrams that appear in ``token_dict``."""
     # base case
@@ -57,8 +71,9 @@ class DictionaryAgent(Agent):
     default_maxngram = -1
     default_minfreq = 0
     default_null = '__NULL__'
-    default_eos = '__EOS__'
+    default_end = '__END__'
     default_unk = '__UNK__'
+    default_start = '__START__'
 
     @staticmethod
     def add_cmdline_args(argparser):
@@ -87,11 +102,14 @@ def add_cmdline_args(argparser):
            '--dict-nulltoken', default=DictionaryAgent.default_null,
            help='empty token, can be used for padding or just empty values')
         dictionary.add_argument(
-           '--dict-eostoken', default=DictionaryAgent.default_eos,
+           '--dict-endtoken', default=DictionaryAgent.default_end,
            help='token for end of sentence markers, if needed')
         dictionary.add_argument(
             '--dict-unktoken', default=DictionaryAgent.default_unk,
             help='token to return for unavailable words')
+        dictionary.add_argument(
+           '--dict-starttoken', default=DictionaryAgent.default_start,
+           help='token for starting sentence generation, if needed')
         dictionary.add_argument(
             '--dict-maxexs', default=100000, type=int,
             help='max number of examples to build dict on')
@@ -101,8 +119,9 @@ def __init__(self, opt, shared=None):
         # initialize fields
         self.opt = copy.deepcopy(opt)
         self.null_token = opt['dict_nulltoken']
-        self.eos_token = opt['dict_eostoken']
+        self.end_token = opt['dict_endtoken']
         self.unk_token = opt['dict_unktoken']
+        self.start_token = opt['dict_starttoken']
         self.max_ngram_size = opt['dict_max_ngram_size']
 
         if shared:
@@ -118,11 +137,11 @@ def __init__(self, opt, shared=None):
                 self.tok2ind[self.null_token] = 0
                 self.ind2tok[0] = self.null_token
 
-            if self.eos_token:
-                # set special unknown word token
+            if self.end_token:
+                # set special end of sentence word token
                 index = len(self.tok2ind)
-                self.tok2ind[self.eos_token] = index
-                self.ind2tok[index] = self.eos_token
+                self.tok2ind[self.end_token] = index
+                self.ind2tok[index] = self.end_token
 
             if self.unk_token:
                 # set special unknown word token
@@ -130,6 +149,12 @@ def __init__(self, opt, shared=None):
                 self.tok2ind[self.unk_token] = index
                 self.ind2tok[index] = self.unk_token
 
+            if self.start_token:
+                # set special start of sentence word token
+                index = len(self.tok2ind)
+                self.tok2ind[self.start_token] = index
+                self.ind2tok[index] = self.start_token
+
             if opt.get('dict_file') and os.path.isfile(opt['dict_file']):
                 # load pre-existing dictionary
                 self.load(opt['dict_file'])
@@ -150,13 +175,17 @@ def __init__(self, opt, shared=None):
 
         if not shared:
 
+            if self.start_token:
+                # fix count for start of sentence token to one billion and three
+                self.freq[self.start_token] = 1000000003
+
             if self.null_token:
                 # fix count for null token to one billion and two
                 self.freq[self.null_token] = 1000000002
 
-            if self.eos_token:
+            if self.end_token:
                 # fix count for end of sentence token to one billion and one
-                self.freq[self.eos_token] = 1000000001
+                self.freq[self.end_token] = 1000000001
 
             if self.unk_token:
                 # fix count for unknown token to one billion
@@ -253,12 +282,12 @@ def load(self, filename):
         """Load pre-existing dictionary in 'token[<TAB>count]' format.
         Initialize counts from other dictionary, or 0 if they aren't included.
         """
-        print('Dictionary: loading existing dictionary from {}.'.format(
+        print('Dictionary: loading existing dictionary from {}'.format(
               filename))
         with open(filename) as read:
             for line in read:
                 split = line.strip().split('\t')
-                token = split[0]
+                token = unescape(split[0])
                 cnt = int(split[1]) if len(split) > 1 else 0
                 self.freq[token] = cnt
                 if token not in self.tok2ind:
@@ -267,7 +296,7 @@ def load(self, filename):
                     self.ind2tok[index] = token
         print('[ num words =  %d ]' % len(self))
 
-    def save(self, filename, append=False, sort=True):
+    def save(self, filename=None, append=False, sort=True):
         """Save dictionary to file.
         Format is 'token<TAB>count' for every token in the dictionary, sorted
         by count with the most frequent words first.
@@ -277,14 +306,16 @@ def save(self, filename, append=False, sort=True):
 
         If ``sort`` (default ``True``), then first sort the dictionary before saving.
         """
-        print('Dictionary: saving dictionary to {}.'.format(filename))
+        filename = self.opt['model_file'] if filename is None else filename
+        print('Dictionary: saving dictionary to {}'.format(filename))
         if sort:
             self.sort()
+
         with open(filename, 'a' if append else 'w') as write:
             for i in range(len(self.ind2tok)):
                 tok = self.ind2tok[i]
                 cnt = self.freq[tok]
-                write.write('{tok}\t{cnt}\n'.format(tok=tok, cnt=cnt))
+                write.write('{tok}\t{cnt}\n'.format(tok=escape(tok), cnt=cnt))
 
     def sort(self):
         """Sorts the dictionary, so that the elements with the lowest index have
diff --git a/parlai/core/fbdialog_teacher.py b/parlai/core/fbdialog_teacher.py
index a1ebd7c653f..c6f669de069 100644
--- a/parlai/core/fbdialog_teacher.py
+++ b/parlai/core/fbdialog_teacher.py
@@ -77,10 +77,10 @@ def load_cands(self, path):
         cnt = 0
         with open(path) as read:
             for line in read:
-                line = line.strip()
+                line = line.strip().replace('\\n', '\n')
                 if len(line) > 0:
                     cnt = cnt + 1
-                    # If lines are numbered we stip them of numbers.
+                    # If lines are numbered we strip them of numbers.
                     if cnt == 1 and line[0:2] == '1 ':
                         lines_have_ids = True
                     # If tabs then the label_candidates are all the replies.
@@ -135,7 +135,7 @@ def setup_data(self, path):
             reward = None
             dialog_index = 0
             for line in read:
-                line = line.strip()
+                line = line.strip().replace('\\n', '\n')
                 if len(line) == 0:
                     continue
 
diff --git a/parlai/core/image_featurizers.py b/parlai/core/image_featurizers.py
new file mode 100644
index 00000000000..2e0c92d08a1
--- /dev/null
+++ b/parlai/core/image_featurizers.py
@@ -0,0 +1,151 @@
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+
+import parlai.core.build_data as build_data
+
+import os
+import copy
+import numpy as np
+from PIL import Image
+
+_greyscale = '  .,:;crsA23hHG#98&@'
+
+class ImageLoader():
+    """Extract image feature using pretrained CNN network.
+    """
+    def __init__(self, opt):
+        self.opt = copy.deepcopy(opt)
+        self.netCNN = None
+
+    def init_cnn(self):
+        """Lazy initialization of preprocessor model in case we don't need any image preprocessing."""
+        try:
+            import torch
+        except ModuleNotFoundError:
+            raise ModuleNotFoundError('Need to install Pytorch: go to pytorch.org')
+        from torch.autograd import Variable
+        import torchvision
+        import torchvision.transforms as transforms
+        import torch.nn as nn
+
+        opt = self.opt
+        self.image_size = opt['image_size']
+        self.crop_size = opt['image_cropsize']
+        self.datatype = opt['datatype']
+        self.image_mode = opt['image_mode']
+
+        opt['cuda'] = not opt['no_cuda'] and torch.cuda.is_available()
+        self.use_cuda = opt['cuda']
+
+        if self.use_cuda:
+            print('[ Using CUDA ]')
+            torch.cuda.set_device(opt['gpu'])
+
+        cnn_type, layer_num = self.image_mode_switcher()
+
+        # initialize the pretrained CNN using pytorch.
+        CNN = getattr(torchvision.models, cnn_type)
+
+        # cut off the additional layer.
+        self.netCNN = nn.Sequential(*list(CNN(pretrained=True).children())[:layer_num])
+
+        # initialize the transform function using torch vision.
+        self.transform = transforms.Compose([
+                            transforms.Scale(self.image_size),
+                            transforms.CenterCrop(self.crop_size),
+                            transforms.ToTensor(),
+                            transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                                    std=[0.229, 0.224, 0.225])
+                            ])
+
+        # container for single image
+        self.xs = torch.FloatTensor(1, 3, self.crop_size, self.crop_size).fill_(0)
+
+        if self.use_cuda:
+            self.cuda()
+            self.xs = self.xs.cuda()
+
+        # make self.xs variable.
+        self.xs = Variable(self.xs)
+
+    def cuda(self):
+        self.netCNN.cuda()
+
+    def save(self, feature, path):
+        feature = feature.cpu().data.numpy()
+        np.save(path, feature)
+
+    def image_mode_switcher(self):
+        switcher = {
+            'resnet152': ['resnet152', -1],
+            'resnet101': ['resnet101', -1],
+            'resnet50': ['resnet50', -1],
+            'resnet34': ['resnet34', -1],
+            'resnet18': ['resnet18', -1],
+            'resnet152_spatial': ['resnet152', -2],
+            'resnet101_spatial': ['resnet101', -2],
+            'resnet50_spatial': ['resnet50', -2],
+            'resnet34_spatial': ['resnet34', -2],
+            'resnet18_spatial': ['resnet18', -2],
+        }
+
+        if self.image_mode not in switcher:
+            raise NotImplementedError('image preprocessing mode' +
+                                      '{} not supported yet'.format(self.image_mode))
+
+        return switcher.get(self.image_mode)
+
+    def extract(self, image, path):
+        # check whether initlize CNN network.
+        if not self.netCNN:
+            self.init_cnn()
+
+        self.xs.data.copy_(self.transform(image))
+        # extract the image feature
+        feature = self.netCNN(self.xs)
+        # save the feature
+        self.save(feature, path)
+        return feature
+
+    def img_to_ascii(self, path):
+        im = Image.open(path)
+        im.thumbnail((60, 40), Image.BICUBIC)
+        im = im.convert('L')
+        asc = []
+        for y in range(0, im.size[1]):
+            for x in range(0, im.size[0]):
+                lum = 255 - im.getpixel((x, y))
+                asc.append(_greyscale[lum * len(_greyscale) // 256])
+            asc.append('\n')
+        return ''.join(asc)
+
+    def load(self, path):
+        opt = self.opt
+        mode = opt.get('image_mode', 'raw')
+        if mode is None or mode == 'none':
+            # don't need to load images
+            return None
+        elif mode == 'raw':
+            # raw just returns RGB values
+            return Image.open(path).convert('RGB')
+        elif mode == 'ascii':
+            # convert images to ascii ¯\_(ツ)_/¯
+            return self.img_to_ascii(path)
+        else:
+            # otherwise, looks for preprocessed version under 'mode' directory
+            prepath, imagefn = os.path.split(path)
+
+            dpath = os.path.join(prepath, mode)
+
+            if not os.path.exists(dpath):
+                build_data.make_dir(dpath)
+
+            imagefn = imagefn + '.npy'
+            new_path = os.path.join(prepath, mode, imagefn)
+
+            if not os.path.isfile(new_path):
+                return self.extract(Image.open(path).convert('RGB'), new_path)
+            else:
+                return np.load(new_path)
diff --git a/parlai/core/metrics.py b/parlai/core/metrics.py
index ff22977f25e..e1aee4a11e8 100644
--- a/parlai/core/metrics.py
+++ b/parlai/core/metrics.py
@@ -8,23 +8,25 @@
 between processes.
 """
 
-from .thread_utils import SharedTable
+from parlai.core.thread_utils import SharedTable
+from parlai.core.utils import round_sigfigs
 from collections import Counter
 
 import re
-import string
-
 
+re_art = re.compile(r'\b(a|an|the)\b')
+re_punc = re.compile(r'[!"#$%&()*+,-./:;<=>?@\[\]\\^`{|}~]')
 def _normalize_answer(s):
     """Lower text and remove punctuation, articles and extra whitespace."""
     def remove_articles(text):
-        return re.sub(r'\b(a|an|the)\b', ' ', text)
+        return re_art.sub(' ', text)
 
     def white_space_fix(text):
         return ' '.join(text.split())
 
     def remove_punc(text):
-        exclude = set(string.punctuation)
+        text = re_punc.sub(' ', text)  # convert interword punctuation to spaces
+        exclude = set('_\'')  # remove intraword punctuation completely
         return ''.join(ch for ch in text if ch not in exclude)
 
     def lower(text):
@@ -109,12 +111,12 @@ def update_ranking_metrics(self, observation, labels):
         # Now loop through text candidates, assuming they are sorted.
         # If any of them is a label then score a point.
         # maintain hits@1, 5, 10, 50, 100,  etc.
-        label_set = set(labels) if type(labels) != set else labels
+        label_set = set(_normalize_answer(l) for l in labels)
         cnts = {k: 0 for k in self.eval_pr}
         cnt = 0
         for c in text_cands:
             cnt += 1
-            if c in label_set:
+            if _normalize_answer(c) in label_set:
                 for k in self.eval_pr:
                     if cnt <= k:
                         cnts[k] += 1
@@ -126,7 +128,6 @@ def update_ranking_metrics(self, observation, labels):
                 if cnts[k] > 0:
                     self.metrics['hits@' + str(k)] += 1
 
-
     def update(self, observation, labels):
         with self._lock():
             self.metrics['cnt'] += 1
@@ -159,11 +160,14 @@ def report(self):
         m = {}
         m['total'] = self.metrics['cnt']
         if self.metrics['cnt'] > 0:
-            m['accuracy'] = self.metrics['correct'] / self.metrics['cnt']
-            m['f1'] = self.metrics['f1'] / self.metrics['cnt']
+            m['accuracy'] = round_sigfigs(
+                self.metrics['correct'] / self.metrics['cnt'], 4)
+            m['f1'] = round_sigfigs(
+                self.metrics['f1'] / self.metrics['cnt'], 4)
             m['hits@k'] = {}
             for k in self.eval_pr:
-                m['hits@k'][k] = self.metrics['hits@' + str(k)] / self.metrics['cnt']
+                m['hits@k'][k] = round_sigfigs(
+                    self.metrics['hits@' + str(k)] / self.metrics['cnt'], 4)
         return m
 
     def clear(self):
diff --git a/parlai/core/params.py b/parlai/core/params.py
index ce12c2b2afa..19975237fb9 100644
--- a/parlai/core/params.py
+++ b/parlai/core/params.py
@@ -11,7 +11,8 @@
 import importlib
 import os
 import sys
-from parlai.core.agents import get_agent_module
+from parlai.core.agents import get_agent_module, get_task_module
+from parlai.tasks.tasks import ids_to_tasks
 
 def str2bool(value):
     v = value.lower()
@@ -69,6 +70,7 @@ def __init__(self, add_parlai_args=True, add_model_args=False, model_argv=None):
 
         if add_parlai_args:
             self.add_parlai_args()
+            self.add_image_args()
         if add_model_args:
             self.add_model_args(model_argv)
 
@@ -138,11 +140,25 @@ def add_parlai_args(self):
             '-bs', '--batchsize', default=1, type=int,
             help='batch size for minibatch training schemes')
         self.add_parlai_data_path(parlai)
+        self.add_task_args()
+
+    def add_task_args(self, args=None):
+        # Find which task specified, and add its specific arguments.
+        args = sys.argv if args is None else args
+        task = None
+        for index, item in enumerate(args):
+            if item == '-t' or item == '--task':
+                task = args[index + 1]
+        if task:
+            for t in ids_to_tasks(task).split(','):
+                agent = get_task_module(t)
+                if hasattr(agent, 'add_cmdline_args'):
+                    agent.add_cmdline_args(self)
 
     def add_model_args(self, args=None):
         model_args = self.add_argument_group('ParlAI Model Arguments')
         model_args.add_argument(
-            '-m', '--model', default='repeat_label',
+            '-m', '--model', default=None,
             help='the model class name, should match parlai/agents/<model>')
         model_args.add_argument(
             '-mf', '--model-file', default=None,
@@ -165,12 +181,27 @@ def add_model_args(self, args=None):
                 s = class2str(agent.dictionary_class())
                 model_args.set_defaults(dict_class=s)
 
+    def add_image_args(self, args=None):
+        # Find which image mode specified, add its specific arguments if needed.
+        args = sys.argv if args is None else args
+        image_mode = None
+        for index, item in enumerate(args):
+            if item == '-im' or item == '--image-mode':
+                image_mode = args[index + 1]
+        if image_mode and image_mode != 'none':
+            parlai = self.add_argument_group('ParlAI Image Preprocessing Arguments')
+            parlai.add_argument('--image-size', type=int, default=256,
+                help='')
+            parlai.add_argument('--image-cropsize', type=int, default=224,
+                help='')
+
     def parse_args(self, args=None, namespace=None, print_args=True):
         """Parses the provided arguments and returns a dictionary of the ``args``.
         We specifically remove items with ``None`` as values in order to support
         the style ``opt.get(key, default)``, which would otherwise return ``None``.
         """
-        self.opt = vars(super().parse_args(args=args))
+        self.args = super().parse_args(args=args)
+        self.opt = vars(self.args)
 
         # custom post-parsing
         self.opt['parlai_home'] = self.parlai_home
@@ -189,5 +220,16 @@ def print_args(self):
         """Print out all the arguments in this parser."""
         if not self.opt:
             self.parse_args(print_args=False)
+        values = {}
         for key, value in self.opt.items():
-            print('[' + str(key) + ':' + str(value) + ']')
+            values[str(key)] = str(value)
+        for group in self._action_groups:
+            group_dict={a.dest:getattr(self.args,a.dest,None) for a in group._group_actions}
+            namespace = argparse.Namespace(**group_dict)
+            count = 0
+            for key in namespace.__dict__:
+                if key in values:
+                    if count == 0:
+                        print('[ ' + group.title + ': ] ')
+                    count += 1
+                    print('[  ' + key + ': ' + values[key] + ' ]')
diff --git a/parlai/core/utils.py b/parlai/core/utils.py
index 4188a8fda02..d82fce403fb 100644
--- a/parlai/core/utils.py
+++ b/parlai/core/utils.py
@@ -4,9 +4,7 @@
 # LICENSE file in the root directory of this source tree. An additional grant
 # of patent rights can be found in the PATENTS file in the same directory.
 
-from parlai.core.params import ParlaiParser
-from parlai.core.agents import create_agent
-
+import math
 import sys
 import time
 
@@ -31,6 +29,9 @@ def __init__(self, args=None, **kwargs):
         with hyphens, so 'dict_file=/tmp/dict.tsv' would be interpreted as
         '--dict-file /tmp/dict.tsv'.
         """
+        from parlai.core.params import ParlaiParser
+        from parlai.core.agents import create_agent
+
         if args is None:
             args = []
         for k, v in kwargs.items():
@@ -80,3 +81,9 @@ def time(self):
         if self.running:
             return self.total + time.time() - self.start
         return self.total
+
+
+def round_sigfigs(x, sigfigs=4):
+    if x == 0:
+        return 0
+    return round(x, -math.floor(math.log10(abs(x)) - sigfigs + 1))
diff --git a/parlai/core/worlds.py b/parlai/core/worlds.py
index 49625bfc9bd..f0131319c2d 100644
--- a/parlai/core/worlds.py
+++ b/parlai/core/worlds.py
@@ -54,12 +54,6 @@
 def validate(observation):
     """Make sure the observation table is valid, or raise an error."""
     if observation is not None and type(observation) == dict:
-        if ('text_candidates' in observation and
-            'text' in observation and
-            observation['text'] != observation['text_candidates'][0]):
-            raise RuntimeError('If text and text_candidates fields are both ' +
-                               'filled, top text candidate should be the same' +
-                               ' as text.')
         return observation
     else:
         raise RuntimeError('Must return dictionary from act().')
@@ -202,6 +196,17 @@ def reset(self):
         for a in self.agents:
             a.reset()
 
+    def reset_metrics(self):
+        for a in self.agents:
+            a.reset_metrics()
+
+    def save_agents(self):
+        """Saves all of the agents in the world by calling their respective
+        save() methods.
+        """
+        for a in self.agents:
+            a.save()
+
     def synchronize(self):
         """Can be used to synchronize processes."""
         pass
@@ -268,7 +273,7 @@ def shutdown(self):
 
 class MultiAgentDialogWorld(World):
     """Basic world where each agent gets a turn in a round-robin fashion,
-    recieving as input the actions of all other agents since that agent last
+    receiving as input the actions of all other agents since that agent last
     acted.
     """
     def __init__(self, opt, agents=None, shared=None):
@@ -311,10 +316,58 @@ def report(self):
         return self.agents[0].report()
 
     def shutdown(self):
+        """Shutdown each agent."""
         for a in self.agents:
             a.shutdown()
 
 
+class ExecutableWorld(MultiAgentDialogWorld):
+    """A world where messages from agents can be interpreted as _actions_ in the
+    world which result in changes in the environment (are executed). Hence a grounded
+    simulation can be implemented rather than just dialogue between agents.
+    """
+    def __init__(self, opt, agents=None, shared=None):
+        super().__init__(opt, agents, shared)
+        self.init_world()
+
+    def init_world(self):
+        """An executable world class should implement this function, otherwise
+        the actions do not do anything (and it is the same as MultiAgentDialogWorld).
+        """
+        pass
+
+    def execute(self, agent, act):
+        """An executable world class should implement this function, otherwise
+        the actions do not do anything (and it is the same as MultiAgentDialogWorld).
+        """
+        pass
+
+    def observe(self, agent, act):
+        """An executable world class should implement this function, otherwise
+        the observations for each agent are just the messages from other agents
+        and not confitioned on the world at all (and it is thus the same as
+        MultiAgentDialogWorld). """
+        if agent.id == act['id']:
+            return None
+        else:
+            return act
+
+    def parley(self):
+        """For each agent: act, execute and observe actions in world
+        """
+        acts = self.acts
+        for index, agent in enumerate(self.agents):
+            # The agent acts.
+            acts[index] = agent.act()
+            # We execute this action in the world.
+            self.execute(agent, acts[index])
+            # All agents (might) observe the results.
+            for other_agent in self.agents:
+                obs = self.observe(other_agent, acts[index])
+                if obs is not None:
+                    other_agent.observe(obs)
+
+
 class MultiWorld(World):
     """Container for a set of worlds where each world gets a turn
     in a round-robin fashion. The same user_agents are placed in each,
@@ -433,6 +486,14 @@ def reset(self):
         for w in self.worlds:
             w.reset()
 
+    def reset_metrics(self):
+        for w in self.worlds:
+            w.reset_metrics()
+
+    def save_agents(self):
+        # Assumes all worlds have same agents, picks first to save.
+        self.worlds[0].save_agents()
+
 
 def override_opts_in_shared(table, overrides):
     """Looks recursively for ``opt`` dictionaries within shared dict and overrides
@@ -457,7 +518,7 @@ class BatchWorld(World):
     """Creates a separate world for each item in the batch, sharing
     the parameters for each.
     The underlying world(s) it is batching can be either ``DialogPartnerWorld``,
-    ``MultiAgentWorld`` or ``MultiWorld``.
+    ``MultiAgentWorld``, ``ExecutableWorld`` or ``MultiWorld``.
     """
 
     def __init__(self, opt, world):
@@ -481,11 +542,20 @@ def __next__(self):
         if self.epoch_done():
             raise StopIteration()
 
-    def batch_observe(self, index, batch_actions):
+    def batch_observe(self, index, batch_actions, index_acting):
         batch_observations = []
         for i, w in enumerate(self.worlds):
             agents = w.get_agents()
-            observation = agents[index].observe(validate(batch_actions[i]))
+            observation = None
+            if hasattr(w, 'observe'):
+                # The world has its own observe function, which the action
+                # first goes through (agents receive messages via the world,
+                # not from each other).
+                observation = w.observe(agents[index], validate(batch_actions[i]))
+            else:
+                if index == index_acting: return None # don't observe yourself talking
+                observation = validate(batch_actions[i])
+            observation = agents[index].observe(observation)
             if observation is None:
                 raise ValueError('Agents should return what they observed.')
             batch_observations.append(observation)
@@ -499,9 +569,9 @@ def batch_act(self, index, batch_observation):
                 hasattr(a, 'batch_act')):
             batch_actions = a.batch_act(batch_observation)
             # Store the actions locally in each world.
-            for w in self.worlds:
+            for i, w in enumerate(self.worlds):
                 acts = w.get_acts()
-                acts[index] = batch_actions[index]
+                acts[index] = batch_actions[i]
         else:
             # Reverts to running on each individually.
             batch_actions = []
@@ -523,11 +593,17 @@ def parley(self):
                 w.parley_init()
 
         for index in range(num_agents):
+            # The agent acts.
             batch_act = self.batch_act(index, batch_observations[index])
+            # We possibly execute this action in the world.
+            for i, w in enumerate(self.worlds):
+                if hasattr(w, 'execute'):
+                    w.execute(w.agents[i], batch_act[i])
+            # All agents (might) observe the results.
             for other_index in range(num_agents):
-                if index != other_index:
-                    batch_observations[other_index] = (
-                        self.batch_observe(other_index, batch_act))
+                obs = self.batch_observe(other_index, batch_act, index)
+                if obs is not None:
+                    batch_observations[other_index] = obs
 
     def display(self):
         s = ("[--batchsize " + str(len(self.worlds)) + "--]\n")
@@ -553,12 +629,26 @@ def epoch_done(self):
         return True
 
     def report(self):
-        return self.worlds[0].report()
+        return self.world.report()
 
     def reset(self):
         for w in self.worlds:
             w.reset()
 
+    def reset_metrics(self):
+        self.world.reset_metrics()
+
+    def save_agents(self):
+        # Because all worlds share the same parameters through sharing, saving
+        # one copy would suffice
+        self.world.save_agents()
+
+    def shutdown(self):
+        """Shutdown each world."""
+        for w in self.worlds:
+            w.shutdown()
+        self.world.shutdown()
+
 
 class HogwildProcess(Process):
     """Process child used for ``HogwildWorld``.
@@ -656,6 +746,9 @@ def getID(self):
     def report(self):
         return self.inner_world.report()
 
+    def save_agents(self):
+        self.inner_world.save_agents()
+
     def synchronize(self):
         """Sync barrier: will wait until all queued examples are processed."""
         with self.epochDone:
@@ -712,6 +805,9 @@ def create_task(opt, user_agents):
     see ``parlai/tasks/tasks.py`` and see ``parlai/tasks/task_list.py``
     for list of tasks.
     """
+    if not opt.get('task'):
+        raise RuntimeError('No task specified. Please select a task with ' +
+                           '--task {task_name}.')
     if type(user_agents) != list:
         user_agents = [user_agents]
 
diff --git a/parlai/mturk/core/__init__.py b/parlai/mturk/core/__init__.py
index 8eff276d72d..fdc3ad907dc 100644
--- a/parlai/mturk/core/__init__.py
+++ b/parlai/mturk/core/__init__.py
@@ -2,4 +2,15 @@
 # All rights reserved.
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree. An additional grant
-# of patent rights can be found in the PATENTS file in the same directory.
\ No newline at end of file
+# of patent rights can be found in the PATENTS file in the same directory.
+
+# Check 3rd-party dependencies
+try:
+    import boto3
+    import botocore
+    import psycopg2
+    import sqlalchemy
+    import joblib
+except ModuleNotFoundError:
+    raise SystemExit("Please install 3rd-party dependencies by running: pip install boto3 psycopg2 sqlalchemy joblib")
+
diff --git a/parlai/mturk/core/agents.py b/parlai/mturk/core/agents.py
index ab6bb4eabc0..356574abcb0 100644
--- a/parlai/mturk/core/agents.py
+++ b/parlai/mturk/core/agents.py
@@ -16,67 +16,82 @@
 import json
 import requests
 from parlai.core.agents import create_agent_from_shared
-from .setup_aws import setup_aws, check_mturk_balance, create_hit_type, create_hit_with_hit_type, setup_aws_credentials
+from parlai.mturk.core.setup_aws import setup_aws, calculate_mturk_cost, check_mturk_balance, create_hit_type, create_hit_with_hit_type, setup_aws_credentials, create_hit_config, get_mturk_client
 import threading
-from .data_model import Base, Message
-from .data_model import get_new_messages as _get_new_messages
+from parlai.mturk.core.data_model import Base, Message
+from parlai.mturk.core.data_model import get_new_messages as _get_new_messages
+from parlai.mturk.core.data_model import COMMAND_GET_NEW_MESSAGES, COMMAND_SEND_MESSAGE, COMMAND_SUBMIT_HIT
 from sqlalchemy.orm import sessionmaker, scoped_session
 from sqlalchemy import create_engine
 from sqlalchemy.pool import StaticPool
+from botocore.exceptions import ClientError
+import uuid
 try:
     import sqlite3
 except ModuleNotFoundError:
     raise SystemExit("Please install sqlite3 by running: pip install sqlite3")
 
-local_db_file_path_template = os.path.dirname(os.path.dirname(os.path.realpath(__file__))) + '/tmp/parlai_mturk_<run_id>.db'
+ASSIGNMENT_NOT_DONE = 'NotDone'
+ASSIGNMENT_DONE = 'Submitted'
+ASSIGNMENT_APPROVED = 'Approved'
+ASSIGNMENT_REJECTED = 'Rejected'
+
+TIMEOUT_MESSAGE = '[TIMEOUT]'
+
 polling_interval = 1 # in seconds
-create_hit_type_lock = threading.Lock()
 local_db_lock = threading.Lock()
+debug = False
 
 class MTurkManager():
-    def __init__(self):
+    def __init__(self, opt, mturk_agent_ids):
         self.html_api_endpoint_url = None
         self.json_api_endpoint_url = None
-        self.requester_key_gt = None
         self.task_group_id = None
         self.db_last_message_id = 0
         self.db_thread = None
         self.db_thread_stop_event = None
-        self.local_db_file_path = None
         self.run_id = None
-        self.mturk_agent_ids = None
-        self.all_agent_ids = None
-
-    def init_aws(self, opt):
-        self.run_id = str(int(time.time()))
+        self.mturk_agent_ids = mturk_agent_ids
+        self.task_files_to_copy = None
+        self.unsent_messages_lock = threading.Lock()
+        self.unsent_messages = []
+        self.is_sandbox = opt['is_sandbox']
 
+    def init_aws(self, opt, task_directory_path=None):
         print("\nYou are going to allow workers from Amazon Mechanical Turk to be an agent in ParlAI.\nDuring this process, Internet connection is required, and you should turn off your computer's auto-sleep feature.\n")
         key_input = input("Please press Enter to continue... ")
         print("")
 
         setup_aws_credentials()
 
-        if not check_mturk_balance(num_hits=opt['num_hits'], hit_reward=opt['reward'], is_sandbox=opt['is_sandbox']):
+        payment_opt = {
+            'type': 'reward',
+            'num_hits': opt['num_hits'],
+            'num_assignments': opt['num_assignments'],
+            'reward': opt['reward']  # in dollars
+        }
+        total_cost = calculate_mturk_cost(payment_opt=payment_opt)
+        if not check_mturk_balance(balance_needed=total_cost, is_sandbox=opt['is_sandbox']):
             return
 
         print('Setting up MTurk backend...')
-        html_api_endpoint_url, json_api_endpoint_url, requester_key_gt = setup_aws(task_description=opt['task_description'], num_hits=opt['num_hits'], num_assignments=opt['num_assignments'], is_sandbox=opt['is_sandbox'])
+        create_hit_config(task_description=opt['task_description'], num_hits=opt['num_hits'], num_assignments=opt['num_assignments'], is_sandbox=opt['is_sandbox'])
+        if not self.task_files_to_copy:
+            self.task_files_to_copy = []
+        if not task_directory_path:
+            task_directory_path = os.path.join(opt['parlai_home'], 'parlai', 'mturk', 'tasks', opt['task'])
+        for mturk_agent_id in self.mturk_agent_ids:
+            self.task_files_to_copy.append(os.path.join(task_directory_path, 'html', mturk_agent_id+'_cover_page.html'))
+            self.task_files_to_copy.append(os.path.join(task_directory_path, 'html', mturk_agent_id+'_index.html'))
+        html_api_endpoint_url, json_api_endpoint_url = setup_aws(task_files_to_copy = self.task_files_to_copy)
         self.html_api_endpoint_url = html_api_endpoint_url
         self.json_api_endpoint_url = json_api_endpoint_url
-        self.requester_key_gt = requester_key_gt
+        if debug:
+            print(self.json_api_endpoint_url)
         print("MTurk setup done.\n")
 
-        self.task_group_id = str(opt['task']) + '_' + str(self.run_id)
-
-        # self.connection = sqlite3.connect(local_db_file_name)
-
-        self.local_db_file_path = local_db_file_path_template.replace('<run_id>', self.run_id)
-
-        if not os.path.exists(os.path.dirname(self.local_db_file_path)):
-            os.makedirs(os.path.dirname(self.local_db_file_path))
-
-        # Create an engine
-        engine = create_engine('sqlite:///'+self.local_db_file_path,
+        # Create an engine connected to the in-memory database
+        engine = create_engine('sqlite://',
                                 connect_args={'check_same_thread':False},
                                 poolclass=StaticPool)
          
@@ -88,13 +103,22 @@ def init_aws(self, opt):
 
         self.db_session = scoped_session(session_maker)
 
+    def start_new_run(self, opt):
+        if self.db_thread_stop_event:
+            self.db_thread_stop_event.set()
+
+        self.run_id = str(int(time.time()))
+        self.task_group_id = str(opt['task']) + '_' + str(self.run_id)
+
         self.db_thread_stop_event = threading.Event()
-        self.db_thread = threading.Thread(target=self._poll_new_messages_and_save_to_db, args=())
+        self.db_thread = threading.Thread(target=self._sync_with_remote_db, args=())
         self.db_thread.daemon = True
         self.db_thread.start()
 
-    def _poll_new_messages_and_save_to_db(self):
+    def _sync_with_remote_db(self):
         while not self.db_thread_stop_event.is_set():
+            if debug:
+                print("Syncing with remote db...")
             self.get_new_messages_and_save_to_db()
             time.sleep(polling_interval)
 
@@ -103,12 +127,13 @@ def get_new_messages_and_save_to_db(self):
             'method_name': 'get_new_messages',
             'task_group_id': self.task_group_id,
             'last_message_id': self.db_last_message_id,
+            'receiver_agent_id': '[World]'
         }
-        request = requests.get(self.json_api_endpoint_url, params=params)
+        response = requests.get(self.json_api_endpoint_url, params=params)
         try:
-            ret = json.loads(request.json())
-        except TypeError as e:
-            print(request.json())
+            ret = json.loads(response.json())
+        except Exception as e:
+            print(response.content)
             raise e
         conversation_dict = ret['conversation_dict']
         if ret['last_message_id']:
@@ -119,36 +144,39 @@ def get_new_messages_and_save_to_db(self):
             for new_message in new_messages:
                 with local_db_lock:
                     if self.db_session.query(Message).filter(Message.id==new_message['message_id']).count() == 0:
-                        obs_act_dict = {k:new_message[k] for k in new_message if k != 'message_id'}
+                        obs_act_dict = {k:new_message[k] for k in new_message if k not in ['message_id']}
                         new_message_in_local_db = Message(
                                                     id = new_message['message_id'],
                                                     task_group_id = self.task_group_id,
                                                     conversation_id = conversation_id,
-                                                    agent_id = new_message['id'],
+                                                    sender_agent_id = new_message['id'],
+                                                    receiver_agent_id = new_message['receiver_agent_id'],
                                                     message_content = json.dumps(obs_act_dict)
                                                 )
                         self.db_session.add(new_message_in_local_db)
                         self.db_session.commit()
     
     # Only gets new messages from local db, which syncs with remote db every `polling_interval` seconds.
-    def get_new_messages(self, task_group_id, conversation_id, after_message_id, excluded_agent_id=None, included_agent_id=None):
+    def get_new_messages(self, task_group_id, conversation_id, receiver_agent_id, after_message_id, excluded_sender_agent_id=None, included_sender_agent_id=None):
         with local_db_lock:
             return _get_new_messages(
                 db_session=self.db_session,
                 task_group_id=task_group_id,
                 conversation_id=conversation_id,
+                receiver_agent_id=receiver_agent_id,
                 after_message_id=after_message_id,
-                excluded_agent_id=excluded_agent_id,
-                included_agent_id=included_agent_id,
+                excluded_sender_agent_id=excluded_sender_agent_id,
+                included_sender_agent_id=included_sender_agent_id,
                 populate_meta_info=True
             )
 
-    def send_new_message(self, task_group_id, conversation_id, agent_id, message_text=None, reward=None, episode_done=False):
+    def send_new_message(self, task_group_id, conversation_id, sender_agent_id, receiver_agent_id, message_text=None, reward=None, episode_done=False):
         post_data_dict = {
             'method_name': 'send_new_message',
             'task_group_id': task_group_id,
             'conversation_id': conversation_id,
-            'cur_agent_id': agent_id,
+            'sender_agent_id': sender_agent_id,
+            'receiver_agent_id': receiver_agent_id,
             'episode_done': episode_done,
         }
         if message_text:
@@ -156,40 +184,68 @@ def send_new_message(self, task_group_id, conversation_id, agent_id, message_tex
         if reward:
             post_data_dict['reward'] = reward
 
-        request = requests.post(self.json_api_endpoint_url, data=json.dumps(post_data_dict))
+        response = requests.post(self.json_api_endpoint_url, data=json.dumps(post_data_dict))
+        try:
+            ret = json.loads(response.json())
+            return ret
+        except Exception as e:
+            print(response.content)
+            raise e
+
+    def send_new_command(self, task_group_id, conversation_id, receiver_agent_id, command):
+        post_data_dict = {
+            'method_name': 'send_new_command',
+            'task_group_id': task_group_id,
+            'conversation_id': conversation_id,
+            'receiver_agent_id': receiver_agent_id,
+            'command': command,
+        }
+        response = requests.post(self.json_api_endpoint_url, data=json.dumps(post_data_dict))
         try:
-            ret = json.loads(request.json())
+            ret = json.loads(response.json())
             return ret
-        except TypeError as e:
-            print(request.json())
+        except Exception as e:
+            print(response.content)
             raise e
 
-    def get_approval_status_count(self, task_group_id, approval_status, requester_key, conversation_id=None):
+    def get_hit_assignment_info(self, task_group_id, conversation_id, agent_id):
         params = {
-            'method_name': 'get_approval_status_count',
+            'method_name': 'get_hit_assignment_info',
             'task_group_id': task_group_id,
-            'approval_status': approval_status,
-            'requester_key': requester_key
+            'agent_id': agent_id,
+            'conversation_id': conversation_id
         }
-        if conversation_id:
-            params['conversation_id'] = conversation_id
-        request = requests.get(self.json_api_endpoint_url, params=params)
-        return request.json()
+        response = requests.get(self.json_api_endpoint_url, params=params)
+        try:
+            ret = json.loads(response.json())
+            return ret['assignment_id'], ret['hit_id'], ret['worker_id']
+        except Exception as e:
+            print(response.content)
+            raise e
+
+    def get_agent_work_status(self, assignment_id):
+        client = get_mturk_client(self.is_sandbox)
+        try:
+            response = client.get_assignment(AssignmentId=assignment_id)
+            return response['Assignment']['AssignmentStatus']
+        except ClientError as e:
+            if 'This operation can be called with a status of: Reviewable,Approved,Rejected' in e.response['Error']['Message']:
+                return ASSIGNMENT_NOT_DONE
 
     def create_hits(self, opt):
         print('Creating HITs...')
+        mturk_agent_HIT_url_dict = {}
         for mturk_agent_id in self.mturk_agent_ids:
             for hit_index in range(1, opt['num_hits']+1):
-                with create_hit_type_lock:
-                    hit_type_id = create_hit_type(
-                        hit_title=opt['hit_title'],
-                        hit_description=opt['hit_description'] + ' (ID: ' + self.task_group_id + ', Role: ' + mturk_agent_id + ')',
-                        hit_keywords=opt['hit_keywords'],
-                        hit_reward=opt['reward'],
-                        is_sandbox=opt['is_sandbox']
-                    )
-                all_agent_ids_string = str(self.all_agent_ids).replace("'", '''"''')
-                mturk_chat_url = self.html_api_endpoint_url + "?method_name=chat_index&task_group_id="+str(self.task_group_id)+"&all_agent_ids="+all_agent_ids_string+"&cur_agent_id="+str(mturk_agent_id)+"&task_additional_info="+str(opt.get('task_additional_info', ''))
+                hit_type_id = create_hit_type(
+                    hit_title=opt['hit_title'],
+                    hit_description=opt['hit_description'] + ' (ID: ' + self.task_group_id + ', Role: ' + mturk_agent_id + ')',
+                    hit_keywords=opt['hit_keywords'],
+                    hit_reward=opt['reward'],
+                    assignment_duration_in_seconds=opt.get('assignment_duration_in_seconds', 30 * 60), # Set to 30 minutes by default
+                    is_sandbox=opt['is_sandbox']
+                )
+                mturk_chat_url = self.html_api_endpoint_url + "?method_name=chat_index&task_group_id="+str(self.task_group_id)+"&cur_agent_id="+str(mturk_agent_id)
                 mturk_page_url = create_hit_with_hit_type(
                     page_url=mturk_chat_url,
                     hit_type_id=hit_type_id,
@@ -198,70 +254,122 @@ def create_hits(self, opt):
                 )
             print("Link to HIT for " + str(mturk_agent_id) + ": " + mturk_page_url + "\n")
             print("Waiting for Turkers to respond... (Please don't close your laptop or put your computer into sleep or standby mode.)\n")
-
-    def review_hits(self):
-        mturk_agent_ids_string = str(self.mturk_agent_ids).replace("'", '''"''')
-        mturk_approval_url = self.html_api_endpoint_url + "?method_name=approval_index&task_group_id="+str(self.task_group_id)+"&hit_index=1&assignment_index=1&mturk_agent_ids="+mturk_agent_ids_string+"&requester_key="+self.requester_key_gt
-
-        print("\nAll HITs are done! Please go to the following link to approve/reject them (or they will be auto-approved in 4 weeks if no action is taken):\n")
-        print(mturk_approval_url)
-        print("")
-
-        # Loop for checking approval status
-        while self.get_approval_status_count(task_group_id=self.task_group_id, approval_status='pending', requester_key=self.requester_key_gt) > 0:
-            time.sleep(polling_interval)
-
-        print("All reviews are done!")
+            mturk_agent_HIT_url_dict[mturk_agent_id] = mturk_page_url
+        return mturk_agent_HIT_url_dict
+
+    def approve_work(self, assignment_id):
+        client = get_mturk_client(self.is_sandbox)
+        client.approve_assignment(AssignmentId=assignment_id)
+
+    def reject_work(self, assignment_id, reason):
+        client = get_mturk_client(self.is_sandbox)
+        client.reject_assignment(AssignmentId=assignment_id, RequesterFeedback=reason)
+
+    def block_worker(self, worker_id, reason):
+        client = get_mturk_client(self.is_sandbox)
+        client.create_worker_block(WorkerId=worker_id, Reason=reason)
+
+    def pay_bonus(self, worker_id, bonus_amount, assignment_id, reason, unique_request_token):
+        total_cost = calculate_mturk_cost(payment_opt={'type': 'bonus', 'amount': bonus_amount})
+        if not check_mturk_balance(balance_needed=total_cost, is_sandbox=self.is_sandbox):
+            print("Cannot pay bonus. Reason: Insufficient fund in your MTurk account.")
+            return False
+
+        client = get_mturk_client(self.is_sandbox)
+        client.send_bonus(
+            WorkerId=worker_id,
+            BonusAmount=str(bonus_amount),
+            AssignmentId=assignment_id,
+            Reason=reason,
+            UniqueRequestToken=unique_request_token # Could be useful in the future, for handling network errors
+        )
+
+        return True
+
+    def email_worker(self, worker_id, subject, message_text):
+        client = get_mturk_client(self.is_sandbox)
+        response = client.notify_workers(
+            Subject=subject,
+            MessageText=message_text,
+            WorkerIds=[worker_id]
+        )
+        if len(response['NotifyWorkersFailureStatuses']) > 0:
+            return {'failure': response['NotifyWorkersFailureStatuses'][0]['NotifyWorkersFailureMessage']}
+        else:
+            return {'success': True}
 
     def shutdown(self):
+        setup_aws_file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'setup_aws.py')
+        print("Remote database instance will accumulate cost over time (about $30/month for t2.medium instance). Please run `python "+setup_aws_file_path+" remove_rds` to remove RDS instance if you don't plan to use MTurk often.")
         self.db_thread_stop_event.set()
-        if os.path.exists(self.local_db_file_path):
-            os.remove(self.local_db_file_path)
 
 
 class MTurkAgent(Agent):
-    def __init__(self, id, manager, conversation_id, opt, shared=None):
+    def __init__(self, id, manager, hit_index, assignment_index, opt, shared=None):
         super().__init__(opt)
 
-        self.conversation_id = conversation_id
+        self.conversation_id = str(hit_index) + '_' + str(assignment_index)
         self.manager = manager
         self.id = id
         self.last_message_id = 0
+        self.assignment_id = None
+        self.hit_id = None
+        self.worker_id = None
+        self.hit_is_abandoned = False
+
+        # Wait for MTurk-specific info
+        while not (self.assignment_id and self.hit_id and self.worker_id):
+            self.assignment_id, self.hit_id, self.worker_id = self.manager.get_hit_assignment_info(self.manager.task_group_id, self.conversation_id, self.id)
+            time.sleep(polling_interval)
 
     def observe(self, msg):
-        if msg['id'] not in self.manager.mturk_agent_ids: # If the message sender is an mturk agent, then there is no need to upload this message to db since it's already been done on the message sender side.
-            self.manager.get_new_messages_and_save_to_db() # Force a refresh for local db.
-            conversation_dict, _ = self.manager.get_new_messages(
-                task_group_id=self.manager.task_group_id,
-                conversation_id=self.conversation_id,
-                after_message_id=self.last_message_id,
-                included_agent_id=msg['id'])
-            if self.conversation_id in conversation_dict:
-                agent_last_message_in_db = conversation_dict[self.conversation_id][-1]
-                agent_last_message_in_db.pop('message_id', None)
-                if 'episode_done' not in msg:
-                    msg['episode_done'] = False
-                if agent_last_message_in_db == msg:
-                    return
-
-            self.manager.send_new_message(
-                task_group_id=self.manager.task_group_id,
-                conversation_id=self.conversation_id,
-                agent_id=msg['id'],
-                message_text=msg.get('text', None),
-                reward=msg.get('reward', None),
-                episode_done=msg.get('episode_done', False),
-            )
+        self.manager.send_new_message(
+            task_group_id=self.manager.task_group_id,
+            conversation_id=self.conversation_id,
+            sender_agent_id=msg['id'],
+            receiver_agent_id=self.id,
+            message_text=msg.get('text', None),
+            reward=msg.get('reward', None),
+            episode_done=msg.get('episode_done', False),
+        )
+        self.manager.send_new_command(
+            task_group_id=self.manager.task_group_id,
+            conversation_id=self.conversation_id,
+            receiver_agent_id=self.id,
+            command=COMMAND_GET_NEW_MESSAGES
+        )
+
+    def act(self, timeout=None): # timeout in seconds
+        if timeout:
+            start_time = time.time()
+
+        self.manager.send_new_command(
+            task_group_id=self.manager.task_group_id,
+            conversation_id=self.conversation_id,
+            receiver_agent_id=self.id,
+            command=COMMAND_SEND_MESSAGE
+        )
 
-    def act(self):
         while True:
+            if timeout:
+                current_time = time.time()
+                if (current_time - start_time) > timeout:
+                    self.hit_is_abandoned = True
+                    msg = {
+                        'id': self.id,
+                        'text': TIMEOUT_MESSAGE,
+                        'episode_done': True
+                    }
+                    return msg
+
             conversation_dict, new_last_message_id = self.manager.get_new_messages(
                 task_group_id=self.manager.task_group_id,
                 conversation_id=self.conversation_id,
+                receiver_agent_id='[World]',
                 after_message_id=self.last_message_id,
-                included_agent_id=self.id
+                included_sender_agent_id=self.id
             )
-            
+
             if self.conversation_id in conversation_dict:
                 if new_last_message_id:
                     self.last_message_id = new_last_message_id
@@ -275,8 +383,71 @@ def act(self):
     def episode_done(self):
         return False
 
-    def shutdown(self):
-        # Loop to ensure all HITs are done
-        while self.manager.get_approval_status_count(task_group_id=self.manager.task_group_id, conversation_id=self.conversation_id, approval_status='pending', requester_key=self.manager.requester_key_gt) < len(self.manager.mturk_agent_ids):
+    def approve_work(self):
+        if self.hit_is_abandoned:
+            print('Conversation ID: ' + str(self.conversation_id) + ', Agent ID: ' + self.id + ' - HIT is abandoned and thus not available for review.')
+        else:
+            if self.manager.get_agent_work_status(assignment_id=self.assignment_id) == ASSIGNMENT_DONE:
+                self.manager.approve_work(assignment_id=self.assignment_id)
+                print('Conversation ID: ' + str(self.conversation_id) + ', Agent ID: ' + self.id + ' - HIT is approved.')
+            else:
+                print("Cannot approve HIT. Reason: Turker hasn't completed the HIT yet.")
+
+    def reject_work(self, reason='unspecified'):
+        if self.hit_is_abandoned:
+            print('Conversation ID: ' + str(self.conversation_id) + ', Agent ID: ' + self.id + ' - HIT is abandoned and thus not available for review.')
+        else:
+            if self.manager.get_agent_work_status(assignment_id=self.assignment_id) == ASSIGNMENT_DONE:
+                self.manager.reject_work(assignment_id=self.assignment_id, reason=reason)
+                print('Conversation ID: ' + str(self.conversation_id) + ', Agent ID: ' + self.id + ' - HIT is rejected.')
+            else:
+                print("Cannot reject HIT. Reason: Turker hasn't completed the HIT yet.")
+
+    def block_worker(self, reason='unspecified'):
+        self.manager.block_worker(worker_id=self.worker_id, reason=reason)
+        print("Blocked worker ID: " + str(self.worker_id) + ". Reason: " + reason)
+
+    def pay_bonus(self, bonus_amount, reason='unspecified'):
+        if self.hit_is_abandoned:
+            print('Conversation ID: ' + str(self.conversation_id) + ', Agent ID: ' + self.id + ' - HIT is abandoned and thus not available for bonus.')
+        else:
+            if self.manager.get_agent_work_status(assignment_id=self.assignment_id) != ASSIGNMENT_NOT_DONE:
+                unique_request_token = str(uuid.uuid4())
+                if self.manager.pay_bonus(worker_id=self.worker_id, bonus_amount=bonus_amount, assignment_id=self.assignment_id, reason=reason, unique_request_token=unique_request_token):
+                    print("Paid $" + str(bonus_amount) + " bonus to WorkerId: " + self.worker_id)
+            else:
+                print("Cannot pay bonus for HIT. Reason: Turker hasn't completed the HIT yet.")
+
+    def email_worker(self, subject, message_text):
+        response = self.manager.email_worker(worker_id=self.worker_id, subject=subject, message_text=message_text)
+        if 'success' in response:
+            print("Email sent to worker ID: "+str(self.worker_id)+": Subject: "+str(subject)+": Text: "+str(message_text))
+            return True
+        elif 'failure' in response:
+            print("Unable to send email to worker ID: "+str(self.worker_id)+". Error: "+str(response['failure']))
+            return False
+
+    def wait_for_hit_completion(self, timeout=None): # Timeout in seconds
+        if timeout:
+            start_time = time.time()
+        while self.manager.get_agent_work_status(assignment_id=self.assignment_id) != ASSIGNMENT_DONE:
+            if timeout:
+                current_time = time.time()
+                if (current_time - start_time) > timeout:
+                    print("Timed out waiting for Turker to complete the HIT.")
+                    self.hit_is_abandoned = True
+                    return False
+            if debug:
+                print("Waiting for Turker to complete the HIT...")
             time.sleep(polling_interval)
         print('Conversation ID: ' + str(self.conversation_id) + ', Agent ID: ' + self.id + ' - HIT is done.')
+
+    def shutdown(self, timeout=None): # Timeout in seconds
+        if not self.hit_is_abandoned:
+            self.manager.send_new_command(
+                task_group_id=self.manager.task_group_id,
+                conversation_id=self.conversation_id,
+                receiver_agent_id=self.id,
+                command=COMMAND_SUBMIT_HIT
+            )
+            self.wait_for_hit_completion(timeout=timeout)
diff --git a/parlai/mturk/core/data_model.py b/parlai/mturk/core/data_model.py
index 50840239fa6..83f4aab9970 100644
--- a/parlai/mturk/core/data_model.py
+++ b/parlai/mturk/core/data_model.py
@@ -12,87 +12,106 @@
 from sqlalchemy.ext.declarative import declarative_base
 from sqlalchemy.orm import sessionmaker, scoped_session
 from sqlalchemy import create_engine, func
+from sqlalchemy.pool import NullPool
+from sqlalchemy import inspect
 
 is_python_2 = False
 if sys.version_info[0] < 3:
     is_python_2 = True
-
  
 Base = declarative_base()
 engine = None
-session = None
 
+COMMAND_GET_NEW_MESSAGES = 'COMMAND_GET_NEW_MESSAGES' # MTurk agent is expected to get new messages from server
+COMMAND_SEND_MESSAGE = 'COMMAND_SEND_MESSAGE' # MTurk agent is expected to send a new message to server
+COMMAND_SUBMIT_HIT = 'COMMAND_SUBMIT_HIT' # MTurk agent is expected to hit "DONE" button and submit the HIT
+
+def object_as_dict(obj):
+    return {c.key: getattr(obj, c.key)
+            for c in inspect(obj).mapper.column_attrs}
 
 class Message(Base):
     __tablename__ = 'message'
     id = Column(Integer, primary_key=True)
     task_group_id = Column(String(255), index=True)  # We assign a new task_group_id for each HIT group
     conversation_id = Column(String(255), index=True)
-    agent_id = Column(String(255))
+    sender_agent_id = Column(String(255), index=True)
+    receiver_agent_id = Column(String(255), index=True, default=None)
     message_content = Column(UnicodeText)
 
-
-class MTurkHITInfo(Base):
-    __tablename__ = 'mturk_hit_info'
+class Command(Base):
+    __tablename__ = 'command'
     id = Column(Integer, primary_key=True)
     task_group_id = Column(String(255), index=True)
     conversation_id = Column(String(255), index=True)
-    assignment_id = Column(String(255))
-    hit_id = Column(String(255))
-    worker_id = Column(String(255))
-    is_sandbox = Column(Boolean())
-    approval_status = Column(String(100), index=True)
+    receiver_agent_id = Column(String(255), index=True)
+    command = Column(String(255))
 
-    def as_dict(self):
-       return {c.name: getattr(self, c.name) for c in self.__table__.columns}
-
-
-class MTurkHITAssignmentInfo(Base):
-    __tablename__ = 'mturk_hit_assignment_info'
+class MTurkHITAgentAllocation(Base):
+    __tablename__ = 'mturk_hit_agent_allocation'
     id = Column(Integer, primary_key=True)
     task_group_id = Column(String(255), index=True)
     agent_id = Column(String(255), index=True)
+    conversation_id = Column(String(255), index=True, default=None)
+    assignment_id = Column(String(255), default=None)
+    hit_id = Column(String(255), default=None)
+    worker_id = Column(String(255), default=None)
 
 
-def is_database_schema_consistent(Base, engine):
+def check_database_health():
     session_maker = sessionmaker(bind=engine)
     session = scoped_session(session_maker)
 
-    # Try insert new objects with current schema
     try:
-        test_message = Message(id=0, task_group_id='Test', conversation_id='Test', agent_id='Test', message_content='Test')
-        session.add(test_message)
-        session.commit()
-        session.delete(test_message)
-        session.commit()
-
-        test_hit_info = MTurkHITInfo(id=0, task_group_id='Test', conversation_id='Test', assignment_id='Test', hit_id='Test', worker_id='Test', is_sandbox=True, approval_status='Test')
-        session.add(test_hit_info)
-        session.commit()
-        session.delete(test_hit_info)
-        session.commit()
-
-        test_hit_assignment_info = MTurkHITAssignmentInfo(id=0, task_group_id='Test', agent_id='Test')
-        session.add(test_hit_assignment_info)
-        session.commit()
-        session.delete(test_hit_assignment_info)
-        session.commit()
-
-        return True
-    except:
-        return False
-
-
-def init_database(host, db_name, username, password, should_check_schema_consistency=False):
+        # Check whether all tables exist
+        for model_class in [Message, MTurkHITAgentAllocation]:
+            if not engine.dialect.has_table(engine, model_class.__tablename__):
+                return 'missing_table'
+
+        # Try insert new objects with current schema
+        try:
+            test_message = Message(id=0, task_group_id='Test', conversation_id='Test', sender_agent_id='Test', receiver_agent_id='Test', message_content='Test')
+            session.add(test_message)
+            session.commit()
+            session.delete(test_message)
+            session.commit()
+
+            test_command = Command(id=0, task_group_id='Test', conversation_id='Test', receiver_agent_id='Test', command='Test')
+            session.add(test_command)
+            session.commit()
+            session.delete(test_command)
+            session.commit()
+
+            test_agent_allocation = MTurkHITAgentAllocation(id=0, task_group_id='Test', agent_id='Test', conversation_id='Test', assignment_id='Test', hit_id='Test', worker_id='Test')
+            session.add(test_agent_allocation)
+            session.commit()
+            session.delete(test_agent_allocation)
+            session.commit()
+
+            return 'healthy'
+        except KeyboardInterrupt:
+            raise
+        except Exception as e:
+            return 'inconsistent_schema'
+    except KeyboardInterrupt:
+        raise
+    except Exception as e:
+        raise e
+        return 'unknown_error'
+
+
+def setup_database_engine(host, db_name, username, password):
     # Create an engine
-    engine = create_engine('postgres://'+username+':'+password+'@'+host+':5432/'+db_name)
-    
-    if should_check_schema_consistency and not is_database_schema_consistent(Base, engine):
-        # Database schema is inconsistent
-        input_key = input("Remote database schema is inconsistent. Please stop all other ParlAI MTurk instances, and press any key to continue:")
-        print('Creating database schema...')
-        Base.metadata.drop_all(engine)
+    global engine
+    engine = create_engine('postgres://'+username+':'+password+'@'+host+':5432/'+db_name, poolclass=NullPool)
+
 
+def close_connection(db_engine, db_session):
+    db_session.close()
+    db_engine.dispose()
+
+
+def init_database():
     # Create all tables in the engine. This is equivalent to "Create Table"
     # statements in raw SQL.
     Base.metadata.create_all(engine)
@@ -103,7 +122,33 @@ def init_database(host, db_name, username, password, should_check_schema_consist
     return engine, session_maker
 
 
-def send_new_message(db_session, task_group_id, conversation_id, agent_id, message_text=None, reward=None, episode_done=False):
+def send_new_command(db_session, task_group_id, conversation_id, receiver_agent_id, command):
+    new_command_object = Command(
+        task_group_id = task_group_id,
+        conversation_id = conversation_id,
+        receiver_agent_id = receiver_agent_id,
+        command = command
+    )
+    db_session.add(new_command_object)
+    db_session.commit()
+
+    return new_command_object
+
+
+def get_command(db_session, task_group_id, conversation_id, receiver_agent_id, after_command_id):
+    query = db_session.query(Command).filter(Command.task_group_id==task_group_id) \
+                                    .filter(Command.conversation_id==conversation_id) \
+                                    .filter(Command.receiver_agent_id==receiver_agent_id) \
+                                    .filter(Command.id > after_command_id) \
+                                    .order_by(Command.id)
+    command_object = query.first()
+    if command_object:
+        return command_object
+    else:
+        return None
+
+
+def send_new_message(db_session, task_group_id, conversation_id, sender_agent_id, receiver_agent_id, message_text=None, reward=None, episode_done=False):
     """
     Message format:
     {
@@ -112,16 +157,13 @@ def send_new_message(db_session, task_group_id, conversation_id, agent_id, messa
         "id": xxx, # id of speaker(s)
         "reward": xxx,
         "episode_done": xxx, # signals end of episode
-
-        # Extra fields for MTurk state maintenance
-        "message_id": xxx, # populated with record on database
     }
     """
 
     # ParlAI observation/action dict fields:
     new_message = {
         "text": message_text,
-        "id": agent_id,
+        "id": sender_agent_id,
     }
     if reward:
         new_message['reward'] = reward
@@ -134,7 +176,8 @@ def send_new_message(db_session, task_group_id, conversation_id, agent_id, messa
     new_message_object = Message(
         task_group_id = task_group_id,
         conversation_id = conversation_id,
-        agent_id = agent_id,
+        sender_agent_id = sender_agent_id,
+        receiver_agent_id = receiver_agent_id,
         message_content = message_content
     )
     db_session.add(new_message_object)
@@ -143,7 +186,7 @@ def send_new_message(db_session, task_group_id, conversation_id, agent_id, messa
     return new_message_object
 
 
-def get_new_messages(db_session, task_group_id, conversation_id=None, after_message_id=None, excluded_agent_id=None, included_agent_id=None, populate_meta_info=False):
+def get_new_messages(db_session, task_group_id, receiver_agent_id, conversation_id=None, after_message_id=None, excluded_sender_agent_id=None, included_sender_agent_id=None, populate_meta_info=False):
     """
     Return:
     conversation_dict = {
@@ -166,23 +209,25 @@ def get_new_messages(db_session, task_group_id, conversation_id=None, after_mess
     if not after_message_id:
         after_message_id = -1
 
-    included_agent_ids = []
-    if included_agent_id:
-        included_agent_ids = [included_agent_id]
+    included_sender_agent_ids = []
+    if included_sender_agent_id:
+        included_sender_agent_ids = [included_sender_agent_id]
 
-    excluded_agent_ids = []
-    if excluded_agent_id:
-        excluded_agent_ids = [excluded_agent_id]
+    excluded_sender_agent_ids = []
+    if excluded_sender_agent_id:
+        excluded_sender_agent_ids = [excluded_sender_agent_id]
 
     last_message_id = None
 
     query = db_session.query(Message).filter(Message.task_group_id==task_group_id).filter(Message.id > after_message_id)
-    if len(included_agent_ids) > 0:
-        query = query.filter(Message.agent_id.in_(included_agent_ids))
-    if len(excluded_agent_ids) > 0:
-        query = query.filter(~Message.agent_id.in_(excluded_agent_ids))
+    if len(included_sender_agent_ids) > 0:
+        query = query.filter(Message.sender_agent_id.in_(included_sender_agent_ids))
+    if len(excluded_sender_agent_ids) > 0:
+        query = query.filter(~Message.sender_agent_id.in_(excluded_sender_agent_ids))
     if conversation_id:
         query = query.filter(Message.conversation_id==conversation_id)
+    if receiver_agent_id:
+        query = query.filter(Message.receiver_agent_id==receiver_agent_id)
     new_message_objects = query.order_by(Message.id)
     conversation_dict = {}
 
@@ -196,11 +241,12 @@ def get_new_messages(db_session, task_group_id, conversation_id=None, after_mess
 
         new_message_dict = {
             "text": text,
-            "id": new_message_object.agent_id,
+            "id": new_message_object.sender_agent_id,
         }
         if 'reward' in message_content:
             new_message_dict['reward'] = message_content['reward']
         new_message_dict['episode_done'] = message_content.get('episode_done', False)
+        new_message_dict['receiver_agent_id'] = new_message_object.receiver_agent_id
 
         if populate_meta_info:
             new_message_dict['message_id'] = new_message_object.id
@@ -212,58 +258,50 @@ def get_new_messages(db_session, task_group_id, conversation_id=None, after_mess
     return conversation_dict, last_message_id
 
 
-def get_hit_index_and_assignment_index(db_session, task_group_id, agent_id, num_assignments):
-    new_assignment_object = MTurkHITAssignmentInfo(task_group_id=task_group_id, agent_id=agent_id)
-    db_session.add(new_assignment_object)
+def sync_hit_assignment_info(db_session, task_group_id, agent_id, num_assignments, assignment_id, hit_id, worker_id):
+    new_allocation_object = MTurkHITAgentAllocation(
+                                task_group_id=task_group_id,
+                                agent_id=agent_id,
+                                conversation_id=None,
+                                assignment_id=assignment_id,
+                                hit_id=hit_id,
+                                worker_id=worker_id
+                            )
+    db_session.add(new_allocation_object)
     db_session.commit()
-    object_id = new_assignment_object.id
-    existing_assignment_id_list = db_session.query(MTurkHITAssignmentInfo.id) \
-                                    .filter(MTurkHITAssignmentInfo.task_group_id==task_group_id) \
-                                    .filter(MTurkHITAssignmentInfo.agent_id==agent_id) \
-                                    .order_by(MTurkHITAssignmentInfo.id).all()
-    existing_assignment_id_list = [id for (id, ) in existing_assignment_id_list]
-    index_in_list = existing_assignment_id_list.index(object_id)
-    return {'hit_index': math.floor(index_in_list / num_assignments) + 1, 'assignment_index': index_in_list % num_assignments + 1}
-
-
-def set_hit_info(db_session, task_group_id, conversation_id, assignment_id, hit_id, worker_id, is_sandbox, approval_status='pending'):
-    existing_object = db_session.query(MTurkHITInfo) \
-                        .filter(MTurkHITInfo.task_group_id==task_group_id) \
-                        .filter(MTurkHITInfo.conversation_id==conversation_id) \
-                        .filter(MTurkHITInfo.assignment_id==assignment_id) \
-                        .filter(MTurkHITInfo.hit_id==hit_id) \
-                        .first()
-    if not existing_object:
-        new_hit_info_object = MTurkHITInfo(
-            task_group_id=task_group_id,
-            conversation_id=conversation_id,
-            assignment_id=assignment_id, 
-            hit_id=hit_id, 
-            worker_id=worker_id,
-            is_sandbox=is_sandbox,
-            approval_status=approval_status
-        )
-        db_session.add(new_hit_info_object)
-        db_session.commit()
-    else:
-        existing_object.assignment_id = assignment_id
-        existing_object.hit_id = hit_id
-        existing_object.worker_id = worker_id
-        existing_object.is_sandbox = is_sandbox
-        existing_object.approval_status = approval_status
-        db_session.add(existing_object)
-        db_session.commit()
-
 
-def get_all_matching_hit_infos(db_session, task_group_id, conversation_id):
-    matching_hit_infos = list(db_session.query(MTurkHITInfo).filter(MTurkHITInfo.task_group_id==task_group_id).filter(MTurkHITInfo.conversation_id==conversation_id).all())
-    return matching_hit_infos
-
-def get_approval_status_count(db_session, task_group_id, approval_status, conversation_id=None):
-    query = db_session.query(MTurkHITInfo).filter(MTurkHITInfo.task_group_id==task_group_id).filter(MTurkHITInfo.approval_status==approval_status)
-    if conversation_id:
-        query = query.filter(MTurkHITInfo.conversation_id==conversation_id)
-    return query.count()
+    object_id = new_allocation_object.id
+    existing_allocation_id_list = db_session.query(MTurkHITAgentAllocation.id) \
+                                    .filter(MTurkHITAgentAllocation.task_group_id==task_group_id) \
+                                    .filter(MTurkHITAgentAllocation.agent_id==agent_id) \
+                                    .order_by(MTurkHITAgentAllocation.id).all()
+    existing_allocation_id_list = [id for (id, ) in existing_allocation_id_list]
+    index_in_list = existing_allocation_id_list.index(object_id)
+
+    hit_index = int(math.floor(index_in_list / num_assignments) + 1)
+    assignment_index = index_in_list % num_assignments + 1
+    conversation_id = str(hit_index) + '_' + str(assignment_index)
+    new_allocation_object.conversation_id = conversation_id
+    db_session.add(new_allocation_object)
+    db_session.commit()
 
-def get_all_approval_status(db_session, task_group_id):
-    return db_session.query(MTurkHITInfo).filter(MTurkHITInfo.task_group_id==task_group_id).order_by(MTurkHITInfo.conversation_id).all()
\ No newline at end of file
+    return {'hit_index': hit_index, 'assignment_index': assignment_index}
+
+def get_hit_assignment_info(db_session, task_group_id, agent_id, conversation_id):
+    existing_allocation_object = db_session.query(MTurkHITAgentAllocation) \
+                                .filter(MTurkHITAgentAllocation.task_group_id==task_group_id) \
+                                .filter(MTurkHITAgentAllocation.agent_id==agent_id) \
+                                .filter(MTurkHITAgentAllocation.conversation_id==conversation_id) \
+                                .first()
+    assignment_id = None
+    hit_id = None
+    worker_id = None
+    if existing_allocation_object:
+        assignment_id = existing_allocation_object.assignment_id
+        hit_id = existing_allocation_object.hit_id
+        worker_id = existing_allocation_object.worker_id
+    return {
+        'assignment_id': assignment_id,
+        'hit_id': hit_id,
+        'worker_id': worker_id
+    }
\ No newline at end of file
diff --git a/parlai/mturk/core/handler_template.py b/parlai/mturk/core/handler_template.py
index 3b5f4d3dd1d..b37658e217e 100755
--- a/parlai/mturk/core/handler_template.py
+++ b/parlai/mturk/core/handler_template.py
@@ -17,11 +17,12 @@
 import data_model
 
 # Dynamically generated code begin
-# Expects mturk_submit_url, frame_height, rds_host, rds_db_name, rds_username, rds_password, task_description, requester_key_gt, num_hits, num_assignments, is_sandbox
+# Expects mturk_submit_url, frame_height, rds_host, rds_db_name, rds_username, rds_password
 # {{block_task_config}}
 # Dynamically generated code end
 
-db_engine, db_session_maker = data_model.init_database(rds_host, rds_db_name, rds_username, rds_password)
+data_model.setup_database_engine(rds_host, rds_db_name, rds_username, rds_password)
+db_engine, db_session_maker = data_model.init_database()
 db_session = db_session_maker()
 
 
@@ -32,6 +33,8 @@ def _render_template(template_context, template_file_name):
     return rendered_template
 
 def lambda_handler(event, context):
+    global db_engine, db_session
+
     params = None
     if event['method'] == 'GET':
         params = event['query']
@@ -40,7 +43,9 @@ def lambda_handler(event, context):
 
     method_name = params['method_name']
     if method_name in globals():
-        return globals()[method_name](event, context)
+        result = globals()[method_name](event, context)
+        data_model.close_connection(db_engine, db_session)
+        return result
 
 def chat_index(event, context):
     if event['method'] == 'GET':
@@ -51,54 +56,76 @@ def chat_index(event, context):
 
         try:
             task_group_id = event['query']['task_group_id']
-            hit_index = event['query'].get('hit_index', 'Pending')
-            assignment_index = event['query'].get('assignment_index', 'Pending')
-            all_agent_ids = event['query']['all_agent_ids']
             cur_agent_id = event['query']['cur_agent_id']
             assignment_id = event['query']['assignmentId'] # from mturk
-            task_additional_info = event['query'].get('task_additional_info', '') # Maximum length: 1000 characters
 
             if assignment_id == 'ASSIGNMENT_ID_NOT_AVAILABLE':
-                template_context['task_description'] = task_description
                 template_context['is_cover_page'] = True
+
+                custom_cover_page = cur_agent_id + '_cover_page.html'
+                if os.path.exists(custom_cover_page):
+                    return _render_template(template_context, custom_cover_page)
+                else:
+                    return _render_template(template_context, 'cover_page.html')
             else:
+                template_context['is_cover_page'] = False
                 template_context['task_group_id'] = task_group_id
-                template_context['hit_index'] = hit_index
-                template_context['assignment_index'] = assignment_index
+                template_context['hit_index'] = 'Pending'
+                template_context['assignment_index'] = 'Pending'
                 template_context['cur_agent_id'] = cur_agent_id
-                template_context['all_agent_ids'] = all_agent_ids
-                template_context['task_description'] = task_description.replace('{{task_additional_info}}', task_additional_info)
-                template_context['mturk_submit_url'] = mturk_submit_url
-                template_context['is_cover_page'] = False
                 template_context['frame_height'] = frame_height
 
-            return _render_template(template_context, 'mturk_index.html')
+                custom_index_page = cur_agent_id + '_index.html'
+                if os.path.exists(custom_index_page):
+                    return _render_template(template_context, custom_index_page)
+                else:
+                    return _render_template(template_context, 'mturk_index.html')
 
         except KeyError:
-            raise Exception('400')
+            raise
 
-def save_hit_info(event, context):
+def get_hit_config(event, context):
+    if event['method'] == 'GET':
+        with open('hit_config.json', 'r') as hit_config_file:
+            return json.loads(hit_config_file.read().replace('\n', ''))
+
+def send_new_command(event, context):
     if event['method'] == 'POST':
-        """
-        Saves HIT info to DB.
-        Expects <task_group_id>, <conversation_id>, <assignmentId>, <hitId>, <workerId> as POST body parameters
-        """
         params = event['body']
         task_group_id = params['task_group_id']
         conversation_id = params['conversation_id']
-        assignment_id = params['assignmentId']
-        hit_id = params['hitId']
-        worker_id = params['workerId']
-
-        data_model.set_hit_info(
-            db_session = db_session, 
-            task_group_id = task_group_id, 
-            conversation_id = conversation_id, 
-            assignment_id = assignment_id, 
-            hit_id = hit_id, 
-            worker_id = worker_id,
-            is_sandbox = is_sandbox
+        receiver_agent_id = params['receiver_agent_id']
+        command = params['command']
+
+        new_command_object = data_model.send_new_command(
+            db_session=db_session, 
+            task_group_id=task_group_id, 
+            conversation_id=conversation_id, 
+            receiver_agent_id=receiver_agent_id, 
+            command=command
+        )
+        
+        return json.dumps(data_model.object_as_dict(new_command_object))
+
+def get_command(event, context):
+    if event['method'] == 'GET':
+        task_group_id = event['query']['task_group_id']
+        conversation_id = event['query']['conversation_id']
+        receiver_agent_id = event['query']['receiver_agent_id']
+        last_command_id = int(event['query']['last_command_id'])
+
+        command_object = data_model.get_command(
+            db_session=db_session, 
+            task_group_id=task_group_id, 
+            conversation_id=conversation_id, 
+            receiver_agent_id=receiver_agent_id, 
+            after_command_id=last_command_id
         )
+         
+        if command_object:   
+            return json.dumps(data_model.object_as_dict(command_object))
+        else:
+            return None
 
 def get_new_messages(event, context):
     if event['method'] == 'GET':
@@ -107,24 +134,27 @@ def get_new_messages(event, context):
         Expects in GET query parameters:
         <task_group_id>
         <last_message_id>
+        <receiver_agent_id>
         <conversation_id> (optional)
-        <excluded_agent_id> (optional)
+        <excluded_sender_agent_id> (optional)
         """
         task_group_id = event['query']['task_group_id']
         last_message_id = int(event['query']['last_message_id'])
+        receiver_agent_id = event['query']['receiver_agent_id']
         conversation_id = None
         if 'conversation_id' in event['query']:
             conversation_id = event['query']['conversation_id']
-        excluded_agent_id = event['query'].get('excluded_agent_id', None)
-        included_agent_id = event['query'].get('included_agent_id', None)
+        excluded_sender_agent_id = event['query'].get('excluded_sender_agent_id', None)
+        included_sender_agent_id = event['query'].get('included_sender_agent_id', None)
 
         conversation_dict, new_last_message_id = data_model.get_new_messages(
             db_session=db_session, 
             task_group_id=task_group_id, 
+            receiver_agent_id=receiver_agent_id,
             conversation_id=conversation_id,
             after_message_id=last_message_id,
-            excluded_agent_id=excluded_agent_id,
-            included_agent_id=included_agent_id,
+            excluded_sender_agent_id=excluded_sender_agent_id,
+            included_sender_agent_id=included_sender_agent_id,
             populate_meta_info=True
         )
 
@@ -139,14 +169,11 @@ def get_new_messages(event, context):
 
 def send_new_message(event, context):
     if event['method'] == 'POST':
-        """
-        Send new message for this agent.
-        Expects <task_group_id>, <conversation_id>, <cur_agent_id> and <text> as POST body parameters
-        """
         params = event['body']
         task_group_id = params['task_group_id']
         conversation_id = params['conversation_id']
-        cur_agent_id = params['cur_agent_id']
+        sender_agent_id = params['sender_agent_id']
+        receiver_agent_id = params['receiver_agent_id'] if 'receiver_agent_id' in params else None
         message_text = params['text'] if 'text' in params else None
         reward = params['reward'] if 'reward' in params else None
         episode_done = params['episode_done']
@@ -155,7 +182,8 @@ def send_new_message(event, context):
             db_session=db_session, 
             task_group_id=task_group_id, 
             conversation_id=conversation_id, 
-            agent_id=cur_agent_id, 
+            sender_agent_id=sender_agent_id, 
+            receiver_agent_id=receiver_agent_id,
             message_text=message_text, 
             reward=reward,
             episode_done=episode_done
@@ -163,7 +191,7 @@ def send_new_message(event, context):
 
         new_message = { 
             "message_id": new_message_object.id,
-            "id": cur_agent_id,
+            "id": sender_agent_id,
             "text": message_text,
         }
         if reward:
@@ -172,143 +200,49 @@ def send_new_message(event, context):
         
         return json.dumps(new_message)
 
-def get_hit_index_and_assignment_index(event, context):
-    if event['method'] == 'GET':
-        """
-        Handler for get assignment index endpoint. 
-        Expects <task_group_id>, <agent_id> as query parameters.
-        """
-        try:
-            task_group_id = event['query']['task_group_id']
-            agent_id = event['query']['agent_id']
 
-            return data_model.get_hit_index_and_assignment_index(
-                db_session=db_session,
-                task_group_id=task_group_id,
-                agent_id=agent_id,
-                num_assignments=num_assignments
-            )
-        except KeyError:
-            raise Exception('400')
-
-def approval_index(event, context):
-    if event['method'] == 'GET':
-        """
-        Handler for approval page endpoint. 
-        Expects <requester_key>, <task_group_id>, <conversation_id>, <cur_agent_id> as query parameters.
-        """
-        try:
-            requester_key = event['query']['requester_key']
-            if not requester_key == requester_key_gt:
-                raise Exception('403')
-
-            task_group_id = event['query']['task_group_id']
-            hit_index = event['query']['hit_index']
-            assignment_index = event['query']['assignment_index']
-            mturk_agent_ids = event['query']['mturk_agent_ids']
-
-            template_context = {}
-            template_context['task_group_id'] = task_group_id
-            template_context['hit_index'] = hit_index
-            template_context['assignment_index'] = assignment_index
-            template_context['mturk_agent_ids'] = mturk_agent_ids
-            template_context['task_description'] = task_description
-            template_context['is_cover_page'] = False
-            template_context['is_approval_page'] = True
-            template_context['num_hits'] = int(num_hits)
-            template_context['num_assignments'] = int(num_assignments)
-            template_context['frame_height'] = frame_height
-
-            return _render_template(template_context, 'mturk_index.html')
-
-        except KeyError:
-            raise Exception('400')
-
-def review_hit(event, context):
+def sync_hit_assignment_info(event, context):
     if event['method'] == 'POST':
         """
-        Approve or reject assignment.
-        Expects <requester_key>, <task_group_id>, <conversation_id>, <action> as POST body parameters
+        Handler for syncing HIT assignment info between webpage client and remote database.
         """
         try:
             params = event['body']
-            requester_key = params['requester_key']
-            if not requester_key == requester_key_gt:
-                raise Exception('403')
-
             task_group_id = params['task_group_id']
-            conversation_id = params['conversation_id']
-            action = params['action'] # 'approve' or 'reject'
-
-            hit_infos = data_model.get_all_matching_hit_infos(
-                db_session=db_session, 
-                task_group_id=task_group_id, 
-                conversation_id=conversation_id
-            )
-
-            if len(hit_infos) > 0:
-                for hit_info in hit_infos:
-                    assignment_id = hit_info.assignment_id
-                    client = boto3.client(
-                        service_name = 'mturk', 
-                        region_name = 'us-east-1',
-                        endpoint_url = 'https://mturk-requester-sandbox.us-east-1.amazonaws.com'
-                    )
-                    # Region is always us-east-1
-                    if not hit_info.is_sandbox:
-                        client = boto3.client(service_name = 'mturk', region_name='us-east-1')
-
-                    if action == 'approve':
-                        client.approve_assignment(AssignmentId=assignment_id)
-                        hit_info.approval_status = 'approved'
-                    elif action == 'reject':
-                        client.reject_assignment(AssignmentId=assignment_id, RequesterFeedback='')
-                        hit_info.approval_status = 'rejected'
-                    db_session.add(hit_info)
-                    db_session.commit()
-
-        except KeyError:
-            raise Exception('400')
+            agent_id = params['agent_id']
+            num_assignments = params['num_assignments']
+            assignment_id = params['assignment_id']
+            hit_id = params['hit_id']
+            worker_id = params['worker_id']
 
-def get_approval_status_count(event, context):
-    if event['method'] == 'GET':
-        """
-        Handler for getting the number of pending approvals.
-        Expects <requester_key>, <task_group_id>, <conversation_id> as query parameters.
-        """
-        try:
-            requester_key = event['query']['requester_key']
-            if not requester_key == requester_key_gt:
-                raise Exception('403')
-
-            task_group_id = event['query']['task_group_id']
-            conversation_id = event['query'].get('conversation_id', None)
-            approval_status = event['query']['approval_status']
-            return data_model.get_approval_status_count(
+            return data_model.sync_hit_assignment_info(
                 db_session=db_session,
                 task_group_id=task_group_id,
-                conversation_id=conversation_id,
-                approval_status=approval_status
+                agent_id=agent_id,
+                num_assignments=int(num_assignments),
+                assignment_id=assignment_id,
+                hit_id=hit_id,
+                worker_id=worker_id
             )
         except KeyError:
-            raise Exception('400')
+            raise
 
-def get_all_approval_status(event, context):
+def get_hit_assignment_info(event, context):
     if event['method'] == 'GET':
         """
-        Handler for getting the number of pending approvals.
-        Expects <requester_key>, <task_group_id> as query parameters.
+        Handler for getting HIT assignment info.
         """
         try:
-            requester_key = event['query']['requester_key']
-            if not requester_key == requester_key_gt:
-                raise Exception('403')
-
             task_group_id = event['query']['task_group_id']
-            hit_info_objects = data_model.get_all_approval_status(
+            agent_id = event['query']['agent_id']
+            conversation_id = event['query']['conversation_id']
+
+            return json.dumps(data_model.get_hit_assignment_info(
                 db_session=db_session,
-                task_group_id=task_group_id
-            )
-            return [hio.as_dict() for hio in hit_info_objects]
+                task_group_id=task_group_id,
+                agent_id=agent_id,
+                conversation_id=conversation_id
+            ))
         except KeyError:
-            raise Exception('400')
\ No newline at end of file
+            raise
+
diff --git a/parlai/mturk/core/html/core.html b/parlai/mturk/core/html/core.html
new file mode 100644
index 00000000000..78107ab29bc
--- /dev/null
+++ b/parlai/mturk/core/html/core.html
@@ -0,0 +1,435 @@
+<!--
+Copyright (c) 2017-present, Facebook, Inc.
+All rights reserved.
+This source code is licensed under the BSD-style license found in the
+LICENSE file in the root directory of this source tree. An additional grant
+of patent rights can be found in the PATENTS file in the same directory.
+-->
+<html>
+
+{% block html_head %}
+<head>
+<title>MTurk Chat</title>
+<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
+</head>
+{% endblock %}
+
+<body>
+<div class="container-fluid">
+<div class="row">
+
+{% block main_pane %}
+{% block left_pane %}
+<div id="left-pane" class="col-xs-4" style="height: {{frame_height}}px; background-color: #dff0d8; padding: 30px; overflow:scroll;">
+    <h1>Live Chat</h1>
+    <hr style="border-top: 1px solid #555" />
+    <span id="task-description" style="font-size: 16px">
+    </span>
+</div>
+{% endblock %}
+
+{% block right_pane %}
+<div id="right-pane" style="min-height: 100%; display: flex; flex-direction: column; justify-content: space-between;">
+    <div id="right-top-pane" style="width: 100%; height: 570px; padding-top: 60px; padding-left: 20px; padding-right: 20px; padding-bottom: 20px; overflow:scroll; ">
+        <div id="message_thread" style="width: 100%">
+        </div>
+        <div id="waiting-for-message" class="row" style="margin-left: 0; margin-right: 0; display: none">
+            <div class="alert alert-warning" role="alert" style="float: left; display:table; background-color: #fff">
+                <div id="hourglass" style="margin-top: -1px; margin-right: 5px; display: inline; float: left;">
+                    <?xml version="1.0" encoding="utf-8"?><svg width='25px' height='25px' xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="xMidYMid" class="uil-hourglass"><rect x="0" y="0" width="100" height="100" fill="none" class="bk"></rect><g><path fill="none" stroke="#007282" stroke-width="5" stroke-miterlimit="10" d="M58.4,51.7c-0.9-0.9-1.4-2-1.4-2.3s0.5-0.4,1.4-1.4 C70.8,43.8,79.8,30.5,80,15.5H70H30H20c0.2,15,9.2,28.1,21.6,32.3c0.9,0.9,1.4,1.2,1.4,1.5s-0.5,1.6-1.4,2.5 C29.2,56.1,20.2,69.5,20,85.5h10h40h10C79.8,69.5,70.8,55.9,58.4,51.7z" class="glass"></path><clipPath id="uil-hourglass-clip1"><rect x="15" y="20" width="70" height="25" class="clip"><animate attributeName="height" from="25" to="0" dur="1.5s" repeatCount="indefinite" values="25;0;0" keyTimes="0;0.5;1"></animate><animate attributeName="y" from="20" to="45" dur="1.5s" repeatCount="indefinite" values="20;45;45" keyTimes="0;0.5;1"></animate></rect></clipPath><clipPath id="uil-hourglass-clip2"><rect x="15" y="55" width="70" height="25" class="clip"><animate attributeName="height" from="0" to="25" dur="1.5s" repeatCount="indefinite" values="0;25;25" keyTimes="0;0.5;1"></animate><animate attributeName="y" from="80" to="55" dur="1.5s" repeatCount="indefinite" values="80;55;55" keyTimes="0;0.5;1"></animate></rect></clipPath><path d="M29,23c3.1,11.4,11.3,19.5,21,19.5S67.9,34.4,71,23H29z" clip-path="url(#uil-hourglass-clip1)" fill="#ffab00" class="sand"></path><path d="M71.6,78c-3-11.6-11.5-20-21.5-20s-18.5,8.4-21.5,20H71.6z" clip-path="url(#uil-hourglass-clip2)" fill="#ffab00" class="sand"></path><animateTransform attributeName="transform" type="rotate" from="0 50 50" to="180 50 50" repeatCount="indefinite" dur="1.5s" values="0 50 50;0 50 50;180 50 50" keyTimes="0;0.7;1"></animateTransform></g></svg>
+                </div>
+                <span style="font-size: 16px">Waiting for the next person to speak...</span>
+            </div>
+        </div>
+    </div>
+
+    <div id="right-bottom-pane" style="width: 100%; background-color: #eee">
+        <div id="response-type-idle" class="response-type-module" style="display:none">
+        </div>
+        <div id="response-type-text-input" class="response-type-module" style="padding-left: 35px; padding-top: 30px; padding-bottom: 30px; padding-right: 35px; float: left; display:none">
+            <div style="height: 50px; width: 100%; display: block; float: left; ">
+                <input id="id_text_input" type="text" style="width: 80%; height: 100%; float: left; font-size: 16px" class="form-control" value="" placeholder="Please enter here...">
+                <button class="btn btn-primary" style="width: 100px; height: 100%; font-size: 16px; float: left; margin-left: 10px; padding: 0px;" id="id_send_msg_button">Send</button>
+            </div>
+        </div>
+        <div id="response-type-done" class="response-type-module" style="padding-left: 35px; padding-top: 30px; padding-bottom: 30px; padding-right: 35px; float: left">
+            <button id="done-button" type="button" class="btn btn-primary btn-lg">
+                <span class="glyphicon glyphicon-ok-circle" aria-hidden="true"></span> Done with this HIT
+            </button>
+        </div>
+    </div>
+</div>
+{% endblock %}
+{% endblock %}
+
+</div>
+</div>
+<form id="mturk_submit_form" action="" method="post" style="display:none">
+    <input id="assignmentId" name="assignmentId" value="" />
+    <input id="hitId" name="hitId" value="" />
+    <input id="workerId" name="workerId" value="" />
+    <input type="submit" value="Submit" name="submitButton" id="mturk_submit_button" />
+</form>
+<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
+<script src="https://cdn.rawgit.com/yf225/jquery-ajax-retry/master/dist/jquery.ajax-retry.min.js"></script>
+
+<script type="text/javascript">
+    /* ================= Utility functions ================= */
+    function scroll_conversation_to_bottom() {
+        $('div#right-top-pane').animate({
+            scrollTop: $('div#right-top-pane').get(0).scrollHeight
+        }, 500);
+    }
+
+    // Check if in mturk hit page, by checking whether in iframe
+    function in_mturk_hit_page () {
+        try {
+            return window.self !== window.top;
+        } catch (e) {
+            return true;
+        }
+    }
+
+    function wait_for_worker_input() {
+        update_UI_for_response_type('text_input');
+        $("div#waiting-for-message").css("display", "none");
+        $("input#id_text_input").focus();
+        if (polling_timer) {
+            clearTimeout(polling_timer);
+            polling_timer = null;
+        }
+    }
+
+    // Get URL parameter
+    function get_url_parameter(sParam) {
+        var sPageURL = decodeURIComponent(window.location.search.substring(1)),
+            sURLVariables = sPageURL.split('&'),
+            sParameterName,
+            i;
+
+        for (i = 0; i < sURLVariables.length; i++) {
+            sParameterName = sURLVariables[i].split('=');
+
+            if (sParameterName[0] === sParam) {
+                return sParameterName[1] === undefined ? true : sParameterName[1];
+            }
+        }
+    };
+
+    function sync_hit_assignment_info(callback_function) {
+        $.ajax({
+            url: '/prod/json',
+            type: "POST",
+            data: JSON.stringify({
+                method_name: 'sync_hit_assignment_info',
+                task_group_id: task_group_id,
+                agent_id: cur_agent_id,
+                num_assignments: num_assignments,
+                assignment_id: assignmentId,
+                hit_id: hitId,
+                worker_id: workerId
+            }),
+            contentType: "application/json",
+            timeout: 3000 // in milliseconds
+        }).retry({times: 1000, timeout: 3000}).then(
+            function(data) {
+                if (verbose_log) console.log(data);
+                hit_index = parseInt(data['hit_index']);
+                assignment_index = parseInt(data['assignment_index']);
+                if (callback_function) {
+                    callback_function();
+                }
+            }
+        );
+    }
+
+    function get_hit_config(callback_function) {
+        $.ajax({
+            url: '/prod/json',
+            data: {
+                method_name: 'get_hit_config',
+            },
+            timeout: 3000 // in milliseconds
+        }).retry({times: 1000, timeout: 3000}).then(
+            function(data) {
+                if (verbose_log) console.log(data);
+                if (callback_function) {
+                    callback_function(data);
+                }
+            }
+        );
+    }
+
+    /* ================= UI related ================= */
+    $(window).resize(function() {
+        $("input#id_text_input").width($("div#right-bottom-pane").width() - 210);
+        $("div#right-top-pane").height($("div#left-pane").height() - $("div#right-bottom-pane").outerHeight() - 20);
+    });
+
+    // Handling keypress event
+    $(document).keypress(function(e) {
+        if (e.which == 13) {
+            if (!($("button#id_send_msg_button").prop("disabled") === true)) {
+                $("button#id_send_msg_button").click();
+            }
+        }
+    });
+    
+    /* ================= State variables ================= */
+    var verbose_log = false;
+    var is_cover_page = (`{{is_cover_page}}` === 'True') ? true : false;
+    var task_group_id = `{{task_group_id}}`;
+    var hit_index = `{{hit_index}}`;
+    var assignment_index = `{{assignment_index}}`;
+    var assignmentId = null;
+    var hitId = null;
+    var workerId = null;
+    var conversation_id = null;
+    var cur_agent_id = `{{cur_agent_id}}`;
+    var task_description = null;
+    var is_sandbox = null;
+    var mturk_submit_url = null;
+    var num_hits = null;
+    var num_assignments = null;
+    var last_message_id = -1;
+    var last_command_id = -1;
+    var polling_timer = null;
+    var messages_processed = {};
+    var messages_shown = {};
+
+    /* ================= Message handling ================= */
+
+    function handle_new_messages(new_messages) {
+        for (var i = 0; i < new_messages.length; i++) {
+            var message = new_messages[i];
+            var agent_id = message['id'];
+            var message_id = parseInt(message['message_id']);
+            var message_text = message['text'].replace(/(?:\r\n|\r|\n)/g, '<br />');
+
+            if (!(message_id in messages_shown)) {
+                if (agent_id != cur_agent_id) {
+                    $('div#message_thread').append(`
+                        <div class="row" style="margin-left: 0; margin-right: 0">
+                            <div class="alert alert-warning" role="alert" style="float: left; display:table">
+                                <span style="font-size: 16px"><b>`+agent_id+`</b>: `+message_text+`</span>
+                            </div>
+                        </div>
+                    `);
+                    received_message_from_other_agents = true;
+                } else {
+                    $('div#message_thread').append(`
+                        <div class="row" style="margin-left: 0; margin-right: 0">
+                            <div class="alert alert-info" role="alert" style="float: right; display:table">
+                                <span style="font-size: 16px"><b>`+agent_id+`</b>: `+message_text+`</span>
+                            </div>
+                        </div>
+                    `);
+                }
+                messages_shown[message_id] = true;
+            }
+        }
+        if (new_messages.length > 0) {
+            $("div#message_thread").css("display", "");
+            scroll_conversation_to_bottom();
+        }
+    }
+
+    function send_message(text, receiver_agent_id, callback_function=null) {
+        var post_data_dict = {
+            task_group_id: task_group_id,
+            conversation_id: conversation_id,
+            sender_agent_id: cur_agent_id,
+            receiver_agent_id: receiver_agent_id
+        };
+        if (text) post_data_dict['text'] = text;
+        post_data_dict['episode_done'] = false;
+        post_data_dict['method_name'] = 'send_new_message';
+        $.ajax({
+            url: '/prod/json',
+            type: "POST",
+            data: JSON.stringify(post_data_dict),
+            contentType: "application/json",
+            timeout: 3000 // in milliseconds
+        }).retry({times: 1000, timeout: 3000}).then(
+            function(data){
+                if (verbose_log) console.log(data);
+                if (callback_function) {
+                    callback_function(data);
+                }
+            }
+        );
+    }
+
+    $("button#id_send_msg_button").on('click', function () {
+        var text = $("input#id_text_input").val();
+        if (!(text == '')) {
+            send_message(text, '[World]', function(data){
+                $("input#id_text_input").val("");
+                $("div#response-type-text-input").css("display", "none");
+                data = JSON.parse(data);
+                // Update last message id for this agent
+                var message_id = parseInt(data["message_id"]);
+                if (message_id > last_message_id) {
+                    last_message_id = message_id;
+                }
+                var new_messages = [];
+                new_messages.push(data);
+                handle_new_messages(new_messages);
+                $("div#waiting-for-message").css("display", "");
+                if (!polling_timer) {
+                    polling_for_command();
+                }
+            });
+        }
+    });
+
+    function get_new_messages() {
+        $.ajax({
+            url: '/prod/json',
+            data: {
+                method_name: 'get_new_messages',
+                task_group_id: task_group_id,
+                conversation_id: conversation_id,
+                receiver_agent_id: cur_agent_id,
+                last_message_id: last_message_id,
+            },
+            timeout: 3000 // in milliseconds
+        }).retry({times: 1000, timeout: 3000}).then(
+            function(data) {
+                if (verbose_log) console.log(data);
+                data = JSON.parse(data);
+                conversation_dict = data['conversation_dict'];
+                if (!($.isEmptyObject(conversation_dict))) {
+                    message_list = data['conversation_dict'][conversation_id];
+                    // New messages are sorted by message id in message_list
+                    for (var i = 0; i < message_list.length; i++) {
+                        message = message_list[i];
+                        var agent_id = message['id'];
+                        var message_id = parseInt(message['message_id']);
+                        if (message_id > last_message_id) {
+                            last_message_id = message_id;
+                        }
+                    }
+                    handle_new_messages(message_list);
+                }
+            }
+        );
+    }
+
+    // background thread that checks new command periodically
+    function polling_for_command() {
+        $.ajax({
+            url: '/prod/json',
+            data: {
+                method_name: 'get_command',
+                task_group_id: task_group_id,
+                conversation_id: conversation_id,
+                receiver_agent_id: cur_agent_id,
+                last_command_id: last_command_id,
+            },
+            timeout: 3000 // in milliseconds
+        }).retry({times: 1000, timeout: 3000}).then(
+            function(data) {
+                if (verbose_log) console.log(data);
+                if (!(data === null)) {
+                    data = JSON.parse(data);
+
+                    new_last_command_id = data['id'];
+                    if (new_last_command_id > last_command_id) {
+                        last_command_id = new_last_command_id;
+                    }
+
+                    command = data['command'];
+                    if (command === 'COMMAND_GET_NEW_MESSAGES') {
+                        get_new_messages();
+                        polling_timer = setTimeout(polling_for_command, 1000);
+                    } else if (command === 'COMMAND_SEND_MESSAGE') {
+                        wait_for_worker_input();
+                    } else if (command === 'COMMAND_SUBMIT_HIT') {
+                        update_UI_for_response_type('done');
+                        $("div#waiting-for-message").css("display", "none");
+                    }
+                } else {
+                    polling_timer = setTimeout(polling_for_command, 1000);
+                }   
+            }
+        );
+    }
+
+    function update_UI_for_response_type(response_type) {
+        $("div.response-type-module").css("display", "none");
+
+        if (response_type == 'idle') {
+            $("div#response-type-idle").css("display", "");
+        } else if (response_type == 'text_input') {
+            $("div#response-type-text-input").css("display", "");
+        } else if (response_type == 'done') {
+            $("div#response-type-done").css("display", "");
+        }
+        $(window).resize();
+    }
+
+    $("button#done-button").on('click', function() {
+        all_done_callback();
+    });
+
+    function all_done_callback() {
+        // Stop polling
+        if (polling_timer) {
+            clearTimeout(polling_timer);
+            polling_timer = null;
+        }
+        if (in_mturk_hit_page()) {
+            $("input#mturk_submit_button").click();
+        }
+    }
+
+    function init_cover_page() {
+        $("span#task-description").html(task_description);
+    }
+
+    function init_chat_page() {
+        // Data related
+        conversation_id = hit_index + '_' + assignment_index;
+        $("form#mturk_submit_form input#assignmentId").val(assignmentId);
+        $("form#mturk_submit_form input#hitId").val(hitId);
+        $("form#mturk_submit_form input#workerId").val(workerId);
+        $("form#mturk_submit_form").attr("action", mturk_submit_url);
+        $("span#task-description").html(task_description);
+
+        // UI related
+        update_UI_for_response_type('idle');
+        $("div#waiting-for-message").css("display", "");
+        $("div#left-pane").removeClass('col-xs-12');
+        $("div#left-pane").addClass('col-xs-4');
+
+        $(window).resize();
+        polling_for_command();
+    }
+
+    $(document).ready(function() {
+        get_hit_config(function(data) {
+            task_description = data['task_description'];
+            is_sandbox = data['is_sandbox'];
+            mturk_submit_url = data['mturk_submit_url'];
+            num_hits = data['num_hits'];
+            num_assignments = data['num_assignments'];
+
+            assignmentId = get_url_parameter('assignmentId');
+            hitId = get_url_parameter('hitId');
+            workerId = get_url_parameter('workerId');
+
+            if (is_cover_page) {
+                init_cover_page();
+            } else {
+                sync_hit_assignment_info(init_chat_page);
+            }
+        });
+    });
+</script>
+
+{% block additional_scripts %}
+{% endblock %}
+
+</body>
+</html>
\ No newline at end of file
diff --git a/parlai/mturk/core/html/cover_page.html b/parlai/mturk/core/html/cover_page.html
new file mode 100644
index 00000000000..db345956b94
--- /dev/null
+++ b/parlai/mturk/core/html/cover_page.html
@@ -0,0 +1,17 @@
+<!--
+Copyright (c) 2017-present, Facebook, Inc.
+All rights reserved.
+This source code is licensed under the BSD-style license found in the
+LICENSE file in the root directory of this source tree. An additional grant
+of patent rights can be found in the PATENTS file in the same directory.
+-->
+{% extends "core.html" %}
+
+{% block main_pane %}
+<div id="main-pane" class="col-xs-12" style="height: {{frame_height}}px; background-color: #dff0d8; padding: 30px; overflow:scroll;">
+    <h1>Live Chat</h1>
+    <hr style="border-top: 1px solid #555" />
+    <span id="task-description" style="font-size: 16px">
+    </span>
+</div>
+{% endblock %}
\ No newline at end of file
diff --git a/parlai/mturk/core/html/mturk_index.html b/parlai/mturk/core/html/mturk_index.html
new file mode 100644
index 00000000000..711ea910578
--- /dev/null
+++ b/parlai/mturk/core/html/mturk_index.html
@@ -0,0 +1,63 @@
+<!--
+Copyright (c) 2017-present, Facebook, Inc.
+All rights reserved.
+This source code is licensed under the BSD-style license found in the
+LICENSE file in the root directory of this source tree. An additional grant
+of patent rights can be found in the PATENTS file in the same directory.
+-->
+{% extends "core.html" %}
+
+<!--
+{# 
+
+Note: you can override any code blocks in core.html in the following way:
+
+Example 1: Overriding left_pane to display custom image instead:
+
+{% block left_pane %}
+	<div id="left-pane" class="col-xs-4" style="height: {{frame_height}}px; background-color: #dff0d8; padding: 30px; overflow:scroll;">
+	    <h1>Live Chat</h1>
+	    <hr style="border-top: 1px solid #555" />
+	    <img src="your-custom-image.jpg" />
+	</div>
+{% endblock %}
+
+Example 2: Overriding handle_new_messages to only show messages from this agent
+
+{% block additional_scripts %}
+<script type="text/javascript">
+    function handle_new_messages(new_messages) {
+        for (var i = 0; i < new_messages.length; i++) {
+            var message = new_messages[i];
+            var agent_id = message['id'];
+            var message_id = parseInt(message['message_id']);
+            var message_text = message['text'].replace(/(?:\r\n|\r|\n)/g, '<br />');
+
+            if (!(message_id in messages_shown)) {
+                if (agent_id == cur_agent_id) {
+                    $('div#message_thread').append(`
+                        <div class="row" style="margin-left: 0; margin-right: 0">
+                            <div class="alert alert-info" role="alert" style="float: right; display:table">
+                                <span style="font-size: 16px"><b>`+agent_id+`</b>: `+message_text+`</span>
+                            </div>
+                        </div>
+                    `);
+                }
+                messages_shown[message_id] = true;
+            }
+        }
+        if (new_messages.length > 0) {
+            $("div#message_thread").css("display", "");
+            scroll_conversation_to_bottom();
+        }
+        if (received_message_from_other_agents && (!task_done) && new_messages[new_messages.length-1]['id'] === previous_agent_id) {
+            update_UI_for_response_type('text_input');
+            $("div#waiting-for-message").css("display", "none");
+            $("input#id_text_input").focus();
+        }
+    }
+</script>
+{% endblock %}
+
+#}
+-->
\ No newline at end of file
diff --git a/parlai/mturk/core/mturk_index.html b/parlai/mturk/core/mturk_index.html
deleted file mode 100755
index 7566f3b8198..00000000000
--- a/parlai/mturk/core/mturk_index.html
+++ /dev/null
@@ -1,549 +0,0 @@
-<html>
-<!--
-Copyright (c) 2017-present, Facebook, Inc.
-All rights reserved.
-This source code is licensed under the BSD-style license found in the
-LICENSE file in the root directory of this source tree. An additional grant
-of patent rights can be found in the PATENTS file in the same directory.
--->
-<head>
-<title>MTurk Chat</title>
-<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
-</head>
-<body>
-<div class="container-fluid">
-<div class="row">
-<div id="left-pane" class="col-xs-4" style="height: {{frame_height}}px; background-color: #dff0d8; padding: 30px; overflow:scroll;">
-    <h1>Live Chat</h1>
-    <hr style="border-top: 1px solid #555" />
-    <span id="task-description" style="font-size: 16px">
-    {{task_description}}
-    </span>
-</div>
-<div id="right-pane" style="min-height: 100%; display: flex; flex-direction: column; justify-content: space-between;">
-    <div id="right-top-pane" style="width: 100%; height: 570px; padding-top: 60px; padding-left: 20px; padding-right: 20px; padding-bottom: 20px; overflow:scroll; ">
-        {% if not is_cover_page %}
-        <div id="message_thread" style="width: 100%">
-        </div>
-        {% if not is_approval_page %}
-        <div id="waiting-for-message" class="row" style="margin-left: 0; margin-right: 0; display: none">
-            <div class="alert alert-warning" role="alert" style="float: left; display:table; background-color: #fff">
-                <div id="hourglass" style="margin-top: -1px; margin-right: 5px; display: inline; float: left;">
-                    <?xml version="1.0" encoding="utf-8"?><svg width='25px' height='25px' xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="xMidYMid" class="uil-hourglass"><rect x="0" y="0" width="100" height="100" fill="none" class="bk"></rect><g><path fill="none" stroke="#007282" stroke-width="5" stroke-miterlimit="10" d="M58.4,51.7c-0.9-0.9-1.4-2-1.4-2.3s0.5-0.4,1.4-1.4 C70.8,43.8,79.8,30.5,80,15.5H70H30H20c0.2,15,9.2,28.1,21.6,32.3c0.9,0.9,1.4,1.2,1.4,1.5s-0.5,1.6-1.4,2.5 C29.2,56.1,20.2,69.5,20,85.5h10h40h10C79.8,69.5,70.8,55.9,58.4,51.7z" class="glass"></path><clipPath id="uil-hourglass-clip1"><rect x="15" y="20" width="70" height="25" class="clip"><animate attributeName="height" from="25" to="0" dur="1.5s" repeatCount="indefinite" values="25;0;0" keyTimes="0;0.5;1"></animate><animate attributeName="y" from="20" to="45" dur="1.5s" repeatCount="indefinite" values="20;45;45" keyTimes="0;0.5;1"></animate></rect></clipPath><clipPath id="uil-hourglass-clip2"><rect x="15" y="55" width="70" height="25" class="clip"><animate attributeName="height" from="0" to="25" dur="1.5s" repeatCount="indefinite" values="0;25;25" keyTimes="0;0.5;1"></animate><animate attributeName="y" from="80" to="55" dur="1.5s" repeatCount="indefinite" values="80;55;55" keyTimes="0;0.5;1"></animate></rect></clipPath><path d="M29,23c3.1,11.4,11.3,19.5,21,19.5S67.9,34.4,71,23H29z" clip-path="url(#uil-hourglass-clip1)" fill="#ffab00" class="sand"></path><path d="M71.6,78c-3-11.6-11.5-20-21.5-20s-18.5,8.4-21.5,20H71.6z" clip-path="url(#uil-hourglass-clip2)" fill="#ffab00" class="sand"></path><animateTransform attributeName="transform" type="rotate" from="0 50 50" to="180 50 50" repeatCount="indefinite" dur="1.5s" values="0 50 50;0 50 50;180 50 50" keyTimes="0;0.7;1"></animateTransform></g></svg>
-                </div>
-                <span style="font-size: 16px">Waiting for the next person to speak...</span>
-            </div>
-        </div>
-        {% endif %}
-        {% endif %}
-    </div>
-
-    {% if not is_cover_page %}
-    <div id="right-bottom-pane" style="width: 100%; background-color: #eee">
-        {% if not is_approval_page %}
-        <div id="response-type-idle" class="response-type-module" style="display:none">
-        </div>
-        <div id="response-type-choices" class="response-type-module" style="padding-left: 35px; padding-top: 30px; padding-bottom: 30px; padding-right: 35px; float: left; display:none">
-            <div style="width: 100%; display: block; float: left; ">
-                <div style="margin-bottom: 5px"><input type="radio" name="choice" value="choice_1" style="margin-right: 10px"><span style="font-size: 16px">Choice 1</span><br></div>
-                <div style="margin-bottom: 5px"><input type="radio" name="choice" value="choice_2" style="margin-right: 10px"><span style="font-size: 16px">Choice 2</span><br></div>
-                <div style="margin-bottom: 5px"><input type="radio" name="choice" value="choice_3" style="margin-right: 10px"><span style="font-size: 16px">Choice 3</span><br></div>
-                <div><input type="radio" name="choice" value="choice_4" style="margin-right: 10px"><span style="font-size: 16px">Choice 4</span></div>
-            </div>
-            <div style="height: 50px; width: 100%; display: block; float: left; ">
-                <button id="choice-send-button" class="btn btn-primary disabled" style="width: 100px; height: 100%; font-size: 16px; float: left; padding: 0px;" id="id_send_msg_button">Send</button>
-            </div>
-        </div>
-        <div id="response-type-text-input" class="response-type-module" style="padding-left: 35px; padding-top: 30px; padding-bottom: 30px; padding-right: 35px; float: left; display:none">
-            <div style="height: 50px; width: 100%; display: block; float: left; ">
-                <input id="id_text_input" type="text" style="width: 80%; height: 100%; float: left; font-size: 16px" class="form-control" value="" placeholder="Please enter here...">
-                <button class="btn btn-primary" style="width: 100px; height: 100%; font-size: 16px; float: left; margin-left: 10px; padding: 0px;" id="id_send_msg_button">Send</button>
-            </div>
-        </div>
-        <div id="response-type-binary-reward" class="response-type-module" style="padding-left: 35px; padding-top: 30px; padding-bottom: 30px; padding-right: 35px; float: left">
-            <div style="width: 100%; display: block; float: left; ">
-                <div id="reward-buttons-group" style="width: 100%">
-                    <button id="positive-reward-button" type="button" class="btn btn-primary btn-lg" style="width: 200px">
-                        <span class="glyphicon glyphicon-ok" aria-hidden="true"></span> Correct
-                    </button>
-                    <button id="negative-reward-button" type="button" class="btn btn-danger btn-lg" style="margin-left: 20px; width: 200px">
-                        <span class="glyphicon glyphicon-remove" aria-hidden="true"></span> Incorrect
-                    </button>
-                </div>
-            </div>
-        </div>
-        <div id="response-type-done" class="response-type-module" style="padding-left: 35px; padding-top: 30px; padding-bottom: 30px; padding-right: 35px; float: left">
-            <button id="done-button" type="button" class="btn btn-primary btn-lg">
-                <span class="glyphicon glyphicon-ok-circle" aria-hidden="true"></span> Done with this HIT
-            </button>
-        </div>
-        {% else %}
-        <div id="approval" style="padding-left: 35px; padding-top: 30px; padding-bottom: 30px; padding-right: 35px; float: left">
-            <div id="approval-prompt-text" style="width: 100%; display: block; float: left">
-                <span class="prompt-text" style="width: 100%; float: left; font-size: 16px">Do you approve this work?</span>
-            </div>
-            <div id="approval-buttons-group" style="width: 100%; display: block; float: left; margin-top: 30px;">
-                <button id="approve-button" type="button" class="btn btn-primary btn-lg" style="width: 200px">
-                    <span class="glyphicon glyphicon-ok" aria-hidden="true"></span> Approve
-                </button>
-                <button id="reject-button" type="button" class="btn btn-danger btn-lg" style="margin-left: 20px; width: 200px">
-                    <span class="glyphicon glyphicon-remove" aria-hidden="true"></span> Reject
-                </button>
-            </div>
-        </div>
-        {% endif %}
-    </div>
-    {% endif %}
-</div>
-</div>
-</div>
-<form id="mturk_submit_form" action="" method="post" style="display:none">
-    <input id="assignmentId" name="assignmentId" value="" />
-    <input id="hitId" name="hitId" value="" />
-    <input id="workerId" name="workerId" value="" />
-    <input type="submit" value="Submit" name="submitButton" id="mturk_submit_button" />
-</form>
-<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
-<script type="text/javascript">
-    // UI related
-    $(window).resize(function() {
-        $("input#id_text_input").width($("div#right-bottom-pane").width() - 210);
-        $("div#right-top-pane").height($("div#left-pane").height() - $("div#right-bottom-pane").outerHeight() - 20);
-        $("div#reward-buttons-group").css("margin-left")
-        $("div#approval-buttons-group").css("margin-left")
-    });
-
-    function scroll_conversation_to_bottom() {
-        $('div#right-top-pane').animate({
-            scrollTop: $('div#right-top-pane').get(0).scrollHeight
-        }, 500);
-    }
-
-    // Handling keypress event
-    $(document).keypress(function(e) {
-        if (e.which == 13) {
-            $("button#id_send_msg_button").click();
-        }
-    });
-
-    // Response type: choice
-    $(':radio[name="choice"]').change(function() {
-        $("button#choice-send-button").removeClass("disabled");
-    });
-
-    // Check if in mturk hit page, by checking whether in iframe
-    function in_mturk_hit_page () {
-        try {
-            return window.self !== window.top;
-        } catch (e) {
-            return true;
-        }
-    }
-
-    // Get URL parameter
-    function get_url_parameter(sParam) {
-        var sPageURL = decodeURIComponent(window.location.search.substring(1)),
-            sURLVariables = sPageURL.split('&'),
-            sParameterName,
-            i;
-
-        for (i = 0; i < sURLVariables.length; i++) {
-            sParameterName = sURLVariables[i].split('=');
-
-            if (sParameterName[0] === sParam) {
-                return sParameterName[1] === undefined ? true : sParameterName[1];
-            }
-        }
-    };
-
-
-    // Response type: reward
-    function send_binary_reward_and_update_UI(is_positive) {
-        var reward = 1;
-        if (!(is_positive)) reward = -1;
-        var post_data_dict = {
-            task_group_id: task_group_id,
-            conversation_id: conversation_id,
-            cur_agent_id: cur_agent_id,
-            reward: reward,
-            text: null,
-            episode_done: false
-        };
-        if (done_after_responding) post_data_dict['episode_done'] = true;
-        post_data_dict['method_name'] = 'send_new_message';
-        $.ajax({
-            url: '/prod/json',
-            type: "POST",
-            data: JSON.stringify(post_data_dict),
-            contentType: "application/json",
-            success: function(data){
-                if (verbose_log) console.log(data);
-                $("div#response-type-binary-reward").css("display", "none");
-                data = JSON.parse(data);
-                // Update last message id for this agent
-                var message_id = parseInt(data["message_id"]);
-                if (message_id > last_message_id) {
-                    last_message_id = message_id;
-                }
-                var new_messages = [];
-                new_messages.push(data);
-                show_new_messages_on_UI(new_messages);
-                if (done_after_responding) {
-                    update_for_response_type('done');
-                    $("div#waiting-for-message").css("display", "none");
-                    task_done = true;
-                } else {
-                    $("div#waiting-for-message").css("display", "");
-                    check_done(new_messages);
-                }
-            }
-        });
-    }
-
-    $("button#positive-reward-button").click(function() {
-        send_binary_reward_and_update_UI(true);
-    });
-
-    $("button#negative-reward-button").click(function() {
-        send_binary_reward_and_update_UI(false);
-    });
-
-    function send_approval_and_update_UI(action) {
-        $("div#message_thread").css("display", "none");
-        $("div#approval-prompt-text span.prompt-text").text("Submitting to MTurk...");
-        $("div#approval-buttons-group").css("display", "none");
-
-        var post_data_dict = {
-            requester_key: get_url_parameter('requester_key'),
-            task_group_id: task_group_id,
-            conversation_id: conversation_id,
-            action: action
-        };
-        post_data_dict['method_name'] = 'review_hit';
-        $.ajax({
-            url: '/prod/json',
-            type: "POST",
-            data: JSON.stringify(post_data_dict),
-            contentType: "application/json",
-            success: function(data){
-                if (verbose_log) console.log(data);
-
-                $("div#approval-prompt-text span.prompt-text").text("Do you approve this work?");
-                $("div#approval-buttons-group").css("display", "");
-
-                hit_index = parseInt(hit_index);
-                assignment_index = parseInt(assignment_index);
-
-                if (assignment_index < num_assignments) {
-                    window.location.href = window.location.href.replace('hit_index='+hit_index+'&assignment_index='+assignment_index, 'hit_index='+hit_index+'&assignment_index='+(assignment_index+1));
-                } else if (hit_index < num_hits) {
-                    window.location.href = window.location.href.replace('hit_index='+hit_index+'&assignment_index='+assignment_index, 'hit_index='+(hit_index+1)+'&assignment_index=1');
-                } else {
-                    $("div#message_thread").css("display", "none");
-                    $("div#approval-prompt-text span.prompt-text").text("All done! Please feel free to close the window.");
-                    $("div#approval-buttons-group").css("display", "none");
-                }
-            }
-        });
-    }
-
-    $("button#approve-button").click(function() {
-        send_approval_and_update_UI('approve');
-    });
-
-    $("button#reject-button").click(function() {
-        send_approval_and_update_UI('reject');
-    });
-
-    $("button#done-button").click(function() {
-        all_done_callback();
-    });
-
-    function get_hit_index_and_assignment_index(callback_function) {
-        $.ajax({
-            url: '/prod/json',
-            data: {
-                method_name: 'get_hit_index_and_assignment_index',
-                task_group_id: task_group_id,
-                agent_id: cur_agent_id
-            },
-            success: function(data) {
-                if (verbose_log) console.log(data);
-                hit_index = parseInt(data['hit_index']);
-                assignment_index = parseInt(data['assignment_index']);
-                if (callback_function) {
-                    callback_function();
-                }
-            },
-        });
-    }
-
-    var mturk_submit_url = `{{mturk_submit_url}}`;
-    var verbose_log = false;
-    var task_group_id = `{{task_group_id}}`;
-    var hit_index = `{{hit_index}}`;
-    var assignment_index = `{{assignment_index}}`;
-    var conversation_id = null;
-    var cur_agent_id = `{{cur_agent_id}}`;
-    var mturk_agent_ids = `{{mturk_agent_ids}}`;
-    if (mturk_agent_ids) {
-        mturk_agent_ids = JSON.parse(mturk_agent_ids);
-    }
-    var all_agent_ids = `{{all_agent_ids}}`;
-    var cur_agent_index = null;
-    var previous_agent_id = null;
-    if (all_agent_ids) {
-        all_agent_ids = JSON.parse(all_agent_ids);
-        cur_agent_index = all_agent_ids.indexOf(cur_agent_id);
-        if (cur_agent_index == 0) {
-            previous_agent_id = all_agent_ids[all_agent_ids.length-1];
-        } else {
-            previous_agent_id = all_agent_ids[cur_agent_index-1];
-        }
-    }
-    var is_cover_page = (`{{is_cover_page}}` === 'True') ? true : false;
-    var is_approval_page = (`{{is_approval_page}}` === 'True') ? true : false;
-    var num_hits = parseInt(`{{num_hits}}`);
-    var num_assignments = parseInt(`{{num_assignments}}`);
-    var last_message_id = -1;
-    var worker_timer = null;
-    var messages_processed = {};
-    var messages_shown = {};
-    var done_after_responding = false;
-    var task_done = false;
-
-    function show_new_messages_on_UI(new_messages) {
-        for (var i = 0; i < new_messages.length; i++) {
-            var message = new_messages[i];
-            var agent_id = message['id'];
-            var message_id = parseInt(message['message_id']);
-            var message_text = message['text'].replace(/(?:\r\n|\r|\n)/g, '<br />');
-            var reward = null;
-            var received_message_from_other_agents = false;
-            if ('reward' in message) {
-                reward = message['reward']
-            }
-
-            if (!(message_id in messages_shown)) {
-                if ((!is_approval_page && agent_id != cur_agent_id) || (is_approval_page && $.inArray(agent_id, mturk_agent_ids) == -1)) {
-                    $('div#message_thread').append(`
-                        <div class="row" style="margin-left: 0; margin-right: 0">
-                            <div class="alert alert-warning" role="alert" style="float: left; display:table">
-                                <span style="font-size: 16px"><b>`+agent_id+`</b>: `+message_text+`</span>
-                            </div>
-                        </div>
-                    `);
-                    received_message_from_other_agents = true;
-                } else {
-                    if (reward) {
-                        var reward_text = reward;
-                        if (reward > 0) {
-                            reward_text = '+' + reward_text;
-                        }
-                        $('div#message_thread').append(`
-                        <div class="row" style="margin-left: 0; margin-right: 0">
-                            <div class="alert alert-info" role="alert" style="float: right; display:table">
-                                <span style="font-size: 16px"><b>`+agent_id+`</b>: `+reward_text+`</span>
-                            </div>
-                        </div>
-                        `);
-                    } else {
-                        $('div#message_thread').append(`
-                        <div class="row" style="margin-left: 0; margin-right: 0">
-                            <div class="alert alert-info" role="alert" style="float: right; display:table">
-                                <span style="font-size: 16px"><b>`+agent_id+`</b>: `+message_text+`</span>
-                            </div>
-                        </div>
-                        `);
-                    }
-                }
-                messages_shown[message_id] = true;
-            }
-        }
-        if (new_messages.length > 0) {
-            $("div#message_thread").css("display", "");
-            scroll_conversation_to_bottom();
-        }
-        if (received_message_from_other_agents && (!task_done) && new_messages[new_messages.length-1]['id'] === previous_agent_id) {
-            update_for_response_type('text_input');
-            $("div#waiting-for-message").css("display", "none");
-            $("input#id_text_input").focus();
-        }
-    }
-
-    function check_done(new_messages) {
-        for (var i = 0; i < new_messages.length; i++) {
-            if (new_messages[i]['episode_done']) {
-                done_after_responding = true;
-            }
-        }
-    }
-
-    function send_message(text, episode_done=false, callback_function=null) {
-        var post_data_dict = {
-            task_group_id: task_group_id,
-            conversation_id: conversation_id,
-            cur_agent_id: cur_agent_id,
-        };
-        if (text) post_data_dict['text'] = text;
-        post_data_dict['episode_done'] = episode_done;
-        post_data_dict['method_name'] = 'send_new_message';
-        $.ajax({
-            url: '/prod/json',
-            type: "POST",
-            data: JSON.stringify(post_data_dict),
-            contentType: "application/json",
-            success: function(data){
-                if (verbose_log) console.log(data);
-                if (callback_function) {
-                    callback_function(data);
-                }
-            }
-        });
-    }
-
-    $("button#id_send_msg_button").click(function () {
-        var text = $("input#id_text_input").val();
-        if (!(text == '')) {
-            send_message(text, done_after_responding, function(data){
-                $("input#id_text_input").val("");
-                $("div#response-type-text-input").css("display", "none");
-                data = JSON.parse(data);
-                // Update last message id for this agent
-                var message_id = parseInt(data["message_id"]);
-                if (message_id > last_message_id) {
-                    last_message_id = message_id;
-                }
-                var new_messages = [];
-                new_messages.push(data);
-                show_new_messages_on_UI(new_messages);
-                if (done_after_responding) {
-                    update_for_response_type('done');
-                    task_done = true;
-                    $("div#waiting-for-message").css("display", "none");
-                } else {
-                    check_done(new_messages);
-                    $("div#waiting-for-message").css("display", "");
-                }
-            });
-        }
-    });
-
-    // background worker that gets new messages from relay server periodically
-    function worker() {
-      $.ajax({
-        url: '/prod/json',
-        data: {
-            method_name: 'get_new_messages',
-            task_group_id: task_group_id,
-            conversation_id: conversation_id,
-            last_message_id: last_message_id
-        },
-        success: function(data) {
-            if (verbose_log) console.log(data);
-            data = JSON.parse(data);
-            conversation_dict = data['conversation_dict'];
-            if (!($.isEmptyObject(conversation_dict))) {
-                message_list = data['conversation_dict'][conversation_id];
-                // New messages are sorted by message id in message_list
-                for (var i = 0; i < message_list.length; i++) {
-                    message = message_list[i];
-                    var agent_id = message['id'];
-                    var message_id = parseInt(message['message_id']);
-                    if (message_id > last_message_id) {
-                        last_message_id = message_id;
-                    }
-                }
-                show_new_messages_on_UI(message_list);
-                check_done(message_list);
-            }
-            worker_timer = setTimeout(worker, 1000);
-        },
-      });
-    }
-
-    function update_for_response_type(response_type) {
-        $("div.response-type-module").css("display", "none");
-
-        if (response_type == 'idle') {
-            $("div#response-type-idle").css("display", "");
-        } else if (response_type == 'choices') {
-            $("div#response-type-choices").css("display", "");
-        } else if (response_type == 'text_input') {
-            $("div#response-type-text-input").css("display", "");
-        } else if (response_type == 'binary_reward') {
-            $("div#response-type-binary-reward").css("display", "");
-        } else if (response_type == 'done') {
-            $("div#response-type-done").css("display", "");
-        }
-        $(window).resize();
-    }
-
-    function all_done_callback() {
-        // Stop polling
-        if (worker_timer) {
-            clearTimeout(worker_timer);
-        }
-
-        // Post HIT info to server
-        var post_data_dict = {
-            task_group_id: task_group_id,
-            conversation_id: conversation_id,
-            assignmentId: get_url_parameter('assignmentId'),
-            hitId: get_url_parameter('hitId'),
-            workerId: get_url_parameter('workerId')
-        };
-        post_data_dict['method_name'] = 'save_hit_info';
-        $.ajax({
-            url: '/prod/json',
-            type: "POST",
-            data: JSON.stringify(post_data_dict),
-            contentType: "application/json",
-            success: function(data){
-                if (verbose_log) console.log(data);
-                if (in_mturk_hit_page() && !is_cover_page) {
-                    $("input#mturk_submit_button").click();
-                }
-            }
-        });
-    }
-
-    function init() {
-        conversation_id = hit_index + '_' + assignment_index;
-        $("form#mturk_submit_form input#assignmentId").val(get_url_parameter('assignmentId'));
-        $("form#mturk_submit_form input#hitId").val(get_url_parameter('hitId'));
-        $("form#mturk_submit_form input#workerId").val(get_url_parameter('workerId'));
-        $("form#mturk_submit_form").attr("action", mturk_submit_url);
-        $("input#id_text_input").focus();
-        if (cur_agent_index == 0) {
-            update_for_response_type('text_input');
-            $("div#waiting-for-message").css("display", "none");
-        } else {
-            update_for_response_type('idle');
-            $("div#waiting-for-message").css("display", "");
-        }
-        if (is_approval_page) {
-            $("div#left-pane").css("min-height", $(window).height());
-        }
-        if (is_cover_page) {
-            $("div#left-pane").removeClass('col-xs-4');
-            $("div#left-pane").addClass('col-xs-12');
-        } else {
-            $("div#left-pane").removeClass('col-xs-12');
-            $("div#left-pane").addClass('col-xs-4');
-            worker();
-        }
-        $(window).resize();
-    }
-
-    $(document).ready(function() {
-        if (hit_index === "Pending" && assignment_index === "Pending") {
-            get_hit_index_and_assignment_index(init);
-        } else {
-            assignment_index = parseInt(assignment_index);
-            init();
-        }
-    });
-</script>
-</body>
-</html>
\ No newline at end of file
diff --git a/parlai/mturk/core/setup_aws.py b/parlai/mturk/core/setup_aws.py
index 2e783b090fb..c37ed9a2bac 100644
--- a/parlai/mturk/core/setup_aws.py
+++ b/parlai/mturk/core/setup_aws.py
@@ -8,12 +8,8 @@
 import shutil
 from subprocess import call
 import zipfile
-try:
-    import boto3
-    import botocore
-    import psycopg2
-except ModuleNotFoundError:
-    raise SystemExit("Please install boto3 and psycopg2 by running: pip install boto3 psycopg2")
+import boto3
+import botocore
 import time
 import json
 import webbrowser
@@ -21,10 +17,10 @@
 import getpass
 from botocore.exceptions import ClientError
 from botocore.exceptions import ProfileNotFound
-from .data_model import init_database
+from parlai.mturk.core.data_model import setup_database_engine, init_database, check_database_health
 
 aws_profile_name = 'parlai_mturk'
-region_name = 'us-west-2'
+region_name = 'us-east-1'
 user_name = getpass.getuser()
 
 iam_role_name = 'parlai_relay_server'
@@ -40,9 +36,16 @@
 rds_password = 'parlai_user_password'
 rds_security_group_name = 'parlai-mturk-db-security-group'
 rds_security_group_description = 'Security group for ParlAI MTurk DB'
+rds_db_instance_class = 'db.t2.medium'
 
 parent_dir = os.path.dirname(os.path.abspath(__file__))
-files_to_copy = [parent_dir+'/'+'data_model.py', parent_dir+'/'+'mturk_index.html']
+generic_files_to_copy = [
+    os.path.join(parent_dir, 'hit_config.json'),
+    os.path.join(parent_dir, 'data_model.py'),
+    os.path.join(parent_dir, 'html', 'core.html'), 
+    os.path.join(parent_dir, 'html', 'cover_page.html'), 
+    os.path.join(parent_dir, 'html', 'mturk_index.html')
+]
 lambda_server_directory_name = 'lambda_server'
 lambda_server_zip_file_name = 'lambda_server.zip'
 mturk_hit_frame_height = 650
@@ -148,15 +151,6 @@ def setup_aws_credentials():
         print("AWS credentials successfully saved in "+aws_credentials_file_path+" file.\n")
     os.environ["AWS_PROFILE"] = aws_profile_name
 
-def get_requester_key():
-    # Compute requester key
-    session = boto3.Session(profile_name=aws_profile_name)
-    hash_gen = hashlib.sha512()
-    hash_gen.update(session.get_credentials().access_key.encode('utf-8')+session.get_credentials().secret_key.encode('utf-8'))
-    requester_key_gt = hash_gen.hexdigest()
-
-    return requester_key_gt
-
 def setup_rds():
     # Set up security group rules first
     ec2 = boto3.client('ec2', region_name=region_name)
@@ -190,63 +184,149 @@ def setup_rds():
             response = ec2.describe_security_groups(GroupNames=[rds_security_group_name])
             security_group_id = response['SecurityGroups'][0]['GroupId']
 
-    rds = boto3.client('rds', region_name=region_name)
-    try:
-        rds.create_db_instance(DBInstanceIdentifier=rds_db_instance_identifier,
-                               AllocatedStorage=20,
-                               DBName=rds_db_name,
-                               Engine='postgres',
-                               # General purpose SSD
-                               StorageType='gp2',
-                               StorageEncrypted=False,
-                               AutoMinorVersionUpgrade=True,
-                               MultiAZ=False,
-                               MasterUsername=rds_username,
-                               MasterUserPassword=rds_password,
-                               VpcSecurityGroupIds=[security_group_id],
-                               DBInstanceClass='db.t2.micro',
-                               Tags=[{'Key': 'Name', 'Value': rds_db_instance_identifier}])
-        print('RDS: Starting RDS instance...')
-    except ClientError as e:
-        if e.response['Error']['Code'] == 'DBInstanceAlreadyExists':
-            print('RDS: DB instance already exists.')
-        else:
-            raise
+    rds_instance_is_ready = False
+    while not rds_instance_is_ready:
+        rds = boto3.client('rds', region_name=region_name)
+        try:
+            rds.create_db_instance(DBInstanceIdentifier=rds_db_instance_identifier,
+                                   AllocatedStorage=20,
+                                   DBName=rds_db_name,
+                                   Engine='postgres',
+                                   # General purpose SSD
+                                   StorageType='gp2',
+                                   StorageEncrypted=False,
+                                   AutoMinorVersionUpgrade=True,
+                                   MultiAZ=False,
+                                   MasterUsername=rds_username,
+                                   MasterUserPassword=rds_password,
+                                   VpcSecurityGroupIds=[security_group_id],
+                                   DBInstanceClass=rds_db_instance_class,
+                                   Tags=[{'Key': 'Name', 'Value': rds_db_instance_identifier}])
+            print('RDS: Starting RDS instance...')
+        except ClientError as e:
+            if e.response['Error']['Code'] == 'DBInstanceAlreadyExists':
+                print('RDS: DB instance already exists.')
+            else:
+                raise
+
+        response = rds.describe_db_instances(DBInstanceIdentifier=rds_db_instance_identifier)
+        db_instances = response['DBInstances']
+        db_instance = db_instances[0]
+
+        if db_instance['DBInstanceClass'] != rds_db_instance_class: # If instance class doesn't match
+            print('RDS: Instance class does not match.')
+            remove_rds_database()
+            rds_instance_is_ready = False
+            continue
+
+        status = db_instance['DBInstanceStatus']
 
-    response = rds.describe_db_instances(DBInstanceIdentifier=rds_db_instance_identifier)
-    db_instances = response['DBInstances']
-    db_instance = db_instances[0]
-    status = db_instance['DBInstanceStatus']
+        if status == 'deleting':
+            print("RDS: Waiting for previous delete operation to complete. This might take a couple minutes...")
+            try:
+                while status == 'deleting':
+                    time.sleep(5)
+                    response = rds.describe_db_instances(DBInstanceIdentifier=rds_db_instance_identifier)
+                    db_instances = response['DBInstances']
+                    db_instance = db_instances[0]
+                    status = db_instance['DBInstanceStatus']
+            except ClientError as e:
+                rds_instance_is_ready = False
+                continue
+
+        if status == 'creating':
+            print("RDS: Waiting for newly created database to be available. This might take a couple minutes...")
+            while status == 'creating':
+                time.sleep(5)
+                response = rds.describe_db_instances(DBInstanceIdentifier=rds_db_instance_identifier)
+                db_instances = response['DBInstances']
+                db_instance = db_instances[0]
+                status = db_instance['DBInstanceStatus']
 
-    if status not in ['available', 'backing-up']:
-        print("RDS: Waiting for newly created database to be available. This might take a couple minutes...")
+        endpoint = db_instance['Endpoint']
+        host = endpoint['Address']
 
-    while status not in ['available', 'backing-up']:
-        time.sleep(5)
+        setup_database_engine(host, rds_db_name, rds_username, rds_password)
+        database_health_status = check_database_health()
+        if database_health_status in ['missing_table', 'healthy']:
+            print("Remote database health status: "+database_health_status)
+            init_database()
+        elif database_health_status in ['inconsistent_schema', 'unknown_error']:
+            print("Remote database error: "+database_health_status+". Removing RDS database...")
+            remove_rds_database()
+            rds_instance_is_ready = False
+            continue
+
+        print('RDS: DB instance ready.')
+        rds_instance_is_ready = True
+
+    return host
+
+def remove_rds_database():
+    # Remove RDS database
+    rds = boto3.client('rds', region_name=region_name)
+    try:
         response = rds.describe_db_instances(DBInstanceIdentifier=rds_db_instance_identifier)
         db_instances = response['DBInstances']
         db_instance = db_instances[0]
         status = db_instance['DBInstanceStatus']
 
-    endpoint = db_instance['Endpoint']
-    host = endpoint['Address']
+        if status == 'deleting':
+            print("RDS: Waiting for previous delete operation to complete. This might take a couple minutes...")
+        else:
+            response = rds.delete_db_instance(
+                DBInstanceIdentifier=rds_db_instance_identifier,
+                SkipFinalSnapshot=True,
+            )
+            response = rds.describe_db_instances(DBInstanceIdentifier=rds_db_instance_identifier)
+            db_instances = response['DBInstances']
+            db_instance = db_instances[0]
+            status = db_instance['DBInstanceStatus']
 
-    init_database(host, rds_db_name, rds_username, rds_password, should_check_schema_consistency=True)
+            if status == 'deleting':
+                print("RDS: Deleting database. This might take a couple minutes...")
 
-    print('RDS: DB instance ready.')
+        try:
+            while status == 'deleting':
+                time.sleep(5)
+                response = rds.describe_db_instances(DBInstanceIdentifier=rds_db_instance_identifier)
+                db_instances = response['DBInstances']
+                db_instance = db_instances[0]
+                status = db_instance['DBInstanceStatus']
+        except ClientError as e:
+            print("RDS: Database deleted.")
+
+    except ClientError as e:
+        print("RDS: Database doesn't exist.")
 
-    return host
 
-def setup_relay_server_api(mturk_submit_url, rds_host, task_description, is_sandbox, num_hits, num_assignments, requester_key_gt, should_clean_up_after_upload=True):
+def create_hit_config(task_description, num_hits, num_assignments, is_sandbox):
+    mturk_submit_url = 'https://workersandbox.mturk.com/mturk/externalSubmit'
+    if not is_sandbox:
+        mturk_submit_url = 'https://www.mturk.com/mturk/externalSubmit'
+    hit_config = {
+        'task_description': task_description, 
+        'num_hits': num_hits, 
+        'num_assignments': num_assignments, 
+        'is_sandbox': is_sandbox,
+        'mturk_submit_url': mturk_submit_url,
+    }
+    hit_config_file_path = os.path.join(parent_dir, 'hit_config.json')
+    if os.path.exists(hit_config_file_path):
+        os.remove(hit_config_file_path)
+    with open(hit_config_file_path, 'w') as hit_config_file:
+        hit_config_file.write(json.dumps(hit_config))
+
+def setup_relay_server_api(rds_host, task_files_to_copy, should_clean_up_after_upload=True):
     # Dynamically generate handler.py file, and then create zip file
     print("Lambda: Preparing relay server code...")
 
     # Create clean folder for lambda server code
-    if os.path.exists(parent_dir + '/' + lambda_server_directory_name):
-        shutil.rmtree(parent_dir + '/' + lambda_server_directory_name)
-    os.makedirs(parent_dir + '/' + lambda_server_directory_name)
-    if os.path.exists(parent_dir + '/' + lambda_server_zip_file_name):
-        os.remove(parent_dir + '/' + lambda_server_zip_file_name)
+    if os.path.exists(os.path.join(parent_dir, lambda_server_directory_name)):
+        shutil.rmtree(os.path.join(parent_dir, lambda_server_directory_name))
+    os.makedirs(os.path.join(parent_dir, lambda_server_directory_name))
+    if os.path.exists(os.path.join(parent_dir, lambda_server_zip_file_name)):
+        os.remove(os.path.join(parent_dir, lambda_server_zip_file_name))
 
     # Copying files
     with open(os.path.join(parent_dir, 'handler_template.py'), 'r') as handler_template_file:
@@ -254,22 +334,16 @@ def setup_relay_server_api(mturk_submit_url, rds_host, task_description, is_sand
     handler_file_string = handler_file_string.replace(
         '# {{block_task_config}}',
         "frame_height = " + str(mturk_hit_frame_height) + "\n" + \
-        "mturk_submit_url = \'" + mturk_submit_url + "\'\n" + \
         "rds_host = \'" + rds_host + "\'\n" + \
         "rds_db_name = \'" + rds_db_name + "\'\n" + \
         "rds_username = \'" + rds_username + "\'\n" + \
-        "rds_password = \'" + rds_password + "\'\n" + \
-        "requester_key_gt = \'" + requester_key_gt + "\'\n" + \
-        "num_hits = " + str(num_hits) + "\n" + \
-        "num_assignments = " + str(num_assignments) + "\n" + \
-        "is_sandbox = " + str(is_sandbox) + "\n" + \
-        'task_description = ' + task_description)
+        "rds_password = \'" + rds_password + "\'")
     with open(os.path.join(parent_dir, lambda_server_directory_name, 'handler.py'), 'w') as handler_file:
         handler_file.write(handler_file_string)
     create_zip_file(
         lambda_server_directory_name=lambda_server_directory_name,
         lambda_server_zip_file_name=lambda_server_zip_file_name,
-        files_to_copy=files_to_copy
+        files_to_copy=generic_files_to_copy + task_files_to_copy
     )
     with open(os.path.join(parent_dir, lambda_server_zip_file_name), mode='rb') as zip_file:
         zip_file_content = zip_file.read()
@@ -328,7 +402,7 @@ def setup_relay_server_api(mturk_submit_url, rds_host, task_description, is_sand
                     Code={
                         'ZipFile': zip_file_content
                     },
-                    Timeout = 10, # in seconds
+                    Timeout = 300, # in seconds
                     MemorySize = 128, # in MB
                     Publish = True,
                 )
@@ -348,8 +422,9 @@ def setup_relay_server_api(mturk_submit_url, rds_host, task_description, is_sand
 
     # Clean up if needed
     if should_clean_up_after_upload:
-        shutil.rmtree(parent_dir + '/' + lambda_server_directory_name)
-        os.remove(parent_dir + '/' + lambda_server_zip_file_name)
+        shutil.rmtree(os.path.join(parent_dir, lambda_server_directory_name))
+        os.remove(os.path.join(parent_dir, lambda_server_zip_file_name))
+        os.remove(os.path.join(parent_dir, 'hit_config.json'))
 
     # Check API Gateway existence.
     # If doesn't exist, create the APIs, point them to Lambda function, and set correct configurations
@@ -440,6 +515,7 @@ def setup_relay_server_api(mturk_submit_url, rds_host, task_description, is_sand
         api_gateway_client.create_deployment(
             restApiId = rest_api_id,
             stageName = "prod",
+            cacheClusterEnabled = False,
         )
 
     html_api_endpoint_url = 'https://' + rest_api_id + '.execute-api.' + region_name + '.amazonaws.com/prod/' + endpoint_api_name_html
@@ -447,7 +523,35 @@ def setup_relay_server_api(mturk_submit_url, rds_host, task_description, is_sand
 
     return html_api_endpoint_url, json_api_endpoint_url
 
-def check_mturk_balance(num_hits, hit_reward, is_sandbox):
+def calculate_mturk_cost(payment_opt):
+    """MTurk Pricing: https://requester.mturk.com/pricing
+    20% fee on the reward and bonus amount (if any) you pay Workers.
+    HITs with 10 or more assignments will be charged an additional 20% fee on the reward you pay Workers.
+
+    Example payment_opt format for paying reward:
+    {
+        'type': 'reward',
+        'num_hits': 1,
+        'num_assignments': 1,
+        'reward': 0.05  # in dollars
+    }
+
+    Example payment_opt format for paying bonus:
+    {
+        'type': 'bonus',
+        'amount': 1000  # in dollars
+    }
+    """
+    total_cost = 0
+    if payment_opt['type'] == 'reward':
+        total_cost = payment_opt['num_hits'] * payment_opt['num_assignments'] * payment_opt['reward'] * 1.2
+        if payment_opt['num_assignments'] >= 10:
+            total_cost = total_cost * 1.2
+    elif payment_opt['type'] == 'bonus':
+        total_cost = payment_opt['amount'] * 1.2
+    return total_cost
+
+def check_mturk_balance(balance_needed, is_sandbox):
     client = boto3.client(
         service_name = 'mturk',
         region_name = 'us-east-1',
@@ -469,7 +573,7 @@ def check_mturk_balance(num_hits, hit_reward, is_sandbox):
         else:
             raise
 
-    balance_needed = num_hits * hit_reward * 1.2
+    balance_needed = balance_needed * 1.2 # AWS charges 20% fee for both reward and bonus payment
 
     if user_balance < balance_needed:
         print("You might not have enough money in your MTurk account. Please go to https://requester.mturk.com/account and increase your balance to at least $"+f'{balance_needed:.2f}'+", and then try again.")
@@ -477,7 +581,18 @@ def check_mturk_balance(num_hits, hit_reward, is_sandbox):
     else:
         return True
 
-def create_hit_type(hit_title, hit_description, hit_keywords, hit_reward, is_sandbox):
+def get_mturk_client(is_sandbox):
+    client = boto3.client(
+        service_name = 'mturk',
+        region_name = 'us-east-1',
+        endpoint_url = 'https://mturk-requester-sandbox.us-east-1.amazonaws.com'
+    )
+    # Region is always us-east-1
+    if not is_sandbox:
+        client = boto3.client(service_name = 'mturk', region_name='us-east-1')
+    return client
+
+def create_hit_type(hit_title, hit_description, hit_keywords, hit_reward, assignment_duration_in_seconds, is_sandbox):
     client = boto3.client(
         service_name = 'mturk',
         region_name = 'us-east-1',
@@ -505,7 +620,7 @@ def create_hit_type(hit_title, hit_description, hit_keywords, hit_reward, is_san
     # Create the HIT type
     response = client.create_hit_type(
         AutoApprovalDelayInSeconds=4*7*24*3600, # auto-approve after 4 weeks
-        AssignmentDurationInSeconds=1800,
+        AssignmentDurationInSeconds=assignment_duration_in_seconds,
         Reward=str(hit_reward),
         Title=hit_title,
         Keywords=hit_keywords,
@@ -602,30 +717,33 @@ def setup_all_dependencies(lambda_server_directory_name):
 
     # Set up all other dependencies
     if has_anaconda:
-        call(("pip install --target="+parent_dir+'/'+lambda_server_directory_name+" -r "+parent_dir+"/lambda_requirements.txt").split(" "), stdout=devnull, stderr=devnull)
+        call(("pip install --target="+os.path.join(parent_dir, lambda_server_directory_name)+" -r "+os.path.join(parent_dir, "lambda_requirements.txt")).split(" "), stdout=devnull, stderr=devnull)
     else:
-        shutil.rmtree(parent_dir+"/venv", ignore_errors=True)
+        shutil.rmtree(os.path.join(parent_dir, "venv"), ignore_errors=True)
         call("pip install virtualenv".split(" "), stdout=devnull, stderr=devnull)
-        call("virtualenv -p python2 venv".split(" "), stdout=devnull, stderr=devnull)
-        call(("venv/bin/pip install --target="+parent_dir+'/'+lambda_server_directory_name+" -r "+parent_dir+"/lambda_requirements.txt").split(" "), stdout=devnull, stderr=devnull)
-        shutil.rmtree(parent_dir+"/venv")
+        call(("virtualenv -p python2 "+os.path.join(parent_dir, "venv")).split(" "), stdout=devnull, stderr=devnull)
+        call((os.path.join(parent_dir, 'venv', 'bin', 'pip')+" install --target="+os.path.join(parent_dir, lambda_server_directory_name)+" -r "+os.path.join(parent_dir, "lambda_requirements.txt")).split(" "), stdout=devnull, stderr=devnull)
+        shutil.rmtree(os.path.join(parent_dir, "venv"), ignore_errors=True)
 
     # Set up psycopg2
-    shutil.rmtree(parent_dir + '/awslambda-psycopg2/', ignore_errors=True)
-    call(("git clone https://github.com/jkehler/awslambda-psycopg2.git " + parent_dir + "/awslambda-psycopg2").split(" "), stdout=devnull, stderr=devnull)
-    shutil.copytree(parent_dir + '/awslambda-psycopg2/with_ssl_support/psycopg2', parent_dir+'/'+lambda_server_directory_name+"/psycopg2")
-    shutil.rmtree(parent_dir + '/awslambda-psycopg2/')
+    shutil.rmtree(os.path.join(parent_dir, 'awslambda-psycopg2'), ignore_errors=True)
+    call(("git clone https://github.com/jkehler/awslambda-psycopg2.git " + os.path.join(parent_dir, "awslambda-psycopg2")).split(" "), stdout=devnull, stderr=devnull)
+    shutil.copytree(os.path.join(parent_dir, 'awslambda-psycopg2', 'with_ssl_support', 'psycopg2'), os.path.join(parent_dir, lambda_server_directory_name, "psycopg2"))
+    shutil.rmtree(os.path.join(parent_dir, 'awslambda-psycopg2'))
 
 def create_zip_file(lambda_server_directory_name, lambda_server_zip_file_name, files_to_copy=None, verbose=False):
     setup_all_dependencies(lambda_server_directory_name)
     parent_dir = os.path.dirname(os.path.abspath(__file__))
 
-    src = parent_dir + '/' + lambda_server_directory_name
-    dst = parent_dir + '/' + lambda_server_zip_file_name
+    src = os.path.join(parent_dir, lambda_server_directory_name)
+    dst = os.path.join(parent_dir, lambda_server_zip_file_name)
 
     if files_to_copy:
         for file_path in files_to_copy:
-            shutil.copy2(file_path, src)
+            try:
+                shutil.copy2(file_path, src)
+            except FileNotFoundError:
+                pass
 
     zf = zipfile.ZipFile("%s" % (dst), "w", zipfile.ZIP_DEFLATED)
     abs_src = os.path.abspath(src)
@@ -643,19 +761,13 @@ def create_zip_file(lambda_server_directory_name, lambda_server_zip_file_name, f
     if verbose:
         print("Done!")
 
-def setup_aws(task_description, num_hits, num_assignments, is_sandbox):
-    mturk_submit_url = 'https://workersandbox.mturk.com/mturk/externalSubmit'
-    if not is_sandbox:
-        mturk_submit_url = 'https://www.mturk.com/mturk/externalSubmit'
-    requester_key_gt = get_requester_key()
+def setup_aws(task_files_to_copy):
     rds_host = setup_rds()
-    html_api_endpoint_url, json_api_endpoint_url = setup_relay_server_api(mturk_submit_url, rds_host, task_description, is_sandbox, num_hits, num_assignments, requester_key_gt)
+    html_api_endpoint_url, json_api_endpoint_url = setup_relay_server_api(rds_host=rds_host, task_files_to_copy=task_files_to_copy)
 
-    return html_api_endpoint_url, json_api_endpoint_url, requester_key_gt
+    return html_api_endpoint_url, json_api_endpoint_url
 
 def clean_aws():
-    setup_aws_credentials()
-
     # Remove RDS database
     try:
         rds = boto3.client('rds', region_name=region_name)
@@ -781,4 +893,8 @@ def clean_aws():
 
 if __name__ == "__main__":
     if sys.argv[1] == 'clean':
+        setup_aws_credentials()
         clean_aws()
+    elif sys.argv[1] == 'remove_rds':
+        setup_aws_credentials()
+        remove_rds_database()
diff --git a/parlai/mturk/core/test/__init__.py b/parlai/mturk/core/test/__init__.py
new file mode 100644
index 00000000000..8eff276d72d
--- /dev/null
+++ b/parlai/mturk/core/test/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
\ No newline at end of file
diff --git a/parlai/mturk/core/test/auto_complete_hit.py b/parlai/mturk/core/test/auto_complete_hit.py
new file mode 100644
index 00000000000..2aa92643c83
--- /dev/null
+++ b/parlai/mturk/core/test/auto_complete_hit.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+"""
+Script for auto-completing HITs. Please change the test flow according to your task.
+"""
+try:
+    from selenium import webdriver
+    import chromedriver_installer
+except ModuleNotFoundError:
+    raise SystemExit("Please make sure your computer has Chrome installed, and then install selenium and chromedriver by running: pip install selenium chromedriver_installer")
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support import expected_conditions as EC
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.common.keys import Keys
+from selenium.common.exceptions import TimeoutException
+import sys
+import time
+import random
+
+HIT_page_url = sys.argv[1]
+
+# create a new Chrome session
+driver = webdriver.Chrome()
+driver.implicitly_wait(30)
+driver.maximize_window()
+
+# login to your MTurk sandbox account
+print("Please log into your MTurk sandbox account within 10 minutes...")
+driver.get("https://workersandbox.mturk.com/mturk/beginsignin")
+while not "Sign Out" in driver.page_source:
+    time.sleep(1)
+print("Successfully logged into your MTurk sandbox account.")
+
+# navigate to the HIT page
+driver.get(HIT_page_url)
+
+total_hits_done = 0
+
+while not "There are no HITs in this group available to you at the moment." in driver.page_source:
+    # Click "Accept" button
+    wait = WebDriverWait(driver, 30)
+    accept_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '''#cookieDependentFunctionality > input[type="image"]''')))
+    time.sleep(random.uniform(2, 10))
+    print("Clicking on Accept button...")
+    accept_button.send_keys("\n")        
+     
+    # Wait for main page to show up
+    iframe = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, "body > form > iframe")))
+    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
+    driver.switch_to.frame(iframe)
+    input_box = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#id_text_input")))
+    
+    # Send message
+    time.sleep(random.uniform(2, 10))
+    input_box.send_keys("text to send")
+    time.sleep(random.uniform(2, 10))
+    print("Sending message...")
+    input_box.send_keys(Keys.RETURN)
+
+    # Click "Done with this HIT" button
+    wait = WebDriverWait(driver, 30)
+    done_button = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#done-button")))
+    time.sleep(random.uniform(2, 10))
+    print("Clicking on Done button...")
+    done_button.click()
+    total_hits_done += 1
+    print("Total HITs done: " + str(total_hits_done))
+    print("\n")
+
+    time.sleep(random.uniform(2, 10))
+    print("Going to next HIT...")
+    driver.get(sys.argv[1])
+
+print("All HITs are done!")
+driver.quit()
\ No newline at end of file
diff --git a/parlai/mturk/core/test/test_concurrent_polling.py b/parlai/mturk/core/test/test_concurrent_polling.py
new file mode 100644
index 00000000000..0cdc37b1f8d
--- /dev/null
+++ b/parlai/mturk/core/test/test_concurrent_polling.py
@@ -0,0 +1,44 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+"""
+You should run this test in clusters where there is less limitation on the number of outbound requests per second.
+"""
+import requests
+import json
+import time
+import sys
+from joblib import Parallel, delayed
+
+num_concurrent_requests = int(sys.argv[1])
+wait_time_between_requests = 1 # in seconds
+
+task_group_id = ''
+db_last_message_id = -1
+json_api_endpoint_url = ''
+
+global test_thread
+def test_thread(thread_id):
+    print("Thread "+str(thread_id)+" is on.")
+    count = 0
+    avg_elapsed = 0
+    while True:
+        count += 1
+        params = {
+            'method_name': 'get_new_messages',
+            'task_group_id': task_group_id,
+            'last_message_id': db_last_message_id,
+        }
+        response = requests.get(json_api_endpoint_url, params=params, allow_redirects=False)
+        try:
+            ret = json.loads(response.json())
+            avg_elapsed = (avg_elapsed * (count - 1) + response.elapsed.total_seconds()) / count
+            print("Thread "+str(thread_id)+": Count: "+str(count)+" Success: "+str(ret)+" Elapsed time: "+str(avg_elapsed))
+            time.sleep(wait_time_between_requests)
+        except Exception as e:
+            print(response.content)
+            raise e
+
+results = Parallel(n_jobs=num_concurrent_requests, backend='threading')(delayed(test_thread)(thread_id) for thread_id in range(num_concurrent_requests))
diff --git a/parlai/mturk/core/worlds.py b/parlai/mturk/core/worlds.py
new file mode 100644
index 00000000000..ce0ab69da20
--- /dev/null
+++ b/parlai/mturk/core/worlds.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+from parlai.core.worlds import World, validate
+
+class MTurkWorld(World):
+    """Generic world for MTurk."""
+    def __init__(self, opt, mturk_agent):
+        self.mturk_agent = mturk_agent
+        self.episodeDone = False    
+
+    def parley(self):
+        self.episode_done = True
+
+    def episode_done(self):
+        return self.episodeDone
+
+    def report(self):
+        pass
+
+    def shutdown(self):
+        self.mturk_agent.shutdown()
+        """
+        Use the following code if there are multiple MTurk agents:
+        
+        global shutdown_agent
+        def shutdown_agent(mturk_agent):
+            mturk_agent.shutdown()
+        Parallel(n_jobs=len(self.mturk_agents), backend='threading')(delayed(shutdown_agent)(agent) for agent in self.mturk_agents)
+        """
+        
+    def review_work(self):
+        """Programmatically approve/reject the turker's work.
+        For example:
+        .. code-block:: python
+            if self.turker_response == '0':
+                self.mturk_agent.reject_work('You rated our model's response as a 0/10 but we know we're better than that')
+            else:
+                if self.turker_response == '10':
+                    self.mturk_agent.pay_bonus(1, 'Thanks for the great rating!')
+                self.mturk_agent.approve_work()
+        """
+        # self.mturk_agent.approve_work()
+        # self.mturk_agent.reject_work()
+        # self.mturk_agent.pay_bonus(1000) # Pay $1000 as bonus
+        # self.mturk_agent.block_worker() # Block this worker from future HITs
+        pass
diff --git a/parlai/mturk/tasks/model_evaluator/run.py b/parlai/mturk/tasks/model_evaluator/run.py
index b04ebd3877d..b71b711d455 100644
--- a/parlai/mturk/tasks/model_evaluator/run.py
+++ b/parlai/mturk/tasks/model_evaluator/run.py
@@ -11,10 +11,7 @@
 import os
 import copy
 from itertools import product
-try:
-    from joblib import Parallel, delayed
-except ModuleNotFoundError:
-    raise SystemExit("Please install joblib by running: pip install joblib")
+from joblib import Parallel, delayed
 
 
 def main():
@@ -26,7 +23,7 @@ def main():
     from parlai.agents.ir_baseline.ir_baseline import IrBaselineAgent
     IrBaselineAgent.add_cmdline_args(argparser)
     opt = argparser.parse_args()
-    opt['task'] = os.path.basename(os.getcwd())
+    opt['task'] = os.path.basename(os.path.dirname(os.path.abspath(__file__)))
     opt.update(task_config)
 
     # The task that we will evaluate the dialog model on
@@ -35,31 +32,30 @@ def main():
     task_opt['datapath'] = opt['datapath']
     task_opt['task'] = '#MovieDD-Reddit'
 
-    mturk_manager = MTurkManager()
-    mturk_manager.init_aws(opt=opt)
-
     mturk_agent_id = 'Worker'
-    mturk_manager.mturk_agent_ids = [mturk_agent_id]
-    mturk_manager.all_agent_ids = [ModelEvaluatorWorld.evaluator_agent_id, mturk_agent_id] # In speaking order
+    mturk_manager = MTurkManager(
+        opt=opt,
+        mturk_agent_ids = [mturk_agent_id]
+    )
+    mturk_manager.init_aws(opt=opt)
+    mturk_manager.start_new_run(opt=opt)
     
     global run_hit
     def run_hit(hit_index, assignment_index, opt, task_opt, mturk_manager):
-        conversation_id = str(hit_index) + '_' + str(assignment_index)
-
         model_agent = IrBaselineAgent(opt=opt)
         # Create the MTurk agent which provides a chat interface to the Turker
-        mturk_agent = MTurkAgent(id=mturk_agent_id, manager=mturk_manager, conversation_id=conversation_id, opt=opt)
+        mturk_agent = MTurkAgent(id=mturk_agent_id, manager=mturk_manager, hit_index=hit_index, assignment_index=assignment_index, opt=opt)
         world = ModelEvaluatorWorld(opt=opt, model_agent=model_agent, task_opt=task_opt, mturk_agent=mturk_agent)
 
         while not world.episode_done():
             world.parley()
         world.shutdown()
+        world.review_work()
 
     mturk_manager.create_hits(opt=opt)
     results = Parallel(n_jobs=opt['num_hits'] * opt['num_assignments'], backend='threading') \
                 (delayed(run_hit)(hit_index, assignment_index, opt, task_opt, mturk_manager) \
                     for hit_index, assignment_index in product(range(1, opt['num_hits']+1), range(1, opt['num_assignments']+1)))    
-    mturk_manager.review_hits()
     mturk_manager.shutdown()
 
 if __name__ == '__main__':
diff --git a/parlai/mturk/tasks/model_evaluator/worlds.py b/parlai/mturk/tasks/model_evaluator/worlds.py
index 593c32b99e5..8068d21f789 100644
--- a/parlai/mturk/tasks/model_evaluator/worlds.py
+++ b/parlai/mturk/tasks/model_evaluator/worlds.py
@@ -3,9 +3,10 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree. An additional grant
 # of patent rights can be found in the PATENTS file in the same directory.
-from parlai.core.worlds import World, validate, create_task
+from parlai.core.worlds import validate, create_task
+from parlai.mturk.core.worlds import MTurkWorld
 
-class ModelEvaluatorWorld(World):
+class ModelEvaluatorWorld(MTurkWorld):
     """
     World for letting Turkers evaluate a dialog model's performance given a context.
     Assumes the context is a context from a given task, e.g. from SQuAD, CBT, etc.
@@ -42,9 +43,11 @@ def episode_done(self):
         return self.episodeDone
 
     def report(self):
-        # TODO: Add logging code here
         pass
 
     def shutdown(self):
         self.task_world.shutdown()
         self.mturk_agent.shutdown()
+
+    def review_work(self):
+        pass
diff --git a/parlai/mturk/tasks/multi_agent_dialog/run.py b/parlai/mturk/tasks/multi_agent_dialog/run.py
index 50d87401e49..259706332ec 100644
--- a/parlai/mturk/tasks/multi_agent_dialog/run.py
+++ b/parlai/mturk/tasks/multi_agent_dialog/run.py
@@ -7,15 +7,12 @@
 import time
 from parlai.core.params import ParlaiParser
 from parlai.mturk.core.agents import MTurkAgent, MTurkManager
+from parlai.mturk.tasks.multi_agent_dialog.worlds import MTurkMultiAgentDialogWorld
 from parlai.agents.local_human.local_human import LocalHumanAgent
-from parlai.core.worlds import MultiAgentDialogWorld
 from task_config import task_config
 import copy
 from itertools import product
-try:
-    from joblib import Parallel, delayed
-except ModuleNotFoundError:
-    raise SystemExit("Please install joblib by running: pip install joblib")
+from joblib import Parallel, delayed
 
 """
 This task consists of two local human agents and two MTurk agents,
@@ -28,26 +25,25 @@ def main():
     argparser.add_parlai_data_path()
     argparser.add_mturk_args()
     opt = argparser.parse_args()
-    opt['task'] = os.path.basename(os.getcwd())
+    opt['task'] = os.path.basename(os.path.dirname(os.path.abspath(__file__)))
     opt.update(task_config)
 
-    mturk_manager = MTurkManager()
-    mturk_manager.init_aws(opt=opt)
-
     mturk_agent_1_id = 'mturk_agent_1'
     mturk_agent_2_id = 'mturk_agent_2'
     human_agent_1_id = 'human_1'
     human_agent_2_id = 'human_2'
-    mturk_manager.mturk_agent_ids = [mturk_agent_1_id, mturk_agent_2_id]
-    mturk_manager.all_agent_ids = [human_agent_1_id, human_agent_2_id] + mturk_manager.mturk_agent_ids # In speaking order
+    mturk_manager = MTurkManager(
+        opt=opt,
+        mturk_agent_ids = [mturk_agent_1_id, mturk_agent_2_id]
+    )
+    mturk_manager.init_aws(opt=opt)
+    mturk_manager.start_new_run(opt=opt)
 
     global run_hit
     def run_hit(hit_index, assignment_index, opt, mturk_manager):
-        conversation_id = str(hit_index) + '_' + str(assignment_index)
-
         # Create mturk agents
-        mturk_agent_1 = MTurkAgent(id=mturk_agent_1_id, manager=mturk_manager, conversation_id=conversation_id, opt=opt)
-        mturk_agent_2 = MTurkAgent(id=mturk_agent_2_id, manager=mturk_manager, conversation_id=conversation_id, opt=opt)
+        mturk_agent_1 = MTurkAgent(id=mturk_agent_1_id, manager=mturk_manager, hit_index=hit_index, assignment_index=assignment_index, opt=opt)
+        mturk_agent_2 = MTurkAgent(id=mturk_agent_2_id, manager=mturk_manager, hit_index=hit_index, assignment_index=assignment_index, opt=opt)
 
         # Create the local human agents
         human_agent_1 = LocalHumanAgent(opt=None)
@@ -55,7 +51,7 @@ def run_hit(hit_index, assignment_index, opt, mturk_manager):
         human_agent_2 = LocalHumanAgent(opt=None)
         human_agent_2.id = human_agent_2_id
 
-        world = MultiAgentDialogWorld(opt=opt, agents=[human_agent_1, human_agent_2, mturk_agent_1, mturk_agent_2])
+        world = MTurkMultiAgentDialogWorld(opt=opt, agents=[human_agent_1, human_agent_2, mturk_agent_1, mturk_agent_2])
 
         while not world.episode_done():
             world.parley()
@@ -65,7 +61,6 @@ def run_hit(hit_index, assignment_index, opt, mturk_manager):
     results = Parallel(n_jobs=opt['num_hits'] * opt['num_assignments'], backend='threading') \
                 (delayed(run_hit)(hit_index, assignment_index, opt, mturk_manager) \
                     for hit_index, assignment_index in product(range(1, opt['num_hits']+1), range(1, opt['num_assignments']+1)))
-    mturk_manager.review_hits()
     mturk_manager.shutdown()
 
 if __name__ == '__main__':
diff --git a/parlai/mturk/tasks/multi_agent_dialog/worlds.py b/parlai/mturk/tasks/multi_agent_dialog/worlds.py
new file mode 100644
index 00000000000..3c569ed2dc0
--- /dev/null
+++ b/parlai/mturk/tasks/multi_agent_dialog/worlds.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+from parlai.core.worlds import MultiAgentDialogWorld
+from joblib import Parallel, delayed
+
+class MTurkMultiAgentDialogWorld(MultiAgentDialogWorld):
+    """Basic world where each agent gets a turn in a round-robin fashion,
+    receiving as input the actions of all other agents since that agent last
+    acted.
+    """
+    def shutdown(self):
+        """Shutdown all mturk agents in parallel, otherwise if one mturk agent
+        is disconnected then it could prevent other mturk agents from completing."""
+        global shutdown_agent
+        def shutdown_agent(mturk_agent):
+            mturk_agent.shutdown()
+        Parallel(n_jobs=len(self.agents), backend='threading')(delayed(shutdown_agent)(agent) for agent in self.agents)
diff --git a/parlai/mturk/tasks/qa_data_collection/run.py b/parlai/mturk/tasks/qa_data_collection/run.py
index cccef0f86ad..2362af434cb 100644
--- a/parlai/mturk/tasks/qa_data_collection/run.py
+++ b/parlai/mturk/tasks/qa_data_collection/run.py
@@ -12,17 +12,15 @@
 import importlib
 import copy
 from itertools import product
-try:
-    from joblib import Parallel, delayed
-except ModuleNotFoundError:
-    raise SystemExit("Please install joblib by running: pip install joblib")
+from joblib import Parallel, delayed
+
 
 def main():
     argparser = ParlaiParser(False, False)
     argparser.add_parlai_data_path()
     argparser.add_mturk_args()
     opt = argparser.parse_args()
-    opt['task'] = os.path.basename(os.getcwd())
+    opt['task'] = os.path.basename(os.path.dirname(os.path.abspath(__file__)))
     opt.update(task_config)
 
     # Initialize a SQuAD teacher agent, which we will get context from
@@ -34,30 +32,29 @@ def main():
     task_opt['datatype'] = 'train'
     task_opt['datapath'] = opt['datapath']
 
-    mturk_manager = MTurkManager()
-    mturk_manager.init_aws(opt=opt)
-
     mturk_agent_id = 'Worker'
-    mturk_manager.mturk_agent_ids = [mturk_agent_id]
-    mturk_manager.all_agent_ids = [QADataCollectionWorld.collector_agent_id, mturk_agent_id] # In speaking order
+    mturk_manager = MTurkManager(
+        opt=opt,
+        mturk_agent_ids = [mturk_agent_id]
+    )
+    mturk_manager.init_aws(opt=opt)
+    mturk_manager.start_new_run(opt=opt)
 
     global run_hit
     def run_hit(hit_index, assignment_index, task_class, task_opt, opt, mturk_manager):
-        conversation_id = str(hit_index) + '_' + str(assignment_index)
-
         task = task_class(task_opt)
         # Create the MTurk agent which provides a chat interface to the Turker
-        mturk_agent = MTurkAgent(id=mturk_agent_id, manager=mturk_manager, conversation_id=conversation_id, opt=opt)
+        mturk_agent = MTurkAgent(id=mturk_agent_id, manager=mturk_manager, hit_index=hit_index, assignment_index=assignment_index, opt=opt)
         world = QADataCollectionWorld(opt=opt, task=task, mturk_agent=mturk_agent)
         while not world.episode_done():
             world.parley()
         world.shutdown()
+        world.review_work()
 
     mturk_manager.create_hits(opt=opt)
     results = Parallel(n_jobs=opt['num_hits'] * opt['num_assignments'], backend='threading') \
                 (delayed(run_hit)(hit_index, assignment_index, task_class, task_opt, opt, mturk_manager) \
                     for hit_index, assignment_index in product(range(1, opt['num_hits']+1), range(1, opt['num_assignments']+1)))    
-    mturk_manager.review_hits()
     mturk_manager.shutdown()
 
 if __name__ == '__main__':
diff --git a/parlai/mturk/tasks/qa_data_collection/worlds.py b/parlai/mturk/tasks/qa_data_collection/worlds.py
index a3a9522206a..268b6fdfc3e 100644
--- a/parlai/mturk/tasks/qa_data_collection/worlds.py
+++ b/parlai/mturk/tasks/qa_data_collection/worlds.py
@@ -3,10 +3,10 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree. An additional grant
 # of patent rights can be found in the PATENTS file in the same directory.
-from parlai.core.worlds import World, validate
+from parlai.core.worlds import validate
+from parlai.mturk.core.worlds import MTurkWorld
 
-
-class QADataCollectionWorld(World):
+class QADataCollectionWorld(MTurkWorld):
     """
     World for recording a turker's question and answer given a context.
     Assumes the context is a random context from a given task, e.g.
@@ -59,9 +59,11 @@ def episode_done(self):
         return self.episodeDone
 
     def report(self):
-        # TODO: Add logging code here
         pass
 
     def shutdown(self):
         self.task.shutdown()
         self.mturk_agent.shutdown()
+
+    def review_work(self):
+        pass
diff --git a/parlai/tasks/babi/agents.py b/parlai/tasks/babi/agents.py
index 8b74cc3d1db..77d8409df20 100644
--- a/parlai/tasks/babi/agents.py
+++ b/parlai/tasks/babi/agents.py
@@ -11,6 +11,7 @@
 import copy
 import os
 
+
 def _path(exsz, task, opt, dt=''):
     # Build the data if it doesn't exist.
     build(opt)
@@ -21,23 +22,58 @@ def _path(exsz, task, opt, dt=''):
                         'qa{task}_{type}.txt'.format(task=task, type=dt))
 
 
+def mod_labels(ys, task):
+    if ys is not None:
+        # replace comma-labeled babi tasks with spaces
+        # this is more friendly to our tokenizer which makes commas full tokens
+        # this way models won't be penalized for not generating a comma
+        if task == '8':
+            # holding: labels like 'milk,cookies,football'
+            # replace with spaces 'milk football cookies'
+            ys = [y.replace(',', ' ') for y in ys]
+        elif task == '19':
+            # pathfinding: labels like 'n,e' or 's,w'
+            # replace with spaces, 'n e'
+            ys = [y.replace(',', ' ') for y in ys]
+
+    return ys
+
+
 # Single bAbI task (1k training).
 class Task1kTeacher(FbDialogTeacher):
     def __init__(self, opt, shared=None):
         task = opt.get('task', 'babi:Task1k:1')
-        opt['datafile'] = _path('', task.split(':')[2], opt)
+        self.task_num = task.split(':')[2]
+        opt['datafile'] = _path('', self.task_num, opt)
         opt['cands_datafile'] = _path('', task.split(':')[2], opt, 'train')
         super().__init__(opt, shared)
 
+    def setup_data(self, path):
+        for entry, new in super().setup_data(path):
+            entry[1] = mod_labels(entry[1], self.task_num)
+            yield entry, new
+
+    def load_cands(self, path):
+        return mod_labels(super().load_cands(path), self.task_num)
+
 
 # Single bAbI task (10k training).
 class Task10kTeacher(FbDialogTeacher):
     def __init__(self, opt, shared=None):
         task = opt.get('task', 'babi:Task10k:1')
-        opt['datafile'] = _path('-10k', task.split(':')[2], opt)
-        opt['cands_datafile'] = _path('', task.split(':')[2], opt, 'train')
+        self.task_num = task.split(':')[2]
+        opt['datafile'] = _path('-10k', self.task_num, opt)
+        opt['cands_datafile'] = _path('-10k', task.split(':')[2], opt, 'train')
         super().__init__(opt, shared)
 
+    def setup_data(self, path):
+        for entry, new in super().setup_data(path):
+            entry[1] = mod_labels(entry[1], self.task_num)
+            yield entry, new
+
+    def load_cands(self, path):
+        return mod_labels(super().load_cands(path), self.task_num)
+
 
 # By default train on all tasks at once.
 class All1kTeacher(MultiTaskTeacher):
diff --git a/parlai/tasks/babi/build.py b/parlai/tasks/babi/build.py
index 5f729c49ec5..7fa5bf9ec0c 100644
--- a/parlai/tasks/babi/build.py
+++ b/parlai/tasks/babi/build.py
@@ -12,10 +12,13 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'bAbI')
+    version = 'None'
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -25,4 +28,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/booktest/build.py b/parlai/tasks/booktest/build.py
index fef3f3febfd..6b5d942e0dc 100644
--- a/parlai/tasks/booktest/build.py
+++ b/parlai/tasks/booktest/build.py
@@ -8,12 +8,16 @@
 import parlai.core.build_data as build_data
 import os
 
+
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'BookTest')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -23,4 +27,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/cbt/build.py b/parlai/tasks/cbt/build.py
index cda6efcf1e6..8a80a54bd44 100644
--- a/parlai/tasks/cbt/build.py
+++ b/parlai/tasks/cbt/build.py
@@ -10,10 +10,13 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'CBT')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -23,4 +26,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/clevr/__init__.py b/parlai/tasks/clevr/__init__.py
new file mode 100644
index 00000000000..8eff276d72d
--- /dev/null
+++ b/parlai/tasks/clevr/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
\ No newline at end of file
diff --git a/parlai/tasks/clevr/agents.py b/parlai/tasks/clevr/agents.py
new file mode 100644
index 00000000000..7c3a26bc1a5
--- /dev/null
+++ b/parlai/tasks/clevr/agents.py
@@ -0,0 +1,71 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+
+from parlai.core.dialog_teacher import DialogTeacher
+from .build import build
+
+import json
+import os
+
+
+def _path(opt):
+    build(opt)
+    dt = opt['datatype'].split(':')[0]
+
+    if dt == 'valid':
+        dt = 'val'
+    elif dt != 'train' and dt != 'test':
+        raise RuntimeError('Not valid datatype.')
+
+    prefix = os.path.join(opt['datapath'], 'CLEVR', 'CLEVR_v1.0')
+    questions_path = os.path.join(prefix, 'questions',
+                                'CLEVR_' + dt + '_questions.json')
+    images_path = os.path.join(prefix, 'images', dt)
+
+    return questions_path, images_path
+
+
+counts = [str(i) for i in range(11)]
+materials = ['metal', 'rubber']
+sizes = ['small', 'large']
+shapes = ['cube', 'sphere', 'cylinder']
+colors = ['gray', 'blue', 'brown', 'yellow', 'red', 'green', 'purple', 'cyan']
+
+
+class DefaultTeacher(DialogTeacher):
+    # all possile answers for the questions
+    cands = ['yes', 'no'] + counts + materials + sizes + shapes + colors
+
+    def __init__(self, opt, shared=None):
+        self.datatype = opt['datatype']
+        data_path, self.images_path = _path(opt)
+        opt['datafile'] = data_path
+        self.id = 'clevr'
+
+        super().__init__(opt, shared)
+
+    def label_candidates(self):
+        return self.cands
+
+    def setup_data(self, path):
+        print('loading: ' + path)
+        with open(path) as data_file:
+            clevr = json.load(data_file)
+
+        image_file = None
+        for ques in clevr['questions']:
+            # episode done if first question or image changed
+            new_episode = ques['image_filename'] != image_file
+
+            # only show image at beginning of episode
+            image_file = ques['image_filename']
+            img_path = None
+            if new_episode:
+                img_path = os.path.join(self.images_path, image_file)
+
+            question = ques['question']
+            answer = [ques['answer']] if ques['split'] != 'test' else None
+            yield (question, answer, None, None, img_path), new_episode
diff --git a/parlai/tasks/clevr/build.py b/parlai/tasks/clevr/build.py
new file mode 100644
index 00000000000..1a80176f7e6
--- /dev/null
+++ b/parlai/tasks/clevr/build.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+# Download and build the data if it does not exist.
+
+import parlai.core.build_data as build_data
+import os
+
+from parlai.tasks.vqa_v1.build import buildImage
+
+
+def build(opt):
+    dpath = os.path.join(opt['datapath'], 'CLEVR')
+    version = 'v1.0'
+
+    if not build_data.built(dpath, version_string=version):
+        print('[building data: ' + dpath + ']')
+        # An older version exists, so remove these outdated files.
+        if build_data.built(dpath):
+            build_data.remove_dir(dpath)
+        build_data.make_dir(dpath)
+
+        # Download the data.
+        fname = 'CLEVR_v1.0.zip'
+        url = 'https://s3-us-west-1.amazonaws.com/clevr/'
+
+        build_data.download(url + fname, dpath, fname)
+        build_data.untar(dpath, fname)
+
+        # Mark the data as built.
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/cornell_movie/build.py b/parlai/tasks/cornell_movie/build.py
index 83b432c55d6..5b2cf387c40 100644
--- a/parlai/tasks/cornell_movie/build.py
+++ b/parlai/tasks/cornell_movie/build.py
@@ -50,10 +50,13 @@ def create_fb_format(lines_file, convo_file, outpath):
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'CornellMovie')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -68,4 +71,4 @@ def build(opt):
                          dpath)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/dbll_babi/build.py b/parlai/tasks/dbll_babi/build.py
index a8404f4dba5..cd91ff20df7 100644
--- a/parlai/tasks/dbll_babi/build.py
+++ b/parlai/tasks/dbll_babi/build.py
@@ -10,10 +10,13 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'DBLL')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -23,4 +26,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/dbll_movie/build.py b/parlai/tasks/dbll_movie/build.py
index acebde1cf68..ac30469ece0 100644
--- a/parlai/tasks/dbll_movie/build.py
+++ b/parlai/tasks/dbll_movie/build.py
@@ -16,9 +16,13 @@ def build(opt):
     wikimovies_build.build(opt)
 
     dpath = os.path.join(opt['datapath'], 'DBLL')
-    if not build_data.built(dpath):
+    version = None
+
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -28,4 +32,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/dialog_babi/agents.py b/parlai/tasks/dialog_babi/agents.py
index b83d52c7b56..61ea1ad59e1 100644
--- a/parlai/tasks/dialog_babi/agents.py
+++ b/parlai/tasks/dialog_babi/agents.py
@@ -22,6 +22,7 @@
 def _path(task, opt):
     # Build the data if it doesn't exist.
     build(opt)
+    prefix = os.path.join(opt['datapath'], 'dialog-bAbI', 'dialog-bAbI-tasks')
     suffix = ''
     dt = opt['datatype'].split(':')[0]
     if dt == 'train':
@@ -30,8 +31,16 @@ def _path(task, opt):
         suffix = 'tst'
     elif dt == 'valid':
         suffix = 'dev'
-    return os.path.join(opt['datapath'], 'dialog-bAbI', 'dialog-bAbI-tasks',
-        '{tsk}-{type}.txt'.format(tsk=tasks[int(task)], type=suffix))
+    datafile = os.path.join(prefix,
+            '{tsk}-{type}.txt'.format(tsk=tasks[int(task)], type=suffix))
+
+    if opt['task'].split(':')[2] != '6':
+        cands_datafile = os.path.join(prefix, 'dialog-babi-candidates.txt')
+    else:
+        cands_datafile = os.path.join(prefix,
+                'dialog-babi-task6-dstc2-candidates.txt')
+
+    return datafile, cands_datafile
 
 
 # The knowledge base of facts that can be used to answer questions.
@@ -47,10 +56,8 @@ def __init__(self, opt, shared=None):
 # Single task.
 class TaskTeacher(FbDialogTeacher):
     def __init__(self, opt, shared=None):
-        opt['datafile'] = _path(opt['task'].split(':')[2], opt)
-        opt['cands_datafile'] = os.path.join(opt['datapath'], 'dialog-bAbI',
-                                             'dialog-bAbI-tasks',
-                                             'dialog-babi-candidates.txt')
+        paths = _path(opt['task'].split(':')[2], opt)
+        opt['datafile'], opt['cands_datafile'] = paths
         super().__init__(opt, shared)
 
 
@@ -60,7 +67,4 @@ def __init__(self, opt, shared=None):
         opt = copy.deepcopy(opt)
         opt['task'] = ','.join('dialog_babi:Task:%d' % (i + 1)
                                for i in range(6))
-        opt['cands_datafile'] = os.path.join(opt['datapath'], 'dialog-bAbI',
-                                             'dialog-bAbI-tasks',
-                                             'dialog-babi-candidates.txt')
         super().__init__(opt, shared)
diff --git a/parlai/tasks/dialog_babi/build.py b/parlai/tasks/dialog_babi/build.py
index f412adde83c..955f6e4efdf 100644
--- a/parlai/tasks/dialog_babi/build.py
+++ b/parlai/tasks/dialog_babi/build.py
@@ -11,10 +11,13 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'dialog-bAbI')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -24,4 +27,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/fromfile/__init__.py b/parlai/tasks/fromfile/__init__.py
new file mode 100644
index 00000000000..8eff276d72d
--- /dev/null
+++ b/parlai/tasks/fromfile/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
\ No newline at end of file
diff --git a/parlai/tasks/fromfile/agents.py b/parlai/tasks/fromfile/agents.py
new file mode 100644
index 00000000000..e2c0bc8ffdc
--- /dev/null
+++ b/parlai/tasks/fromfile/agents.py
@@ -0,0 +1,31 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+#
+# This task simply loads the specified file: useful for quick tests without
+# setting up a new task.
+
+from parlai.core.fbdialog_teacher import FbDialogTeacher
+
+import copy
+import os
+
+class DefaultTeacher(FbDialogTeacher):
+    """This task simply loads the specified file: useful for quick tests without
+    setting up a new task.
+    """
+    
+    @staticmethod
+    def add_cmdline_args(argparser):
+        agent = argparser.add_argument_group('FromFile Task Arguments')
+        agent.add_argument('--fromfile_datapath', type=str,
+                           help="Data file in FbDialogFormat")
+
+    def __init__(self, opt, shared=None):
+        opt = copy.deepcopy(opt)
+        if not opt.get('fromfile_datapath'):
+            raise RuntimeError('fromfile_datapath not specified')
+        opt['datafile'] = opt['fromfile_datapath']
+        super().__init__(opt, shared)
diff --git a/parlai/tasks/insuranceqa/__init__.py b/parlai/tasks/insuranceqa/__init__.py
new file mode 100644
index 00000000000..8eff276d72d
--- /dev/null
+++ b/parlai/tasks/insuranceqa/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
\ No newline at end of file
diff --git a/parlai/tasks/insuranceqa/agents.py b/parlai/tasks/insuranceqa/agents.py
new file mode 100644
index 00000000000..03fb8b56598
--- /dev/null
+++ b/parlai/tasks/insuranceqa/agents.py
@@ -0,0 +1,46 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+import copy
+import os
+
+from parlai.core.fbdialog_teacher import FbDialogTeacher
+from .build import build
+
+
+def _path(version, opt, exsz=''):
+    # Build the data if it doesn't exist.
+    build(opt)
+    dt = opt['datatype'].split(':')[0]
+    if exsz:
+        fname = '%s.%s.txt' % (dt, exsz)
+    else:
+        fname = '%s.txt' % dt
+    return os.path.join(opt['datapath'], 'InsuranceQA', version, fname)
+
+
+# V1 InsuranceQA task
+class V1Teacher(FbDialogTeacher):
+    def __init__(self, opt, shared=None):
+        opt = copy.deepcopy(opt)
+        opt['datafile'] = _path('V1', opt)
+        super().__init__(opt, shared)
+
+
+# V2 InsuranceQA task
+class V2Teacher(FbDialogTeacher):
+    def __init__(self, opt, shared=None):
+        opt = copy.deepcopy(opt)
+        task = opt.get('task', None)
+        if not task:
+            # options are 100, 500, 1000, or 1500
+            task = 'insuranceqa:V2:100'
+        split = task.split(':')
+        opt['datafile'] = _path('V2', opt, split[2])
+        super().__init__(opt, shared)
+
+
+class DefaultTeacher(V1Teacher):
+    pass
diff --git a/parlai/tasks/insuranceqa/build.py b/parlai/tasks/insuranceqa/build.py
new file mode 100644
index 00000000000..121b8b0c95a
--- /dev/null
+++ b/parlai/tasks/insuranceqa/build.py
@@ -0,0 +1,201 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+# Download and build the data if it does not exist.
+
+import gzip
+import os
+
+import parlai.core.build_data as build_data
+
+
+class ParseInsuranceQA(object):
+    version = None
+    label2answer_fname = None
+
+    @classmethod
+    def read_gz(cls, filename):
+        f = gzip.open(filename, 'rb')
+        return [x.decode('utf-8') for x in f.readlines()]
+
+    @classmethod
+    def readlines(cls, path):
+        if path.endswith(".gz"):
+            lines = cls.read_gz(path)
+        else:
+            lines = open(path).readlines()
+        return lines
+
+    @classmethod
+    def wids2sent(cls, wids, d_vocab):
+        return " ".join([d_vocab[w] for w in wids])
+
+    @classmethod
+    def read_vocab(cls, vocab_path):
+        d_vocab = {}
+        with open(vocab_path, "r") as f:
+            for line in f:
+                fields = line.rstrip('\n').split("\t")
+                if len(fields) != 2:
+                    raise ValueError("vocab file (%s) corrupted. Line (%s)" % (repr(line), vocab_path))
+                else:
+                    wid, word = fields
+                    d_vocab[wid] = word
+        return d_vocab
+
+    @classmethod
+    def read_label2answer(cls, label2answer_path_gz, d_vocab):
+        lines = cls.readlines(label2answer_path_gz)
+
+        d_label_answer = {}
+        for line in lines:
+            fields = line.rstrip("\n").split("\t")
+            if len(fields) != 2:
+                raise ValueError("label2answer file (%s) corrupted. Line (%s)" % (repr(line), label2answer_path_gz))
+            else:
+                aid, s_wids = fields
+                sent = cls.wids2sent(s_wids.split(), d_vocab)
+                d_label_answer[aid] = sent
+        return d_label_answer
+
+    @classmethod
+    def create_fb_format(cls, out_path, dtype, inpath, d_vocab, d_label_answer):
+        pass
+
+    @classmethod
+    def write_data_files(cls, dpext, out_path, d_vocab, d_label_answer):
+        pass
+
+    @classmethod
+    def build(cls, dpath):
+        print("building version: %s" % cls.version)
+
+        # the root of dataset
+        dpext = os.path.join(dpath, 'insuranceQA-master/%s' % cls.version)
+
+        # read vocab file
+        vocab_path = os.path.join(dpext, "vocabulary")
+        d_vocab = cls.read_vocab(vocab_path)
+
+        # read label2answer file
+        label2answer_path_gz = os.path.join(dpext, cls.label2answer_fname)
+        d_label_answer = cls.read_label2answer(label2answer_path_gz, d_vocab)
+
+        # Create out path
+        out_path = os.path.join(dpath, cls.version)
+        build_data.make_dir(out_path)
+
+        # Parse and write data files
+        cls.write_data_files(dpext, out_path, d_vocab, d_label_answer)
+
+
+class ParseInsuranceQAV1(ParseInsuranceQA):
+    version = "V1"
+    label2answer_fname = "answers.label.token_idx"
+
+    @classmethod
+    def write_data_files(cls, dpext, out_path, d_vocab, d_label_answer):
+        data_fnames = [
+            ("train", "question.train.token_idx.label"),
+            ("valid", "question.dev.label.token_idx.pool"),
+            ("test", "question.test1.label.token_idx.pool"),
+            # ("test2", "question.test2.label.token_idx.pool")
+        ]
+        for dtype, data_fname in data_fnames:
+            data_path = os.path.join(dpext, data_fname)
+            cls.create_fb_format(out_path, dtype, data_path, d_vocab, d_label_answer)
+
+    @classmethod
+    def create_fb_format(cls, out_path, dtype, inpath, d_vocab, d_label_answer):
+        print('building fbformat:' + dtype)
+        fout = open(os.path.join(out_path, dtype + '.txt'), 'w')
+        lines = open(inpath).readlines()
+
+        for line in lines:
+            fields = line.rstrip("\n").split("\t")
+            if dtype == "train":
+                assert len(fields) == 2, "data file (%s) corrupted." % inpath
+                s_q_wids, s_good_aids = fields
+
+                q = cls.wids2sent(s_q_wids.split(), d_vocab)
+                good_ans = [d_label_answer[aid_] for aid_ in s_good_aids.split()]
+                # save good answers (train only)
+                s = '1 ' + q + '\t' + "|".join(good_ans)
+                fout.write(s + '\n')
+            else:
+                assert len(fields) == 3, "data file (%s) corrupted." % inpath
+                s_good_aids, s_q_wids, s_bad_aids = fields
+
+                q = cls.wids2sent(s_q_wids.split(), d_vocab)
+                good_ans = [d_label_answer[aid_] for aid_ in s_good_aids.split()]
+                bad_ans = [d_label_answer[aid_] for aid_ in s_bad_aids.split()]
+                # save good answers and candidates
+                s = '1 ' + q + '\t' + "|".join(good_ans) + '\t\t' + "|".join(good_ans + bad_ans)
+                fout.write(s + '\n')
+        fout.close()
+
+
+class ParseInsuranceQAV2(ParseInsuranceQA):
+    version = "V2"
+    label2answer_fname = "InsuranceQA.label2answer.token.encoded.gz"
+
+    @classmethod
+    def write_data_files(cls, dpext, out_path, d_vocab, d_label_answer):
+        data_fnames_tmpl = [
+            ("train.%s", "InsuranceQA.question.anslabel.token.%s.pool.solr.train.encoded.gz"),
+            ("valid.%s", "InsuranceQA.question.anslabel.token.%s.pool.solr.valid.encoded.gz"),
+            ("test.%s", "InsuranceQA.question.anslabel.token.%s.pool.solr.test.encoded.gz")
+        ]
+        for n_cands in [100, 500, 1000, 1500]:
+            for dtype_tmp, data_fname_tmp in data_fnames_tmpl:
+                dtype = dtype_tmp % n_cands
+                data_fname = data_fname_tmp % n_cands
+                data_path = os.path.join(dpext, data_fname)
+                cls.create_fb_format(out_path, dtype, data_path, d_vocab, d_label_answer)
+
+    @classmethod
+    def create_fb_format(cls, out_path, dtype, inpath, d_vocab, d_label_answer):
+        print('building fbformat:' + dtype)
+        fout = open(os.path.join(out_path, dtype + '.txt'), 'w')
+        lines = cls.readlines(inpath)
+
+        for line in lines:
+            fields = line.rstrip("\n").split("\t")
+            if len(fields) != 4:
+                raise ValueError("data file (%s) corrupted. Line (%s)" % (repr(line), inpath))
+            else:
+                _, s_q_wids, s_good_aids, s_bad_aids = fields
+                q = cls.wids2sent(s_q_wids.split(), d_vocab)
+                good_ans = [d_label_answer[aid_] for aid_ in s_good_aids.split()]
+                bad_ans = [d_label_answer[aid_] for aid_ in s_bad_aids.split()]
+                # save
+                s = '1 ' + q + '\t' + "|".join(good_ans) + '\t\t' + "|".join(good_ans + bad_ans)
+                fout.write(s + '\n')
+        fout.close()
+
+
+def build(opt):
+    dpath = os.path.join(opt['datapath'], 'InsuranceQA')
+    version = '1'
+
+    if not build_data.built(dpath, version_string=version):
+        print('[building data: ' + dpath + ']')
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
+        build_data.make_dir(dpath)
+
+        # Download the data from github.
+        fname = 'insuranceqa.zip'
+        url = 'https://github.com/shuzi/insuranceQA/archive/master.zip'
+        print('[downloading data from: ' + url + ']')
+        build_data.download(url, dpath, fname)
+        build_data.untar(dpath, fname)
+
+        ParseInsuranceQAV1.build(dpath)
+        ParseInsuranceQAV2.build(dpath)
+
+        # Mark the data as built.
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/mctest/build.py b/parlai/tasks/mctest/build.py
index b3546c9c33c..d8239a62c1f 100644
--- a/parlai/tasks/mctest/build.py
+++ b/parlai/tasks/mctest/build.py
@@ -42,10 +42,13 @@ def create_fb_format(outpath, dtype, inpath, inpath2):
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'MCTest')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -71,4 +74,4 @@ def build(opt):
                          os.path.join(dpext, 'MCTestAnswers', 'mc500.test.ans'))
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/mnist_qa/build.py b/parlai/tasks/mnist_qa/build.py
index 2811a9c5de0..f5c0f14d023 100644
--- a/parlai/tasks/mnist_qa/build.py
+++ b/parlai/tasks/mnist_qa/build.py
@@ -12,10 +12,13 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'mnist')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -25,4 +28,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/moviedialog/build.py b/parlai/tasks/moviedialog/build.py
index 7810a38a682..cac2e482362 100644
--- a/parlai/tasks/moviedialog/build.py
+++ b/parlai/tasks/moviedialog/build.py
@@ -11,22 +11,27 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'MovieDialog')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        dpath2 = os.path.join(dpath, 'movie_dialog_dataset', 'task4_reddit')
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
+        build_data.make_dir(dpath2)
 
         # Download the data.
         fname = 'moviedialog.tar.gz'
         url = 'https://s3.amazonaws.com/fair-data/parlai/moviedialog/' + fname
         build_data.download(url, dpath, fname)
-        build_data.untar(dpath, fname)
 
-        dpath2 = os.path.join(dpath, 'movie_dialog_dataset', 'task4_reddit')
         url2 = 'http://tinyurl.com/' + 'p6tyohj'
         build_data.download(url2, dpath2, 'p6tyohj.tgz')
+
+        build_data.untar(dpath, fname)
         build_data.untar(dpath2, 'p6tyohj.tgz')
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/ms_marco/__init__.py b/parlai/tasks/ms_marco/__init__.py
new file mode 100644
index 00000000000..8eff276d72d
--- /dev/null
+++ b/parlai/tasks/ms_marco/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
\ No newline at end of file
diff --git a/parlai/tasks/ms_marco/agents.py b/parlai/tasks/ms_marco/agents.py
new file mode 100644
index 00000000000..e3e4dd8f965
--- /dev/null
+++ b/parlai/tasks/ms_marco/agents.py
@@ -0,0 +1,55 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+import copy
+import json
+import os
+
+from parlai.core.dialog_teacher import DialogTeacher
+from parlai.core.fbdialog_teacher import FbDialogTeacher
+from .build import build
+
+
+def _path(opt, is_passage=False):
+    # Build the data if it doesn't exist.
+    build(opt)
+    dt = opt['datatype'].split(':')[0]
+
+    if is_passage:  # for passage selection task
+        fname = "%s.passage.txt" % dt
+    else:
+        fname = "%s.txt" % dt
+
+    return os.path.join(opt['datapath'], 'MS_MARCO', fname)
+
+
+class PassageTeacher(FbDialogTeacher):
+    def __init__(self, opt, shared=None):
+        opt = copy.deepcopy(opt)
+        opt['datafile'] = _path(opt, is_passage=True)
+        super().__init__(opt, shared)
+
+
+class DefaultTeacher(DialogTeacher):
+    def __init__(self, opt, shared=None):
+        opt = copy.deepcopy(opt)
+        self.datatype = opt['datatype']
+        opt['datafile'] = _path(opt, is_passage=False)
+        super().__init__(opt, shared)
+
+    def setup_data(self, path):
+        print('loading: ' + path)
+        with open(path) as data_file:
+            for jline in data_file:
+                d_example = json.loads(jline)
+                context = [d['passage_text'] for d in d_example['passages']]
+                question = d_example['query']
+                if self.datatype != 'test':
+                    answers = d_example['answers']
+                    if not answers:
+                        answers = ['NULL']  # empty list of answers will cause exception
+                else:
+                    answers = ['NULL']
+                yield ('\n'.join(context) + '\n' + question, answers), True
diff --git a/parlai/tasks/ms_marco/build.py b/parlai/tasks/ms_marco/build.py
new file mode 100644
index 00000000000..144b383d806
--- /dev/null
+++ b/parlai/tasks/ms_marco/build.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+# Download and build the data if it does not exist.
+import gzip
+import json
+import os
+
+import parlai.core.build_data as build_data
+
+
+def read_gz(filename, delete_gz=True):
+    f = gzip.open(filename, 'rb')
+    lines = [x.decode('utf-8') for x in f.readlines()]
+    if delete_gz:
+        os.remove(filename)
+    return lines
+
+
+def create_fb_format(outpath, dtype, inpath):
+    print('building fbformat:' + dtype)
+
+    lines = read_gz(inpath)
+
+    # save the raw json version for span selection task (default)
+    fout1 = open(os.path.join(outpath, dtype + '.txt'), 'w')
+    for line in lines:
+        fout1.write(line.rstrip("\n") + "\n")
+    fout1.close()
+
+    # save the file for passage selection task
+    fout2 = open(os.path.join(outpath, dtype + '.passage.txt'), 'w')
+    for line in lines:
+        dic = json.loads(line)
+        lq = dic["query"]
+        if dtype != "test":
+            ans = "|".join([d["passage_text"] for d in dic["passages"] if d["is_selected"] == 1])
+            cands = "|".join([d["passage_text"] for d in dic["passages"] if d["is_selected"] == 0])
+            cands = ans + "|" + cands
+            if ans == "": continue  # if no true label, skip for now
+        else:  # ground truth for test data is not available yet
+            ans = ""
+            cands = "|".join([d["passage_text"] for d in dic["passages"]])
+        s = '1 ' + lq + '\t' + ans.lstrip("|") + '\t\t' + cands
+        fout2.write(s + '\n')
+    fout2.close()
+
+
+def build(opt):
+    dpath = os.path.join(opt['datapath'], 'MS_MARCO')
+    version = None
+
+    if not build_data.built(dpath, version_string=version):
+        print('[building data: ' + dpath + ']')
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
+        build_data.make_dir(dpath)
+
+        # Download the data
+        url = "https://msmarco.blob.core.windows.net/msmarco/"
+
+        fname = "train_v1.1.json.gz"
+        build_data.download(url + fname, dpath, 'train.gz')
+
+        fname = "dev_v1.1.json.gz"
+        build_data.download(url + fname, dpath, 'valid.gz')
+
+        fname = "test_public_v1.1.json.gz"
+        build_data.download(url + fname, dpath, 'test.gz')
+
+        create_fb_format(dpath, "train", os.path.join(dpath, 'train.gz'))
+        create_fb_format(dpath, "valid", os.path.join(dpath, 'valid.gz'))
+        create_fb_format(dpath, "test", os.path.join(dpath, 'test.gz'))
+
+        # Mark the data as built.
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/mturkwikimovies/build.py b/parlai/tasks/mturkwikimovies/build.py
index 9951d90df75..116395d7180 100644
--- a/parlai/tasks/mturkwikimovies/build.py
+++ b/parlai/tasks/mturkwikimovies/build.py
@@ -15,9 +15,13 @@ def build(opt):
     wikimovies_build.build(opt)
 
     dpath = os.path.join(opt['datapath'], 'MTurkWikiMovies')
-    if not build_data.built(dpath):
+    version = None
+
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -28,4 +32,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/opensubtitles/build.py b/parlai/tasks/opensubtitles/build.py
index 6f450e67eac..dff4c13d559 100644
--- a/parlai/tasks/opensubtitles/build.py
+++ b/parlai/tasks/opensubtitles/build.py
@@ -60,10 +60,13 @@ def create_fb_format(inpath, outpath):
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'OpenSubtitles')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -74,4 +77,4 @@ def build(opt):
         create_fb_format(os.path.join(dpath, 'OpenSubtitles', 'en'), dpath)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/personalized_dialog/__init__.py b/parlai/tasks/personalized_dialog/__init__.py
new file mode 100644
index 00000000000..8eff276d72d
--- /dev/null
+++ b/parlai/tasks/personalized_dialog/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
\ No newline at end of file
diff --git a/parlai/tasks/personalized_dialog/agents.py b/parlai/tasks/personalized_dialog/agents.py
new file mode 100644
index 00000000000..e06b6bb9d5f
--- /dev/null
+++ b/parlai/tasks/personalized_dialog/agents.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+
+from parlai.core.fbdialog_teacher import FbDialogTeacher
+from parlai.core.agents import MultiTaskTeacher
+from .build import build
+
+import copy
+import os
+
+tasks = {}
+tasks[1] = 'personalized-dialog-task1-API-calls'
+tasks[2] = 'personalized-dialog-task2-API-refine'
+tasks[3] = 'personalized-dialog-task3-options'
+tasks[4] = 'personalized-dialog-task4-info'
+tasks[5] = 'personalized-dialog-task5-full-dialogs'
+
+def _path(exsz, task, opt):
+    # Build the data if it doesn't exist.
+    build(opt)
+    suffix = ''
+    dt = opt['datatype'].split(':')[0]
+    if dt == 'train':
+        suffix = 'trn'
+    elif dt == 'test':
+        suffix = 'tst'
+    elif dt == 'valid':
+        suffix = 'dev'
+    return os.path.join(opt['datapath'], 'personalized-dialog', 'personalized-dialog-dataset',
+        '{exsz}'.format(exsz=exsz), 
+        '{tsk}-{type}.txt'.format(tsk=tasks[int(task)], type=suffix))
+
+
+# The knowledge base of facts that can be used to answer questions.
+class KBTeacher(FbDialogTeacher):
+    def __init__(self, opt, shared=None):
+        build(opt)
+        opt['datafile'] = os.path.join(opt['datapath'], 'personalized-dialog', 'personalized-dialog-dataset',
+                                       'personalized-dialog-kb-all.txt')
+        super().__init__(opt, shared)
+
+
+# python <script.py> -t personalized_dialog:FullTask:<task_id>
+# Single full task.
+class FullTaskTeacher(FbDialogTeacher):
+    def __init__(self, opt, shared=None):
+        opt['datafile'] = _path('full', opt['task'].split(':')[2], opt)
+        opt['cands_datafile'] = os.path.join(opt['datapath'], 'personalized-dialog', 'personalized-dialog-dataset',
+                                             'personalized-dialog-candidates.txt')
+        super().__init__(opt, shared)
+
+
+# python <script.py> -t personalized_dialog:SmallTask:<task_id>
+# Single small task.
+class SmallTaskTeacher(FbDialogTeacher):
+    def __init__(self, opt, shared=None):
+        opt['datafile'] = _path('small', opt['task'].split(':')[2], opt)
+        opt['cands_datafile'] = os.path.join(opt['datapath'], 'personalized-dialog', 'personalized-dialog-dataset',
+                                             'personalized-dialog-candidates.txt')
+        super().__init__(opt, shared)
+
+
+# python <script.py> -t personalized_dialog:AllFull
+# By default train on all tasks at once.
+class AllFullTeacher(MultiTaskTeacher):
+    def __init__(self, opt, shared=None):
+        opt = copy.deepcopy(opt)
+        opt['task'] = ','.join('personalized_dialog:FullTask:%d' % (i + 1)
+                               for i in range(5))
+        opt['cands_datafile'] = os.path.join(opt['datapath'], 'personalized-dialog', 'personalized-dialog-dataset',
+                                             'personalized-dialog-candidates.txt')
+        super().__init__(opt, shared)
+
+
+# python <script.py> -t personalized_dialog:AllSmall
+# By default train on all tasks at once.
+class AllSmallTeacher(MultiTaskTeacher):
+    def __init__(self, opt, shared=None):
+        opt = copy.deepcopy(opt)
+        opt['task'] = ','.join('personalized_dialog:SmallTask:%d' % (i + 1)
+                               for i in range(5))
+        opt['cands_datafile'] = os.path.join(opt['datapath'], 'personalized-dialog', 'personalized-dialog-dataset',
+                                             'personalized-dialog-candidates.txt')
+        super().__init__(opt, shared)
+
+
+# By default train on all tasks at once.
+class DefaultTeacher(AllSmallTeacher):
+    pass
diff --git a/parlai/tasks/personalized_dialog/build.py b/parlai/tasks/personalized_dialog/build.py
new file mode 100644
index 00000000000..815aa9ad51f
--- /dev/null
+++ b/parlai/tasks/personalized_dialog/build.py
@@ -0,0 +1,31 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+# Download and build the data if it does not exist.
+
+import parlai.core.build_data as build_data
+import os
+
+
+def build(opt):
+    dpath = os.path.join(opt['datapath'], 'personalized-dialog')
+    version = None
+
+    if not build_data.built(dpath, version_string=version):
+        print('[building data: ' + dpath + ']')
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
+        build_data.make_dir(dpath)
+
+        # Download the data.
+        # https://www.dropbox.com/s/4i9u4y24pt3paba/personalized-dialog-dataset.tar.gz?dl=1
+        fname = 'personalized-dialog-dataset.tar.gz'
+        url = 'https://www.dropbox.com/s/4i9u4y24pt3paba/' + fname + '?dl=1'
+        build_data.download(url, dpath, fname)
+        build_data.untar(dpath, fname)
+
+        # Mark the data as built.
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/qacnn/build.py b/parlai/tasks/qacnn/build.py
index fcf953416b3..29818766fc4 100644
--- a/parlai/tasks/qacnn/build.py
+++ b/parlai/tasks/qacnn/build.py
@@ -15,13 +15,13 @@ def _process(fname, fout):
     # main article
     s = '1 ' + lines[2]
     # add question
-    s = s + lines[4]
+    s = s + ' ' + lines[4]
     # add answer
     s = s + '\t' + lines[6]
     # add candidates (and strip them of the real names)
     for i in range(8, len(lines)):
         lines[i] = lines[i].split(':')[0]
-    s = s + '\t\t' + '|'.join(lines[8:-1])
+    s = s + '\t\t' + '|'.join(lines[8:])
     fout.write(s + '\n\n')
 
 
@@ -35,11 +35,14 @@ def create_fb_format(outpath, dtype, inpath):
 
 
 def build(opt):
+    version = 'v1.0'
     dpath = os.path.join(opt['datapath'], 'QACNN')
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -56,4 +59,4 @@ def build(opt):
                          os.path.join(dpath, 'cnn', 'questions', 'test'))
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version)
diff --git a/parlai/tasks/qadailymail/build.py b/parlai/tasks/qadailymail/build.py
index 2ee790987c6..171d3151874 100644
--- a/parlai/tasks/qadailymail/build.py
+++ b/parlai/tasks/qadailymail/build.py
@@ -15,13 +15,13 @@ def _process(fname, fout):
     # main article
     s = '1 ' + lines[2]
     # add question
-    s = s + lines[4]
+    s = s + ' ' + lines[4]
     # add answer
     s = s + '\t' + lines[6]
     # add candidates (and strip them of the real names)
     for i in range(8, len(lines)):
         lines[i] = lines[i].split(':')[0]
-    s = s + '\t\t' + '|'.join(lines[8:-1])
+    s = s + '\t\t' + '|'.join(lines[8:])
     fout.write(s + '\n\n')
 
 
@@ -35,11 +35,14 @@ def create_fb_format(outpath, dtype, inpath):
 
 
 def build(opt):
+    version = 'v1.0'
     dpath = os.path.join(opt['datapath'], 'QADailyMail')
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -54,4 +57,4 @@ def build(opt):
         create_fb_format(dpath, 'test', os.path.join(dpath, ext, 'test'))
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version)
diff --git a/parlai/tasks/simplequestions/build.py b/parlai/tasks/simplequestions/build.py
index e41e22e61f5..a32e0b727a5 100644
--- a/parlai/tasks/simplequestions/build.py
+++ b/parlai/tasks/simplequestions/build.py
@@ -11,10 +11,13 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'SimpleQuestions')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -25,4 +28,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/squad/build.py b/parlai/tasks/squad/build.py
index 08e20069514..417b1d1a99f 100644
--- a/parlai/tasks/squad/build.py
+++ b/parlai/tasks/squad/build.py
@@ -11,10 +11,13 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'SQuAD')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -25,4 +28,4 @@ def build(opt):
         build_data.download(url + fname2, dpath, fname2)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/task_list.py b/parlai/tasks/task_list.py
index 9bf5d429f46..3e0cb26387a 100644
--- a/parlai/tasks/task_list.py
+++ b/parlai/tasks/task_list.py
@@ -12,7 +12,7 @@
         "id": "bAbI-1k",
         "display_name": "bAbI 1k",
         "task": "babi:All1k",
-        "tags": [ "all",  "QA" ],
+        "tags": [ "All",  "QA" ],
         "description": "20 synthetic tasks that each test a unique aspect of text and reasoning, and hence test different capabilities of learning models. From Weston et al. '16. Link: http://arxiv.org/abs/1502.05698",
         "notes": "You can access just one of the bAbI tasks with e.g. 'babi:Task1k:3' for task 3."
     },
@@ -20,7 +20,7 @@
         "id": "bAbI-10k",
         "display_name": "bAbI 10k",
         "task": "babi:All10k",
-        "tags": [ "all",  "QA" ],
+        "tags": [ "All",  "QA" ],
         "description": "20 synthetic tasks that each test a unique aspect of text and reasoning, and hence test different capabilities of learning models. From Weston et al. '16. Link: http://arxiv.org/abs/1502.05698",
         "notes": "You can access just one of the bAbI tasks with e.g. 'babi:Task10k:3' for task 3."
     },
@@ -28,175 +28,217 @@
         "id": "BookTest",
         "display_name": "BookTest",
         "task": "booktest",
-        "tags": [ "all",  "Cloze" ],
+        "tags": [ "All",  "Cloze" ],
         "description": "Sentence completion given a few sentences as context from a book. A larger version of CBT. From Bajgar et al., 16. Link: https://arxiv.org/abs/1610.00956"
     },
     {
         "id": "CBT",
         "display_name": "Children's Book Test (CBT)",
         "task": "cbt",
-        "tags": [ "all",  "Cloze" ],
+        "tags": [ "All",  "Cloze" ],
         "description": "Sentence completion given a few sentences as context from a children's book. From Hill et al., '16. Link: https://arxiv.org/abs/1511.02301"
     },
     {
         "id": "CornellMovie",
         "display_name": "Cornell Movie",
         "task": "cornell_movie",
-        "tags": [ "all",  "ChitChat" ],
+        "tags": [ "All",  "ChitChat" ],
         "description": "Fictional conversations extracted from raw movie scripts. Link: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html"
     },
     {
         "id": "DBLL-bAbI",
         "display_name": "Dialog Based Language Learning: bAbI Task",
         "task": "dbll_babi",
-        "tags": [ "all",  "Goal" ],
+        "tags": [ "All",  "Goal" ],
         "description": "Short dialogs based on the bAbI tasks, but in the form of a question from a teacher, the answer from the student, and finally a comment on the answer from the teacher. The aim is to find learning models that use the comments to improve. From Weston '16. Link: https://arxiv.org/abs/1604.06045"
     },
     {
         "id": "DBLL-Movie",
         "display_name": "Dialog Based Language Learning: WikiMovies Task",
         "task": "dbll_movie",
-        "tags": [ "all",  "Goal" ],
+        "tags": [ "All",  "Goal" ],
         "description": "Short dialogs based on WikiMovies, but in the form of a question from a teacher, the answer from the student, and finally a comment on the answer from the teacher. The aim is to find learning models that use the comments to improve. From Weston '16. Link: https://arxiv.org/abs/1604.06045"
     },
     {
         "id": "dialog-bAbI",
         "display_name": "Dialog bAbI",
         "task": "dialog_babi",
-        "tags": [ "all",  "Goal" ],
+        "tags": [ "All",  "Goal" ],
         "description": "Simulated dialogs of restaurant booking, from Bordes et al. '16. Link: https://arxiv.org/abs/1605.07683"
     },
     {
         "id": "MCTest",
         "display_name": "MCTest",
         "task": "mctest",
-        "tags": [ "all",  "QA" ],
+        "tags": [ "All",  "QA" ],
         "description": "Questions about short children's stories, from Richardson et al. '13. Link: https://www.microsoft.com/en-us/research/publication/mctest-challenge-dataset-open-domain-machine-comprehension-text/"
     },
     {
         "id": "MovieDD-QA",
         "display_name": "Movie Dialog QA",
         "task": "moviedialog:Task:1",
-        "tags": [ "all",  "QA", "MovieDD" ],
+        "tags": [ "All",  "QA", "MovieDD" ],
         "description": "Closed-domain QA dataset asking templated questions about movies, answerable from Wikipedia, similar to WikiMovies. From Dodge et al. '15. Link: https://arxiv.org/abs/1511.06931"
     },
     {
         "id": "MovieDD-QARecs",
         "display_name": "Movie Dialog QA Recommendations",
         "task": "moviedialog:Task:3",
-        "tags": [ "all",  "Goal", "MovieDD" ],
+        "tags": [ "All",  "Goal", "MovieDD" ],
         "description": "Dialogs discussing questions about movies as well as recommendations. From Dodge et al. '15. Link: https://arxiv.org/abs/1511.06931"
     },
     {
         "id": "MovieDD-Recs",
         "display_name": "Movie Dialog Recommendations",
         "task": "moviedialog:Task:2",
-        "tags": [ "all",  "QA", "MovieDD" ],
+        "tags": [ "All",  "QA", "MovieDD" ],
         "description": "Questions asking for movie recommendations. From Dodge et al. '15. Link: https://arxiv.org/abs/1511.06931"
     },
     {
         "id": "MovieDD-Reddit",
         "display_name": "Movie Dialog Reddit",
         "task": "moviedialog:Task:4",
-        "tags": [ "all",  "ChitChat", "MovieDD" ],
+        "tags": [ "All",  "ChitChat", "MovieDD" ],
         "description": "Dialogs discussing Movies from Reddit (the Movies SubReddit). From Dodge et al. '15. Link: https://arxiv.org/abs/1511.06931"
     },
     {
         "id": "MTurkWikiMovies",
         "display_name": "MTurk WikiMovies",
         "task": "mturkwikimovies",
-        "tags": [ "all",  "QA" ],
+        "tags": [ "All",  "QA" ],
         "description": "Closed-domain QA dataset asking MTurk-derived questions about movies, answerable from Wikipedia. From Li et al. '16. Link: https://arxiv.org/abs/1611.09823"
     },
     {
         "id": "OpenSubtitles",
         "display_name": "Open Subtitles",
         "task": "opensubtitles",
-        "tags": [ "all",  "ChitChat" ],
+        "tags": [ "All",  "ChitChat" ],
         "description": "Dataset of dialogs from movie scripts: http://opus.lingfil.uu.se/OpenSubtitles.php. A variant of the dataset used in Vinyals & Le '15, https://arxiv.org/abs/1506.05869."
     },
+    {
+        "id": "personalized-dialog-full",
+        "display_name": "Personalized Dialog Full Set",
+        "task": "personalized_dialog:full",
+        "tags": [ "All",  "Goal", "Personalization" ],
+        "description": "Simulated dataset of restaurant booking focused on personalization based on user profiles. From Joshi et al. '17. Link: https://arxiv.org/abs/1706.07503"
+    },
+    {
+        "id": "personalized-dialog-small",
+        "display_name": "Personalized Dialog Small Set",
+        "task": "personalized_dialog:small",
+        "tags": [ "All",  "Goal", "Personalization" ],
+        "description": "Simulated dataset of restaurant booking focused on personalization based on user profiles. From Joshi et al. '17. Link: https://arxiv.org/abs/1706.07503"
+    },
     {
         "id": "QACNN",
         "display_name": "QA CNN",
         "task": "qacnn",
-        "tags": [ "all",  "Cloze" ],
+        "tags": [ "All",  "Cloze" ],
         "description": "Cloze dataset based on a missing (anonymized) entity phrase from a CNN article, Hermann et al. '15. Link: https://arxiv.org/abs/1506.03340"
     },
     {
         "id": "QADailyMail",
         "display_name": "QA Daily Mail",
         "task": "qadailymail",
-        "tags": [ "all",  "Cloze" ],
+        "tags": [ "All",  "Cloze" ],
         "description": "Cloze dataset based on a missing (anonymized) entity phrase from a Daily Mail article, Hermann et al. '15. Link: https://arxiv.org/abs/1506.03340"
     },
     {
         "id": "SimpleQuestions",
         "display_name": "Simple Questions",
         "task": "simplequestions",
-        "tags": [ "all",  "QA" ],
+        "tags": [ "All",  "QA" ],
         "description": "Open-domain QA dataset based on Freebase triples from Bordes et al. '15. Link: https://arxiv.org/abs/1506.02075"
     },
     {
         "id": "SQuAD",
         "display_name": "SQuAD",
         "task": "squad",
-        "tags": [ "all",  "QA" ],
+        "tags": [ "All",  "QA" ],
         "description": "Open-domain QA dataset answerable from a given paragraph from Wikipedia, from Rajpurkar et al. '16. Link: https://arxiv.org/abs/1606.05250"
     },
+    {
+        "id": "TriviaQA",
+        "display_name": "TriviaQA",
+        "task": "triviaqa",
+        "tags": [ "All",  "QA" ],
+        "description": "Open-domain QA dataset with question-answer-evidence triples, from Joshi et al. '17. Link: https://arxiv.org/abs/1705.03551"
+    },
     {
         "id": "Ubuntu",
         "display_name": "Ubuntu",
         "task": "ubuntu",
-        "tags": [ "all",  "ChitChat" ],
+        "tags": [ "All",  "ChitChat" ],
         "description": "Dialogs between an Ubuntu user and an expert trying to fix issue, from Lowe et al. '15. Link: https://arxiv.org/abs/1506.08909"
     },
     {
         "id": "WebQuestions",
         "display_name": "Web Questions",
         "task": "webquestions",
-        "tags": [ "all",  "QA" ],
+        "tags": [ "All",  "QA" ],
         "description": "Open-domain QA dataset from Web queries from Berant et al. '13. Link: http://www.aclweb.org/anthology/D13-1160"
     },
     {
         "id": "WikiMovies",
         "display_name": "WikiMovies",
         "task": "wikimovies",
-        "tags": [ "all",  "QA" ],
+        "tags": [ "All",  "QA" ],
         "description": "Closed-domain QA dataset asking templated questions about movies, answerable from Wikipedia. From Miller et al. '16. Link: https://arxiv.org/abs/1606.03126"
     },
     {
         "id": "WikiQA",
         "display_name": "WikiQA",
         "task": "wikiqa",
-        "tags": [ "all",  "QA" ],
+        "tags": [ "All",  "QA" ],
         "description": "Open domain QA from Wikipedia dataset from Yang et al. '15. Link: https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/"
     },
     {
-            "id": "VQAv1",
-            "display_name": "VQAv1",
-            "task": "vqa_v1",
-            "tags": [ "all", "Visual" ],
-            "description": "Open-ended question answering about visual content. From Agrawal et al. '15. Link: https://arxiv.org/abs/1505.00468"
+        "id": "VQAv1",
+        "display_name": "VQAv1",
+        "task": "vqa_v1",
+        "tags": [ "All", "Visual" ],
+        "description": "Open-ended question answering about visual content. From Agrawal et al. '15. Link: https://arxiv.org/abs/1505.00468"
     },
     {
-            "id": "VQAv2",
-            "display_name": "VQAv2",
-            "task": "vqa_v2",
-            "tags": [ "all", "Visual" ],
-            "description": "Bigger, more balanced version of the original VQA dataset. From Goyal et al. '16. Link: https://arxiv.org/abs/1612.00837"
+        "id": "VQAv2",
+        "display_name": "VQAv2",
+        "task": "vqa_v2",
+        "tags": [ "All", "Visual" ],
+        "description": "Bigger, more balanced version of the original VQA dataset. From Goyal et al. '16. Link: https://arxiv.org/abs/1612.00837"
     },
     {
-            "id": "VisDial",
-            "display_name": "VisDial",
-            "task": "visdial",
-            "tags": [ "all", "Visual" ],
-            "description": "Task which requires agents to hold a meaningful dialog about visual content. From Das et al. '16. Link: https://arxiv.org/abs/1611.08669"
+        "id": "VisDial",
+        "display_name": "VisDial",
+        "task": "visdial",
+        "tags": [ "All", "Visual" ],
+        "description": "Task which requires agents to hold a meaningful dialog about visual content. From Das et al. '16. Link: https://arxiv.org/abs/1611.08669"
     },
     {
-            "id": "MNIST_QA",
-            "display_name": "MNIST_QA",
-            "task": "mnist_qa",
-            "tags": [ "all", "Visual" ],
-            "description": "Task which requires agents to identify which number they are seeing. From the MNIST dataset."
+        "id": "MNIST_QA",
+        "display_name": "MNIST_QA",
+        "task": "mnist_qa",
+        "tags": [ "All", "Visual" ],
+        "description": "Task which requires agents to identify which number they are seeing. From the MNIST dataset."
     },
+    {
+        "id": "InsuranceQA",
+        "display_name": "InsuranceQA",
+        "task": "insuranceqa",
+        "tags": [ "All",  "QA" ],
+        "description": "Task which requires agents to identify high quality answers composed by professionals with deep domain knowledge. From Feng et al. '15. Link: https://arxiv.org/abs/1508.01585"
+    },
+    {
+        "id": "MS_MARCO",
+        "display_name": "MS_MARCO",
+        "task": "ms_marco",
+        "tags": [ "All",  "QA" ],
+        "description": "A large scale Machine Reading Comprehension Dataset with questions sampled from real anonymized user queries and contexts from web documents. From Nguyen et al. '16. Link: https://arxiv.org/abs/1611.09268"
+    },
+    {
+        "id": "CLEVR",
+        "display_name": "CLEVR",
+        "task": "clevr",
+        "tags": [ "All",  "Visual" ],
+        "description": "A visual reasoning dataset that tests abilities such as attribute identification, counting, comparison, spatial relationships, and logical operations. From Johnson et al. '16. Link: https://arxiv.org/abs/1612.06890"
+    }
 ]
diff --git a/parlai/tasks/triviaqa/__init__.py b/parlai/tasks/triviaqa/__init__.py
new file mode 100644
index 00000000000..8eff276d72d
--- /dev/null
+++ b/parlai/tasks/triviaqa/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
\ No newline at end of file
diff --git a/parlai/tasks/triviaqa/agents.py b/parlai/tasks/triviaqa/agents.py
new file mode 100644
index 00000000000..53f971c6df8
--- /dev/null
+++ b/parlai/tasks/triviaqa/agents.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+
+from parlai.core.dialog_teacher import DialogTeacher
+from parlai.core.agents import MultiTaskTeacher
+from .build import build
+
+import copy
+import json
+import os
+import random
+
+def _path(opt):
+    build(opt)
+
+    return (os.path.join(opt['datapath'], 'TriviaQA', 'qa'),
+            os.path.join(opt['datapath'], 'TriviaQA', 'evidence'))
+
+
+class WebTeacher(DialogTeacher):
+    def __init__(self, opt, shared=None):
+        if not hasattr(self, 'prefix'):
+            self.prefix = ''
+            if opt['datatype'].startswith('train'):
+                self.suffix = 'train'
+            else:
+                self.suffix = 'dev'
+
+        qa_dir, self.evidence_dir = _path(opt)
+        opt['datafile'] = os.path.join(qa_dir, self.prefix + 'web-' +
+                                               self.suffix + '.json')
+        self.id = 'triviaqa'
+        super().__init__(opt, shared)
+
+    def setup_data(self, path):
+        print('loading: ' + path)
+        with open(path) as data_file:
+            data = json.load(data_file)['Data']
+        for datapoint in data:
+            question = datapoint['Question']
+            answers = datapoint['Answer']['Aliases']
+            evidence_list = datapoint['SearchResults']
+
+            if len(evidence_list) == 0:
+                continue
+
+            for evidence_item in evidence_list:
+                evidence_file_path = os.path.join(self.evidence_dir, 'web',
+                                                  evidence_item['Filename'])
+                with open(evidence_file_path) as evidence_file:
+                    evidence = 'Title: %s\n' % evidence_item['Title']
+                    evidence += evidence_file.read()
+                    yield (evidence + '\n' + question, answers), True
+
+
+class VerifiedWebTeacher(WebTeacher):
+    def __init__(self, opt, shared=None):
+        self.prefix = 'verified-'
+        self.suffix = 'dev'
+        if opt['datatype'] != 'valid':
+            print('WARNING: Verified teacher only provides dev data')
+
+        opt['datafile'], self.evidence_dir = _path(opt)
+        self.id = 'triviaqa'
+        super().__init__(opt, shared)
+
+
+class WikipediaTeacher(DialogTeacher):
+    def __init__(self, opt, shared=None):
+        if not hasattr(self, 'prefix'):
+            self.prefix = ''
+            if opt['datatype'].startswith('train'):
+                self.suffix = 'train'
+            else:
+                self.suffix = 'dev'
+
+        qa_dir, self.evidence_dir = _path(opt)
+        opt['datafile'] = os.path.join(qa_dir, self.prefix + 'wikipedia-' +
+                                               self.suffix + '.json')
+
+        self.id = 'triviaqa'
+        super().__init__(opt, shared)
+
+    def setup_data(self, path):
+        print('loading: ' + path)
+        with open(path) as data_file:
+            data = json.load(data_file)['Data']
+        for datapoint in data:
+            question = datapoint['Question']
+            answers = datapoint['Answer']['Aliases']
+            evidence_list = datapoint['EntityPages']
+
+            if len(evidence_list) == 0:
+                continue
+
+            evidence = ''
+            for evidence_item in evidence_list:
+                evidence_file_path = os.path.join(self.evidence_dir,
+                                                  'wikipedia',
+                                                  evidence_item['Filename'])
+                with open(evidence_file_path) as evidence_file:
+                    evidence += 'Title: %s\n' % evidence_item['Title']
+                    evidence += evidence_file.read() + '\n\n'
+
+            yield (evidence + question, answers), True
+
+
+class VerifiedWikipediaTeacher(WikipediaTeacher):
+    def __init__(self, opt, shared=None):
+        self.prefix = 'verified-'
+        self.suffix = 'dev'
+        if opt['datatype'] != 'valid':
+            print('WARNING: Verified teacher only provides dev data')
+
+        opt['datafile'], self.evidence_dir = _path(opt)
+        self.id = 'triviaqa'
+        super().__init__(opt, shared)
+
+
+class VerifiedTeacher(MultiTaskTeacher):
+    def __init__(self, opt, shared=None):
+        opt = copy.deepcopy(opt)
+        opt['task'] = 'triviaqa:VerifiedWikipedia,triviaqa:VerifiedWeb'
+        super().__init__(opt, shared)
+
+class DefaultTeacher(MultiTaskTeacher):
+    def __init__(self, opt, shared=None):
+        opt = copy.deepcopy(opt)
+        opt['task'] = 'triviaqa:wikipedia,triviaqa:web'
+        super().__init__(opt, shared)
diff --git a/parlai/tasks/triviaqa/build.py b/parlai/tasks/triviaqa/build.py
new file mode 100644
index 00000000000..e955164c86b
--- /dev/null
+++ b/parlai/tasks/triviaqa/build.py
@@ -0,0 +1,31 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+#
+# Download and build the data if it does not exist.
+
+import parlai.core.build_data as build_data
+import os
+
+
+def build(opt):
+    dpath = os.path.join(opt['datapath'], 'TriviaQA')
+    version = None
+
+    if not build_data.built(dpath, version_string=version):
+        print('[building data: ' + dpath + ']')
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
+        build_data.make_dir(dpath)
+
+        # Download the data.
+        fname = 'triviaqa-rc.tar.gz'
+        url = 'http://nlp.cs.washington.edu/triviaqa/data/'
+        build_data.download(url + fname, dpath, fname)
+        build_data.untar(dpath, fname)
+
+        # Mark the data as built.
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/ubuntu/build.py b/parlai/tasks/ubuntu/build.py
index 6c263a1f125..36fd5cb965a 100644
--- a/parlai/tasks/ubuntu/build.py
+++ b/parlai/tasks/ubuntu/build.py
@@ -12,10 +12,13 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'Ubuntu')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -25,4 +28,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/visdial/build.py b/parlai/tasks/visdial/build.py
index f111418ecaf..c4efeb2f176 100644
--- a/parlai/tasks/visdial/build.py
+++ b/parlai/tasks/visdial/build.py
@@ -9,32 +9,7 @@
 import os
 import json
 
-
-def buildImage(opt):
-    dpath = os.path.join(opt['datapath'], 'COCO-IMG')
-
-    if not build_data.built(dpath):
-        print('[building image data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
-        build_data.make_dir(dpath)
-
-        # download the image data.
-        fname1 = 'train2014.zip'
-        fname2 = 'val2014.zip'
-        fname3 = 'test2014.zip'
-
-        url = 'http://msvocds.blob.core.windows.net/coco2014/'
-
-        build_data.download(url + fname1, dpath, fname1)
-        build_data.download(url + fname2, dpath, fname2)
-        build_data.download(url + fname3, dpath, fname3)
-
-        build_data.untar(dpath, fname1)
-        build_data.untar(dpath, fname2)
-        build_data.untar(dpath, fname3)
-
-        # Mark the data as built.
-        build_data.mark_done(dpath)
+from parlai.tasks.vqa_v1.build import buildImage
 
 
 def build(opt):
@@ -44,7 +19,9 @@ def build(opt):
     if not build_data.built(dpath, version):
         print('[building data: ' + dpath + ']')
 
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
diff --git a/parlai/tasks/vqa_v1/agents.py b/parlai/tasks/vqa_v1/agents.py
index 7de400cf223..2db211ce646 100644
--- a/parlai/tasks/vqa_v1/agents.py
+++ b/parlai/tasks/vqa_v1/agents.py
@@ -5,7 +5,7 @@
 # of patent rights can be found in the PATENTS file in the same directory.
 
 from parlai.core.agents import Teacher
-from parlai.core.dialog_teacher import load_image
+from parlai.core.image_featurizers import ImageLoader
 from .build import build, buildImage
 
 import json
@@ -29,7 +29,7 @@ def _path(opt):
     elif dt == 'test':
         ques_suffix = 'MultipleChoice_mscoco_test2015'
         annotation_suffix = 'None'
-        img_suffix = os.path.join('test2014', 'COCO_test2014_')
+        img_suffix = os.path.join('test2015', 'COCO_test2015_')
     else:
         raise RuntimeError('Not valid datatype.')
 
@@ -66,7 +66,7 @@ def __init__(self, opt, shared=None):
         # size so they all process disparate sets of the data
         self.step_size = opt.get('batchsize', 1)
         self.data_offset = opt.get('batchindex', 0)
-
+        self.image_loader = ImageLoader(opt)
         self.reset()
 
     def __len__(self):
@@ -101,7 +101,7 @@ def act(self):
         img_path = self.image_path + '%012d.jpg' % (image_id)
 
         action = {
-            'image': load_image(self.opt, img_path),
+            'image': self.image_loader.load(img_path),
             'text': question,
             'episode_done': True
         }
diff --git a/parlai/tasks/vqa_v1/build.py b/parlai/tasks/vqa_v1/build.py
index ab274a06b92..309fcfe387d 100644
--- a/parlai/tasks/vqa_v1/build.py
+++ b/parlai/tasks/vqa_v1/build.py
@@ -11,18 +11,21 @@
 
 def buildImage(opt):
     dpath = os.path.join(opt['datapath'], 'COCO-IMG')
+    version = '1'
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building image data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
-        # download the image data.
+        # Download the image data.
         fname1 = 'train2014.zip'
         fname2 = 'val2014.zip'
-        fname3 = 'test2014.zip'
+        fname3 = 'test2015.zip'
 
-        url = 'http://msvocds.blob.core.windows.net/coco2014/'
+        url = 'https://s3.amazonaws.com/fair-data/parlai/COCO-IMG/'
 
         build_data.download(url + fname1, dpath, fname1)
         build_data.download(url + fname2, dpath, fname2)
@@ -33,15 +36,18 @@ def buildImage(opt):
         build_data.untar(dpath, fname3)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
 
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'VQA-v1')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -66,4 +72,4 @@ def build(opt):
         build_data.untar(dpath, fname5)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/vqa_v2/agents.py b/parlai/tasks/vqa_v2/agents.py
index 1d46a9432bf..2ffd66ed598 100644
--- a/parlai/tasks/vqa_v2/agents.py
+++ b/parlai/tasks/vqa_v2/agents.py
@@ -5,7 +5,7 @@
 # of patent rights can be found in the PATENTS file in the same directory.
 
 from parlai.core.agents import Teacher
-from parlai.core.dialog_teacher import load_image
+from parlai.core.image_featurizers import ImageLoader
 from .build import build, buildImage
 
 import json
@@ -29,7 +29,7 @@ def _path(opt):
     elif dt == 'test':
         ques_suffix = 'v2_OpenEnded_mscoco_test2015'
         annotation_suffix = 'None'
-        img_suffix = os.path.join('test2014', 'COCO_test2014_')
+        img_suffix = os.path.join('test2015', 'COCO_test2015_')
     else:
         raise RuntimeError('Not valid datatype.')
 
@@ -60,12 +60,14 @@ def __init__(self, opt, shared=None):
                 self.annotation = shared['annotation']
         else:
             self._setup_data(data_path, annotation_path)
+        self.len = len(self.ques['questions'])
 
         # for ordered data in batch mode (especially, for validation and
         # testing), each teacher in the batch gets a start index and a step
         # size so they all process disparate sets of the data
         self.step_size = opt.get('batchsize', 1)
         self.data_offset = opt.get('batchindex', 0)
+        self.image_loader = ImageLoader(opt)
 
         self.reset()
 
@@ -90,7 +92,9 @@ def act(self):
         if self.datatype == 'train':
             self.episode_idx = random.randrange(self.len)
         else:
-            self.episode_idx = (self.episode_idx + 1) % self.len
+            self.episode_idx = (self.episode_idx + self.step_size) % len(self)
+            if self.episode_idx == len(self) - self.step_size:
+                self.epochDone = True
 
         qa = self.ques['questions'][self.episode_idx]
         question = qa['question']
@@ -99,7 +103,7 @@ def act(self):
         img_path = self.image_path + '%012d.jpg' % (image_id)
 
         action = {
-            'image': load_image(self.opt, img_path),
+            'image': self.image_loader.load(img_path),
             'text': question,
             'episode_done': True
         }
@@ -130,8 +134,6 @@ def _setup_data(self, data_path, annotation_path):
             with open(annotation_path) as data_file:
                 self.annotation = json.load(data_file)
 
-        self.len = len(self.ques['questions'])
-
 
 class DefaultTeacher(OeTeacher):
     pass
diff --git a/parlai/tasks/vqa_v2/build.py b/parlai/tasks/vqa_v2/build.py
index 76a0784ea0d..9c1f4f874d8 100644
--- a/parlai/tasks/vqa_v2/build.py
+++ b/parlai/tasks/vqa_v2/build.py
@@ -8,40 +8,18 @@
 import parlai.core.build_data as build_data
 import os
 
-
-def buildImage(opt):
-    dpath = os.path.join(opt['datapath'], 'COCO-IMG')
-
-    if not build_data.built(dpath):
-        print('[building image data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
-        build_data.make_dir(dpath)
-
-        # download the image data.
-        fname1 = 'train2014.zip'
-        fname2 = 'val2014.zip'
-        fname3 = 'test2014.zip'
-
-        url = 'http://msvocds.blob.core.windows.net/coco2014/'
-
-        build_data.download(url + fname1, dpath, fname1)
-        build_data.download(url + fname2, dpath, fname2)
-        build_data.download(url + fname3, dpath, fname3)
-
-        build_data.untar(dpath, fname1)
-        build_data.untar(dpath, fname2)
-        build_data.untar(dpath, fname3)
-
-        # Mark the data as built.
-        build_data.mark_done(dpath)
+from parlai.tasks.vqa_v1.build import buildImage
 
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'VQA-v2')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        # An older version exists, so remove these outdated files.
+        if build_data.built(dpath):
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -67,4 +45,4 @@ def build(opt):
         build_data.untar(dpath, fname5)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/webquestions/build.py b/parlai/tasks/webquestions/build.py
index 226743842b0..ddf54c2c3ec 100644
--- a/parlai/tasks/webquestions/build.py
+++ b/parlai/tasks/webquestions/build.py
@@ -35,10 +35,13 @@ def create_fb_format(outpath, dtype, inpath):
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'WebQuestions')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -55,4 +58,4 @@ def build(opt):
         create_fb_format(dpath, 'test', os.path.join(dpath, 'test.json'))
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/wikimovies/agents.py b/parlai/tasks/wikimovies/agents.py
index 059f82e6cab..0bb417f0f73 100644
--- a/parlai/tasks/wikimovies/agents.py
+++ b/parlai/tasks/wikimovies/agents.py
@@ -30,8 +30,15 @@ def _path(opt):
 class KBTeacher(FbDialogTeacher):
     def __init__(self, opt, shared=None):
         build(opt)
-        task = opt.get('task', 'wikimovies:KB:kb')
-        kb = task.split(':')[2]
+        task = opt.get('task')
+        if not task:
+            task = 'wikimovies:KB:kb'
+        kb = task.split(':')
+        if len(kb) == 3:
+            kb = kb[2]
+        elif len(kb) == 2:
+            # default to 'kb' if 'kb', 'wiki', or 'ie' not specified
+            kb = 'kb'
         kbs = {}
         kbs['kb'] = os.path.join('wiki_entities', 'wiki_entities_kb.txt')
         kbs['wiki'] = 'wiki.txt'
diff --git a/parlai/tasks/wikimovies/build.py b/parlai/tasks/wikimovies/build.py
index 141ada2b7e4..9205d182a72 100644
--- a/parlai/tasks/wikimovies/build.py
+++ b/parlai/tasks/wikimovies/build.py
@@ -11,10 +11,13 @@
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'WikiMovies')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -24,4 +27,4 @@ def build(opt):
         build_data.untar(dpath, fname)
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/parlai/tasks/wikiqa/build.py b/parlai/tasks/wikiqa/build.py
index 156f2635c81..749915a051d 100644
--- a/parlai/tasks/wikiqa/build.py
+++ b/parlai/tasks/wikiqa/build.py
@@ -39,10 +39,13 @@ def create_fb_format(outpath, dtype, inpath):
 
 def build(opt):
     dpath = os.path.join(opt['datapath'], 'WikiQA')
+    version = None
 
-    if not build_data.built(dpath):
+    if not build_data.built(dpath, version_string=version):
         print('[building data: ' + dpath + ']')
-        build_data.remove_dir(dpath)
+        if build_data.built(dpath):
+            # An older version exists, so remove these outdated files.
+            build_data.remove_dir(dpath)
         build_data.make_dir(dpath)
 
         # Download the data.
@@ -66,4 +69,4 @@ def build(opt):
                          os.path.join(dpext, 'WikiQA-test.tsv'))
 
         # Mark the data as built.
-        build_data.mark_done(dpath)
+        build_data.mark_done(dpath, version_string=version)
diff --git a/tests/check_examples.sh b/tests/check_examples.sh
index aa29a8b22dc..e40e705d3a6 100755
--- a/tests/check_examples.sh
+++ b/tests/check_examples.sh
@@ -14,9 +14,10 @@ python display_data.py -t babi:task1k:1,squad -n 100
 python eval_model.py -m ir_baseline -t "#moviedd-reddit" -dt valid -n 10
 python display_model.py -m ir_baseline -t "#moviedd-reddit" -dt valid -n 10
 python build_dict.py -t babi:task1k:1 --dict-file /tmp/dict.tsv
+python train_model.py -m seq2seq -t babi:task1k:1 -bs 8 -e 1 -mf /tmp/model_s2s
 
 # TODO: this one breaks when done in scripts due to some environment variable issues
 #python memnn_luatorch_cpu/full_task_train.py -t babi:task10k:1 -nt 8 --num-examples 100 --num-its 1
 
 # if this returns without an error code, you're good!
-python train_model.py -m drqa -t squad -bs 32 -mf /tmp/model & sleep 60 ; kill $!
+python train_model.py -m drqa -t squad -bs 32 -mf /tmp/model_drqa & sleep 60 ; kill $!
diff --git a/tests/run_tests_long.sh b/tests/run_tests_long.sh
index a0f2ec8e05d..92c48483d7f 100755
--- a/tests/run_tests_long.sh
+++ b/tests/run_tests_long.sh
@@ -7,4 +7,4 @@
 # of patent rights can be found in the PATENTS file in the same directory.
 
 set -e # stop if any tests fail
-python test_data.py
+python3 test_downloads.py
diff --git a/tests/run_tests_short.sh b/tests/run_tests_short.sh
index d306b33fff4..ecbad8af794 100755
--- a/tests/run_tests_short.sh
+++ b/tests/run_tests_short.sh
@@ -7,8 +7,9 @@
 # of patent rights can be found in the PATENTS file in the same directory.
 
 set -e # stop if any tests fail
-python test_init.py
-python test_import.py
-python test_dict.py
-python test_tasklist.py
-python test_threadutils.py
+python3 test_init.py
+python3 test_import.py
+python3 test_dict.py
+python3 test_tasklist.py
+python3 test_threadutils.py
+python3 test_utils.py
diff --git a/tests/test_data.py b/tests/test_downloads.py
similarity index 88%
rename from tests/test_data.py
rename to tests/test_downloads.py
index d27b446f260..eb46cf3bdab 100644
--- a/tests/test_data.py
+++ b/tests/test_downloads.py
@@ -284,6 +284,22 @@ def test_squad(self):
 
         shutil.rmtree(self.TMP_PATH)
 
+    def test_triviaqa(self):
+        from parlai.core.params import ParlaiParser
+        from parlai.tasks.triviaqa.agents import WebTeacher, WikipediaTeacher
+
+        opt = ParlaiParser().parse_args(args=self.args)
+
+        for teacher_class in (WebTeacher, WikipediaTeacher):
+            for dt in ['train:ordered', 'valid']:
+                opt['datatype'] = dt
+
+                teacher = teacher_class(opt)
+                reply = teacher.act()
+                check(opt, reply)
+
+        shutil.rmtree(self.TMP_PATH)
+
     def test_ubuntu(self):
         from parlai.core.params import ParlaiParser
         from parlai.tasks.ubuntu.agents import DefaultTeacher
@@ -379,6 +395,44 @@ def test_coco_datasets(self):
 
         shutil.rmtree(self.TMP_PATH)
 
+    def test_insuranceqa(self):
+        from parlai.core.params import ParlaiParser
+        from parlai.tasks.insuranceqa.agents import V1Teacher, V2Teacher
+
+        opt = ParlaiParser().parse_args(args=self.args)
+
+        for dt in ['train', 'valid', 'test']:
+            opt['datatype'] = dt
+
+            teacher = V1Teacher(opt)
+            reply = teacher.act()
+            check(opt, reply)
+
+            teacher = V2Teacher(opt)
+            reply = teacher.act()
+            check(opt, reply)
+
+        shutil.rmtree(self.TMP_PATH)
+
+    def test_ms_marco(self):
+        from parlai.core.params import ParlaiParser
+        from parlai.tasks.ms_marco.agents import DefaultTeacher, PassageTeacher
+
+        opt = ParlaiParser().parse_args(args=self.args)
+
+        for dt in ['train', 'valid']:
+            opt['datatype'] = dt
+
+            teacher = DefaultTeacher(opt)
+            reply = teacher.act()
+            check(opt, reply)
+
+            teacher = PassageTeacher(opt)
+            reply = teacher.act()
+            check(opt, reply)
+
+        shutil.rmtree(self.TMP_PATH)
+
 
 if __name__ == '__main__':
     # clean out temp dir first
diff --git a/tests/test_init.py b/tests/test_init.py
index ae7b11c83de..216fd763965 100644
--- a/tests/test_init.py
+++ b/tests/test_init.py
@@ -9,13 +9,16 @@
 
 
 class TestInit(unittest.TestCase):
-    """Make sure the package is alive."""
+    """Make sure all python packages have init files."""
 
     def test_init_everywhere(self):
         from parlai.core.params import ParlaiParser
         opt = ParlaiParser().parse_args()
         for root, subfolder, files in os.walk(os.path.join(opt['parlai_home'], 'parlai')):
             if not root.endswith('__pycache__'):
+                if os.path.basename(root) == 'html':
+                    # skip mturk core's html folder--not a python module
+                    continue
                 assert '__init__.py' in files, 'Dir {} is missing __init__.py'.format(root)
 
 
diff --git a/tests/test_utils.py b/tests/test_utils.py
new file mode 100644
index 00000000000..6ff48e4e6ce
--- /dev/null
+++ b/tests/test_utils.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+# All rights reserved.
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree. An additional grant
+# of patent rights can be found in the PATENTS file in the same directory.
+
+from parlai.core.utils import Timer, round_sigfigs
+import time
+import unittest
+
+
+class TestUtils(unittest.TestCase):
+
+    def test_round_sigfigs(self):
+        x = 0
+        y = 0
+        assert round_sigfigs(x, 2) == y
+
+        x = 100
+        y = 100
+        assert round_sigfigs(x, 2) == y
+
+        x = 0.01
+        y = 0.01
+        assert round_sigfigs(x, 2) == y
+
+        x = 0.00123
+        y = 0.001
+        assert round_sigfigs(x, 1) == y
+
+        x = 0.37
+        y = 0.4
+        assert round_sigfigs(x, 1) == y
+
+        x = 2353
+        y = 2350
+        assert round_sigfigs(x, 3) == y
+
+        x = 3547345734
+        y = 3547350000
+        assert round_sigfigs(x, 6) == y
+
+        x = 0.0000046246
+        y = 0.00000462
+        assert round_sigfigs(x, 3) == y
+
+    def test_timer(self):
+        t = Timer()
+        elapsed = t.stop().time()
+        assert elapsed > 0
+
+        same = t.time()
+        assert elapsed == same
+
+        t.resume()
+        time.sleep(0.1)
+        more = t.time()
+        assert more > elapsed
+
+        other = Timer()
+        less = other.reset().time()
+        assert less > 0
+        assert less < t.time()
+
+
+if __name__ == '__main__':
+    unittest.main()