Skip to content

Application Example

Scott Sievert edited this page Feb 17, 2017 · 1 revision

Algorithms

We will now look at the code for an example algorithm for PoolBasedBinaryClassification. The algorithm code obeys an interface defined in apps/PoolBasedBinaryClassification/algs/Algs.yaml and the actual code is given inapps/PoolBasedBinaryClassification/RandomSamplingLinearLeastSquares/RandomSamplingLinearLeastSquares.py.

Algs.yaml

Algs.yaml provides a uniform interface that every algorithm has to satisfy. As with PoolBasedBinaryClassification.yaml, the interface is specified in a standard YAML format, described here. Requiring a specific interface makes it easier to know what your algorithm is allowed to take as inputs and return as outputs. In projects where there are multiple algorithms for a given application, this makes it easier for many developers to work together and spread the task of developing algorithms.

initExp:
  args:
    n:
      type: num
      description: Number of targets available.
  rets:
    type: bool
    description: A boolean indiciating if the algorithm initialization succeeded or failed
    values: true

getQuery:
  args:
    participant_uid:
      type: str
      description: Participant unique ID
  rets:
    type: num
    description: The index of the target to ask about

processAnswer:
  args:
    target_index:
      type: num
      description: The ID of the target we are asking about
    target_label:
      type: num
      description: The label assigned to the target
  rets:
    description: Indicates if the algorithm succeeded
    type: bool
    values: true

getModel:
  rets:
    type: dict
    description: The current state of the model
    values:
      weights:
        type: list
        description: The linear model weights
        values:
          type: num
      num_reported_answers:
        type: num
        description: The number of reported answers (for this algorithm)

This interface is best understood when compared to the algorithm file, RandomSamplingLinearLeastSquares.py. The args in each API function correspond to the inputs of that API function in the algorithm. The rets correspond to the outputs of those functions.

import numpy.random

class RandomSamplingLinearLeastSquares:
    def initExp(self, butler, n, d):
		return ...

    def getQuery(self, butler, participant_uid):
        return ...

    def processAnswer(self, butler, target_index, target_label):
        return ...


    def getModel(self, butler):
        return ...

    def full_embedding_update(self, butler, args):
		...

Ignoring the butler input, we see that the args to initExp should be n and d. These are precisely the inputs to the associated initExp function. Note that these also correspond to the keys in the inputs to the alg call in initExp in the application code. We recommend checking this consistency in inputs across Algs.yaml and the algorithm code with the application code for each API function.

RandomSamplingLinearLeastSquares.py

We now describe the API functions in RandomSamplingLeastSquares.py in detail.

QUESTION: Can the butler in the algorithm access other collections and change them? Look into how this is setup.

As we can see above, each API function takes in a butler object. Algorithm specific variables should be set and retrieved using the butler.algorithms collection. The butler has to be used to ensure that variables are stored in the NEXT database and can be retreived over different workers and user web sessions. Again, the full set of features ofthe butler is documented in the Butler API.

initExp:

       def initExp(self, butler, n, d):
        # Save the number of targets, dimension, and to algorithm storage
        butler.algorithms.set(key='n',value= n)
        butler.algorithms.set(key='d',value= d)
        
        # Initialize the weight to an empty list of 0's
        butler.algorithms.set(key='weights',value=[0]*(d+1))
        return True

The initExp function is very simple. It saves n and d, and an empty weights vector (representing the weights in our least squares linear model) in the butler.algorithms collection. You may recall that these values are also stored in the butler.experiment collection. Again, best practice dictates that any variables needed in an algorithm be stored and retreived from the butler.algorithm collection directly.

Note about atomicity Insert one here.

getQuery:

    def getQuery(self, butler, participant_uid):
        # Retrieve the number of targets and return the index of one at random
        n = butler.algorithms.get(key='n')
		# Get the list of queries answered by this choice
	    answered_queries = butler.participants.get(uid=participant_uid, key='asked_queries')
		# If we have asked this participant to label all the targets, return 0
		if len(asked_queries) == n:
			return 0
		# Choose a random target to answer
		i = numpy.random.choice(n)		
		while i in asked_queries:
			i = numpy.random.choice(n) 
		return i

In an active algorithm, the procedure to return an active query is at the heart of the algorithm. In this example, our (in)active algorithm is very simple, it should just returns a random index between 0 and number of targets -1. This index corresponds to the target_id of the random target that we wish to return to the user. We also want to ensure that the user has not labelled this item previously. It is a good idea to review how this index is used by the application code.

Note that we retrieve the set of answered queries from the butler.participants collection. Our decision to return 0 if we run out of targets is arbitrary. It is up to the developer to decide how to handle that.

Note. As we discuss in more depth in widgets the participant_uid is not associated with a person but rather with a browser session, so refreshing a query page will assign a new participant_uid.

processAnswer:

   def processAnswer(self, butler, target_index, target_label):
	   # S maintains a list of labelled items. Appending to S will create it.
        butler.algorithms.append(key='S',value=(target_index,target_label))
        
		# Increment the number of reported answers by one.
        num_reported_answers = butler.algorithms.increment(key='num_reported_answers')
        
		# Append the 
		
		# Run a model update job after every d answers
        d = butler.algorithms.get(key='d')
        if num_reported_answers % int(d) == 0:
            butler.job('full_embedding_update', {}, time_limit=30)
		return True

processAnswer appends the id and the associated label to the S list, an internal representation by the algorithm of the set of queries. Note that the algorithm could also access the set of queries by calling butler.queries, this is a much slower operation compared to pulling S so we recommend against it. The number of reported answers for this algorithm is also incremented.

Finally, an asynchronous job is given to the butler to run every d steps. In our case, the job is a full_embedding_update which uses a least squares model to update our weights. In the case where we have a lot of targets, and a large set of answered queries, least squares may be very slow and it is best not to leave the user waiting for the response from processAnswer. Instead the model will be updated in the background by the butler and the result can be retrieved and used later.

The downside of this approach is that the weights may be out of date at any given time if the model has not fully updated. This can lead to "stale" queries in algorithms which use the weights to generate active queries, i.e. queries which have not been generated by the most up to date information. It is up to the application/algorithm developer to manage this tradeoff. We address this issue more carefully in our NIPS paper on NEXT.

'full_embedding_update`

	def full_embedding_update(self, butler, args):
        # Main function to update the model.
        labelled_items = butler.algorithms.get(key='S')
        # Get the list of targets.
        targets = butler.targets.get_targetset(butler.exp_uid)
        # Make sure the targets are sorted by id
        targets = sorted(targets,key=lambda x: x['target_id'])
        # Extract the features form each target and then append a bias feature.
        target_features = [targets[i]['meta']['features'] for i in range(len(targets))]
        for feature_vector in target_features:
            feature_vector.append(1.)
        # Build a list of feature vectors and associated labels.
        X = []
        y = []
        for index, label in labelled_items:
            X.append(target_features[index])
            y.append(label)
        # Convert to numpy arrays and use lstsquares to find the weights.
        X = numpy.array(X)
        y = numpy.array(y)
        weights = numpy.linalg.lstsq(X,y)[0]
        # Save the weights under the key weights.
        butler.algorithms.set(key='weights',value=weights.tolist())

The embedding update code first pulls the full set of queries aasked and answered by this algorithm. The butler.targets collection is then used to extract the associated feature vectors, which are then aggrated into a matrix X. The labels are similarly aggregated and then numpy's least squares algorithm is used to compute the weights and store them.

getModel:

def getModel(self, butler):
# The model is simply the vector of weights and a record of the number of reported answers.
    return butler.algorithms.get(key=['weights','num_reported_answers'])

getModel is intended to return the data that comprises the classifier. In this case, that is simply the list of weights and the number of reported answers for this algorithm.

Note: Extra parameters not specified in Algs.yaml

The argument checking we use (described here) supports many types (dict, str, num, etc). It also supports the types "any", "anything" and "stuff" which mean an arbitrary type. When init'ing the experiment, you can some parameter of type "anything" to Algs.yaml and then include it when launching the experiment in a initExp['args']['alg_list'] item.

Of course, a parameter might not be changed by the user of your algorithm and only by you, the algorithm developer. It's up to you how you want to do this; more defaults in the appropriate functions might be a good call (globals probably aren't a good call).

Clone this wiki locally