Skip to content

Latest commit

 

History

History
349 lines (277 loc) · 14.8 KB

README.md

File metadata and controls

349 lines (277 loc) · 14.8 KB

Scorecard

Given an output json from a run, create a 'scorecard' which evaluates the motion of the agent in several dimensions.

The actions that are counted (as of Eval 5):

  • Revisiting parts of the room that have not changed
  • Attempting to open an un-openable object.
    • We will begin counting from the first attempt
  • Looking into the same container repeatedly
  • Target object in the field of view but agent does not move toward it
    • After a certain number of frames, not moving toward a visible target object will be counted
  • Repeating a failed action.
    • We will begin counting from the second attempt, since it cannot be a “failed” action until after the AI has received feedback from the first attempt. If the AI system attempts the same action from a new position, this will not be considered a repeated action.
  • Actions on ramps, including whether the agent successfully went up, abandonded going up or down, and fell off
  • Moving, rotating, torquing interactions with objects
  • Determine which side of the platform the agent went to (correct or incorrect)
  • Determine which door the agent went to (correct or incorrect)
  • Determine if the agent took the fastest path when two paths are present (True or False)
  • Attempting physically impossible actions. This is not implemented yet.
    • e.g., trying to pick up a sofa or another similarly large item; trying to interact with the floor or wall
    • Impossible actions will be counted from the first attempt.
  • Number of rewards achieved.
    • For interactive scenes, will indicate how many reward balls are held by the time the scene is over.
  • Determine number of times the agent interacted with a non-agent
  • Determine number of times the agent picked up a non pickupable object
  • Determine number of times the agent walked into walls
  • Determine number of times the agent walked into platform lips
  • Determine what order the agent opened the containers in an imitation task. Order is determined by color (green, blue, red, etc.)
  • Determine what door the agent opened (left, middle, right)
  • Determine (for multi tool use) how many unique tools were touched and which tools were rotated.
  • Determine whether or not the performer stepped in "lava"

Some of these are mathematically vague; for example, the space that the agent moves in is continous, so 'revisit' needs to have a particular distance. Below, we discuss the way to count them.

Algorithms

Note that the algorithms depend on parameters, such as grid size.
These parameters are all in the top of scorecard.py and the reader is urged to review those values.

Revisiting parts of the Room

Algorithm:

  • Divide room into a grid of 0.5 m
  • Count the number of times that the agent enters a grid square while facing in the same direction (within 10 degrees) as a previous time they were in that grid square
  • If paths cross while facing in different directions, it does not count as a revist
  • If the actor rotates or does not move (PASS), it does not count
    • Note that this means that if the actor spins in a circle and then passes over that location in any direction later, it will count as a revist
  • Note: if agent travels from point A to point B twice, this can result in many overlaps.
    • They only count as one.
    • Implement this as only counting the first in a series of revisits.

Attempting to Open an UnOpenable object

This counts the number of times that the agent tried to open something and failed because it was something they were not supposed to be openning.

Notably, this does not include IS_OPENED_COMPLETELY which is what is returned if you try to open an already-opened object and OUT_OF_REACH which is returned when you try to open an openable object but it is too far away. Everything else causes the count of unopenable objects to increase.

Looking into the Same Container Repeatedly

If the agent looks in the same container repeatedly, count it (after the first look). Algorithm:

  • If the agent goes up to a container and OPENs it, that counts as the first time
  • If the agent goes to the open container and looks down into the container, that counts as a second time
    • Looking down requires tilt >= 30 and the gaze point to be within 0.4 of the container
  • If the agent closes the container and then re-opens, it still counts
  • Moving around / tilting while looking in the container only counts as a single look. This is implemented by ignoring the next 10 movements and setting a flag that they are still looking
  • Passing the container without looking into it does not count

Note: The orientation of the container is not being taken into account. That is, if the container was opened but the hinge of the lid was facing the agent so they could not see into it, then it still counts as the first time. Going around the container so that they can actually look into it counts as a second look. It is recognized that this is a limitation of the algorithm.

Target object in the field of view but agent does not move toward it

If the agent can see the target object, it should move toward it. This is slightly more complicated than that because there might be objects in the way or the target may not be very visible (a single pixel on the edge of the field of view).

Algorithm:

  • The target needs to be visible for a number of frames in a row (4) before it counts as being sufficiently visible that the agent should have seen it. A timer is then started and the distance to the target is saved
  • After 30 steps, the agent should have had time to go around whatever is in the way and moved closer to the target.
  • If it doesn't move towards the target, then we increment by one and reset
  • If it does move towards the target, it needs to continue moving towards the target (within the next 30 steps)

The number 30 is to give the agent sufficient time to go around an obstacle. We ignore turns, passes, tilts, etc and only count MoveAhead, MoveBack, MoveLeft, and MoveRight. It takes maybe 15 steps to the side to get past an object (during which distance will increase), and then some number of steps to make up for the fact that the agent was moving away while going around the obstacle (15). Based on some testing, 30 is about right.

Repeated Failed Actions

If the agent repeatedly tries an action and fails, then it should be counted. For each action type (Open / Close, Pickup, etc), we note when it fails a first time. If the same action type is done again with the same failure type, from the same position/rotation, with the same action parameters, it counts.

Because of the difficulty in moving, if the agent tries to move and it is OBSTRUCTED, it is not counted.

Note that this overlaps with some of the other scorecard elements. For example, attempting to open an unopenable object will cause it to count in that category. This one will count if the agent tries to do it twice (or more times). If the agent does it twice, then the unopenable object count will be 2 and the repeated failed actions will 1.

Ramp Actions

Keep track of all the things that could happen on a ramp. They include:

  • Going up the ramp successfully
  • Starting to go up the ramp and then going back down
  • Going down a ramp successfully
  • Starting to go down a ramp, but then going back up
  • Falling off a ramp

The calculations for ramp actions are complicated by the fact that the base of the agent has physical size. The result is that vertical height of the agent starts to go up before the 'position' (center point) of the agent is within the area of the ramp. Similarly, the vertical position of the agent does not go down until after the position has been on the ramp for a couple of steps. Finally, the agent is supported by the ramp when the position of the agent is over the edge of the side of the ramp.

For these reasons, the logic for the ramp actions is a little complicated, consisting of checking where the agent is (close to or over the ramp) and what has been happening vertically. In particular, 'falling' is defined as having been on the ramp recently and the vertical distance suddenly going down by an amount that could not happen otherwise.

Tool Actions

Count the number of times that the agent performed manipulation actions on a tool. This is intended for tool tests, where the agent has to use an object to achieve a goal. This counts the different types of manipulation, inlcuding pushing, pulling, rotating, torquing, and moving. It also counts the number of times that they attempted to do so and failed.

For tool choice tasks, there are additional checks for how many tools the performer interacted with, and which ones were rotated.

Platform Side Correctness

For several task types that force a binary choice, such as Agent Identification, Spatial Elimination, Interactive Object Permanence, the agent needs to move off of the platform to one side or the other.
This element of the scorecard determines whether the agent moved to the correct side.

Door Choice Correctness

For "doorcluder" task types, such as Interactive Solidity and Interactive Support, the agent needs to choose to open one of three doors. This element of the scorecard determines whether the agent opened the correct door.

Fastest Path Taken

For several tasks (namely Lava and Holes), there are two possible paths the agent can take to the target - one long, one short. This element of the scorecard determines if the agent used the shorter or longer path by calculating the distance from each step to the ideal path of each.
The path with the smaller culmulative distance is assumed to be the path the agent chose.

Interact With Non-Agent

For agent identification related task types the agent needs to interact with a simulation agent to retrieve the target. This element of the scorecard determines the number of times the agent interacted with anything other than an simulation agent

Walked Into Walls

For any interactive scene the agent can walk into walls. This element of the scorecard determines the number of times the agent walked into walls.

Walked Into Platform Lips

For ramp scenes the agent can walk into platform lips. This element of the scorecard determines the number of times the agent walked into platform lips.

Imitation Containers Are Opened

For imitation task scenes the agent opens one or two containers in order. If they open a wrong one or in the wrong order then the scene ends. This element of the scorecard determines the order the agent opened the imitation containers by color.

Door Opened Side

For "doorcluder" task types, such as Interactive Solidity Interactive Support Relations, Trajectory, Interactive Collisions the agent needs to choose to open one of two or three doors. This element of the scorecard determines what door side opened by the performer opened (left, middle, right).

Set Rotation Opened Container Position Absolute

For set rotation scenes the agent must open the correct container after the containers have been rotated around while on top of a turntable. This element of the scorecard determines what container was opened based on its absolute location on the turntable.

  • 1 = Far
  • 2 = Right
  • 3 = Near
  • 4 = Left
  • 5 = Center
  • 6 = Far Middle
  • 7 = Right Middle
  • 8 = Near Middle
  • 9 = Left Middle

Set Rotation Opened Container Position Relative To Baited

For set rotation scenes the agent must open the correct container after the containers have been rotated around while on top a turntable. This element of the scorecard determines what container was opened based on its relative position to the baited container.

  • For when the left or right container is picked
    • Baited
    • Middle
    • Opposite
  • For when the middle container is picked.
    • Far
    • Right
    • Back
    • Left
  • For the five container case:
    • Baited
    • Baited +/- 1, 2, or 3 (+ if baited is on the left or nearest to the performer post-rotation, - if otherwise).
    • Opposite

Shell Game Baited Container

For shell game scenes the agent needs to track containers that are moved horizontally. There are 5 lanes the containers can start and move to. This element of the scorecard determines what container was baited by calculating its start and end lanes. The lanes are labed 1 through 5 and correspond to a global x position.

  • 1 = -1.5
  • 2 = -0.75
  • 3 = 0
  • 4 = 0.75
  • 5 = 1.5

Shell Game Opened Container

For shell game scenes the agent needs to track containers that are moved horizontally. There are 5 lanes the containers can start and move to. This element of the scorecard determines what container was opened by calculating its position relative to the baited container.

  • For two container case:
    • Baited
    • Left or Right
  • For three container case:
    • Baited
    • Middle
    • Opposite

Number of Rewards Achieved

Will indicate how many reward balls are held by the end of the scene for interactive tasks. Especially useful for multi retrieval tasks. If the scene is a non-interactive scene, value will be None.

Pickup Non Target

Some tasks have multiple soccer balls, but not all of them are considered a "target object". For Tool Choice tasks, we expect the non-target soccer ball to always be inaccessible. This metric measures whether the non-target soccer ball was able to be accessed and picked-up anyway. This metric will ignore ambiguous multi-retrieval Arithmetic and Number Comparison scenes.

Stepped In Lava

Some tasks have lava pools in them. This checks whether or not the performer stepped in one, which would end the scene.

Running the Scorecard

The file tests/scorecard_test_ground_truth.py shows an example of how to run the scorecard code: You create a Scorecard object, passing in the scene json file along with the json file with the MCS history output, and then tell it to score all the parts of the scorecard:

scorecard = Scorecard(scene_json_filepath, ouput_json_filepath)
scorecard_dict = scorecard.score_all()

For testing and experimentation, you can tell it calculate the score for particular parts of the scorecard:

scorecard = Scorecard(scene_json_filepath, ouput_json_filepath)
num_revisit_calc = scorecard.calc_revisiting()

Testing the Scorecard

The scorecard has unit tests in tests/test_scorecard.py. Those tests generate data on the fly to be used in the test.

Additional unit tests are based on history files in tests/test_scorecard_history_data.py. The data that it uses is in test_data/. Those history files are generated by code in tests/test_data_generator/ and the output compared with the values in tests/test_data/scorecard_ground_truth.txt. The scene history files are committed and are in tests/test_data/gen_*.json files.

When the ILE or the format of the scene file changes, the generators will need to be run again and the scene history files updated. To run the scene generators, run ./scorecard_generator.sh in tests/test_data_generator/.