-
Notifications
You must be signed in to change notification settings - Fork 2
/
chapter-7.tex
1398 lines (1235 loc) · 73.2 KB
/
chapter-7.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\setcounter{chapter}{6}
\chapter{Embodiment in humanoid robots}
\label{c:embodiment}
In this third part we will apply the lexicon formation models from the
previous three chapters with to real-world situated interactions of
autonomous robots. We will discuss mechanisms and representations for
conceptualization that allow to link words to the visual perceptions
of the robots and we will analyze what impact the additional
challenges and complexities coming from embodiment and
conceptualization have on the performance of these models. But in
order to do that, we will first dedicate one chapter to the perceptual
and social skills that we endowed the robots with for engaging in
grounded language games\footnote{Some parts of of the first two
sections of this chapter are taken from \cite*{loetzsch12grounding}
and additionally appeared in shorter form in
\cite*{spranger12perceptual}.}.
\begin{figure}[p]
\begin{tabular}{lr}
\multicolumn{2}{l}{\includegraphics[width=1\textwidth]{figures/photo-qrio-1}} \\
& \\
& \\
\includegraphics[width=0.48\textwidth]{figures/photo-qrio-3} &
\includegraphics[width=0.48\textwidth]{figures/photo-qrio-2} \\
& \\
\end{tabular}
\caption{The Sony humanoid robot.}
\label{f:photo-qrio}
\end{figure}
We used two ``Sony humanoid robots'' (\citealp{fujita03autonomous},
see Figure \ref{f:photo-qrio}) for all of our robotic
experiments. They are about 60 cm high, weigh approximately 7 kg and
have 38 degrees of freedom (4 in the head, 2 in the body, 5$\times$2
in the arms, 6$\times$2 in the legs and 5$\times$2 in the
fingers). The main sensors are three CCD cameras in the head, of which
we used here one. The camera delivers up to 30 images per second, has
an opening angle of about 120$^\circ$ and a resolution of
176$\times$144 pixels. It uses the $YCrCb$ color space ($Y$: luma or
brightness, $Cr$: chroma red and $Cb$: chroma blue) with 8 bits per
channel. Furthermore, the robots have three accelerometers and gyro
sensors in the trunk and one accelerometer in each foot. The feet are
equipped with force feedback sensors to detect ground contact. The
batteries have enough capacity for about an hour of autonomous
operation.
We endowed the robots with a vision system for recognizing and
tracking objects in their environment. This system is explained in
Section \ref{s:vision-system} and Section
\ref{s:joint-attention-and-social-skills} introduces a set of social
skills for engaging in language games that were programmed into the
robots. Finally, in Section \ref{s:doing-experiments-with-robots} we
describe the overall experimental setup, i.e. how more high-level
cognitive processess for conceptualization and language interact with
the sensori-motor capabilities of the robots, and we characterize some
properties of the sensory experiences that the robots construct.
\section{Visual object recognition and tracking}
\label{s:vision-system}
\begin{figure}[t]
\includegraphics[width=1\textwidth]{figures/vision-system-stages-overview}
\caption{Image processing steps for three subsequent points in
time. A: Source images provided by the camera of the robot. B:
Foreground/ background classification and motion detection (blue
rectangles). Foreground regions are then associated to existing
object models or become seeds for new object representations. C/D:
The changing histogram of the green-red channel for object
$o_{716}$ is used to track $o_{716}$ in space and time and thus to
create a persistent model of the object. E: Knowing the offset and
orientation of the camera relative to the body, the robots are
able to estimate the position and size of objects in the
world. Black arrows denote the positions of the two robots
perceiving the scene.}
\label{f:vision-system-overview}
\end{figure}
The environment of the robots consists of a variety of physical
objects such as toys, cones, barrels and cuboids (see Figure
\ref{f:object-sets}, page \pageref{f:object-sets}) that are initially
unknown to the robots. Objects are frequently added to the scene and
removed again. In addition, objects are moved within a scene and their
appearance may alter. For example the red block in Figure
\ref{f:vision-system-overview}A is standing up in the beginning and is
then put down, changing the perception of the object from being high
and thin to low and broad. In addition, perceiving objects is made
difficult by partial occlusions and other interfering factors such as
human experimenters manipulating the objects in front of the robots.
A prerequisite for building the internal cognitive structures needed
for communicating about objects is that the robots have mechanisms for
constructing perceptual representations of the objects in their
immediate surroundings from the raw sensations streaming from the
robots' sensors. Constructing such representations involves three
sub-systems: First, low-level vision routines process raw camera
images to yield basic \emph{percepts} -- connected regions that differ
from the background of the environment. Figure
\ref{f:vision-system-overview}B gives an example and the mechanisms
involved are explained in Section \ref{s:detecting-foreground-regions}
below. Second, these foreground regions are tracked in subsequent
camera images despite changing positions and appearances of the
objects. In order to do so, the vision system needs to establish a
correspondence between an internal \emph{object model} and the image
regions that refer to the same physical object, a process known in
robotics as \emph{anchoring}
\citep{coradeschi03anchoring,loutfy05maintaining}. For example as
illustrated in Figure \ref{f:vision-system-overview}D, the changing
raw sensations for the red block in Figure
\ref{f:vision-system-overview}A are continously connected to the same
\emph{anchor} $o_{716}$. We used \emph{Kalman Filters} for maintaining
such persistent object models (Section \ref{s:object-models}). Third,
when needed in communicative interactions, the vision system encodes a
set of visual properties about each object model. In this particular
setup these properties are the object's position in a robot egocentric
reference system, an estimated width and height and color information,
as shown in Figure \ref{f:vision-system-overview}E. This process is
discussed further in Section \ref{s:computing-object-features}.
\subsection{Detecting foreground regions in images}
\label{s:detecting-foreground-regions}
The robots do not know in advance what kind of objects to expect in
their environment. Thus, the assumption is made that everything that
was not in the environment before is considered to be a potential
object. The system, therefore, gathers statistical information about
the environment's background in a calibration phase
and image regions that sufficiently differ from the background are
treated as candidates for object models.
For generating a statistical model of the scene
background, the robots observe the experiment space without objects
for some time (see Figure \ref{f:photo-calibration-phase}) and
perceive a series of calibration images such as in Figure
\ref{f:object-perception}A. For all three color channels $c \in
\{Y,Cr,Cb\}$ the mean $\mu_{c,\vec{p}}$ and variance
$\sigma_{c,\vec{p}}^2$ of the image intensities at every image pixel
$\vec{p}$ are computed over all calibration images. After the
calibration phase the robots are presented with objects, resulting in
raw camera images such as in Figure \ref{f:object-perception}B. The
generated background statistics are used to classify all image pixels
as being foreground or background. A pixel is considered foreground
when the difference between the image intensity $i_c(\vec{p})$ and the
mean of that pixel is bigger than the pixel's standard deviation
($\mid i_c(\vec{p}) - \mu_{c,\vec{p}}\mid > \sigma_{c,\vec{p}}$) for
one of the color channels $c \in \{Y,Cr,Cb\}$. As a result, a binary
image as shown in Figure \ref{f:object-perception}C is generated with
all foreground pixels having the value of $1$ and all others $0$.
\begin{figure}[t]
\includegraphics[width=0.65\textwidth]{figures/vision-system-photo-calibration-phase}
\caption{Calibration phase of the vision system. Both robots are
shown an empty environment for some extended period of time,
allowing them to observe the statistical characteristics of the
scene background.}
\label{f:photo-calibration-phase}
\end{figure}
\begin{figure}[t]
\centerline{\footnotesize\sffamily
\renewcommand{\arraystretch}{2.0}
\begin{tabular}{@{}l@{}p{0.025\textwidth}@{}l@{}p{0.025\textwidth}@{}l@{}p{0.025\textwidth}@{}l@{}p{0.025\textwidth}@{}l@{}}
A & & B & & C & & D \\
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-perception-1} & &
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-perception-2} & &
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-perception-3} & &
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-perception-4} \\
E & & F & & G & & H \\
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-perception-5} & &
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-perception-6} & &
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-perception-7} & &
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-perception-8} \\
\end{tabular}}
\caption{From foreground regions to object models. A: A raw camera
image taken during the calibration phase. B: A camera image of a
scene containing objects. C: The result of foreground/ background
classification. White pixels are foreground, green pixels were not
classified. D: The noise-reduced classification image. E: The
segmented foreground regions drawn in their average color and with
bounding boxes. Note that the partially overlapping blue and green
blocks in the right bottom of the original image are segmented
into the same foreground region. F: Classification of foreground
pixels using existing color models. Pixels are drawn in the
average color of the most similar object model. G: Bounding boxes
and average colors of the segmented classification image. Note
that the use of previous color models helped to generate separate
percepts for the blue and green blocks at the right bottom of the
image. H: Kalman filtered object models. The state bounding boxes
are drawn in the average color of the model. I: Computation of
position and size in a robot-egocentric reference system. The
width and height of objects is indicated by the width and height
of the triangles.}
\label{f:object-perception}
\end{figure}
This binary image is further noise-reduced using standard image
operators (dilatation, erosion, see for example
\citealp{parker96algorithms,soille02morphological}) as illustrated in
Figure \ref{f:object-perception}D. First, noise is removed through
applying a $3\times 3$ erosion operator. Second, the change in size of
regions caused by the erosion operator is compensated by applying a
$3\times 3$ dilation operator. Then a segmentation algorithm scans the
filtered image and computes for all connected foreground pixels a
surrounding polygon, the bounding box, and color histograms of the
pixels contained in the region (for each color channel, from the
original image). Color histograms $M^c$ represent frequencies of image
intensities on the color channel $c$, computed either over complete
images or parts of them in the case of foreground regions. The whole
range of intensities is divided into $m$ bins $k~\in~\{1,\dots,m\}$ of
equal size. The number of pixels that have intensities falling into
each bin $M^c(k)$ is counted using a function $h(i_{c}(\vec{p}))$ that
assigns the intensity $i_c$ of a pixel $\vec{p}$ to a bin
$k$. Normalized histograms $\hat{M}^c(k)$ are computed from such
histograms by dividing each frequency $M^c(k)$ by the number of pixels
sampled, resulting in a representation where the sum of all
$\hat{M}^c(k)$ for $k~\in~\{1,\dots,m\}$ is equal to $1$, allowing to
interpret $\hat{M}(h(i_c(\vec{p})))$ as the probability of an image
intensity to occur in an image (or a sub-region). Figure
\ref{f:object-perception}E shows the estimated bounding boxes and
average colors extracted from the regions.
Objects frequently occlude each other, due to particular spatial
placement, but also when moved around in the scene. For example the
green cube is partly overlapping the blue cuboid in the right bottom
of Figure \ref{f:object-perception}B and thus the segmentation
algorithm creates only one foreground region for both
objects. Provided that there is an established object model (see next
Section \ref{s:object-models}) for at least one of the objects, it is
possible to further divide such regions. Each pixel in a foreground
region is assigned to the most similar color model of previously
perceived objects as shown in Figure \ref{f:object-perception}F. Given
the normalized color histograms $M_I^c$ of all pixels in the current
image $I$ and $M_1^c,\dots,M_n^c$ of the $n$ previously established
object models, the likelihood $p_j$ of a pixel $\vec{p}$ in a
foreground region to belong to a color model $j$ can be calculated:
$$p_{j}(\vec{p})=M^Y_{j}(h(i_Y(\vec{p}))) \cdot M_{j}^{Cr}(h(i_{Cr}(\vec{p}))) \cdot M_{j}^{Cb}(h(i_{Cb}
(\vec{p})))$$
\noindent Based on this probability, each pixel is either classified
to belong to the model $j$ with the highest likelihood
$\operatorname{class}({\vec{p}})=\arg\max_{j=1..n}(p_{i}(\vec{p}))$
or, when the highest $p_{j}$ is smaller than a threshold $t$ or when
no previous model exists, to a ``no model'' class. Classified pixels
are again segmented into connected regions. As shown in Figures
\ref{f:object-perception}G and \ref{f:object-perception}H, the
initially connected foreground region for the blue and green objects
in the right bottom of the image could be divided into separate
regions due to the use of previous color models.
The resulting subdivided foreground regions are called
\emph{percepts}. They represent the result of the low-level image
processing mechanisms acting separately on each image without
incorporating past knowledge (except for the color information of
previous objects). A percept $P$ is defined as $P:=\langle
x_P,y_P,w_P,h_P,M_P^Y,M_P^{Cr},M_P^{Cb},n_P\rangle$ with $x_P,y_P$
describing the center of the percepts bounding rectangle in image
coordinates, $w_P$ and $h_P$ the width and height of the bounding
rectangle in pixels, $M_P^Y$, $M_P^{Cr}$ and $M_P^{Cb}$ the normalized
histograms for the three color channels and $n_P$ the number of pixels
contained in the region.\\
\noindent In order to improve the tracking algorithm described in the next
Section, we also implemented a component for identifying regions in
the image where motion has occured. Image intensities
$i_{c,t}(\vec{p})$ at time $t$ are compared to those of images taken
at time $t-1$. A pixel $\vec{p}$ is classified as subject of motion
when the difference is bigger than the standard deviation
$\sigma_{c,\vec{p}}$ of this pixel's intensities calculated during the
calibration phase ($\mid i_{c,t}(\vec{p}) - i_{c,t-1}(\vec{p})\mid >
\sigma_{c,\vec{p}}$) for one of the color channels $c
\in\{Y,Cr,Cb\}$. The resulting classification image is noise-reduced
and segmented into regions of motion as shown in Figure
\ref{f:vision-system-overview}B. This information is used to loosen
the parameters for the association of percepts to object models. If
there is motion in a particular region of the image, then object
models are allowed to move and change color more drastically than if
there is no motion.
\subsection{Maintaining persistent object models}
\label{s:object-models}
For maintaining a set of stable and persistent models of the objects
in their environment, the robots have to associate the percepts
extracted from each raw image to existing object models. Furthermore,
they have to create new models when new objects enter the scene and
eventually delete some models when objects disappear. This task is
difficult because objects can move and the detection of regions
through foreground/background separation is noisy and
unreliable. Extracted properties such as size or position may highly
vary from image to image and it can happen that objects are only
detected in some of the images streaming from the camera.
The internal object model $O_t$ of an object at time step $t$
(whenever a new camera image is processed) is defined as $O_t:=\langle
id_O,s_{O,t},\Sigma_{O,t},M^Y_{O,t},M^{Cr}_{O,t},M^{Cb}_{O,t}\rangle$,
with $id_{O}$ being an unique id serving as an anchor for the object,
$s_{O,t}$ a state vector capturing spatial properties, $\Sigma_{O,t}$
the $8 \times 8$ state covariance matrix and $M_{O,t}^Y$,
$M_{O,t}^{Cr}$ and $M_{O,t}^{Cb}$ normalized color histograms. A state
vector $s$ is defined as $s_{O,t}:=\begin{pmatrix} x_{O,t} & y_{O,t} &
w_{O,t} & h_{O,t} & \dot{x}_{O,t} & \dot{y}_{O,t} & \dot{w}_{O,t} &
\dot{h}_{O,t}\end{pmatrix}^T$, with $x_{O,t},y_{O,t}$ describing the
center of the object in the image, $w_{O,t}$ and $h_{O,t}$ the
object's width and the height in pixels and $\dot{x}_{O,t}$,
$\dot{y}_{O,t}$, $\dot{w}_{O,t}$ and $\dot{h}_{O,t}$ the change
variables (speed of change in position and size).
We use Kalman Filters \citep{kalman60new} to model the spatial
component $s_{O,t}$ of object models. In every time step $t$ all Kalman
Filter states $s_{O,t-1}$ and $\Sigma_{O,t-1}$ of the last time step $t-1$
are used to \emph{predict} a new a priori state $\overline{s}_{O,t}$ and a
state covariance matrix $\overline{\Sigma}_{O,t}$ given the $8\times8$
state transition matrix $A$ and the process noise covariance matrix
$Q$:
\begin{eqnarray*}
\overline{s}_{O,t}&:=&As_{O,t-1} \\
\overline{\Sigma}_{O,t}&: =& A \Sigma_{O,t-1} A^T+Q
\end{eqnarray*}
We found it sufficient to use a constant state transition matrix $A$,
which predicts every dimension via its change variable
and a constant noise covariance matrix $Q=1^{-5}\cdot I_8$.
Next attempts are made to associate percepts to existing models.
Since the position, dimension and color of objects change over time,
no a priori known invariant properties of objects allow to decide
which percept belongs to which model. Instead, a similarity score
$\hat{s}$ based on position and color is used. The score reflects a
set of assumptions and heuristics, which are based on intuitive
notions of how objects behave, so that experimenters can change the
scene, without having to adjust to particular properties of the vision
system. First it is assumed that an object can not randomly jump in
the image or disappear at one point in space and appear at
another. Consequently, a spatial similarity $\hat{s}_{euclid}$ can be
defined using the Euclidean distance between the center of a percept
$P$ and the predicted position $\overline{x}_{O,t},\overline{y}_{O,t}$
of a model $O$
$$
\hat{s}_{euclid}(P,O):=1 - \frac{\sqrt{(x_P-\overline{x}_{O,t})^2+(y_P-\overline{y}_{O,t})^2}}{l}
$$
\noindent with $l$ being the length of the image diagonal in
pixels. The result of $\hat{s}_{euclid}$ is $1$ when the two points
are identical and $0$ when they are in opposite corners of the image.
Since objects are assumed to move in a predictable fashion, a
threshold $t_{space}$ restricts the radius around a model in which
percepts are associated -- the spatial association score
$\hat{s}_{space}$ equals to $\hat{s}_{euclid}$ when it is bigger than
$t_{space}$ and $0$ otherwise. Second, it is assumed that objects do
not change their color in a random fashion. An object's color
histogram that has a very high value in a certain bin will not have a
zero value in that bin in the next image. Percepts and object models
can thus be compared using a color similarity $\hat{s}_{color}$. It is
based on the Bhattacharyya coefficient $BC$
\citep{bhattacharyya43measure,aherne98bhattacharyya} that is used as a
similarity measure between two normalized histograms $M$ and $M'$:
$$BC(M,M'):=\sum_{k=1}^{m}\sqrt{M(k) \cdot M'(k)}$$
\noindent Using the color histograms $M_P^c$ of a percept $P$ and the
histograms $M_{O,t-1}^c$ of a previous model $O$, a similarity measure
combining all three color channels is defined as:
$$
\hat{s}_{Bhatt}(P,O) := \prod_{c \in \{Y,Cr,Cb\}} BC(M^c_P,M^c_{O,t-1})
$$
\noindent The association score $\hat{s}_{color}(P,O)$ then yields the
result from the above measure when it is bigger than a threshold
$t_{color}$ or $0$ otherwise. In order to allow more rapid changes in
space and color when objects move, the two association thresholds
$t_{space}$ and $t_{color}$ are loosened when motion has been detected
within the area spawned by a state.
The overall similarity score between a particular percept and an
existing object model is then defined as:
$$\hat{s}(P,O)= \hat{s}_{space}(P,O) \cdot \hat{s}_{color}(P,O)$$
\noindent Each percept is associated with the internal state that has
the highest association non-zero score $\hat{s}$ with respect to that
percept. If no such state exists (when either the spatial or color
similarity is below the threshold), then the percept is stored in a
list of unassociated percepts.
\begin{figure}[t]
\parbox{0.486\textwidth}{%
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-modeling-1}%
\hspace{0.025\textwidth}%
\includegraphics[width=0.23\textwidth]{figures/vision-system-object-modeling-2}}
\caption{Kalman filtered object models. The state bounding boxes are
drawn in the average color of the model and the state covariance
is visualized with the thin cross in the center of each model.}
\label{f:object-modeling}
\end{figure}
The Kalman Filter states are \emph{updated} given the associated
percepts, which are beforehand combined into a single percept.
Percepts are combined by computing a bounding polygon and a histogram
representing the color frequency in the combined region. Using the
predicted a priori state vector $\overline{s}_{O,t}$ and state
covariance $\overline{\Sigma}_{O,t}$ as well as the spatial components
$p$ of the combined percept $p:=\begin{pmatrix}x_P&y_P
&w_P&h_P\end{pmatrix}^T$, the a posteriori state $s_{t}$ and the a
posteriori state covariance matrix $\Sigma_{O,t}$ are computed
\begin{eqnarray*}
K_{O,t}&=&\overline{\Sigma}_{O,t} H^T H\overline{\Sigma}_{O,t}H^T+R\\
s_{O,t}&=&\overline{s}_{O,t}+K_{O,t}(p-H\overline{s}_{O,t})\\
\Sigma_{O,t}&=&(I-K_{O,t}H)\overline{\Sigma}_{t}
\end{eqnarray*}
\noindent with $R$ as the constant $4\times4$ measurement covariance
matrix (with $R=1^{-1}\cdot I_4$) and $H$ a constant $8\times4$ matrix
relating the measurement space and the state space (with $h_{i,j}=1$
for all $i=j$ and $0$ for all others). In principle $H$ and $R$ are
allowed to change over time, but the above estimates resulted in
sufficient tracking performance. Additionally, the color histograms
of a state $S$ are updated using
$$M^c_{O,t}(k):=(1-\alpha) M^c_{O,t-1}(k)+\alpha M^c_{P}(k)$$
\noindent for all color channels $c\in\{Y,Cr,Cb\}$, all histogram bins
$k\in\{1,\dots,m\}$ and with $\alpha \in [0,1]$ being the influence
of the combined percept.
New object models are created from unassociated percepts. All
unassociated percepts lying in the same foreground region are combined
and used as a seed for a new model which is assigned a new
unique ID. In order to avoid creating models from percepts generated
for body parts of the experimenter, new models are only created when
no motion was detected. Models that have not been associated
with percepts for some time are deleted. This mainly happens when
objects disappear from the scene and consequently no percepts are
associated with them. As a result of the modeling process, Figure
\ref{f:object-modeling} shows the object models at the time when the
percepts in Figure \ref{f:object-perception} were generated.
\subsection{Computing object features}
\label{s:computing-object-features}
From each object model, a set of features such as color, position and
size are extracted. These feature vectors are called \emph{sensory
experiences} and are used by the agents to construct the different
conceptual entities needed for engaging in the different kind of
language games introduced in this thesis.
\begin{figure}[t]
\includegraphics[width=0.75\textwidth]{figures/vision-system-coordinate-systems}
\caption{Computation of object positions on the ground plane, size
estimation and the involved coordinate systems. Note that all
systems except the image coordinate system are three
dimensional. }
\label{f:vision-system-coordinate-systems}
\end{figure}
The two robots can perceive the environment from arbitrary angles,
which makes the position and size of objects in the camera image bad
features for communicating about objects. For example the width of an
object in the image depends on how far the object is away from the
robot and is thus not at all shared by the robots. In order to be
independent from how objects are projected onto camera images, spatial
features are computed in an egocentric coordinate system relative to
the robot. However, without the use of stereo vision or a priori known
object sizes, positions can not be determined solely from camera
images. But given the reasonable assumption that objects are located
on the ground, they can be calculated by geometrically projecting
image pixels onto the ground plane using the offset and rotation of
the camera relative to the robot as shown in Figure
\ref{f:vision-system-coordinate-systems}. The egocentric robot
coordinate system originates between the two feet of the robot, the
$z$ axis is perpendicular to the ground and the $x$ axis runs along
the sagittal and the $y$ axis along the coronal plane. First, a
virtual image projection plane orthogonal to the optical axis of the
camera is used to relate image pixels in the two-dimensional image
coordinate system to the three-dimensional camera coordinate system
(which has its origin in the optical center of the camera, with the
$x$ axis running along the optical axis and the $y$ and $z$ an axis
being parallel to the virtual image plane). Given the camera
resolution height and width $r_{w}$ and $r_{h}$ (in pixels) as well as
the horizontal and vertical camera opening angle $\phi_{v}$ and
$\phi_{h}$, the $x_i$ and $y_i$ coordinates of an image pixel can be
transformed into a vector $\vec{v}_c$ in the camera coordinate system
$$
\vec{v}_{c}= \begin{pmatrix}1\\
-\frac{x_i}{r_{h}} \cdot \tan\frac{\phi_{h}}{2}\\
\frac{y_i}{r_{v}} \cdot \tan\frac{\phi_{v}}{2}\\
\end{pmatrix}
$$
\noindent that ``points'' to the pixel on the virtual projection
plane. Given the orientation of the camera relative to the robot
represented by the $3\times3$ rotation matrix $R_{c}$, a vector
$\vec{v}_c$ can be rotated into a vector $\vec{v}_t$ in the camera
translated coordinate system (which originates in the center of the
camera, with the axes being parallel to the robot coordinate system)
with $\vec{v}_{t}=R_{c} \cdot \vec{v}_{c}$. Furthermore, given the
offset from the origin of the robot coordinate system to the center of
the camera $\vec{t}_{c}$, the position of a pixel projected onto the
ground plane $\vec{v}_r$ in the egocentric robot coordinate system can
be computed by intersecting the ray $\vec{v}_{t}$ with the ground
plane using simple geometric triangulation: The equation
$$
\vec{v}_r= a \cdot \vec{v}_t + \vec{t}_c
$$
\noindent with the unknown scalar $a$ has exactly one solution for
$x_r$ and $y_r$ when the pixel designated by $\vec{v}_t$ lies below
the horizon. The operating system of the Sony humanoid readily
provides estimates for $R_{c}$ and $\vec{t}_{c}$ that are computed
from joint sensor values.
Using these transformations, the position features {\tt x} and {\tt y}
(in mm) are extracted from an object model by projecting the pixel at
the center of the lower edge of the object's bounding box onto the
ground plane. For estimating a {\tt width} feature, the lower left and
right corner of a the bounding box are transformed into positions
relative to the robot and the distance between them is calculated. For
the computation of {\tt height}, the ray of the pixel on the middle of
the upper bounding box edge is intersected with a virtual plane
perpendicular to the ground and through the position of the object as
shown in Figure \ref{f:vision-system-coordinate-systems}. The
extraction of color features from object models is also
straightforward. The feature {\tt luminance} is computed as the mean
of an internal state's color histogram $M_t^{Y}$, {\tt green-red} as
the mean of $M_t^{Cr}$ and {\tt yellow-blue} from
$M_t^{Cb}$. % Similarily, the {\tt stdev-luminance}, {\tt
% stdev-green-red} and {\tt stdev-yellow-blue} features are the
% standard deviations of the color histograms $M_t^c$ and express how
% uniform the image pixels perceived for an object are.
\begin{figure}[t]
\parbox{0.75\columnwidth}{%
\vspace{-3mm}\hspace{3mm}%
\includegraphics[width=0.75\columnwidth]{figures/vision-system-scaling}}
\caption{Scaling of feature values. The distribution of the 'height'
feature sampled over all objects of the geometric objects data set
(see Section \ref{s:recording-data-sets} on page
\pageref{s:recording-data-sets}) is used to define an interval
$\mathsf{[\mu-2\sigma,\mu+2\sigma]}$ for scaling feature values
into the interval $\mathsf{[0,1]}$.}
\label{f:vision-system-scaling}
\end{figure}
\begin{figure}[t]
\input{figures/vision-system-example-scene}
\caption{Snapshots of the sensory experiences of both robots at the
end of the image sequence in Figure
\ref{f:vision-system-overview}. Top: The camera images at that
point in time are overlaid with the object anchors maintained by
the tracking system. Left of them, the positions of objects and
other robots in the egocentric reference system of each robot are
shown. Each object is drawn as a circle in its average color, with
the radius representing the object's width. The positions of the
two robots (see Section \ref{s:extimating-robot-position} below)
are indicated using black arrows. Bottom: The actual feature
values are shown in each first column and feature values scaled to
the interval $[0,1]$ in each second column. On the right side of
the table, the third columns give for each scaled feature the
difference between the perception of robot A and B.}
\label{f:vision-system-example-scene}
\end{figure}
The values of the {\tt x} and {\tt y} features are ususally in the
range of meters, {\tt width} and {\tt height} can range from a few
centimeters up to half a meter and values on color channels are within
the interval $[0,255]$. In order to be able to handle all features
independently from the dimensions of their domains, feature values are
scaled to be within the interval $[0,1]$ using the statistical
distributions of feature values as illustrated in Figure
\ref{f:vision-system-scaling} In theory the robots could gradually
build up such distributions by seeing many different objects over the
course of time, in practice the distributions are sampled from objects
of recorded data sets (see Section \ref{s:recording-data-sets}). Given
the mean $\mu$ and standard deviation $\sigma$ of the distribution of
a feature over a (large) number of objects, a scaled value is computed
%$$\overline{v}:=\min(1,\max(0,\frac{v-\mu}{4\cdot\sigma}+\frac{1}{2}))$$
by mapping values in the interval $\mathsf{[\mu-2\sigma,\mu+2\sigma]}$
onto $\mathsf{[0,1]}$ and clipping all others. Figure
\ref{f:vision-system-example-scene} gives an example of the sensory
experiences of the two robots. For each object, both the unscaled and
scaled feature values are given.
\subsection{Visual perception in humans and robots}
The psychological and neurobiological literature on vision contains a
lot of evidence for correlates of these three sub-systems in the human
brain. First, there are dedicated neural assemblies along the visual
stream from the retina to the primary visual cortex that detect basic
visual features on a number of separable dimensions such as color,
orientation, spatial frequency, brightness and direction of movement.
These \emph{early vision} processes operate independently from
attention to objects and features ``are registered early,
automatically, and in parallel across the visual field ''
\citep[p. 98]{treisman80feature-integration}. From there on, two
separate visual pathways (also known as the ``what'' and ``where''
systems) are responsible for identifying objects and encoding
properties about them (see \citealp{mishkin83object} for an early
review): A dorsal stream (the ``where'' system) connecting the primary
visual cortex and the posterior parietal cortex is responsible for the
primitive individuation of visual objects, mainly based on spatial
features. ``Infants divide perceptual arrays into units that move as
connected wholes, that move separately from one another, and that tend
to maintain their size and shape over motion''
(\citealp{spelke90principles}, p. 29). These ``units'' can be
understood as ``pointers'' to sensory data about physical objects that
enable the brain for example to count or grasp objects without having
to encode their properties. They can be compared to the \emph{anchors}
mentioned above and are subject of a large number of studies:
\cite{marr82vision} calls them \emph{place tokens},
\cite{pylyshyn01visual,pylyshyn89role} \emph{visual indexes},
\cite{ballard97deictic} \emph{deictic codes} and
\cite{hurford03neural} discusses them from an artificial intelligence
and linguistics perspective as \emph{deictic variables}. In a second
ventral stream (the ``what'' system) running to the infero-temporal
cortex, properties of objects are \emph{encoded} and temporarily stored in
the \emph{working memory} \citep{baddeley86working-memory} for the use
in other cognitive processes. What these properties are depends on
top-down attentional processes -- for example different aspects of objects
have to be encoded when a subject is asked to count the number of
``big objects'' vs. the number of ``chairs''.
In addition to findings from neuroscience, there is also a variety of
previous work in robotics to rely on. The most widely known setups for
grounding symbolic representations in visual data for the purpose of
communication is probably the Talking Heads experiment
(\citealp{steels98origins}, see \citealp{belpaeme98construction} for
details of the vision system). Static scenes consisting of geometric
shapes on a blackboard are perceived by robotic pan-tilt cameras and
the vision system is able to extract features such as color, size and
position from these shapes. \cite{siskind95grounding} describes a
computer program for creating hierarchical symbolic representations
for simple motion events from simulated video input and in
\citep{siskind01grounding} from real video sequences (see also
\citealp{baillie00action,steels03shared,dominey05learning} for very
similar systems and \citealp{chella00understanding,chella03anchoring}
for a comparable framework inspired by the \emph{conceptual spaces} of
\citealp{gardenfors00conceptual-spaces}).
Furthermore, there is a vast literature on object detection and
tracking algorithms for other purposes than symbol grounding (see
\citealp*{yilmaz06object-tracking} for an extensive review). And the
vision system introduced here does not reinvent the wheel but makes
use of well-established techniques such as color histograms and Kalman
filters. It differs, however, from many other approaches in the notion
of what is considered to be an object. The types of objects that are
expected to occur in the world are often explicitly represented in the
vision system, for example by using pre-specified color ranges for
identifying different object classes in images
(e.g. \citealp{perez02color-based}), by matching (sometimes learnt)
object templates with images (e.g. \citealp{hager98efficient}) or by
engineering dedicated algorithms tailored for recognizing specific
classes of objects (e.g. \citealp*{juengel04real-time}).
In contrast, our robots have no preconceptions of what to expect in
their environment and thus can detect and track any type of object,
using only two assumptions: First, everything appearing in the
environment that sufficiently distinguishes itself from the background
and that was not there before is considered to be an object. Second,
objects have to be on the ground for being able to make reliable
position and size estimates. Furthermore, what makes the approach
presented here quite special is the tight integration of visual
perception with other cognitive mechanisms such as social behavior
(see below), conceptualization and language (as discussed in the
next chapters).
\section{Joint attention \& mechanisms for social learning in robots}
\label{s:joint-attention-and-social-skills}
Robots learning a language are not only grounded in the physical world
through their sensorimotor apparatus but also socially grounded in
interactions with others. In addition to perceptual capabilities for
detecting and tracking objects in their environment they need a set of
social skills for engaging in communicative interactions with each
other. This includes mechanisms for joint attention and pointing as
well as behavioral scripts for structured conversations. Joint
attentional scenes \citep{tomasello95jointattention} ``are social
interactions in which the child and the adult are jointly attending to
some third thing, and to one another's attention to that third thing,
for some reasonably extended length of time''
\citep[p. 97]{tomasello99cultural}. Establishing joint attention means
in our robotic experiments that two robots taking part in a language
game must (1) share a physical environment, (2) attend to a set of
objects in their surrounding, (3) track whether the respective other
robot is able to attend to the same set of objects and (4) be able to
manipulate attention by pointing to distal objects and perceiving
these pointing gestures (see Figure \ref{f:qrio-pointing}).
\begin{figure}[t]
\parbox{0.7\textwidth}{ \includegraphics[width=0.7\textwidth]{figures/qrio-pointing-photo-of-scene}
\vspace{0.03\textwidth}\includegraphics[width=0.335\textwidth]{figures/qrio-pointing-camera-image-b}\hspace{0.03\textwidth}\includegraphics[width=0.335\textwidth]{figures/qrio-pointing-camera-image-a}
}
\caption{Demonstration of a Sony humanoid robot drawing the
attention of the other robot to an object in the shared
environment by pointing at it. The images at the right show the
scene as seen through the camera of the pointer (top) and the
robot observing the pointing (bottom). However, please note that
the robots are not able to detect pointing gestures using their
built-in cameras. Instead, they directly transmit $x,y$
coordinates of the object pointed at.}
\label{f:qrio-pointing}
\end{figure}
\subsection{Social robotics}
How social mechanisms can be implemented in robots is a research area
in its own. Scientist in this field are mainly interested in how
social skills can improve communication and collaboration between
humans and robots \citep{breazeal02designing}. Additionally, by trying
to endow robots with social behaviors that appear ``natural'' to human
observers, they want to understand what social cues humans are
responding to. For reviews, refer to \cite*{dautenhahn02from} who
developed taxonomies for different degrees of robots' embodiment and
``social embeddedness'', \cite*{fong03survey} who give a general
survey of socially interactive robots, and \cite{vinciarelli09social}
who review the field of ``social signal processing'', i.e. the
detection of social cues in human behavior. For an overview of skills
that are prerequisites for joint attention and the state of the art in
robotic experiments trying to implement these skills, refer to
\cite{kaplan06challenges}. Some examples of work relevant for the
experiments in this paper are listed below.
\cite{scassellati99imitation} endowed the ``Cog'' robot
\citep{brooks99cog} with capabilities for finding human faces,
extracting the location of the eye within the face, and determining if
the eye is looking at the robot for maintaining eye contact (or mutual
gaze). \cite*{marjanovic99self-taught} showed how the same robot could
learn to control his arm for pointing at distal objects in the
surrounding space, guided by the camera of the robot. Gaze
recognition was investigated among many others by
\cite{kozima01robot}. They demonstrated how the ``Infanoid'' robot is
able to track gaze direction in human faces and use this information
to identify objects that humans are looking at. Joint attention is
established by alternatingly looking at distal objects and the
faces. \cite{nagai03constructive} modeled the transitions between
different developmental stages that infants are going through in the
process of learning to engage in joint attentional scenes, resulting
in the robot being able to determine which object a human caregiver is
looking at.
For recognizing pointing gestures performed by humans,
\cite*{kortenkamp96recognizing} developed a robot that can detect and
track the 3D positions of arm and shoulder joints of humans in dynamic
scenes, without requiring the humans to wear special markers. By
searching along the vector defined by the detected arm joints, the
robot can determine which object the experimenter was pointing
at. Similarly, \cite{martin09estimation} used pointing gestures to
instruct a mobile robot where to navigate to. \cite*{colombo03visual}
used multiple cameras for tracking humans pointing at areas on walls
in a room. \cite{nickel07visual} equipped a robot with stereo cameras
and use color and disparity information and Hidden Markov Models to
track both the direction of gaze and the position where a human is
pointing at. \cite{haasch05multi-modal} apply the ability to
recognize pointing gestures for teaching words for objects in a
domestic environment and \cite*{imai03physical} showed how the robot
"Robovie" could combine mechanisms for establishing mutual gaze and
pointing at objects to draw the attention of humans to a poster in the
environment of the robot. Finally, \cite{hafner05learning}
demonstrated how recognition of pointing gestures could be learned in
Aibo robots. One robot performs a hard-wired pointing gesture and the
other one has to detect whether it was to the left or to the right.
Additionally there is considerable research into implementing and
learning the necessary behaviors for engaging in structured
conversations. \cite{breazeal03sociable} investigated turn taking
with the Kismet robot, focussing on the factors regulating the
exchange of speaking turns so that the communication seems natural to
human interlocutors. \cite{cassell99turntaking} discussed how
nonverbal gestures and gaze can support turn taking behaviors in
multimodal dialogs with the embodied conversational agent (ECA)
``Gandalf'', trying to replicate findings from psychologic data. A bit
more on the theoretical side, \cite{iizuka03adaptive} followed a
Dynamic Systems approach for understanding processes of cognition and
action \citep{thelen94dynamic} to understand turn-taking in wheeled
mobile robots in terms of the underlying dynamics of recurrent neural
networks. Recent work on communication with ECAs is reviewed by
\cite{kroeger09model} for the co-ordination of communicative bodily
actions across different modalities and by \cite{kopp10social} for the
alignment of communicative behaviors between interlocutors.
\subsection{Implementing language games in robots}
\label{s:scaffolding-social-skills}
Language games are coordinated by behavioral scripts (see Section
\ref{s:language-game}, page \pageref{s:language-game}). Every agent in
the population knows the language game script and individually reacts
to changes in the environment and actions of the other robot. For
example the speaker triggers the action of pointing to the intended
topic when the hearer signals that he did not understand the
utterance. The scripts are implemented in the form of finite-state
machines: actions are performed depending on the current state in the
game flow, the perception of the environment and the history of the
interaction (see also \citealp*{loetzsch06xabsl}).
Joint attention is monitored by an external computer program, that has
access to the world models of both interacting robots. This system
initiates the interaction between two agents as soon as both agents
observe the same set of objects. It is the task of the human
experimenter to find spatial setups in which joint attention is
possible, the program only monitors whether robots are seeing the same
set of objects. But in the literature there are also other proposals
for establishing joint attention in embodied language game
experiments. For example \cite{steels97grounding} programmed
sophisticated signaling protocols into LEGO robots. A robot that
decides to become a speaker emits an infrared signal and the other
robot then aligns its position so that it faces the speaker. The
robots ``point'' to objects by orienting themselves toward them. In
the Talking Heads experiment \citep{steels98origins}, the speaker
directly controls the view direction of the hearer's camera in order
to make sure that their cameras perceive the same objects on the
whiteboard. An agent points to an object by letting the other agent's
camera zoom in on it. In contrast, establishing joint attention in
social language learning scenarios between humans and robots is
usually easier because the human experimenter (as a well-trained
social being) is good at monitoring the attention of the robot and can
for example (as in \citealp{dominey05learning}) point to an object by
moving it.
For constructing a naming system robots need non-linguistic means of
conveying information, such as pointing to an object or conveying
notions of success, failure and agreement in communication. For
demonstration purposes robots were equipped with behaviors for
pointing at objects (see Figure \ref{f:qrio-pointing}). We used motion
teaching for creating a set of 18 pointing motions for different areas
in front of the robot. Depending on the $x,y$ coordinate of the object
to point at, the pointing routines selects and performs one of these
pre-taught motions.
Nevertheless, in the communicative interactions underlying the
experiments presented here, robots use a different mechanism in order
to avoid further difficulties stemming from uncertainties in pointing
(see \citealp{steels98stochasticity} for a disscussion of the impact
of such uncertainties on the performance in language games). When a
robot wants to point to an object in the environment, he directly
transmits the $x_o,y_o$ coordinates of the intended object $o$ to the
interlocutor. Since robots model object positions in their own
(egocentric) coordinate systems, additional steps have to be taken to
interpret these coordinates. Most importantly the robot has to know
the position $x_r,y_r$ and orientation $\theta_r$ of the robot that is
pointing $r$ (see next Section \ref{s:extimating-robot-position} for
details on how robots estimate these values). With this information
robots transform the coordinates into their own coordinate system:
$$
\vec{v}=\begin{pmatrix} \cos \theta_r & -\sin \theta_r \\ \sin
\theta_r & \cos \theta_r \end{pmatrix} \begin{pmatrix} x_o \\
y_o\end{pmatrix} + \begin{pmatrix} x_r \\ y_r\end{pmatrix}
$$
\noindent The robot interpreting the pointing is determining the
intended object by choosing the object in his world model that is
closest to $\vec{v}$. Furthermore, although we implemented gestures
for giving non-linguistic communicative feedback (nodding the head for
success and shaking for failure) and we used the built-in speech
synthesizer of the Sony humanoid robots for producing utterances,
feedback signals whose meaning is shared and utterances are directly
passed between
interlocutors.\\
\noindent The mechanisms presented in this Section provide simple solutions to
required capacities for social language learning that are not meant to
be in themselves proposals as to how these skills could be
implemented. Nevertheless, we claim that the realism of this study
does not suffer from this simplicity: humans rely on extremely
powerful mechanisms for perceiving and sharing intentions within
interactive situations \citep{tomasello05understanding} and similarly
our solutions provide us with the technical prerequisites for letting
our robots learn from communicative interactions.
\subsection{Robot pose estimation}
\label{s:extimating-robot-position}
A requirement for the pointing mechanisms described above is a quite
precise estimate of the position and orientation of the other
robot. For that, robots localize themselves with respect to landmark
objects in the environment and transmit their position with respect to
these landmarks to the other robot. This way both agents establish
mutual knowledge about their position. We use carton boxes enhanced
with visual markers (see Figure \ref{f:artoolkit}) as landmark
objects. The unique, black and white, barcode-like, 2D-patterns
attached to carton boxes are tracked using the ARToolKitPlus library
\citep{wagner07artoolkitplus}, which is an improved version of
ARToolKit \citep{kato99marker,kato00virtual}, especially adapted for
mobile devices.
\begin{figure}[t]
\begin{tabular}{lr}
\multicolumn{2}{l}{
\includegraphics[width=0.6\columnwidth]{figures/artoolkit-0}\vspace{4mm}} \\
\includegraphics[width=0.28\columnwidth]{figures/artoolkit-1} &
\includegraphics[width=0.28\columnwidth]{figures/artoolkit-2}\vspace{4mm}\\
\includegraphics[width=0.28\columnwidth]{figures/artoolkit-3} &
\includegraphics[width=0.28\columnwidth]{figures/artoolkit-4}\\
\end{tabular}
\caption{Using objects enhanced with visual markers for estimating
the position and orientation of the other robot. Top: Example of a
carton box that is enhanced with 2D patterns. Center left: A
carton box with markers as seen through the camera of a Sony
humanoid robot. Center right: Binary image generated from the
original image. Bottom left: The marker as detected by the
ARToolKit tracking system. Bottom right: Both robots send the
position and orientation of the carton box (blue) to each other
and are thus able to deduce the position and orientation of the
respective other robot. }
\label{f:artoolkit}
\end{figure}
From each camera image, a histogram of the pixel luminance is
computed. This histogram is then used to derive a threshold for
creating a binary image as shown in the top right of Fig.
\ref{f:artoolkit}. The binary image is passed to the tracking library,
which searches it for marker patterns and determines the four vertices
of the polygon surrounding the marker in the image (see bottom left of
Fig. \ref{f:artoolkit}). Provided with the camera resolution width
and height (in pixels), the width and height camera opening angle (in
deg) and the widths of the markers used on the carton boxes (in mm),
the tracking library is able to make an orientation and position
estimate from the edges of the detected patterns, which is then
iteratively enhanced by matrix fitting. As a result, the system
returns for each detected marker pattern a unique ID and a matrix
describing the position and orientation of the marker relative to the
camera of the robot (for details of the pose estimation algorithm
see \citealt*{kato99marker}).
To transform the camera relative marker position and orientation into
robot egocentric coordinates, they are transformed using the offset
and orientation of the camera relative to the ground point of the
robot (see Section \ref{s:computing-object-features}). Finally, for
each marker attached to a carton box, the offset and orientation
relative to the center of the box, which is a priori known, is used to
determine the position and orientation of the box in egocentric
coordinates. To filter out noise and recognition errors, the resulting
box poses are averaged over the last $n$ images. Also, when two
markers of the same box are detected in the same image, their
resulting box poses are averaged. The output of the landmark modeling
system is a list of objects consisting of an ID (an ID of the box, not
to confuse with the ID of the marker patterns) and a pose
$\vec{b}:=\begin{pmatrix}x_b & y_b & \theta_b\end{pmatrix}$ of the
carton box in robot egocentric coordinates.
In order to determine the position $x_r,y_r$ and orientation
$\theta_r$ of the respective other robot, the robots use the carton
boxes as global landmarks (see bottom right of Fig.
\ref{f:artoolkit}). About five times per second they exchange the
poses of the boxes they have seen over a wireless network
connection. Given that both robots see the same box (all robots use
the same box IDs for the same visual markers), they can compute the
pose of the other robot from the box pose $\vec{b}$ as perceived by
the robot (in egocentric coordinates) and the $\vec{b}'$ as sent by
the other robot (in the coordinate system of the other robot):
$$
\begin{pmatrix} x_r \\ y_r \\ \theta_r \end{pmatrix}
:=
\begin{pmatrix}
x_b - \cos(\theta_b - \theta_b') \cdot x_b' + \sin(\theta_b - \theta_b') \cdot y_b' \\
y_b - \cos(\theta_b - \theta_b') \cdot x_b' + \sin(\theta_b - \theta_b') \cdot x_b' \\
\theta_b - \theta_b'
\end{pmatrix}
$$
\noindent When both robots see multiple boxes the results of the above
transformation are averaged.
\section{Experimental setup}
\label{s:doing-experiments-with-robots}
Integrating all the mechanisms for visual perception and behavior
control into a complete setup for doing language game experiments is a