-
Notifications
You must be signed in to change notification settings - Fork 2
/
chapter-2.tex
1220 lines (1095 loc) · 67 KB
/
chapter-2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\setcounter{chapter}{1}
\chapter{Building blocks of situated communicative interactions}
\label{c:building-blocks}
Let us now introduce some of the building blocks that form the basis
of all experiments in this thesis. We will start with a detailed
characterization of the communicative interactions between agents and
how the agents can learn from them. Then we will discuss different
ways of representing linguistic knowledge, i.e. how word forms can be
connected to meanings and what impact the structure of this
association has on the complexity of the learning task. And finally,
we give an overview how word meanings can be grounded in robots, i.e.
how persisting conceptual representations can be constructed by
robotic agents and how they co-evolve with language. For all of these
mechanisms and representational structures we will motivate their
underlying design choices from various perspectives. But we will not
give formal definitions yet and leave that to the description of the
actual experiments later in this thesis.
\section{Language games: the social context}
Following the assumption that communication is a social act in which a
speaker uses language to affect the mental states of a hearer (see
Section \ref{s:communication-as-a-social-act} above) and that a shared
language is constructed and shaped in repeated conversations (Section
\ref{s:language-as-a-complex-adaptive-system}), we will design all of
our experiments around one particular such type of interaction, called
a \emph{language game}. This term is commonly associated to
\cite{wittgenstein67philosophische}, who made an analogy between the
use of language in dialogue and playing a game (e.g. a ball-game; in
both cases there are sets of context-dependent rules for each
interaction step), and it is
\cite{steels95selforganizing,steels01language} who is recognized for
adopting Wittgenstein's concept of language games to the modeling of
communicative interactions between artificial agents.
\subsection{Distributed co-ordination in language games}
\label{s:language-game}
Language games are played by populations of autonomous \emph{agents}
that are modeled as software programs (utilizing standard agent-based
techniques of artificial intelligence, see
e.g. \citealp{wooldridge95intelligent,russel95artificial}). Each agent
maintains its own set of initially empty \emph{inventories}
(e.g. ontologies, lexicons, etc.) for memorizing acquired
knowledge. The agents have built-in mechanisms for using these
inventories to produce and interpret language in a rather automatic
way, \emph{diagnostics} and \emph{repair strategies} for detecting and
overcoming problems in their internal information processing and
\emph{alignment mechanisms} to adapt their inventories in order to
perform better in future interactions. The agents make their own
decisions solely based on internal goals and states, their perception
of the environment and their interaction with others -- i.e. there is
no central control, agents can't directly effect mental states of
others nor have they access to others' mental states (there is no
telepathy) and no agent has an overview over the whole population.
The agents are situated in a \emph{world} to which they are connected
via sensors and actuators. The external goal that is given to the
agents is to communicate about things in the world. Thus, the
environment creates a \emph{communicative task} for the agents and
part of designing an experiment is defining what things in the world
will be presented to the agents. The world is usually not static,
i.e. the configuration of the scenes presented to the agents may
continuously change. As mentioned before, we will investigate models
of lexicon formation both in simulated worlds and with physical robots
in real environments. For our simulated environments will not try to
set up virtual worlds in which simulated robots interact in but we
completely scaffold all problems of perception and categorization by
generating pre-conceptualized scene descriptions that are directly
perceived by the agents. An example scene consisting of two objects
created by such a world generator could look like this:
\begin{verbatim}
green(obj-1), small(obj-1), square(obj-1), red(obj-2), small(obj-2), circle(obj-2)
\end{verbatim}
In contrast, in our experiments with physical environments, real
robots perceive actual objects through their cameras (see Chapter
\ref{c:embodiment}).
A language game follows a strict script. That is, the agents conform
to routinized dialogue patterns that consist of distinct actions
applicable only to specific contexts and which constrain how to
interpret utterances. An example of such a routinized dialogue is the
procedure for running into a person that one knows: (in western
English-speaking cultures) it starts with a greeting phrase (``hi'',
``hello'', etc.), usually accompanied by eye contact and optionally
complemented by a hand shake or other greeting gestures. Then the
chances are very high that one of the interlocutors will take
initiative and say ``How are you?'', a question which the other person
is not supposed to answer honestly but to reply with ``fine'',
``great'', etc., optionally followed by ``, and you?'' (which doesn't
need be replied). Only after these compulsory steps the two persons
can start to have a real conversation. And it is not OK to end the
dialogue by just going away, it has to be announced (e.g. ``Well, I
have to leave.'') and concluded by a final phrase such as ``see you
later'', ``goodbye'', etc.
Another example is the routine for buying a train ticket at a counter.
After an optional greeting, the customer will utter his
request. Because the ticket seller already knows that the customer
will most likely want to buy a ticket, it is enough to say for
example: ``One ticket to London for tomorrow morning please'' (the
``please'' does not add any information but is compulsory). The seller
will then issue the ticket, if necessary asking for more details. When
the ticket gets printed, the seller will say a price (e.g. ``seventeen
pounds''), which functions (since the customer could also read the
price from the electronic display in front of him) as a request to
hand over the money. The interaction ends with both involved persons
thanking each other and optional greetings.
The type of game that we are going to use for our experiments is not
not embedded in complex activities such as meeting another person on
the street or buying a train ticket. The underlying purpose of the
dialogue lies solely in the communication itself and in providing rich
opportunities for learning and alignment. The game is thus a rather
idealized interaction scenario with only one goal: drawing attention
to an object in the external environment. But it doesn't lack realism:
we will discuss below that children indeed learn many words from such
interactions and it is also very close to one of the games discussed
by \cite{wittgenstein67philosophische}, in which parents teach
children words by pointing at an object and uttering a name for it. A
situation in which somebody points at a thing (e.g. a cow) and tells
its name (e.g. ``cow'') with the purpose of teaching the word to a
child can be conceptualized as a game because in order for the child
to successfully learn the name for the object it has know how the game
works, i.e. that the parent is telling something about the thing that
he is pointing at (it could be also that pointing at a cow and
uttering ``cow'' is an action that the parent performs in order to
make the cow go away or to get milk from it, but that's not the case
-- the game is about learning words and the child has to know this in
order to make sense of the action).
\begin{figure}[t!]
\centerline{\includegraphics[width=0.80\textwidth]{figures/guessing-game-flow}}
\caption{Flow of one language game. A speaker and a hearer follow a
routinized script. The speaker tries to draw the attention of the
hearer to a physical object in their shared environment. Both
agents are able to monitor whether they reached communicative
success and thus learn from the interaction by pointing to the
topic of the conversation and giving non-linguistic feedback.
Populations of agents gradually reach consensus about the meanings
of words by taking turns being speaker and hearer over thousands
of such games. }
\label{f:guessing-game-flow}
\end{figure}
Figure \ref{f:guessing-game-flow} shows a schematic view of the
language game that our agents are going to play (we will discuss each
of mechanisms mentioned below in much more detail in the description
of the actual experiments -- here we only will outline the general
dialogue script that is shared by all experiments throughout this
thesis). Two agents are randomly drawn from the population and
together establish a \emph{joint attentional scene}
\citep{tomasello95jointattention} -- a situation in which both agents
attend to the same set of objects in the environment and in which both
agents know that the respective other agent is attending to the same
set of objects. Once such a state is reached, the game starts. One of
the agents is randomly assigned to take the role of the speaker and
the other the role of the hearer. Both agents perceive then a
\emph{sensory context} from the joint attentional scene and keep it in
their short-term memory (visual perception and joint attention with
real robots is enormously difficult and we will dedicate the whole
Chapter \ref{c:embodiment} to that; in our experiments involving
simulated environments all these issues will be scaffolded and both
agents will perceive the same scene description that is generated by
the world generator mentioned above).
Next, the speaker randomly picks one object from his context to be the
\emph{topic} of the interaction -- his communicative goal will be to
draw the attention of the hearer to that object. For this he
constructs an utterance, which involves first coming up with a mental
representation of the meanings to express (\emph{conceptualization})
and then finding words that cover these meanings. When the speaker
does not have the necessary categories or words in his inventories, he
\emph{invents} them. Additionally, the speaker uses himself as a model
of the hearer and by listening to himself (\emph{re-entrance}), he
checks whether the words he came up with are clear and precise enough
to be understood (given his own inventories). Once the speaker is
satisfied with the constructed utterance, he speaks out the words to
the hearer. The hearer then \emph{parses} the utterance and tries to
find the object from his own perception of the scene that he believes
to be most probable given his interpreted meanings. He will point then
to that object and the speaker will either confirm that this was
indeed the object he intended to talk about (and signal
\emph{communicative success}) or he will point to his chosen topic
(and thus signal \emph{communicative failure}). It could also happen
that the hearer is confronted with a novel word or that his
interpretation doesn't match any of the objects in his context. In
this case, the hearer signals a communicative failure and the speaker
then also points to the object he intended. In both cases, the hearer
is able to learn from the interaction by \emph{adopting} the words
heard and associating them with the topic pointed at by the speaker
(and, if necessary, also inventing categories that are needed to
conceptualize the topic). Finally, at the end of each interaction both
agents \emph{adapt} their inventories based on the sensory context,
the topic, the words used and the outcome of the game in order to be
more successful in future interactions (\emph{alignment}). The
population of agents plays \emph{series} of such language games. Each
agent starts with initially empty inventories and has never before
seen any of the objects in the world. Each agent tries to optimize his
own communicative success and cognitive effort and thus coherent
mental representations and shared language emerge (solely through
processes of invention, adoption and alignment) as a side-effect of
the game.
Finally some terminology issues: this type of game has often been
called \emph{Guessing Game}, either because the hearer has to guess
the topic of the utterance and point to it or because the hearer can
not know what aspect of an object the speaker intended with a
particular word (referential uncertainty, see below). When the focus
is on the kind of languages learnt, our game could be also called
\emph{Object Naming Game} because it is about naming objects (in
contrast to describing objects and their relations to other objects or
their roles in events). We will avoid possible confusions by always
using the term ``language game'' when referring to this particular
interaction pattern.
\subsection{Other social learning scenarios}
The language game paradigm has proved to be very successful in
demonstrating how groups of artificial agents can establish a shared
set of conventions through self-organization processes. However, when
it comes to explaining human communication, it has been -- rightfully
-- criticized for two reasons: First, it happens very rarely that
humans have to construct a communication system from scratch and the
normal case is that children learn the existing language of their
parents' culture. And second, the explicit feedback that our agents
give each other (including pointing and correnctions) is not necessary
for children to learn the
meanings of words.\\
\noindent Because our agents start without any prior language, speakers have to
invent words whenever their lexicons are not sufficient for their
communicative needs. And when multiple speakers independently invent
words for the same thing, a large number of competing words are
spreading in the population, before eventually one word ``wins'' and a
convention is established (as we will see further below). Although
some psychologists have demonstrated that humans are indeed able to
bootstrap and align symbolic communication systems in similar ways
(e.g. \citealp{galantucci05experimental,healey07graphical}), it is not
the normal situation that children are confronted with in language
acquisition -- they are born into a culture with an established
language and parents also won't adopt inventions made by their
children.
An alternative to this \emph{horizontal transmission} of language is
the \emph{iterated learning model}
(\citealp*{kirby01spontaneous,smith03iterated}; see also
\citealp{steels02iterated} for a comparison with the language game
framework). Instead of focusing on how language propagates within
members of the same generation, it investigates \emph{vertical
transmission} from one generation to the next. Following an
inductive machine learning approach, training sets consisting of
meaning-form pairs created from a parent are used to train the
inventories of a child, which then becomes the parent for the next
generation. The language of the first generation is usually
initialized randomly.
However, the purely inductive nature of iterated learning leaves out
crucial aspects of communication such as joint attention, shared
context and communicative goals. Furthermore, languages also change
within generations and these changes can't be explained with effects
of vertical transmission because they rely on processes of
coordination and alignment.\\
\noindent The agents in our language game experiments always give each other
non-linguistic corrective feedback, i.e. the speaker either confirms
that the topic pointed at by the hearer was the intended one or he
points to the right topic. But children don't necessarily need such
social scaffolds in order to learn the language of their parents --
they are smart enough to make sense of the communicative intentions of
speakers, even when just overhearing conversations of others.
\cite{lieven94crosslinguistic} extensively reviews cross-cultural
differences in the social interactions from that children learn
language and the conclusion is that parents in some cultures give
extensive feedback, others almost not: ``children are clearly not
having to learn language from something like a television set; but nor
are they being presented with a graded set of syntax lessons''
\citep[p. 73]{lieven94crosslinguistic}.
Some researchers investigated other types of games with less explicit
feedback. Best known are \emph{Description Games} in which the speaker
describes a scene and the hearer either agrees that it is a good
description for the current scene or he disagrees. The disadvantage is
that the speaker has no way to verify whether the hearer indeed
understood him (the fact that the hearer agreed does not mean that
they had a similar understanding of the words used). But description
games actually need to be played when the topic of a conversation is
not an object (which can be pointed at) but for example an aspect of
an event or other relations between objects (which can't be pointed
at). The lacking consensus between speaker and hearer on what the
topic of the conversation is makes self-organizing a shared language
harder and the problem is usually tackled with
\emph{cross-situational} learning techniques (discussed further
below). \cite{vogt03investigating} have compared the performance of the
language game introduced above with so-called ``selfish games'', in
which there is no feedback at all (so it's like learning language from
a television set). Their conclusion is that selfish games are -- albeit
viable -- much more difficult.
Even if children don't need extensive teaching and feedback, it
nevertheless helps them. For example \cite{chouinard03adult}
demonstrated that learning improves when parents reformulate erroneous
utterances of their children. And \citet{tomasello83joint-attention}
compared lexical learning rates in trials where mothers directed the
attention of their children at novel objects with trials where they
just followed into what their child was looking at -- the results
suggest that joint attention supports lexical acquisition.
\cite{bloom01precis} puts it this way: ``The natural conclusion here
is that these naming patterns on the part of adults really are useful,
they just aren't necessary. Environments differ in how supportive they
are, and word learning is easier when speakers make the effort to
clarify their intent and exclude alternative interpretations. But
children are good enough at word learning that they can succeed
without such support '' (p. 1099).
Our agents don't have a `theory of mind', i.e. hearers have no
non-linguistic pragmatic means available to them for figuring out what
the speaker intends. And they don't have additional heuristics for
determining whether they reached their communicative goal, because
they use language only to direct attention (it would be for example
easier when the speaker would not try to draw attention to an object
but try to request the hearer to bring him the object -- if the hearer
brings another one then he knows that he said something wrong). The
only way for our agents to deal with these limitations is thus is to
establish joint attention and to use pointing as a means to check
whether the words were used correctly. So our language game is, in a
way, designed to overcome our agents' lack of social intelligence by
making it easy to verify whether communicative goals were reached. And
again \cite{bloom01precis}: ``Because of this, the best way to teach a
child an object name is to make it as clear as possible that you are
intending to refer to the referent of that name; and the best way to
do this is to point and say the word. In this way, the child can infer
that the speaker means to pick out the dog when using this new word,
`dog', and the meaning will be quickly and accurately learned''
(p. 1099).
\subsection{Evaluating the performance of language games}
\label{s:evaluating-language-games}
How can we then compare the performance of the different language game
experiments that we're going to do, i.e. how do we assess the
development of our agents' \emph{communicative competence}?
Intuitively, we would say that a person who knows more words than
somebody else and who complies better with the rules of for example
English is a better speaker of the language. The underlying conception
is that a language is some homogeneous public entity, casted into
dictionaries and internalized by its speakers. But even a person who
learnt the English dictionary by heart and follows all rules of the
language can still find himself in a situation where he will not
understand what other English speakers say. The person could for
example attend a mathematics conference and (although he understands
all the words) have no clue what they are talking about. Or he could
meet a group of adolescents who use slang words that did not make it
into the dictionaries yet.
Despite still ongoing debates about the historical distinction between
linguistic \emph{competence} and \emph{performance}
\citep{chomsky65aspects}, most linguists and philosophers agree now
that mastering a language is not about knowing the words and rules,
but about reaching communicative goals: ``We forget that there is no
such thing as a language apart from the sounds and marks people make,
and the habits and expectations that go with them. `Sharing a
language' with someone else consists in understanding what they say,
and talking pretty much the same way they do''
\citep[p. 131]{davidson05truth}.
Therefore, we will make make \emph{communicative
success} our main criterion for performance in language games. That
is, the focus is not on the content our agents' inventories, but how
they use this knowledge in communication. As detailed before (Section
\ref{s:language-game}), our language game script allows both the
speaker and the hearer to determine whether the communicative goal
(drawing attention to an external object) was reached. After each
interaction in an experiment's ongoing series of dialogues, we will
determine how the agents assessed their success in communication and
record it using the following measure:
\begin{measure}[h]{Communicative success}{m:communicative-success}
Measures the fraction of successful games as assessed by the agents.
An interaction is a success when the hearer is able to point to the
topic intended by the speaker (see Figure
\ref{f:guessing-game-flow}, page
\pageref{f:guessing-game-flow}). After each successful interaction
the value of 1 is recorded, for each failure 0. Values are averaged
over the last n interactions (n=250 if not stated otherwise).
\end{measure}
\startfiguregroup
\begin{figure}[p]
\gnuplotfigure{figures/communicative-success-example-average-window-1}
\caption{Example for the evolution of communicative success over
time. Values were recorded for 10 different series of the same
experiment, each consisting of 10000 interactions. The size of the
average window for recording the values of each series is 1,
i.e. values within a series are not averaged.}
\label{f:communicative-success-example-average-window-1}
\end{figure}
\begin{figure}[p]
\gnuplotfigure{figures/communicative-success-example-average-window-100}
\caption{A graph of communicative success in the same experimental
run as above, but with values averaged over the last 100
interactions in each series. Error bars are standard deviations
across the 10 repeated series of the same experiment.}
\label{f:communicative-success-example-average-window-100}
\end{figure}
\begin{figure}[p]
\gnuplotfigure{figures/communicative-success-example-average-window-1000}
\caption{The same as above, but with an average window of 1000. Note
that this curve seems to be ``delayed'' compared to the other two
as a result of the bigger averaging window. Another side-effect of
averaging is the little ``bend'' in the curve at around
interaction 1000.}
\label{f:communicative-success-example-average-window-1000}
\end{figure}
\stopfiguregroup
\noindent Throughout this thesis, we will record such data along
repeated series of language tames (together with data of many other
measures) to generate graphs such as in Figures
\ref{f:communicative-success-example-average-window-1}--\ref{f:communicative-success-example-average-window-1000}. How
to read then these graphs? The recorded values (in this case for the
communicative success measure) are plotted over the number of
interactions along the x-axis. So in this example the agents reach an
average communicative success of about 80\% after 1000 interactions,
which then later on increases to about 95\%.
Three things are important when interpreting such graphs. First, the
fact that it takes 1000 interactions to reach 80\% success does not
mean that each agent played 1000 games up to that point. In the
example the population consisted of 10 agents, and with each time two
agents participating in an interaction, 1000 interactions means that
each agent played 200 games on average, being speaker in about 100
interactions. Second, values are averaged over an average window. The
example graphs show the same results for average windows of 1, 100 and
1000. Many authors in the field of artificial language evolution
include graphs such as Figure
\ref{f:communicative-success-example-average-window-1} in their papers
(no averaging). But we believe that the noisy curve in that example
does not add any information and makes comparisons with other graphs
harder. We will thus use higher averaging windows (usually 250, but
sometimes even higher), which produces cleaner curves. The
disadvantage of heavy averaging is, as it is shown in the other two
graphs (Figure
\ref{f:communicative-success-example-average-window-100} and
\ref{f:communicative-success-example-average-window-1000}), that the
curves are a bit ``behind'' the non-averaged data (so this has to be
kept in mind). And, finally, third, we will always repeat the same
experiment 10 times and average the results of each series to rule out
effects of randomness (the agents will always talk about different
scenes, each time with other randomly chosen partners, leading always
to varying dynamics). The error bars in Figures
\ref{f:communicative-success-example-average-window-100} and
\ref{f:communicative-success-example-average-window-1000} still give a
hint on how values vary across the different series (they indicate the
standard deviation of the values at that interaction number in all 10
series).
Of course communicative success is not the only measure we are
interested in (we will introduce others later). Part of
self-organizing a language is also that agents improve their cognitive
economy. That means that inventory sizes will converge to an optimal
number of elements that are needed to cope with the communicative task
(making processing faster) and the number of changes in the agent's
inventories will decrease. And we will compute measures of
\emph{coherence} that indicate how similar the inventories of the
population's agents are. But, as we will see, it is possible (and in
the case of embodied agents unavoidable) that agents have very
different conceptual and linguistic inventories but still communicate
successfully. Thus: ``What matters, the point of language or speech or
whatever you want to call it, is communication, getting across to
someone else what you have in mind by means of words that they
interpret (understand) as you want them to''
\citep[p. 120]{davidson05truth}.
\section{Words: representing linguistic knowledge}
\label{s:representing-linguistic-knowlede}
\begin{figure}
\parbox{0.6\textwidth}{\centerline{\includegraphics{figures/saussurean-sign}}}
\caption{A diagram that illustrates our notion of the term ``word''
as referring to the whole association of a meaning to a form.}
\label{f:saussurean-sign}
\end{figure}
We have introduced the social context in which our communicative
interactions are going to take place. Next, we're going to define what
it means for our agents to ``know a language''. Since the focus of our
thesis is on lexicon formation (which leaves out many crucial aspects
of natural language such as grammar and morphology), our agents'
linguistic inventories are single \emph{lexicons}, consisting solely
of \emph{words}. Words are couplings between a \emph{meaning} and a
\emph{form} (see Figure \ref{f:saussurean-sign}) and we will
consistently use the term \emph{word} to refer to the whole of this
association (and not to the form). What meanings are and where they
come from will be the topic of the next Section \ref{s:meanings}. For
now we will treat them as sets of unstructured symbols (or
\emph{categories}, \emph{attributes}, \emph{features},
\emph{conceptual entities}, whatever you want to call them) such as
{\tt object-34}, {\tt category-17}, {\tt red-2} and so on. Forms are
random character strings that are created by speakers whenever they
invent a new word. Throughout our thesis, these forms will be built
from three random consonant/ vowel pairs such as for example in
``nuzega'' or ``firopa''.
\subsection{Saussurean signs}
\label{s:saussurean-signs}
For the coupling between meaning and form we rely on the concept of
the the \emph{Saussurean Sign} \citep{saussure67cours}. It is a
bi-directional relation between a concept (in the sense of some entity
of thought, \emph{signified}) and a form (a sound, a gesture, etc.,
\emph{signifier}). Bi-directional means that the same representation
is used to parse and produce utterances (which is not self-evident --
it is easy to imagine non-reciprocal communication systems in which
agents use different representations for parsing and producing or in
which agents lack the capability to either parse or produce). The
connection between the signified and the signifier is arbitrary, i.e.
there is nothing in the concept of a donkey that determines the sound
``donkey'' (in fact, different cultures arbitrarily connect very
different forms to similar concepts of donkeyness, e.g. ``Esel'' in
German). It's important to note that Saussurean Signs don't link
actual sounds waves to physical objects existing in the world but both
the signifier and the signified are mental patterns of reoccurring
sensory experiences of sounds and objects. Furthermore, and this will
be more clear later on, it is not the signs directly that determine
what we speak or how we interpret utterances -- it is the differences
in meaning and form between within a whole system of signs that govern
the speech of individuals (\emph{parole} in Saussure's terms). That
is, speakers don't follow explicit rules (in a classical artificial
intelligence rule system sense) such as {\tt \ "if donkey visible
$\rightarrow$ produce sound `donkey'"} -- instead, they consider
their whole system of signs and their differences in meaning to
eventually use the sign that \emph{distinguishes} the donkey from
the other objects in the scene.
We'll assume Saussurean signs to be an appropriate construct for the
representation of form-meaning couplings in our work (especially the
notion of bi-directionality, arbitrariness and the importance of
relative differences to other signs), and we think that this is not a
controversial choice. But there is still the question of where this
particular nature of words comes from. To investigate this,
\cite{hurford89biological} compared different strategies for lexicon
formation in computer simulations. Learners either separately imitated
the production and speaking behavior of others or used observed
speaking behavior both in production and interpretation. The latter
strategy clearly had advantages because it makes it easier for the
agents to learn. Additionally, \cite{oliphant96dilemma} carried out
similar simulation studies which demonstrated that Saussurean
communication is favourable in populations of repeatedly interacting
agents (e.g. as in our language games), especially when the
populations are spatially organized. These experiments clearly show
that the Saussurean nature of words has advantages over other
communication systems. But the authors discuss these results under the
assumption that Saussurean communication evolved by means of natural
selection, a view that is challenged nowadays (see
\citealp[pp. 74--78]{bloom00how-children} for a discussion). As an
alternative, the bi-directional use of signs can be seen as a
consequence of our theory of mind: ``Children's ability to reproduce
intentional communicative actions via some form of cultural or
imitative learning involves a role reversal -- the child has
intentions towards the other person's intentional states -- which
leads to the creation of linguistic conventions''
\citep[p. 153]{tomasello01perceiving}. So we don't directly imitate
the linguistic behavior of others, that is, we don't imitate the
production of the sound ``donkey'' in the presence of a donkey but we
imitate the action of saying ``donkey'' as a method for directing
attention to donkeys. ``Once a child believes that the adult's use of
the word \emph{dog} was used with the intent to refer to a dog, then
she could use the same means (saying `dog') to satisfy this goal''
\citep[p. 76]{bloom00how-children}.
Finally, how are we going to implement our agents' systems of
Saussurean signs in terms of data structures? We'll choose the most
simple representation possible: a lexicon is represented as a list of
words, each having a meaning, a form and a score reflecting how
successful that word was used in past interactions. As we will see
later, the lexicon is usually part of a larger \emph{semiotic
network}, a complex network \citep{strogatz01exploring} that
connects an agent's sensory experiences to forms and back and whose
overall behavior is the result of a coupling of different processes
that each have their own dynamics. There are many representations
thinkable that are more cognitively plausible than lists of words. For
example \cite{kosko88bidirectional} implemented a two-layer neural
network that can store paired data associations and
\cite*{billard99drama} developed DRAMA (dynamical recurrent
associative memory architecture) specifically for representing words
in robots. We prefer our representation over more integrated solutions
because it gives us full control over processes of language use and
learning. We assume that these structures could be easily transferred
into more natural representations (e.g. neural networks).
\subsection{Increasing complexity in the coupling between form and
meaning}
\label{s:nature-of-form-meaning-couplings}
Words are couplings between meaning and form. We'll treat forms as
simple random strings and what meanings are will be explained in the
next section. We will turn now to the nature of this coupling, i.e.
how a form is coupled to meaning and how words in a lexicon relate to
each other. This structure is part of an agents cognitive
infrastructure, especially his mechanisms for production/
interpretation, learning and alignment. And it has direct consequences
on the dynamics of the language game experiments, i.e. how quick the
agents reach communicative success and coherence. Depending on the
``degrees of freedom'' in what the agents can associate to a form, in
how words with equivalent meanings/forms relate to each other, and in
how agents combine different words into utterances, various kinds (and
degrees) of \emph{ambiguities} arise in an agent's lexicon. For
example in all of these models it happens that different forms for the
same meaning spread in the population (because agents independently
invent them), causing \emph{synonyms} (the same meaning is associated
to multiple forms) to occur in the agent's lexicon. Similarly,
different contexts and other reasons might cause an agent to adopt
multiple meanings to the same form (\emph{homonymy}).
What does it mean for an agent to have for example a synonym in his
lexicon? Technically, an agent that learnt two different forms $f_1$
and $f_2$ for the meaning $m_1$ will not store them in the same word
with connections to both forms, but he maintains two separate
representations $w_1: m_1 \Leftrightarrow f_1$ and $w_2: m_1
\Leftrightarrow f_2$. Part of the self-organization process in the
series of language games is that the whole population eventually
agrees on one single form for a particular meaning (and vice
versa). In order to reach this goal, each agent individually tries to
optimize his own lexicon by preferring the most conventionalized
associations and eliminating \emph{competing} synonymous and
homonymous words. We will introduce various algorithms that achieve
this -- all of them rely on scoring each word depending on how
successful it is used in communication. When enough agents in the
population start preferring a particular form-meaning association, it
will prevail over the others, causing each individual agent to remove
competing synonyms and homonyms.
\begin{figure}[t]
\centerline{
\begin{tabular}{lp{1cm}l}
A & & B \\
\includegraphics[width=0.30\columnwidth]{figures/models-of-form-meaning-couplings-1} & &
\includegraphics[width=0.30\columnwidth]{figures/models-of-form-meaning-couplings-2} \\
& & \\
C & & D \\
\includegraphics[width=0.30\columnwidth]{figures/models-of-form-meaning-couplings-3} & &
\includegraphics[width=0.30\columnwidth]{figures/models-of-form-meaning-couplings-4}
\end{tabular}}
\caption{Increasing complexity in the nature of the coupling between
form and meaning. Hypothetical example lexicons of one agent are
shown for four different models of lexicon formation. Line widths
denote different connection weights (scores). A: One-to-one
mappings between names and individuals in the Naming Game. There can
be competing mappings involving the same individual (synonyms). B:
One-to-one mappings between words and single categories in Guessing
Games. Additionally to synonymy, there can be competing mappings
involving the same words (homonymy). c: Many-to-one mappings between
sets of categories and words. In addition to synonymy and homonymy,
words can be mapped to different competing sets of categories that
partially overlap each other. D: Flexible word meaning
representations. Competition is not explicitly represented but words
have flexible associations to different categories that are shaped
through language use.
}
\label{f:models-of-form-meaning-couplings}
\end{figure}
Furthermore, other ambiguities arise from the use of \emph{multi-word}
utterances (it can become unclear which word covers which meaning),
from \emph{specificity} relations (whether a new word refers to the
whole object, to it's kind or a general property of it), and others.
The degrees of freedom in what to associate to a new form can be
interpreted as the \emph{complexity} of a lexicon formation model and
we will classify a variety of models according this degree of
freedom. For now, Figure \ref{f:models-of-form-meaning-couplings}
illustrates the nature of the coupling between meaning and form for
four of them.
The simplest of these four models, the \emph{Naming Game}
(\citealp{steels95selforganizing,steels99spatially}; Figure
\ref{f:models-of-form-meaning-couplings}A), is historically also the
oldest. The task in this game is to agree on a set of names for
established individuals (for example proper names such as ``John'' and
``Mary'' for individual persons). Agents jointly perceive sets of
uniquely identifiable objects such as for example {\tt object-3}, {\tt
object-8}, {\tt object-4}; or (as in
\citealp{steels95selforganizing}) unambiguously interpretable
positions on a spatial grid relative to the speaker (e.g. {\tt front},
{\tt side}, {\tt behind}, {\tt left}, etc). Words are thus one-to-one
associations between a representation for an individual and a name.
Since both speaker and hearer have the same representations of
individuals (the world they perceive consists already of
pre-conceptualized symbolic representations for unique objects or
locations), the hearer immediately knows which concept to associate to
a novel word after the speaker pointed to it. But synonymy can occur
because different speakers might invent different names for the same
object (for example in Figure
\ref{f:models-of-form-meaning-couplings}A the words {\tt
individual-2}$\Leftrightarrow$\emph{form-2} and {\tt
individual-2}$\Leftrightarrow$\emph{form-3} are synonymous).
Figure \ref{f:models-of-form-meaning-couplings}B illustrates a next
class of models. It is commonly referred to as a \emph{Guessing Game}
and was first introduced by \cite{steels96emergent}. It takes away the
scaffold that objects are represented as unique concepts by letting
the agents perceive scenes in which objects are sets of
pre-conceptualized discrete categories as for example in:
\begin{verbatim}
object-1: [weight heavy] [size medium] [shape square]
object-2: [weight light] [size small] [shape round]
object-3: [weight heavy] [size tall] [shape square]
\end{verbatim}
\noindent The speaker then searches for a category that
\emph{discriminates} the chosen topic from the other objects in the
context (for example {\tt [size medium]} discriminates {\tt object-1}
from the rest, {\tt [weight light]} or {\tt [shape round]}
discriminate {\tt object-2}, etc.) and then uses a single word to
express that meaning (the game stops when no discriminative category
can be found). So the words acquired by the agents are comparable to
adjectives for basic categories such as ``red'', ``small'' or
``round''. The representation of words is identical to those of Naming
Games (a one-to-one mapping between an atomic category and a form), but
further difficulties arise because the hearer does not know which
sensory quality (or channel) a novel word refers to. Consequently,
homonyms may appear in addition to synonyms because a hearer might
adopt different interpretations of the same word (for example in the
agent in Figure \ref{f:models-of-form-meaning-couplings}B interprets
the form \emph{form-3} both as {\tt feature-2} and as {\tt
feature-3}). Because the words in this game still ``name'' single
categories, such experiments are sometimes called Naming Games as
well, reserving the term Guessing Game for the language game script.
\cite{looveren99multiple,looveren00analysis} presented two further
innovations: first, \emph{multi-word} utterances were introduced:
objects don't need to be discriminated anymore by a single category
but combinations of categories can be expressed by different words
(e.g. ``red'' and ``small'' when some other objects in the context are
also red and some others also small, but none of them red and small at
the same time). This leads to the additional difficulty that when a
hearer is confronted with two novel words at the same time then he
does not know which word covers which part of the inferred meaning
(such a situation is usually seen as too difficult: hearers only learn
when there is only one unknown word so that they can infer its meaning
using the know words and the context). Second, meanings of words can
be \emph{structured}: instead of expressing a single individual or
category, words are many-to-one mappings between forms and sets of
discrete categories (see Figure
\ref{f:models-of-form-meaning-couplings}C). Due to this, another
challenge arises for the hearer: he does not know to which subset of
the topic's feature he has to associate a new word. As a result, the
agents' lexicons do not only contain homonyms but also competing words
where the meaning of one is the subset of another (e.g. in Figure
\ref{f:models-of-form-meaning-couplings}C there are two words with the
form \emph{form-3}: one that expresses only {\tt feature-3} and one
that covers both {\tt feature-2} and {\tt feature-3}).
In order to scale up the above three lexicon formation models towards
more complex meaning spaces and in order to allow for the emergence of
more natural communication systems,
\cite*{wellens08flexible,wellens12multi-dimensional} proposed another
lexicon representation as shown in Figure
\ref{f:models-of-form-meaning-couplings}D. The main innovation is to
tackle ambiguities in what words mean with a \emph{flexible} coupling
between meaning and form: whereas agents in the previous models try to
figure out the meaning of a word by adopting multiple associations
between a form and its alternative meanings (and then use word scoring
techniques to rule out all of them except one), here the uncertainty
is put in the word representation itself. Instead of having a single
score for the whole coupling between a form and a set of categories,
each connection to a category is scored separately, which allows the
meaning of a word to gradually change towards its conventional use in
the population. Figure \ref{f:models-of-form-meaning-couplings}D tries
to illustrate this: an agent's lexicon is represented as a
many-to-many association between categories and forms, with each
connection scored separately.
\section{Meanings: grounded word semantics}
\label{s:meanings}
In addition to the social context of the communicative interactions
and the nature of word representations, the notion of ``meaning'' is
central to the understanding of communication in general and models
lexicon formation in particular. In our simulated language game
experiments, as discussed before, the world of the agents already
provides shared pre-conceptualized meanings consisting of (sets of)
symbols such as {\tt object-34}, {\tt category-17} or {\tt
red-2}. With the meanings already being ``in the world'', they are
also immediately shared by all agents in the population and the
question what meanings are and where they come from is not posed --
the focus is rather on reaching consensus on which meanings to connect
to which forms. However, objects in the real world -- which is also
the world of our robots -- do not come with universally shared
properties directly accessible to observers. Instead, each agent has
to construct ``meanings'' as his own interpretation of a scene from
its sensori-motor interaction with the environment.
\subsection{From Saussure to Peirce}
\label{s:saussure-to-peirce}
A widely accepted notion of meaning is that they are not something to
be found in the world, but that they are used to \emph{refer} to
things in the world: ``The traditional view, emerging first in
Aristotle, is that the meaning of a word is what determines its
reference. ... Hence the meaning of \emph{dog} determines which things
are and are not dogs, and knowing the meaning of dog entails knowing
what things are dogs and are not dogs''
\citep[p. 18]{bloom00how-children}.
\begin{figure}
\parbox{0.6\textwidth}{\centerline{\includegraphics{figures/semiotic-triangle}}}
\caption{Diagram of a semiotic triangle. The relation between a
meaning, a form and a referent loosely resemble the definition of
a sign by \cite{peirce31collected}. }
\label{f:semiotic-triangle}
\end{figure}
Adding referents to De Saussure's (\citeyear{saussure67cours}, see
also Section \ref{s:saussurean-signs} above) definition of a sign as a
relation between a meaning and a form, \cite{peirce31collected}
introduced the concept of a sign as a triadic relationship between a
form, a meaning and a referent (see Figure
\ref{f:semiotic-triangle}). Peirce originally used the term
\emph{representamen} for the \emph{shape} (form) of the sign and
\emph{interpretant} for its \emph{sense} or \emph{concept} (meaning).
For the referent, which is a physical object in the world but which
also can be abstract, Peirce used the term \emph{object}.
The relation between form an meaning is, analogous to the Saussurean
sign, an arbitrary conventionalized bi-directional association between
a meaning and a form. And although finding the appropriate meaning
underlying a form or finding the form that expresses a meaning of
course requires some look-up process, these associations can be
considered to be ``stored'' in the lexicon of an agent. In contrast,
the relation between meanings and referents is of a different
nature. Word meanings are representations that allow to determine to
which referents a word applies and to which not. Therefore, finding
out whether a specific meaning is applicable to a specific referent in
the context is an active process that in each interaction again
establishes the relation between a meaning and a referent. We call the
process of determining the meanings that are applicable to a referent
\emph{conceptualization} and the reverse process of applying the
meanings underlying an utterance to a situation in order to determine
a referent \emph{interpretation}.
The third relation in Figure \ref{f:semiotic-triangle} between forms
and referents is even less direct. The meaning representations
maintained by each agent are not accessible by other agents -- they
can only observe forms and referents. Meanings thus constitute an
intermediate layer that allows agents to relate the same words to
similar referents in the world, i.e. \emph{use} a word in the same
way: ``For a large class of cases -- though not for all -- in which we
employ the word `meaning' it can be defined thus: the meaning of a
word is its use in the language'' \citep[Part I, Section
43]{wittgenstein67philosophische}. For example, the meaning of ``red''
is a shared convention how to classify the world into things that are
red and things that are not. Moreover, meaning representations are
constructed individually by each agent from sensory experiences of
specific referents. And because every agent has a different history of
interactions with the world and other agents, two agents will never
connect exactly the same meaning representation to the same
form. Intuitively, every two humans will also have slightly different
opinions about which border cases of red objects should be considered
red, but they will still use ``red'' successfully in most of the cases
to refer to red object. As we will see later, conceptual coherence,
i.e. the similarity between meanings acquired by different agents, is
not necessarily a prerequisite for successful communication. It is
enough that we all use a word to refer to the same things -- further
cognitive overlap is not necessary.
Furthermore, conceptualizing a referent or interpreting a meaning
never happens in a vacuum. Words can be used differently in different
contexts (for example ``the red block'' can be used to refer to an
orange block when all other objects are blue, but not when there is
another red block). And more importantly, the interpretation of words
depends also on the social context, i.e. the previous discourse and
the kind of communicative interaction. As discussed above in Section
\ref{s:language-game}, the language game played determines how words
have to be interpreted to yield a referent. ``We must therefore
explicitly acknowledge the theoretical point that linguistic reference
is a \emph{social} act in which one person attempts to get another
person to focus her attention on something in the world''
\citep[p. 97]{tomasello99cultural}. In our experiments, the type of
communicative interaction is fixed (see Figure
\ref{f:guessing-game-flow}, page \pageref{f:guessing-game-flow}) and
the implicit communicative goal underlying each utterance is to draw
attention to a single object in the environment of the
robots. Consequently, when an agent says for example ``red small'',
then the built-in convention is to interpret these words as ``please
point to the object that is small and red''.
The question of how to represent and process word meanings is very
closely related to the \emph{symbol grounding problem}
\citep{harnad90symbolgrounding}, which his ``... , generally speaking,
the problem of how to causally connect an artificial agent with its
environment such that the agent's behavior, as well as the mechanisms,
representations, etc. underlying it, can be intrinsic and meaningful
to itself, rather than dependent on an external designer or observer''
\citep[p.~177]{ziemke99rethinking}. The debate around this problem was
started by \cite{searle80minds} with the Chinese room argument as a
critique to early paradigms in artificial intelligence that envisioned
the possibility of intelligence based solely on the manipulation of
idealized \emph{physical symbol systems}
\citep{newell76computer,newell80physical} and since that has occupied
many philosophers and cognitive scientists. However, when adopting the
notion of meaning discussed above as a functional relation between
forms, internal representations and referents, then ``... one may
argue that argue that the semiotic symbol is \emph{per definition}
grounded, becasue the triadic relation (i.e. the semiotic symbol)
already bears symbols meaning with respect to reality''
\citep[p.~434]{vogt02physical}. We will thus not take part in this
debate and rather focus on the technical challenge of the acquisition
of meanings through the interaction of a physical body with the
environment and on processes for conceptualization and semantic
interpretation, which together ``solve the symbol grounding problem''
\citealp*{steels08symbol-grounding,steels07semiotic}.
\subsection{Mental representations for categorization}
\label{s:mental-representations-for-categorizations}
Peirce's definition of a sign can be discussed without subscribing to
any theory of what word meanings are and how they are represented in
an agent, a question which has occupied philosophers, logicians,
linguists and psychologists for a very long time. We will not delve
into the history of this debate but rather stick with contemporary
notions of meaning in the cognitive sciences that are based on the
concept of \emph{categories}, as advanced by scholars such as
\cite{lakoff87woman}, \cite{harnad87categorial} or
\cite{barsalou99perceptual}. A category is a representation that
allows to classify objects according to some criterion or ``a category
exists whenever two or more distinguishable objects or events are
treated equivalently'' \citep[p. 89]{mervis81categorization}. We call
the long-term memory of categories that are acquired by an agent an
\emph{ontology}.
Categories are abstractions from the continuous sensori-motor
interaction with the environment that have proved to be useful for an
agent, for example in communication: ``one purpose of categorization
is to reduce the infinite differences among stimuli to behaviorally
and cognitively usable proportions. It is to the organism's advantage
not to differentiate one stimulus from others when that
differentiation is irrelevant for the purposes at hand'' \citep[page
384]{rosch76basic}. Consequently, well-tuned category systems
contribute to the cognitive economy of an agent because they limit the
number of sensori-motor patterns that have to memorized and they can
be processed independently of the context in which they were created
and the objects and the events that they stand for, a phenomenon which
\cite{gardenfors05detachment} calls the ``detachment of
thought''. Finally, categories are not only used for language, but
also for a big variety of other cognitive activities such as for
example planning. Some scholars such as
Peirce~(\citeyear[p.~2.302]{peirce31collected}) even claim that ``we
think only in signs''.
Early psychological studies by \cite{rosch73natural} have shown that
many categories do not have strict borders but that membership to a
category is continuous. For example, the category \texttt{red} does
not unambiguously divide all things in the world into a set of objects
that are red and into another set of objects that are not red, but
instead provides a graded judgement of \emph{how} red an object
is. And at least for `basic level' categories, \cite{rosch73natural}
demonstrated that the gradedness of this classification is a function
of the similarity to a \emph{prototype}, which can be understood as a
point in a sensori-motor space that defines the center of the
category. Such a space is defined by multiple dimensions representing
continuous sensory or other qualities and multiple categories defined
by points in that space. For example color categories can be
represented as points in a two- or threedimensional color space and