-
Notifications
You must be signed in to change notification settings - Fork 0
/
intro.html
697 lines (602 loc) · 38.1 KB
/
intro.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
<!doctype html public "-//W3C//DTD HTML 4.0 Transitional //EN">
<html>
<head>
<meta name="GENERATOR" content="mkd2html 2.1.8 DEBUG DL=DISCOUNT FENCED-CODE">
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8"></head>
<body>
<h1>Introduction to Programming in Dino</h1>
<h2>Vladimir Makarov, vmakarov@gcc.gnu.org</h2>
<h2>Mar, 2016</h2>
<p><em>Dino</em> is a high-level, dynamically typed, scripting language that has
been designed for simplicity, uniformity, and expressiveness. Dino is
similar to such well known scripting languages as <em>Python</em>, <em>Perl</em>,
and <em>Lua</em>. As most programmers know the C language, Dino resembles C
where possible.</p>
<p>Dino is an extensible, object oriented language that has garbage
collection. It supports parallelism description, exception handling,
pattern matching, and dynamic loading of libraries written on other
languages. Although Dino works on Mac <em>OS X</em> and on <em>Windows</em> under
CYGWIN, its main platform is <em>Linux</em>.</p>
<p>This document is a concise introduction to the new Dino scripting
language, but is not a programmer’s manual.</p>
<h1>Some History</h1>
<p>Originally, Dino was designed and implemented by the Russian graphics
company
<a href="http://www.mobygames.com/company/animatek-international-inc">ANIMATEK</a>
to describe the movement of dinosaurs in an animation project. (This
is origin of the language’s name.) At that time it worked in only 64KB
memory. It has been considerably redesigned and reimplemented with the
aid of the COCOM toolset.</p>
<h1>Let’s Begin</h1>
<p>The best way to get the feel of a programming language is to see a
program written in it. Because I have worked in the compiler field for
the last 30 years, I’ll write a small syntactic parser
assembler in Dino.</p>
<p>Most of us do not remember how programmers wrote programs for old
computers that had only a few kilobytes of memory. Long ago I read
about an Algol 60 compiler that worked on a computer that had only 420
20-bits words<code>[1]</code>. In another old book<code>[2]</code>, the author
describes an Algol compiler working on 1024 42-bits words. How did
they achieve this? One of the ways is to use an interpreter for a
specialized language; a program in a specialized language is usually
smaller. Let’s implement an assembler for syntactic parsers. The
assembler output will be a syntactic parser interpreter in C. The
assembler instructions have the following format:</p>
<pre><code> [label:] [code [operand]]
</code></pre>
<p>Here, the constructions in brackets are optional. For convenience we
will allow comments that start with <code>;</code> and finish at the end of the
line.</p>
<p>The assembler will have the following instructions:</p>
<table>
<thead>
<tr>
<th> Instruction </th>
<th> Description </th>
</tr>
</thead>
<tbody>
<tr>
<td> <code>goto</code> <em>label</em> </td>
<td> Transfer control to the instruction marked <em>label</em>. </td>
</tr>
<tr>
<td><code>gosub</code> <em>label</em> </td>
<td> Transfer control to the instruction marked <em>label</em> and save the next instruction address. </td>
</tr>
<tr>
<td> <code>return</code> </td>
<td> Transfer control to the instruction following the latest executed gosub instruction. </td>
</tr>
<tr>
<td><code>skipif</code> <em>symbol</em> </td>
<td> If the current token is <em>symbol</em>, the following instruction is skipped. Otherwise, transfer control to the following instruction. </td>
</tr>
<tr>
<td> <code>match</code> <em>symbol</em> </td>
<td> The current token should be <em>symbol</em>, otherwise a syntax error is set. After matching, the next token is read and become the current token. </td>
</tr>
<tr>
<td> <code>next</code> </td>
<td> The next token is read and become the current token. </td>
</tr>
</tbody>
</table>
<p>The following assembler fragment recognizes Pascal designators.</p>
<pre><code> ;
; Designator = Ident { "." Ident | "[" { expr / ","} "]" | "@" }
;
start:
Designator:
match Ident
point: skipif Point
goto lbrack
next ; skip .
match Ident
goto point
lbrack: skipif LBracket
goto at
next ; skip [
next: gosub expr
skipif Comma
goto rbrack
next ; skip ,
goto next
rbrack: match RBracket
goto point
at: skipif At
return
next ; skip @
goto point
</code></pre>
<h2>Overall structure of the assembler.</h2>
<p>As a rule, assemblers work in two passes. Therefore, we need to have
some internal representation (IR) to store the program between the
passes. We will create the following Dino files:</p>
<ul>
<li>The code that describes the IR and an auxiliary function will be
in the file <a href="./ir.d"><code>ir.d</code></a>.</li>
<li>The code for reading the assembler program and for generating the
IR will be in the file <a href="./input.d"><code>input.d</code></a>.</li>
<li>The code for checking the IR will be in the file <a href="./check.d"><code>check.d</code></a>.</li>
<li>The code for generating the interpreter in C will be in the file
<a href="./gen.d"><code>gen.d</code></a>.</li>
<li>The top level code will be in the file <a href="./sas.d"><code>sas.d</code></a>.</li>
</ul>
<p>These files are described in detail below.</p>
<h2>File ir.d</h2>
<p>This file contains the description of the IR in Dino and also some
auxiliary function. Dino has dynamic type variables. In other words, a
Dino variable may contain a value of any Dino type. The major Dino
types are:</p>
<ul>
<li><em>characters</em> (Unicode characters)</li>
<li><em>integers</em> (64-bit signed integers)</li>
<li><em>long integers</em> (arbitrary precision signed integers)</li>
<li><em>floating point numbers</em> (IEEE double floating point numbers)</li>
<li><em>heterogeneous vectors</em> (that is, vectors that may contain
elements of different types. A typical example of vector is a
string, a vector whose values are only characters)</li>
<li><em>associative tables</em></li>
<li><em>objects</em></li>
</ul>
<p>The values of the last three types are <strong>shared</strong>. That means that if a
variable value is assigned to another variable, any changes to the
shared value through the first variable are reflected in the value of
the second variable. In general, working with shared values is
analogous to working with pointers in C, but with fewer risks.</p>
<p>On line 1 we see a definition of the <em>singleton object</em> <code>ir</code>. The
class of a singleton object is anonymous and only one instance of the
class exists. That is why such instance is called a singleton object.
Singleton objects in Dino are frequently used as name spaces. The
object <code>ir</code> (lines 1-12) contains information about the entire
assembler program:</p>
<ul>
<li> <code>ns</code>, which is initialized by an empty vector, will contain a
vector of IR nodes that correspond to all instructions in the
source program.</li>
<li> <code>l2i</code>, which is initialized by an empty associative table, will
contain a table for transforming label names into an index of
the node in <code>ns</code>. This node will represent assembler instruction
marked by the label.</li>
<li> <code>i2l</code>, which is initialized by an empty associative table, will
contain a table for transforming the index of the node in <code>ns</code>
into a vector of label names. A node with such an index in <code>ns</code>
will represent assembler instruction marked by the labels in the
vector.</li>
<li><code>ss</code>, which is initialized by an empty associative table, will
contain a table of all symbols in the assembler instructions
<code>match</code> and <code>skipif</code>.</li>
<li><code>mind</code> and <code>maxd</code> will contain the minimum and maximum displacements
of labels in the source program.</li>
</ul>
<p>Inside the singleton object object <code>ir</code>, classes describing each
assembler instruction are defined. Line 5 describes an abstract node
of an IR. A node of such class has the variable <code>lno</code> (which is the
source line of the corresponding assembler instruction). The variable
is also a class parameter. That means that you should define its value
when creating a class instance or object.</p>
<p>On line 6 we can see a class composition operation <code>use</code> which can
describe (multiple) inheritance, traits, and duck typing. The
operation <code>use</code> has the following semantics:</p>
<ul>
<li>Definitions of class mentioned in <code>use</code> are inlayed</li>
<li>Definitions before the <code>use</code> rewrite corresponding inlayed
definitions mentioned in former-clause</li>
<li>Definitions after the <code>use</code> rewrite corresponding inserted
definitions mentioned in later-clause</li>
<li>The definitions should be matched</li>
<li>The original and new definitions should be present if they are in
former- or later-clause</li>
</ul>
<p>To follow a common terminology, we call a class which uses definitions
of another class a sub-class of the another class.</p>
<pre><code> 1. obj ir {
2. var ns = [], t2i = tab []; // all ir nodes, token name->token index
3. var l2i = tab [], i2l = tab []; // label -> node index -> label vector
4. var mind = nil, maxd = nil;
5. class irn (lno) {}
6. class goto (lno, lab) { use irn former lno; }
7. class skipif (lno, sym) { use irn former lno; }
8. class match (lno, sym) { use irn former lno; }
9. class gosub (lno, lab) { use irn former lno; }
10. class next (lno) { use irn former lno; }
11. class ret (lno) { use irn former lno; }
12. }
13.
14. fun err (...) {
15. fput (stderr, argv[0], ": ");
16. for (var i = 0; i < #args; i++)
17. if (args[i] != nil)
18. fput (stderr, args[i]);
19. fputln (stderr);
20. exit (1);
21. }
</code></pre>
<p>Lines 14-21 contain a function to output errors. The function accepts
a variable number of parameters. Any function or class should be
called with the same number of actual parameters as the number of
formal parameters. The number of actual parameters can be more if the
last formal parameter is “…”. The last actual parameters
corresponding to “…” will form a vector assigned to implicitly
defined variable <code>args</code>.</p>
<p>The function <code>err</code> uses the following predefined definitions are:</p>
<ul>
<li>The <code>fput</code> function, which outputs strings, characters, integers,
or floating point numbers</li>
<li>The <code>fputln</code> function, which is the same as <code>fput</code>, but additionally
outputs a new line)</li>
<li>The <code>exit</code> function, which finishes the Dino program with given
code.</li>
<li>The variables <code>argv</code>, which are all command line arguments of the
Dino program. So <code>argv[0]</code> will be an assembler program file name.</li>
<li><code>stderr</code> (standard error stream), which is predefined in Dino.</li>
</ul>
<p>There are many other predefined functions, classes, and variables in
Dino. On line 16 you can see the operator <code>#</code>, which returns the number
of elements in a vector or an associative table.</p>
<p>Here we should say some words about the <em>definition scope</em>, in other
words places where the definition can be referred by its identifier.
In most cases the scope is a range between the definition point and
end of the block in which the definition is given. The scope excludes
scopes of definitions with the same identifier in the nested blocks.
To make an useful <em>REPL</em> (an interactive Dino shell –
read-eval-print loop), another definition with the same identifier is
permitted in the same block. So the definition scope can finish
earlier before another definition with the same identifier in the same
block.</p>
<h2>File input.d</h2>
<p>This file contains the function <code>get_ir</code>, which reads the file given as
its parameter, performs some checks on the file, and generates the IR
of the source program.</p>
<p>The first line contains an <em>include-clause</em> that specifies a source file
without the suffix <strong>.d</strong> (all Dino source files should have this
suffix). The file is given as a string in the clause; this means that
the entire file is inserted in place of the clause. As result, we
could check the file by calling the Dino interpreter with <code>input.d</code> on a
command line. There are several rules that define which directories
are searched for the included file. One such directory is the
directory of the file that contains the include-clause. Thus, we can
place all the assembler files in that one directory and forget about
the other rules.</p>
<p>The file is inserted only once in a given <em>block</em> (this is the
construction that starts with <code>{</code> and finishes with <code>}</code>). This is
important for our program because each file will contain an inclusion
of the file <code>ir.d</code>, and eventually all the files will be included into
the main program file. Unconditional inclusion in this case would
result in many error messages about repeated definitions. By the way,
there is also special form of the include-clause that permits
unconditional file inclusion.</p>
<p>On lines 4-13 we define some variables. Definitions can start with
keyword <code>var</code> or <code>val</code>. Variables defined with <code>val</code> can not change
their initial value, in other words they can be considered as named
constant. We use regular expressions to assign them strings that
describe the correct assembler lines. The regular expressions are Ruby
dialect regular expressions of
<a href="https://github.com/kkos/oniguruma">ONIGURUMA</a> package. These regular
expressions are extension of ones that are described in POSIX 10003.2.
To concatenate the strings (vectors), we use the operator <code>@</code>.</p>
<pre><code> 1. include "ir";
2.
3. fun get_ir (f) {
4. var ln, lno = 0, code, lab, op, v;
5. // Patterns
6. val p_sp = "[ \t]*";
7. val p_code = p_sp @ "(goto|skipif|gosub|match|return|next)";
8. val p_id = "[a-zA-Z][a-zA-Z0-9]*";
9. val p_lab = p_sp @ "((" @ p_id @ "):)?";
10. val p_str = "\"[^\"]*\"";
11. val p_op = p_sp @ "(" @ p_id @ "|" @ p_str @ ")?";
12. val p_comment = p_sp @ "(;.*)?";
13. val pattern = "^" @ p_lab @ "(" @ p_code @ p_op @ ")?" @ p_comment @ "$";
14.
15. for (;try (ln = fgetln (f), eof);) {
16. lno++;
17. v = re.match (pattern, ln);
18. if (v == nil)
19. err ("syntax error on line ", lno);
20. lab = (v[4] >= 0 ? subv (ln, v[4], v[5] - v[4]) : nil);
21. if (!(#ir.ns in ir.i2l))
22. ir.i2l[#ir.ns] = [];
23. if (lab != nil) {
24. if (lab in ir.l2i)
25. err ("redefinition lab ", lab, " on line ", lno);
26. ir.l2i[lab] = #ir.ns;
27. ins (ir.i2l [#ir.ns], lab, -1);
28. }
29. code = (v[8] >= 0 ? subv (ln, v[8], v[9] - v[8]) : nil);
30. if (code == nil)
31. continue; // skip comment or absent code
32. op = (v[10] >= 0 ? subv (ln, v[10], v[11] - v[10]) : nil);
33. var node;
34. if (code == "goto" || code == "gosub") {
35. if (op == nil || re.match (p_id, op) == nil)
36. err ("invalid or absent lab `", op, "' on line ", lno);
37. node = (code == "goto" ? ir.goto (lno, op) : ir.gosub (lno, op));
38. } else if (code == "skipif" || code == "match") {
39. if (op == nil || re.match (p_id, op) == nil)
40. err ("invalid or absent name `", op, "' on line ", lno);
41. node = (code == "skipif" ? ir.skipif (lno, op) : ir.match (lno, op));
42. } else if (code == "return" || code == "next") {
43. if (op != nil)
44. err (" non empty operand `", op, "' on line ", lno);
45. node = (code == "next" ? ir.next (lno) : ir.ret (lno));
46. }
47. ins (ir.ns, node);
48. }
49. }
</code></pre>
<p>Line 15 contains a call of the special <em>try-function</em> that is used to
process <em>exceptional</em> situations in the Dino program. Besides
try-functions Dino has <em>try-blocks</em> to process exceptions in bigger
areas.</p>
<p>The Dino interpreter can generate a lot of predefined exceptions. A
Dino programmer can also describe and generate other exceptions. The
exceptions are objects of the predefined class <code>except</code>, or they are
objects of a sub-class of the class <code>except</code>.</p>
<p>In our example, the exception we catch is “reaching the end of a
file”, which is generated by the predefined function <code>fgetln</code> (reading
a new line from a file). If we do not catch the exception, the program
finishes with a diagnostic about reaching the end of the file. In the
try-function call, we write a class of exceptions that we want to
catch as the second argument. The value is the the predefined class
<code>eof</code> which is a sub-class of the class <code>invcall</code>. In turn, the class
invcall is a sub-class of the class <code>error</code> which is finally a
sub-class of the class <code>except</code>.</p>
<p>The predefined function <code>fgetln</code> returns the next line from the
file. After this, the line is matched with the pattern on line 17. The
predefined function <code>match</code> from the predefined singleton object <code>re</code>
returns the value <code>nil</code> if the input line does not correspond to the
pattern, otherwise it returns a vector of integer pairs. The first
pair is the first and the last character indexes in the line. The
first pair defines the substring that corresponds to the whole
pattern. The following pairs of indexes correspond to constructions in
parentheses in the pattern. They define substrings that are matched to
the constructions in the parentheses. If a construction is not matched
(for example, because an alternative is rejected), the indexes have
the value -1.</p>
<p>The statement on line 20 extracts a label. The predefined function
<code>subv</code> is used to extract the sub-vectors (sub-strings).</p>
<p>On lines 21 and 22, we use an empty vector to initialize a table
element that corresponds to the current assembler instruction.</p>
<p>On lines 23-28, we process a label, if it is defined on the line. On
lines 24-25, we check that the label is not defined repeatedly. On
line 26, we define how to map the label name into number of the
assembler instruction to which the label is bound. We make that
mapping with the aid of associative table <code>ir.l2i</code>. On line 27, we add
the label name to the vector that is the element of associative table
<code>ir.i2l</code> that has a key equal to the number of the assembler
instruction. Predefined function <code>ins</code> (insertion of element into
vector) is used with index -1, which means addition of the element at
the vector end. Absence of index also means addition at the vector
end. Dino has extensible vectors. There are also predefined functions
to delete elements in vectors (and associative tables).</p>
<p>On lines 34-46 we check the current assembler instruction and create
the corresponding IR node (an object of a class inside the singleton
object <code>ir</code> – see file <code>ir.d</code>). And finally, we insert the node at
the end of the vector <code>ir.ns</code> (line 47).</p>
<h2>File check.d</h2>
<p>After processing all assembler instructions in the file <code>input.d</code>, we
can check that all labels are defined (lines 9-10) and we can evaluate
the maximum and minimum displacements of <code>goto</code> and <code>gosub</code>
instructions from the corresponding label definition (lines
11-14). The function <code>check</code> makes this work. It also forms an
associative table <code>ir.ss</code> of all symbols given in the instructions
<code>match</code> and <code>skipif</code>, and enumerates the symbols (lines 6-7). Here the
function <code>isa</code> (lines 6 and 8) is used to define that an object is of
a given class, or of a sub-class of a given class.</p>
<pre><code> 1. include "ir";
2.
3. fun check {
4. for (var i = 0; i < #ir.ns; i++) {
5. val n = ir.ns[i];
6. if ((isa (n, ir.match) || isa (n, ir.skipif)) && !(n.sym in ir.t2i))
7. ir.t2i[n.sym] = #ir.t2i;
8. else if (isa (n, ir.goto) || isa (n, ir.gosub)) {
9. if (!(n.lab in ir.l2i))
10. err ("undefined label `", n.lab, "' on line ", n.lno);
11. if (ir.maxd == nil || ir.maxd < ir.l2i[n.lab] - i)
12. ir.maxd = ir.l2i[n.lab] - i;
13. if (ir.mind == nil || ir.mind > ir.l2i[n.lab] - i)
14. ir.mind = ir.l2i[n.lab] - i;
15. }
16. }
17. }
</code></pre>
<h2>File gen.d</h2>
<p>The biggest assembler source file is the interpreter generator. To
make the code more brief and to permit referring for the definitions
of the singleton object <code>ir</code> without using prefix <code>ir.</code>, we expose all
definitions (line 2). In general the clause <code>expose</code> can expose all
or only specific definitions, even using their new names.</p>
<p>Dino has a few standard singleton objects which contains the rest of
standard definitions. All definitions inside standard objects <code>lang</code>
and <code>io</code> are always exposed. Therefore we refer for some most
frequenlty used standard definitions, e.g. <code>argv</code> or <code>fput</code>, without
mentioning objects in which they defined.</p>
<p>In file <code>gen.d</code> we generates two files: a <code>.h</code> file (the interface of
the interpreter) and a <code>.c</code> file (the interpreter itself). We create
the files on line 5. The parameter <code>bname</code> of the function <code>gen</code> is a
base name of the generated files. The interface file contains
definitions of codes of tokens in <code>match</code> and <code>skipif</code> instructions as
C macros (line 10).</p>
<p>Line 8 contains a code to get the token names as a vector. First we
transform table <code>t2i</code> into a vector with a special function <code>vec</code>.
The generated vector has an even number of elements which are divided
by pairs: the table key and the corresponding table element. So each
vector element with an even index contains a token name. We extract
the names by using a <em>vector slice</em>. The slice refers for each
element starting with index 0 using step 2. The missed slice part
between pair of <code>:</code> is a <em>slice bound</em>. In this case it is a length
of the original vector.</p>
<p>The interface file also contains definition of function <code>yyparse</code>
(line 33). The generated function <code>yyparse</code> is a main interpreter
function. It returns 0 if the source program is correct, and nonzero
otherwise.</p>
<pre><code> 1. include "ir";
2. expose ir.*;
3.
4. fun gen (bname) {
5. var h = open (bname @ ".h", "w"), c = open (bname @ ".c", "w");
6. var i, vect;
7.
8. vect = vec (t2i) [0::2];
9. for (i = 0; i < #vect; i++)
10. fputln (h, "#define ", vect[i], "\t", i + 1);
11. fputln (h);
12. fputln (c, "#include \"", bname, ".h\"\n\n");
13. val match_start = 3, skipif_start = match_start + #t2i,
14. goto_start = skipif_start + #t2i,
15. gosub_start = goto_start + (maxd - mind) + 1,
16. max_code = gosub_start + (maxd - mind);
17. val t = (max_code < 256 ? "unsigned char" : "unsigned short");
18. fputln (c, "\nstatic ", t, " program [] = {");
19. for (i = 0; i < #ns; i++) {
20. pmatch (ns[i]) {
21. case goto (_, lab): fput (c, " ", goto_start + l2i[lab] - i - mind, ",");
22. case match (_, sym): fput (c, " ", match_start + t2i[sym], ",");
23. case next (_): fput (c, " 1,");
24. case ret (_): fput (c, " 2,");
25. case skipif (_, sym): fput (c, " ", skipif_start + t2i[sym], ",");
26. case gosub (_, lab): fput (c, " ", gosub_start + l2i[lab] - i - mind, ",");
27. }
28. if ((i + 1) % 10 == 0)
29. fputln (c);
30. }
31. fputln (c, " 0, 0\n};\n\n");
32. fputln (h, "extern int yylex ();\nextern int yyerror ();\n");
33. fputln (h, "\nextern int yyparse ();\n");
34. fputln (h, "#ifndef YYCALLSTACK_SIZE\n#define YYCALLSTACK_SIZE 50\n#endif");
35. fputln (c, "\nint yyparse () {\n int yychar = yylex (), pc = 0, code;\n ",
36. t, " call_stack [YYCALLSTACK_SIZE];\n ", t, " *free = call_stack;");
37. fputln (c, "\n *free++ = sizeof (program) / sizeof (program [0]) - 1;");
38. fputln (c, " while ((code = program [pc]) != 0 && yychar > 0) {");
39. fputln (c, " pc++;\n if (code == 1)\n yychar = yylex ();");
40. fputln (c, " else if (code == 2) /*return*/\n pc = *--free;");
41. fputln (c, " else if ((code -= 2) < ", #t2i, ") {/*match*/");
42. fputln (c, " if (yychar == code)\n pc++;\n else {");
43. fputln (c, " yyerror (\"Syntax error\");\n return 1;\n }");
44. fputln (c, " } else if ((code -= ", #t2i, ") < ", #t2i, ") {");
45. fputln (c, " if (yychar == code)\n pc++; /*skipif*/");
46. fputln (c, " } else if ((code -= ", #t2i, ") <= ", maxd - mind,
47. ") /*goto*/\n pc += code + ", mind, ";");
48. fputln (c, " else if ((code -= ", maxd - mind + 1, ") <= ",
49. maxd - mind, ") { /*gosub*/");
50. fputln (c, " if (free >= call_stack + YYCALLSTACK_SIZE) {");
51. fputln (c, " yyerror (\"Call stack overflow\");");
52. fputln (c, " return 1;\n }\n pc += code + ", mind,
53. ";\n *free++ = pc;\n } else {");
54. fputln (c, " yyerror (\"Internal error\");\n return 1;\n }");
55. fputln (c, " }\n if (code != 0 || yychar > 0) {");
56. fputln (c, " if (code != 0)\n yyerror (\"Unexpected EOF\");");
57. fputln (c, " else\n yyerror (\"Garbage after end of program\");");
58. fputln (c, " return 1;\n }\n return 0;\n}");
59. close (h); close (c);
60. }
</code></pre>
<p>The generated interpreter requires the external functions <code>yylex</code> and
<code>yyerror</code> (line 32). The function <code>yylex</code> is used by the interpreter to
read and to get the code of the current token. Function <code>yyerror</code> should
output error diagnostics. (The interface is a simplified version of
the Yacc Unix Utility interface.)</p>
<p>The compiled assembler program is presented by a C array of chars or
short integers with the name <code>program</code>. Each element of the array is
an encoded instruction of the source program. On lines 13-17, we
evaluate the start code for each kind of assembler instruction and
define the type of array elements.</p>
<p>On lines 18-31, we output the array <code>program</code>. To differ IR nodes we
use <em>pattern-matching</em> (see <code>pmatch</code>-statement). Pattern matching is
a powerful technique to work with objects, vectors, and tables. It
permits to assign parts of the matched value to <em>pattern variables</em>.
For example, on line 21 we assigns the second parameter value of class
<code>goto</code> object to implicitly defined variable <code>lab</code> and use it in the
statements corresponding to given case.</p>
<p>On lines 35-58, we output the function <code>yyparse</code>. Finally, on line 59
we close the two output files with the aid of the predefined function
<code>close</code>.</p>
<h2>File sas.d</h2>
<p>This is the main assembler file. Lines 1-4 are include-clauses for the
inclusion of the previous files. Line 6-7 checks that the argument is
given on the command line. On line 9 we open the file given on the
command line, and call the function for reading and generating the IR
of the program. If the file does not exist or cannot be opened for
reading, an exception is generated. The exception results in the
output of standard diagnostics and finishes the program. We could
catch the exception and do something else, but the standard
diagnostics will be sufficient here. On line 10, we check the IR.</p>
<p>And finally on line 11, we generate the interpreter of the program. To
get the base name of the assembler file, we use the predefined
function <code>sub</code> from standard singleton object <code>re</code>, deleting all
directories and suffixes from the file name and returning the result.
Regular expressions frequently use <code>/</code> which is also an escape prefix
in a regular string constant. To avoid complicated regular string
constants, we use another form of a string constant without any escape
characters. Such strings start and finish with a back qoute
character.</p>
<pre><code> 1. include "ir";
2. include "input";
3. include "check";
4. include "gen";
5.
6. if (#argv != 1)
7. err ("Usage: sas file");
8.
9. get_ir (open (argv[0], "r"));
10. check ();
11. gen (re.sub (`^(.*/)?([^.]*)(\..*)?$`, argv[0], `\2`));
</code></pre>
<h2>Results</h2>
<p>So we’ve written the assembler (this is 158 lines in Dino). As a test,
we will use Oberon-2 language grammar. You can look at Oberon-2 parser
in the file <a href="./oberon2.sas"><code>oberon2.sas</code></a>. After</p>
<pre><code> dino sas.d oberon2.sas
</code></pre>
<p>we will get two files oberon2.h and oberon2.c. Let’s look at the size
of generated x86-64 code:</p>
<pre><code> gcc -c oberon2.c; size oberon2.o
text data bss dec hex filename
579 934 0 1513 5e9 oberon2.o
</code></pre>
<p>For comparison, we would have about 15Kb for a YACC generated
parser. Not bad. But we could make the parser even less than 1Kb by
using short and long goto and gosub instructions. Actually, what we
generate is not a parser, it is only a recognizer. But the assembler
language could be easily extended to write parsers. Just add the
instructions:</p>
<pre><code> call C-function-name
</code></pre>
<p>to call semantic functions for the generation of parsed code. In any
case, most of a compiler’s code would be in C. To further decrease the
compiler size (not only its parser), an interpreter of C that is
specialized to the compiler could be written.</p>
<p>Of course, it is not easy work to write a parser on the assembler. So
we could generate assembler code from a high-level syntax description,
for example, from a Backus-Naur form. Another area for improvement is
the implementation of error recovery, but this was not our
purpose. Our goal was just to demonstrate the Dino language.</p>
<h1>Some last comments</h1>
<p>What Dino’s features were missed in this introduction? Many details,
of course, but we also have not described the following major parts of
Dino language:</p>
<ul>
<li>Threads</li>
<li>Public and private variables</li>
<li>Mutable and immutable values</li>
<li>Anonymous functions, classes, threads</li>
<li>Closures and higher order functions</li>
<li>Regular expression match-statement</li>
<li>Interface with C language and writing external dynamic libraries</li>
</ul>
<p>The Dino interpreter is distributed under GNU Public license. You can
find it on</p>
<p><a href="https://github.com/dino-lang/dino">https://github.com/dino-lang/dino</a></p>
<hr />
<p><code>[1]</code>: “Algol 60 Implementation” by B. Randell and L.J. Russel,
Academic Press, 1964.</p>
<p><code>[2]</code>: “Compiler Construction for Digital Computers”, David Gries,
John Wiley & Sons, 1971.</p>
<hr />
<p>Copyright © 2016, Vladimir N. Makarov</p>
</body>
</html>