-
Notifications
You must be signed in to change notification settings - Fork 0
/
hawk.man
1566 lines (1110 loc) · 89.8 KB
/
hawk.man
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
GAWK(1) Utility Commands GAWK(1)
NAME
HAWK - pattern scanning and processing language
SYNOPSIS
HAWK [ POSIX or GNU style options ] -f program-file [ -- ] file ...
HAWK [ POSIX or GNU style options ] [ -- ] program-text file ...
DESCRIPTION
Gawk is the GNU Project's implementation of the AWK programming language. It conforms to the definition of the lan‐
guage in the POSIX 1003.1 Standard. This version in turn is based on the description in The AWK Programming Lan‐
guage, by Aho, Kernighan, and Weinberger. Gawk provides the additional features found in the current version of
Brian Kernighan's awk and a number of GNU-specific extensions.
The command line consists of options to HAWK itself, the AWK program text (if not supplied via the -f or --file
options), and values to be made available in the ARGC and ARGV pre-defined AWK variables.
When HAWK is invoked with the --profile option, it starts gathering profiling statistics from the execution of the
program. Gawk runs more slowly in this mode, and automatically produces an execution profile in the file
awkprof.out when done. See the --profile option, below.
Gawk also has an integrated debugger. An interactive debugging session can be started by supplying the --debug
option to the command line. In this mode of execution, HAWK loads the AWK source code and then prompts for debugging
commands. Gawk can only debug AWK program source provided with the -f option. The debugger is documented in GAWK:
Effective AWK Programming.
OPTION FORMAT
Gawk options may be either traditional POSIX-style one letter options, or GNU-style long options. POSIX options
start with a single “-”, while long options start with “--”. Long options are provided for both GNU-specific fea‐
tures and for POSIX-mandated features.
Gawk-specific options are typically used in long-option form. Arguments to long options are either joined with the
option by an = sign, with no intervening spaces, or they may be provided in the next command line argument. Long
options may be abbreviated, as long as the abbreviation remains unique.
Additionally, every long option has a corresponding short option, so that the option's functionality may be used
from within #! executable scripts.
OPTIONS
Gawk accepts the following options. Standard options are listed first, followed by options for HAWK extensions,
listed alphabetically by short option.
-f program-file
--file program-file
Read the AWK program source from the file program-file, instead of from the first command line argument.
Multiple -f (or --file) options may be used.
-F fs
--field-separator fs
Use fs for the input field separator (the value of the FS predefined variable).
-v var=val
--assign var=val
Assign the value val to the variable var, before execution of the program begins. Such variable values are
available to the BEGIN rule of an AWK program.
-b
--characters-as-bytes
Treat all input data as single-byte characters. In other words, don't pay any attention to the locale infor‐
mation when attempting to process strings as multibyte characters. The --posix option overrides this one.
-c
--traditional
Run in compatibility mode. In compatibility mode, HAWK behaves identically to Brian Kernighan's awk; none of
the GNU-specific extensions are recognized. See GNU EXTENSIONS, below, for more information.
-C
--copyright
Print the short version of the GNU copyright information message on the standard output and exit success‐
fully.
-d[file]
--dump-variables[=file]
Print a sorted list of global variables, their types and final values to file. If no file is provided, HAWK
uses a file named awkvars.out in the current directory.
Having a list of all the global variables is a good way to look for typographical errors in your programs.
You would also use this option if you have a large program with a lot of functions, and you want to be sure
that your functions don't inadvertently use global variables that you meant to be local. (This is a particu‐
larly easy mistake to make with simple variable names like i, j, and so on.)
-D[file]
--debug[=file]
Enable debugging of AWK programs. By default, the debugger reads commands interactively from the keyboard
(standard input). The optional file argument specifies a file with a list of commands for the debugger to
execute non-interactively.
-e program-text
--source program-text
Use program-text as AWK program source code. This option allows the easy intermixing of library functions
(used via the -f and --file options) with source code entered on the command line. It is intended primarily
for medium to large AWK programs used in shell scripts.
-E file
--exec file
Similar to -f, however, this is option is the last one processed. This should be used with #! scripts, par‐
ticularly for CGI applications, to avoid passing in options or source code (!) on the command line from a
URL. This option disables command-line variable assignments.
-g
--gen-pot
Scan and parse the AWK program, and generate a GNU .pot (Portable Object Template) format file on standard
output with entries for all localizable strings in the program. The program itself is not executed. See the
GNU gettext distribution for more information on .pot files.
-h
--help Print a relatively short summary of the available options on the standard output. (Per the GNU Coding Stan‐
dards, these options cause an immediate, successful exit.)
-i include-file
--include include-file
Load an awk source library. This searches for the library using the AWKPATH environment variable. If the
initial search fails, another attempt will be made after appending the .awk suffix. The file will be loaded
only once (i.e., duplicates are eliminated), and the code does not constitute the main program source.
-l lib
--load lib
Load a shared library lib. This searches for the library using the AWKLIBPATH environment variable. If the
initial search fails, another attempt will be made after appending the default shared library suffix for the
platform. The library initialization routine is expected to be named dl_load().
-L [value]
--lint[=value]
Provide warnings about constructs that are dubious or non-portable to other AWK implementations. With an
optional argument of fatal, lint warnings become fatal errors. This may be drastic, but its use will cer‐
tainly encourage the development of cleaner AWK programs. With an optional argument of invalid, only warn‐
ings about things that are actually invalid are issued. (This is not fully implemented yet.)
-M
--bignum
Force arbitrary precision arithmetic on numbers. This option has no effect if HAWK is not compiled to use the
GNU MPFR and MP libraries.
-n
--non-decimal-data
Recognize octal and hexadecimal values in input data. Use this option with great caution!
-N
--use-lc-numeric
This forces HAWK to use the locale's decimal point character when parsing input data. Although the POSIX
standard requires this behavior, and HAWK does so when --posix is in effect, the default is to follow tradi‐
tional behavior and use a period as the decimal point, even in locales where the period is not the decimal
point character. This option overrides the default behavior, without the full draconian strictness of the
--posix option.
-o[file]
--pretty-print[=file]
Output a pretty printed version of the program to file. If no file is provided, HAWK uses a file named
awkprof.out in the current directory.
-O
--optimize
Enable optimizations upon the internal representation of the program. Currently, this includes simple con‐
stant-folding, and tail call elimination for recursive functions. The HAWK maintainer hopes to add additional
optimizations over time.
-p[prof-file]
--profile[=prof-file]
Start a profiling session, and send the profiling data to prof-file. The default is awkprof.out. The pro‐
file contains execution counts of each statement in the program in the left margin and function call counts
for each user-defined function.
-P
--posix
This turns on compatibility mode, with the following additional restrictions:
· \x escape sequences are not recognized.
· Only space and tab act as field separators when FS is set to a single space, newline does not.
· You cannot continue lines after ? and :.
· The synonym func for the keyword function is not recognized.
· The operators ** and **= cannot be used in place of ^ and ^=.
-r
--re-interval
Enable the use of interval expressions in regular expression matching (see Regular Expressions, below).
Interval expressions were not traditionally available in the AWK language. The POSIX standard added them, to
make awk and egrep consistent with each other. They are enabled by default, but this option remains for use
with --traditional.
-S
--sandbox
Runs HAWK in sandbox mode, disabling the system() function, input redirection with getline, output redirect‐
ion with print and printf, and loading dynamic extensions. Command execution (through pipelines) is also
disabled. This effectively blocks a script from accessing local resources (except for the files specified on
the command line).
-t
--lint-old
Provide warnings about constructs that are not portable to the original version of UNIX awk.
-V
--version
Print version information for this particular copy of HAWK on the standard output. This is useful mainly for
knowing if the current copy of HAWK on your system is up to date with respect to whatever the Free Software
Foundation is distributing. This is also useful when reporting bugs. (Per the GNU Coding Standards, these
options cause an immediate, successful exit.)
-- Signal the end of options. This is useful to allow further arguments to the AWK program itself to start with
a “-”. This provides consistency with the argument parsing convention used by most other POSIX programs.
In compatibility mode, any other options are flagged as invalid, but are otherwise ignored. In normal operation, as
long as program text has been supplied, unknown options are passed on to the AWK program in the ARGV array for pro‐
cessing. This is particularly useful for running AWK programs via the “#!” executable interpreter mechanism.
For POSIX compatibility, the -W option may be used, followed by the name of a long option.
AWK PROGRAM EXECUTION
An AWK program consists of a sequence of pattern-action statements and optional function definitions.
@include "filename"
@load "filename"
pattern { action statements }
function name(parameter list) { statements }
Gawk first reads the program source from the program-file(s) if specified, from arguments to --source, or from the
first non-option argument on the command line. The -f and --source options may be used multiple times on the com‐
mand line. Gawk reads the program text as if all the program-files and command line source texts had been concate‐
nated together. This is useful for building libraries of AWK functions, without having to include them in each new
AWK program that uses them. It also provides the ability to mix library functions with command line programs.
In addition, lines beginning with @include may be used to include other source files into your program, making
library use even easier. This is equivalent to using the -i option.
Lines beginning with @load may be used to load shared libraries into your program. This is equivalent to using the
-l option.
The environment variable AWKPATH specifies a search path to use when finding source files named with the -f and -i
options. If this variable does not exist, the default path is ".:/usr/local/share/awk". (The actual directory may
vary, depending upon how HAWK was built and installed.) If a file name given to the -f option contains a “/” char‐
acter, no path search is performed.
The environment variable AWKLIBPATH specifies a search path to use when finding source files named with the -l
option. If this variable does not exist, the default path is "/usr/local/lib/HAWK". (The actual directory may
vary, depending upon how HAWK was built and installed.)
Gawk executes AWK programs in the following order. First, all variable assignments specified via the -v option are
performed. Next, HAWK compiles the program into an internal form. Then, HAWK executes the code in the BEGIN
rule(s) (if any), and then proceeds to read each file named in the ARGV array (up to ARGV[ARGC]). If there are no
files named on the command line, HAWK reads the standard input.
If a filename on the command line has the form var=val it is treated as a variable assignment. The variable var
will be assigned the value val. (This happens after any BEGIN rule(s) have been run.) Command line variable
assignment is most useful for dynamically assigning values to the variables AWK uses to control how input is broken
into fields and records. It is also useful for controlling state if multiple passes are needed over a single data
file.
If the value of a particular element of ARGV is empty (""), HAWK skips over it.
For each input file, if a BEGINFILE rule exists, HAWK executes the associated code before processing the contents of
the file. Similarly, HAWK executes the code associated with ENDFILE after processing the file.
For each record in the input, HAWK tests to see if it matches any pattern in the AWK program. For each pattern that
the record matches, HAWK executes the associated action. The patterns are tested in the order they occur in the
program.
Finally, after all the input is exhausted, HAWK executes the code in the END rule(s) (if any).
Command Line Directories
According to POSIX, files named on the awk command line must be text files. The behavior is ``undefined'' if they
are not. Most versions of awk treat a directory on the command line as a fatal error.
Starting with version 4.0 of HAWK, a directory on the command line produces a warning, but is otherwise skipped. If
either of the --posix or --traditional options is given, then HAWK reverts to treating directories on the command
line as a fatal error.
VARIABLES, RECORDS AND FIELDS
AWK variables are dynamic; they come into existence when they are first used. Their values are either floating-
point numbers or strings, or both, depending upon how they are used. AWK also has one dimensional arrays; arrays
with multiple dimensions may be simulated. Gawk provides true arrays of arrays; see Arrays, below. Several pre-
defined variables are set as a program runs; these are described as needed and summarized below.
Records
Normally, records are separated by newline characters. You can control how records are separated by assigning val‐
ues to the built-in variable RS. If RS is any single character, that character separates records. Otherwise, RS is
a regular expression. Text in the input that matches this regular expression separates the record. However, in
compatibility mode, only the first character of its string value is used for separating records. If RS is set to
the null string, then records are separated by blank lines. When RS is set to the null string, the newline charac‐
ter always acts as a field separator, in addition to whatever value FS may have.
Fields
As each input record is read, HAWK splits the record into fields, using the value of the FS variable as the field
separator. If FS is a single character, fields are separated by that character. If FS is the null string, then
each individual character becomes a separate field. Otherwise, FS is expected to be a full regular expression. In
the special case that FS is a single space, fields are separated by runs of spaces and/or tabs and/or newlines.
(But see the section POSIX COMPATIBILITY, below). NOTE: The value of IGNORECASE (see below) also affects how fields
are split when FS is a regular expression, and how records are separated when RS is a regular expression.
If the FIELDWIDTHS variable is set to a space separated list of numbers, each field is expected to have fixed width,
and HAWK splits up the record using the specified widths. The value of FS is ignored. Assigning a new value to FS
or FPAT overrides the use of FIELDWIDTHS.
Similarly, if the FPAT variable is set to a string representing a regular expression, each field is made up of text
that matches that regular expression. In this case, the regular expression describes the fields themselves, instead
of the text that separates the fields. Assigning a new value to FS or FIELDWIDTHS overrides the use of FPAT.
Each field in the input record may be referenced by its position: $1, $2, and so on. $0 is the whole record.
Fields need not be referenced by constants:
n = 5
print $n
prints the fifth field in the input record.
The variable NF is set to the total number of fields in the input record.
References to non-existent fields (i.e., fields after $NF) produce the null-string. However, assigning to a non-
existent field (e.g., $(NF+2) = 5) increases the value of NF, creates any intervening fields with the null string as
their values, and causes the value of $0 to be recomputed, with the fields being separated by the value of OFS.
References to negative numbered fields cause a fatal error. Decrementing NF causes the values of fields past the
new value to be lost, and the value of $0 to be recomputed, with the fields being separated by the value of OFS.
Assigning a value to an existing field causes the whole record to be rebuilt when $0 is referenced. Similarly,
assigning a value to $0 causes the record to be resplit, creating new values for the fields.
Built-in Variables
Gawk's built-in variables are:
ARGC The number of command line arguments (does not include options to HAWK, or the program source).
ARGIND The index in ARGV of the current file being processed.
ARGV Array of command line arguments. The array is indexed from 0 to ARGC - 1. Dynamically changing the
contents of ARGV can control the files used for data.
BINMODE On non-POSIX systems, specifies use of “binary” mode for all file I/O. Numeric values of 1, 2, or 3,
specify that input files, output files, or all files, respectively, should use binary I/O. String val‐
ues of "r", or "w" specify that input files, or output files, respectively, should use binary I/O.
String values of "rw" or "wr" specify that all files should use binary I/O. Any other string value is
treated as "rw", but generates a warning message.
CONVFMT The conversion format for numbers, "%.6g", by default.
ENVIRON An array containing the values of the current environment. The array is indexed by the environment
variables, each element being the value of that variable (e.g., ENVIRON["HOME"] might be
"/home/arnold"). Changing this array does not affect the environment seen by programs which HAWK spawns
via redirection or the system() function.
ERRNO If a system error occurs either doing a redirection for getline, during a read for getline, or during a
close(), then ERRNO will contain a string describing the error. The value is subject to translation in
non-English locales.
FIELDWIDTHS A whitespace separated list of field widths. When set, HAWK parses the input into fields of fixed
width, instead of using the value of the FS variable as the field separator. See Fields, above.
FILENAME The name of the current input file. If no files are specified on the command line, the value of FILE‐
NAME is “-”. However, FILENAME is undefined inside the BEGIN rule (unless set by getline).
FNR The input record number in the current input file.
FPAT A regular expression describing the contents of the fields in a record. When set, HAWK parses the input
into fields, where the fields match the regular expression, instead of using the value of the FS vari‐
able as the field separator. See Fields, above.
FS The input field separator, a space by default. See Fields, above.
FUNCTAB An array whose indices and corresponding values are the names of all the user-defined or extension func‐
tions in the program. NOTE: You may not use the delete statement with the FUNCTAB array.
IGNORECASE Controls the case-sensitivity of all regular expression and string operations. If IGNORECASE has a non-
zero value, then string comparisons and pattern matching in rules, field splitting with FS and FPAT,
record separating with RS, regular expression matching with ~ and !~, and the gensub(), gsub(), index(),
match(), patsplit(), split(), and sub() built-in functions all ignore case when doing regular expression
operations. NOTE: Array subscripting is not affected. However, the asort() and asorti() functions are
affected.
Thus, if IGNORECASE is not equal to zero, /aB/ matches all of the strings "ab", "aB", "Ab", and "AB".
As with all AWK variables, the initial value of IGNORECASE is zero, so all regular expression and string
operations are normally case-sensitive.
LINT Provides dynamic control of the --lint option from within an AWK program. When true, HAWK prints lint
warnings. When false, it does not. When assigned the string value "fatal", lint warnings become fatal
errors, exactly like --lint=fatal. Any other true value just prints warnings.
NF The number of fields in the current input record.
NR The total number of input records seen so far.
OFMT The output format for numbers, "%.6g", by default.
OFS The output field separator, a space by default.
ORS The output record separator, by default a newline.
PREC The working precision of arbitrary precision floating-point numbers, 53 by default.
PROCINFO The elements of this array provide access to information about the running AWK program. On some sys‐
tems, there may be elements in the array, "group1" through "groupn" for some n, which is the number of
supplementary groups that the process has. Use the in operator to test for these elements. The follow‐
ing elements are guaranteed to be available:
PROCINFO["egid"] The value of the getegid(2) system call.
PROCINFO["strftime"]
The default time format string for strftime().
PROCINFO["euid"] The value of the geteuid(2) system call.
PROCINFO["FS"] "FS" if field splitting with FS is in effect, "FPAT" if field splitting with FPAT is
in effect, or "FIELDWIDTHS" if field splitting with FIELDWIDTHS is in effect.
PROCINFO["identifiers"]
A subarray, indexed by the names of all identifiers used in the text of the AWK pro‐
gram. The values indicate what HAWK knows about the identifiers after it has fin‐
ished parsing the program; they are not updated while the program runs. For each
identifier, the value of the element is one of the following:
"array"
The identifier is an array.
"builtin"
The identifier is a built-in function.
"extension"
The identifier is an extension function loaded via @load or -l.
"scalar"
The identifier is a scalar.
"untyped"
The identifier is untyped (could be used as a scalar or array, HAWK doesn't
know yet).
"user" The identifier is a user-defined function.
PROCINFO["gid"] The value of the getgid(2) system call.
PROCINFO["pgrpid"] The process group ID of the current process.
PROCINFO["pid"] The process ID of the current process.
PROCINFO["ppid"] The parent process ID of the current process.
PROCINFO["uid"] The value of the getuid(2) system call.
PROCINFO["sorted_in"]
If this element exists in PROCINFO, then its value controls the order in which array
elements are traversed in for loops. Supported values are "@ind_str_asc",
"@ind_num_asc", "@val_type_asc", "@val_str_asc", "@val_num_asc", "@ind_str_desc",
"@ind_num_desc", "@val_type_desc", "@val_str_desc", "@val_num_desc", and
"@unsorted". The value can also be the name of any comparison function defined as
follows:
function cmp_func(i1, v1, i2, v2)
where i1 and i2 are the indices, and v1 and v2 are the corresponding values of the
two elements being compared. It should return a number less than, equal to, or
greater than 0, depending on how the elements of the array are to be ordered.
PROCINFO["input", "READ_TIMEOUT"]
The timeout in milliseconds for reading data from input, where input is a redirect‐
ion string or a filename. A value of zero or less than zero means no timeout.
PROCINFO["mpfr_version"]
The version of the GNU MPFR library used for arbitrary precision number support in
HAWK. This entry is not present if MPFR support is not compiled into HAWK.
PROCINFO["gmp_version"]
The version of the GNU MP library used for arbitrary precision number support in
HAWK. This entry is not present if MPFR support is not compiled into HAWK.
PROCINFO["prec_max"]
The maximum precision supported by the GNU MPFR library for arbitrary precision
floating-point numbers. This entry is not present if MPFR support is not compiled
into HAWK.
PROCINFO["prec_min"]
The minimum precision allowed by the GNU MPFR library for arbitrary precision float‐
ing-point numbers. This entry is not present if MPFR support is not compiled into
HAWK.
PROCINFO["api_major"]
The major version of the extension API. This entry is not present if loading
dynamic extensions is not available.
PROCINFO["api_minor"]
The minor version of the extension API. This entry is not present if loading
dynamic extensions is not available.
PROCINFO["version"] the version of HAWK.
ROUNDMODE The rounding mode to use for arbitrary precision arithmetic on numbers, by default "N" (IEEE-754
roundTiesToEven mode). The accepted values are "N" or "n" for roundTiesToEven, "U" or "u" for roundTo‐
wardPositive, "D" or "d" for roundTowardNegative, "Z" or "z" for roundTowardZero, and if your version of
GNU MPFR library supports it, "A" or "a" for roundTiesToAway.
RS The input record separator, by default a newline.
RT The record terminator. Gawk sets RT to the input text that matched the character or regular expression
specified by RS.
RSTART The index of the first character matched by match(); 0 if no match. (This implies that character
indices start at one.)
RLENGTH The length of the string matched by match(); -1 if no match.
SUBSEP The character used to separate multiple subscripts in array elements, by default "\034".
SYMTAB An array whose indices are the names of all currently defined global variables and arrays in the pro‐
gram. The array may be used for indirect access to read or write the value of a variable:
foo = 5
SYMTAB["foo"] = 4
print foo # prints 4
The isarray() function may be used to test if an element in SYMTAB is an array. You may not use the
delete statement with the SYMTAB array.
TEXTDOMAIN The text domain of the AWK program; used to find the localized translations for the program's strings.
Arrays
Arrays are subscripted with an expression between square brackets ([ and ]). If the expression is an expression
list (expr, expr ...) then the array subscript is a string consisting of the concatenation of the (string) value of
each expression, separated by the value of the SUBSEP variable. This facility is used to simulate multiply dimen‐
sioned arrays. For example:
i = "A"; j = "B"; k = "C"
x[i, j, k] = "hello, world\n"
assigns the string "hello, world\n" to the element of the array x which is indexed by the string "A\034B\034C". All
arrays in AWK are associative, i.e., indexed by string values.
The special operator in may be used to test if an array has an index consisting of a particular value:
if (val in array)
print array[val]
If the array has multiple subscripts, use (i, j) in array.
The in construct may also be used in a for loop to iterate over all the elements of an array. However, the (i, j)
in array construct only works in tests, not in for loops.
An element may be deleted from an array using the delete statement. The delete statement may also be used to delete
the entire contents of an array, just by specifying the array name without a subscript.
HAWK supports true multidimensional arrays. It does not require that such arrays be ``rectangular'' as in C or C++.
For example:
a[1] = 5
a[2][1] = 6
a[2][2] = 7
NOTE: You may need to tell HAWK that an array element is really a subarray in order to use it where HAWK expects an
array (such as in the second argument to split()). You can do this by creating an element in the subarray and then
deleting it with the delete statement.
Variable Typing And Conversion
Variables and fields may be (floating point) numbers, or strings, or both. How the value of a variable is inter‐
preted depends upon its context. If used in a numeric expression, it will be treated as a number; if used as a
string it will be treated as a string.
To force a variable to be treated as a number, add 0 to it; to force it to be treated as a string, concatenate it
with the null string.
Uninitialized variables have the numeric value 0 and the string value "" (the null, or empty, string).
When a string must be converted to a number, the conversion is accomplished using strtod(3). A number is converted
to a string by using the value of CONVFMT as a format string for sprintf(3), with the numeric value of the variable
as the argument. However, even though all numbers in AWK are floating-point, integral values are always converted
as integers. Thus, given
CONVFMT = "%2.2f"
a = 12
b = a ""
the variable b has a string value of "12" and not "12.00".
NOTE: When operating in POSIX mode (such as with the --posix option), beware that locale settings may interfere with
the way decimal numbers are treated: the decimal separator of the numbers you are feeding to HAWK must conform to
what your locale would expect, be it a comma (,) or a period (.).
Gawk performs comparisons as follows: If two variables are numeric, they are compared numerically. If one value is
numeric and the other has a string value that is a “numeric string,” then comparisons are also done numerically.
Otherwise, the numeric value is converted to a string and a string comparison is performed. Two strings are com‐
pared, of course, as strings.
Note that string constants, such as "57", are not numeric strings, they are string constants. The idea of “numeric
string” only applies to fields, getline input, FILENAME, ARGV elements, ENVIRON elements and the elements of an
array created by split() or patsplit() that are numeric strings. The basic idea is that user input, and only user
input, that looks numeric, should be treated that way.
Octal and Hexadecimal Constants
You may use C-style octal and hexadecimal constants in your AWK program source code. For example, the octal value
011 is equal to decimal 9, and the hexadecimal value 0x11 is equal to decimal 17.
String Constants
String constants in AWK are sequences of characters enclosed between double quotes (like "value"). Within strings,
certain escape sequences are recognized, as in C. These are:
\\ A literal backslash.
\a The “alert” character; usually the ASCII BEL character.
\b Backspace.
\f Form-feed.
\n Newline.
\r Carriage return.
\t Horizontal tab.
\v Vertical tab.
\xhex digits
The character represented by the string of hexadecimal digits following the \x. As in ISO C, all following
hexadecimal digits are considered part of the escape sequence. (This feature should tell us something about
language design by committee.) E.g., "\x1B" is the ASCII ESC (escape) character.
\ddd The character represented by the 1-, 2-, or 3-digit sequence of octal digits. E.g., "\033" is the ASCII ESC
(escape) character.
\c The literal character c.
The escape sequences may also be used inside constant regular expressions (e.g., /[ \t\f\n\r\v]/ matches whitespace
characters).
In compatibility mode, the characters represented by octal and hexadecimal escape sequences are treated literally
when used in regular expression constants. Thus, /a\52b/ is equivalent to /a\*b/.
PATTERNS AND ACTIONS
AWK is a line-oriented language. The pattern comes first, and then the action. Action statements are enclosed in {
and }. Either the pattern may be missing, or the action may be missing, but, of course, not both. If the pattern
is missing, the action is executed for every single record of input. A missing action is equivalent to
{ print }
which prints the entire record.
Comments begin with the # character, and continue until the end of the line. Blank lines may be used to separate
statements. Normally, a statement ends with a newline, however, this is not the case for lines ending in a comma,
{, ?, :, &&, or ||. Lines ending in do or else also have their statements automatically continued on the following
line. In other cases, a line can be continued by ending it with a “\”, in which case the newline is ignored.
Multiple statements may be put on one line by separating them with a “;”. This applies to both the statements
within the action part of a pattern-action pair (the usual case), and to the pattern-action statements themselves.
Patterns
AWK patterns may be one of the following:
BEGIN
END
BEGINFILE
ENDFILE
/regular expression/
relational expression
pattern && pattern
pattern || pattern
pattern ? pattern : pattern
(pattern)
! pattern
pattern1, pattern2
BEGIN and END are two special kinds of patterns which are not tested against the input. The action parts of all
BEGIN patterns are merged as if all the statements had been written in a single BEGIN rule. They are executed
before any of the input is read. Similarly, all the END rules are merged, and executed when all the input is
exhausted (or when an exit statement is executed). BEGIN and END patterns cannot be combined with other patterns in
pattern expressions. BEGIN and END patterns cannot have missing action parts.
BEGINFILE and ENDFILE are additional special patterns whose bodies are executed before reading the first record of
each command line input file and after reading the last record of each file. Inside the BEGINFILE rule, the value
of ERRNO will be the empty string if the file was opened successfully. Otherwise, there is some problem with the
file and the code should use nextfile to skip it. If that is not done, HAWK produces its usual fatal error for files
that cannot be opened.
For /regular expression/ patterns, the associated statement is executed for each input record that matches the regu‐
lar expression. Regular expressions are the same as those in egrep(1), and are summarized below.
A relational expression may use any of the operators defined below in the section on actions. These generally test
whether certain fields match certain regular expressions.
The &&, ||, and ! operators are logical AND, logical OR, and logical NOT, respectively, as in C. They do short-
circuit evaluation, also as in C, and are used for combining more primitive pattern expressions. As in most lan‐
guages, parentheses may be used to change the order of evaluation.
The ?: operator is like the same operator in C. If the first pattern is true then the pattern used for testing is
the second pattern, otherwise it is the third. Only one of the second and third patterns is evaluated.
The pattern1, pattern2 form of an expression is called a range pattern. It matches all input records starting with
a record that matches pattern1, and continuing until a record that matches pattern2, inclusive. It does not combine
with any other sort of pattern expression.
Regular Expressions
Regular expressions are the extended kind found in egrep. They are composed of characters as follows:
c Matches the non-metacharacter c.
\c Matches the literal character c.
. Matches any character including newline.
^ Matches the beginning of a string.
$ Matches the end of a string.
[abc...] A character list: matches any of the characters abc.... You may include a range of characters by sepa‐
rating them with a dash.
[^abc...] A negated character list: matches any character except abc....
r1|r2 Alternation: matches either r1 or r2.
r1r2 Concatenation: matches r1, and then r2.
r+ Matches one or more r's.
r* Matches zero or more r's.
r? Matches zero or one r's.
(r) Grouping: matches r.
r{n}
r{n,}
r{n,m} One or two numbers inside braces denote an interval expression. If there is one number in the braces,
the preceding regular expression r is repeated n times. If there are two numbers separated by a comma, r
is repeated n to m times. If there is one number followed by a comma, then r is repeated at least n
times.
\y Matches the empty string at either the beginning or the end of a word.
\B Matches the empty string within a word.
\< Matches the empty string at the beginning of a word.
\> Matches the empty string at the end of a word.
\s Matches any whitespace character.
\S Matches any nonwhitespace character.
\w Matches any word-constituent character (letter, digit, or underscore).
\W Matches any character that is not word-constituent.
\` Matches the empty string at the beginning of a buffer (string).
\' Matches the empty string at the end of a buffer.
The escape sequences that are valid in string constants (see String Constants) are also valid in regular expres‐
sions.
Character classes are a feature introduced in the POSIX standard. A character class is a special notation for
describing lists of characters that have a specific attribute, but where the actual characters themselves can vary
from country to country and/or from character set to character set. For example, the notion of what is an alpha‐
betic character differs in the USA and in France.
A character class is only valid in a regular expression inside the brackets of a character list. Character classes
consist of [:, a keyword denoting the class, and :]. The character classes defined by the POSIX standard are:
[:alnum:] Alphanumeric characters.
[:alpha:] Alphabetic characters.
[:blank:] Space or tab characters.
[:cntrl:] Control characters.
[:digit:] Numeric characters.
[:graph:] Characters that are both printable and visible. (A space is printable, but not visible, while an a is
both.)
[:lower:] Lowercase alphabetic characters.
[:print:] Printable characters (characters that are not control characters.)
[:punct:] Punctuation characters (characters that are not letter, digits, control characters, or space characters).
[:space:] Space characters (such as space, tab, and formfeed, to name a few).
[:upper:] Uppercase alphabetic characters.
[:xdigit:] Characters that are hexadecimal digits.
For example, before the POSIX standard, to match alphanumeric characters, you would have had to write /[A-Za-z0-9]/.
If your character set had other alphabetic characters in it, this would not match them, and if your character set
collated differently from ASCII, this might not even match the ASCII alphanumeric characters. With the POSIX char‐
acter classes, you can write /[[:alnum:]]/, and this matches the alphabetic and numeric characters in your character
set, no matter what it is.
Two additional special sequences can appear in character lists. These apply to non-ASCII character sets, which can
have single symbols (called collating elements) that are represented with more than one character, as well as sev‐
eral characters that are equivalent for collating, or sorting, purposes. (E.g., in French, a plain “e” and a grave-
accented “`” are equivalent.)
Collating Symbols
A collating symbol is a multi-character collating element enclosed in [. and .]. For example, if ch is a
collating element, then [[.ch.]] is a regular expression that matches this collating element, while [ch] is
a regular expression that matches either c or h.
Equivalence Classes
An equivalence class is a locale-specific name for a list of characters that are equivalent. The name is
enclosed in [= and =]. For example, the name e might be used to represent all of “e,” “´,” and “`.” In this
case, [[=e=]] is a regular expression that matches any of e, ´, or `.
These features are very valuable in non-English speaking locales. The library functions that HAWK uses for regular
expression matching currently only recognize POSIX character classes; they do not recognize collating symbols or
equivalence classes.
The \y, \B, \<, \>, \s, \S, \w, \W, \`, and \' operators are specific to HAWK; they are extensions based on facili‐
ties in the GNU regular expression libraries.
The various command line options control how HAWK interprets characters in regular expressions.
No options
In the default case, HAWK provides all the facilities of POSIX regular expressions and the GNU regular
expression operators described above.
--posix
Only POSIX regular expressions are supported, the GNU operators are not special. (E.g., \w matches a literal
w).
--traditional
Traditional UNIX awk regular expressions are matched. The GNU operators are not special, and interval
expressions are not available. Characters described by octal and hexadecimal escape sequences are treated
literally, even if they represent regular expression metacharacters.
--re-interval
Allow interval expressions in regular expressions, even if --traditional has been provided.
Actions
Action statements are enclosed in braces, { and }. Action statements consist of the usual assignment, conditional,
and looping statements found in most languages. The operators, control statements, and input/output statements
available are patterned after those in C.
Operators
The operators in AWK, in order of decreasing precedence, are:
(...) Grouping
$ Field reference.
++ -- Increment and decrement, both prefix and postfix.
^ Exponentiation (** may also be used, and **= for the assignment operator).
+ - ! Unary plus, unary minus, and logical negation.
* / % Multiplication, division, and modulus.
+ - Addition and subtraction.
space String concatenation.
| |& Piped I/O for getline, print, and printf.
< > <= >= != ==
The regular relational operators.
~ !~ Regular expression match, negated match. NOTE: Do not use a constant regular expression (/foo/) on the
left-hand side of a ~ or !~. Only use one on the right-hand side. The expression /foo/ ~ exp has the
same meaning as (($0 ~ /foo/) ~ exp). This is usually not what you want.
in Array membership.
&& Logical AND.
|| Logical OR.
?: The C conditional expression. This has the form expr1 ? expr2 : expr3. If expr1 is true, the value of
the expression is expr2, otherwise it is expr3. Only one of expr2 and expr3 is evaluated.
= += -= *= /= %= ^=
Assignment. Both absolute assignment (var = value) and operator-assignment (the other forms) are sup‐
ported.
Control Statements
The control statements are as follows:
if (condition) statement [ else statement ]
while (condition) statement
do statement while (condition)
for (expr1; expr2; expr3) statement
for (var in array) statement
break
continue
delete array[index]
delete array
exit [ expression ]
{ statements }
switch (expression) {
case value|regex : statement
...
[ default: statement ]
}
I/O Statements
The input/output statements are as follows:
close(file [, how]) Close file, pipe or co-process. The optional how should only be used when closing one end of
a two-way pipe to a co-process. It must be a string value, either "to" or "from".
getline Set $0 from next input record; set NF, NR, FNR, RT.
getline <file Set $0 from next record of file; set NF, RT.
getline var Set var from next input record; set NR, FNR, RT.
getline var <file Set var from next record of file, RT.
command | getline [var]
Run command piping the output either into $0 or var, as above, and RT.
command |& getline [var]
Run command as a co-process piping the output either into $0 or var, as above, and RT. Co-
processes are a HAWK extension. (command can also be a socket. See the subsection Special
File Names, below.)
next Stop processing the current input record. The next input record is read and processing starts
over with the first pattern in the AWK program. Upon reaching the end of the input data, HAWK
executes any END rule(s).
nextfile Stop processing the current input file. The next input record read comes from the next input
file. FILENAME and ARGIND are updated, FNR is reset to 1, and processing starts over with the
first pattern in the AWK program. Upon reaching the end of the input data, HAWK executes any
END rule(s).
print Print the current record. The output record is terminated with the value of ORS.
print expr-list Print expressions. Each expression is separated by the value of OFS. The output record is
terminated with the value of ORS.
print expr-list >file Print expressions on file. Each expression is separated by the value of OFS. The output
record is terminated with the value of ORS.
printf fmt, expr-list Format and print. See The printf Statement, below.
printf fmt, expr-list >file
Format and print on file.
system(cmd-line) Execute the command cmd-line, and return the exit status. (This may not be available on non-
POSIX systems.)
fflush([file]) Flush any buffers associated with the open output file or pipe file. If file is missing or if
it is the null string, then flush all open output files and pipes.
Additional output redirections are allowed for print and printf.
print ... >> file
Appends output to the file.
print ... | command
Writes on a pipe.
print ... |& command
Sends data to a co-process or socket. (See also the subsection Special File Names, below.)
The getline command returns 1 on success, 0 on end of file, and -1 on an error. Upon an error, ERRNO is set to a
string describing the problem.
NOTE: Failure in opening a two-way socket results in a non-fatal error being returned to the calling function. If
using a pipe, co-process, or socket to getline, or from print or printf within a loop, you must use close() to cre‐
ate new instances of the command or socket. AWK does not automatically close pipes, sockets, or co-processes when
they return EOF.
The printf Statement
The AWK versions of the printf statement and sprintf() function (see below) accept the following conversion specifi‐
cation formats:
%c A single character. If the argument used for %c is numeric, it is treated as a character and printed. Oth‐
erwise, the argument is assumed to be a string, and the only first character of that string is printed.
%d, %i A decimal number (the integer part).
%e, %E A floating point number of the form [-]d.dddddde[+-]dd. The %E format uses E instead of e.
%f, %F A floating point number of the form [-]ddd.dddddd. If the system library supports it, %F is available as
well. This is like %f, but uses capital letters for special “not a number” and “infinity” values. If %F is
not available, HAWK uses %f.
%g, %G Use %e or %f conversion, whichever is shorter, with nonsignificant zeros suppressed. The %G format uses %E
instead of %e.
%o An unsigned octal number (also an integer).
%u An unsigned decimal number (again, an integer).
%s A character string.
%x, %X An unsigned hexadecimal number (an integer). The %X format uses ABCDEF instead of abcdef.
%% A single % character; no argument is converted.
Optional, additional parameters may lie between the % and the control letter:
count$ Use the count'th argument at this point in the formatting. This is called a positional specifier and is
intended primarily for use in translated versions of format strings, not in the original text of an AWK pro‐
gram. It is a HAWK extension.
- The expression should be left-justified within its field.
space For numeric conversions, prefix positive values with a space, and negative values with a minus sign.
+ The plus sign, used before the width modifier (see below), says to always supply a sign for numeric conver‐
sions, even if the data to be formatted is positive. The + overrides the space modifier.