This repository has been archived by the owner on Mar 6, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
370 lines (261 loc) · 11.2 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
bug: presorted relations use new/delete
bug: exec_wrap seems to hang if command is used along with long running stream
or even by itself. try better test cases here.
support "<expression> AS <name>" syntax in usercolumn lists
just to reduce typo thunking between sql and here.
list functions and metric functions should be interchangeable,
at least for those that return expressions
sum,max,avg,first,last, etc...
then, you could make metrics into expressions and vica versa
sum( 1, 2, 3) same as sum in a group by, but sum in a
non-group by relation would just return
list functions
expression metric(list);//for all unary groupby metrics
expression metric(list,list);//for all binary groupby metrics
expression ca([ad]*)r (list)
list cd([ad]*)r (list)
expression first(list) ;//same as car
expression last(list) ;
expression size(list) ;
list rest(list) ;//same as cdr
expression nth(i,list)
list slice(start,count,list)
list split(regex,expression)
list range(regex,expression)
list words(expression)
list range(regex,expression)
bool equal(list,list)
expression mismatch(list,list);//index of first mismatch
expression mismatch(list,list);//index of first mismatch
vector vector(list)
maps as expressions
map_expression wordcount(text)
or more generally...
list words(text)
map_expression counts(list)
so...
map_expression counts(words(text))
then subsequent relations could have expressions that
used dynamic cast at parse time to make sure they were pointed at a map.
list map_keys(map_expression)
list map_values(map_expression)
list map_keys_sort_by_key(map_expression)
list map_keys_sort_by_value(map_expression)
list map_values_sort_by_key(map_expression)
list map_values_sort_by_value(map_expression)
expression map_expression[expression];
vectors as expressions
take list as construction arg,
or just be pointer to named vector constructed from column
designate as column or row vector
then do linear algebra transforms on vectors and matrices.
relations could support anonymous numbered columns, rather than named.
optional default relation called stdin
named matrix created from columns, eg expression_list
create relation from rss_feed(expression)
named vectors and expression_lists
map_value syntax sugar using square brackets
merge join should support the outer_clause
make sure large file support compiled for platforms that need that.
command() hangs on some platforms (centos, smvdapp01)
error string not there on some platforms (centos, smvdapp01)
*** job control with dependencies ! ***
parallelize
pipeline
restart transactionally
report errors
normalize() syntax should not require two lists if no stop words used.
pidfiles for cplusql
autodetermine column names based on first line of input
DONE optionally write out column names of a given relation to .meta file.
have an option to create vcg graphs
have a parse/lex only option that simply validates parse, and optionally
produces metadata and graph.
record streams : just 1 column, no delimiter streams
cat (command lines )
jobqstat (command lines )
stat (command lines )
DONE, but has performance bug - stream pg inputs from copy or query...
DONE- threaded N buffered output
DONE - threaded N buffered input
- allow weak parse to deliver more columns than required, or fewer...
- support for cronolog
for loop.
create global/local options for map behavior when input is not unique.
allow for anonymous relations to be embedded in create map foo using ...
validation
cmd line shortcuts
-S --sql "sql text" acts like extractor, number and name of columns determined by result set
--DBHOST
--DBUSER
--DBPASS
--DBTYPE
-C --cat <#cols> behaves like cat, except that it parses input and requires
there to be <#cols>. filters and modifiers may be defined which
refer to "input" as a relation and c1 ... cN as column names.
--input-delimiter <char> sets default input delimiter
-d --delimiter <char> sets default delimiter
--output-delimiter <char> sets default output delimiter
--mapkey-delimiter <char> sets mapfile delimiter
-F --filter "cplusql text"
-M --map <mapfile> <keycols> "cplusql text of value, may be cN of mapfile"
-U --user-column <name> "cplusql text"
-L --map-file-column <name> <mapfile> <mapfile-key> <mapfile-value> <input-key>
-O --output-column-list <name> [ , <name> [ ... ]]
-G --groupby <columnlist>
-q to shut upthe sync messages (use VERBOSE_SYNC=0)
-v to set APPLOG_MINLOG=0
all incompatible with:
-f <cplusql file>
-y
-l
bug: *_ON_PANIC breaks intentional use of exceptions, such as in mergejoin creation
setup for distribution:
- finalize directory structure
- doxygenize
- configure ; make ; make install
design review points:
- current class structure
- proposed plugin architecture
- proposed build platforms
important, 1.1, but not 1.0:
- gzip
- db
- sort (execpipe?)
- execpipe
- self reference
2.x:
- sets
- numeric constants
FEATURES:
Expressions:
is_float( exp )
is_integer( exp )
x integer expression ceil( exp )
x integer expression floor( exp )
integer expression round( exp )
boolean plusp()
boolean minusp()
complex(number,number)
boolean expression isNull( exp ); [ length(s) == 0
x untyped expression nvl( exp1, exp2 ); #equiv to if(isNull(exp1))then{exp2}else{exp1} (see coalesce)
untyped expression decode( source-exp, special-exp, special-value-exp, default-value );
x boolean expression op1 ~= op2 true if strings are equal.
x boolean expression strequal( exp, exp ) true if strings are equal.
x boolean expression match( exp, pattern );
x string expression strcat( exp1, exp2 ) [ see ~+ ]
string expression truncate( exp1, number )
x string expression substr( exp1, startpos, length )
string expression substitute( exp1, string pattern, string replacement )
string expression replace_str( exp1, string pattern, string replacement, skip count, replace count )
string expression replace_chars( exp1, ( chars or hex codes ), ( replacement chars or hex codes) )
x lcase( expression ) lowercased version of input
ltrim( expression ) eliminate space on left
rtrim( expression ) eliminate space on right
wtrim( expression ) eliminate left, right white space and squeeze redundant spaces into one
normalize( expression ) combination of lcase(),wtrim()
strip_punct() (see bytestrip)
bitwise operators
x & and
x | or
^ xor
~ complement
>> shift right N bits
<< shift left N bits
x abort( "message", message, ... )
x warn( "message", message, ... )
info( "message", message, ... )
dbg( "message", message, ... )
rowexception( "message", message, ... )
date conversion functions
allow optional destination section in stream defs.
-s flag creates stdin default with loose parsing and "c1","c2", column names
and sends to stdout
projection expression nthval( exp, N )
nthvalue( exp )
other joins ( 2 day )
presorted group by ( 2 day )
unsorted group by ( 1 day, milestone mar 9 )
projection expression set ( 1 day )
sync must accept text and print out start and stop, each joint must also.
better usage
--key=value args go into global
verbose flag
dont require stdin be cplusql
non 1.0:
more hash functions: crc32, md5
rewrite grammar using spirit c++ tool from boost.org.
where clause for all streams, not just from
user_column clause for all streams, not just from
execpipe as source or dest or both
db as source or dest
#include other cplusql files.
set based relational ops
trigonometry expressions
sin,cos,tan
acos,asin,atan,atan2
asinh,atanh,acosh
sinh,cosh,tanh
hypot
sqrt,cbrt
signbit
constants: pi, pi/2, etc...
abs
match oracle 9x statistical expressions, ie: regression, etc...
#first, last
lag, lead
pvm and/or runsw integration to exec remote procs and pipe them data.
use configs to set delimiters and buffer sizes
- create distinction between row-only exceptions and other exceptions for which
it is not ok to continue. allow Joint to continue calling next for the
former, if so configured, up to some other perstream and/or global
configurable number of row exceptions.
HYGEINE:
doxygen
BUGS:
- NUMBER-NUMBER is parse error, parse does "5","-6", not "5","-","6"
- it is possible to specify an expression on a relation that is not an
immediate parent of the current relation. this will cause undefined behavior
example helper scripts:
example join, appending some data.
(inner, outer, single column with default values )
example of sequential key generation using single db.
example of distributed hashed key generation (static distribution of hashes)
data profiling support
profile_column.cplusql
uses COLUMN_INDEX or chooses 0
( most of this is possible allready, or better done with sql + graphing tool,
but wanted to get these down, maybe write generic cplusql script that does all
this on one column of input. )
most frequent value k
count of distinct values k
count of distinct text normalized values k
count of rows k
count of empty strings (NULL) k
decimal? needs is_decimal()
float? needs is_decimal()
string? needs is_decimal()
ascii? needs is_decimal()
utf8? needs is_decimal()
iso-8859-1 ? needs is_decimal()
avg,min,max,stddev of numeric value
avg,min,max,stddev of string length
is value distribution gaussian distribution ? pareto distribution?
do disparate columns join? how well? k
check_join.cplusql
uses COLUMN_INDEX or chooses 0 for left
uses JOIN_FILE for right values
uses JOIN_FILE_COLUMN_INDEX or 0 for right values
requires that join_file join column fits into memory with index.
4 columns...
left row count and distinct value count that did and did not join
next 2 columns
most frequent columns that did/didnt match
leave 4 files with value,count for did/didnt join, left, right
check if "correlated" (ie: one dimension or two ) call it "co-dimensionality"
A card = 100, B card = 1000,
is
bigger dimension unique values / A ^ B unique values
A ^ B = 1000 B maps 1-1 with A (perfectly correlated = 1000/1000 = 1
A ^ B = 10000 B correlated with A, rate of 1000/10000 = 1/10
A ^ B = 75000 B correlated with A, rate of 1000/75000 = 1/75