pico/examples/word-count at master · mdrocco/pico

History

Name		Name	Last commit message	Last commit date
parent directory ..
testdata		testdata
CMakeLists.txt		CMakeLists.txt
README.md		README.md
pico_wc.cpp		pico_wc.cpp
run_wordcount.sh		run_wordcount.sh
seq_wc.cpp		seq_wc.cpp

README.md

Word-Count (a.k.a. the "Hello, World!" for data analytics) counts the occurrences of each distinct word from a text file.

This examples shows the killer feature in the PiCo API: there is no data! A PiCo application is described only in terms of processing (i.e., pipeline stages) - rather than processing and data.

A simple PiCo pipeline for word-count:

reads an input file line by line, by a ReadFromFile stage
tokenizes each line into words, by a FlatMap stage - a FlatMap maps an item (line) to multiple items (words)
maps each word w to a key-value pair <w,1>, by a Map stage
groups the pairs by word and sums up them, by a ReduceByKey stage
finally, the word-occurrences pairs <w,n> are written to an output file, by a WriteToFile stage

In pico_wc.cpp, we show a common optimization known as stage fusion. The wc pipeline fuses step 3 into step 2, letting the FlatMap stage produce the <w,1> pairs from each word in the processed line.

Run the application

See the home README for build instructions.

Generate 1k random lines:

cd /path/to/build/examples/word-count
cd testdata
./generate_text dictionary.txt 1000 >> lines.txt

Count those words:

cd ..
./pico_wc testdata/lines.txt count.txt

💡 Parallelism degree can be set:

externally, by the application-wise PARDEG environment variable
within the code, for each operator, by passing an (optional) argument to operators' constructors; per-operator parallelism overrides PARDEG

See the application graph

Call the to_dotfile() function on a PiCo pipeline to produce a dot representation of its semantics.

To visualize the graph for the wc pipeline:

dot -Tpng word-count.dot -o word-count.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word-count

word-count

README.md

Run the application

See the application graph

Files

word-count

Directory actions

More options

Directory actions

More options

Latest commit

History

word-count

Folders and files

parent directory

README.md

Run the application

See the application graph