Skip to content

Commit

Permalink
organizing and attic-ing material to determine final outline
Browse files Browse the repository at this point in the history
  • Loading branch information
Philip (flip) Kromer committed Sep 16, 2014
1 parent 2852f5b commit 2cf4b21
Show file tree
Hide file tree
Showing 100 changed files with 1,991 additions and 1,208 deletions.
10 changes: 10 additions & 0 deletions 00-outlines.asciidoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,14 @@

9. Statistics
10. Event streams -- has some good examples, no real flow. The topic I'd be most excied to get in the book is the geo-ip matching, which demonstrates a range join.
12, 21, 22, 23. Hadoop internals and tuning. As you can see just from the number of files involved this is particularly disorganized. If you and I worked out a structure of what should be there I can organize the spare parts around it.
13. Data munging. This is some of the earliest material and thus some of the messiest. I don't believe this is worth reworking.
14. Organizing data -- only real material here is a rundown of data formats. Rough.
15. Filesystem mojo and `cat` herding -- runs down the commandline tools: wc, cut, etc. This is actually in decent shape, but should become an appendix I think.
18. Native Java API -- I'd like to have this chapter in there with either the content being the single sentence "Don't", or that sentence plus one prose paragraph saying you should write Hive or Pig UDFs instead.
19. Advanced Pig -- the material that's there, on pig config variables and two of the fancy joins, is not too messy. I'd like to at least tell readers about the replicated join, and probably even move it into the earlier chapters. The most we should do here would be to also describe an inline Python UDF and a Java UDF, but there's no material for that (though I do have code examples of UDFs)
10. **Event log**
- geo IP via range query
- sessionizing, user paths
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
== Analytic Patterns part 1: Pipeline Operations
== Analytic Patterns: Map-only Operations

This chapter focuses exclusively on what we'll call 'pipelineable operations'.
A pipelineable operations is one that can handle each record in isolation, like the translator chimps from Chimpanzee & Elephant's first job. That property makes those operations trivially parallelizable: they require no reduce phase of their own.
This chapter focuses exclusively on what we'll call 'Map-only operations'.
A map-only operation is one that can handle each record in isolation, like the translator chimps from Chimpanzee & Elephant's first job. That property makes those operations trivially parallelizable: they require no reduce phase of their own.

When a script has only pipelineable operations, they give rise to one mapper-only job which executes the composed pipeline stages. When pipelinable operations are combined with the structural operations you'll meet in the next chapter, they are composed with the stages of the mapper or reducer (depending on whether they come before or after the structural operation).
When a script has only map-only operations, they give rise to one mapper-only job which executes the composed pipeline stages. When map-only operations are combined with the structural operations you'll meet in the next chapter, they are composed with the stages of the mapper or reducer (depending on whether they come before or after the structural operation).

All of these are listed first and together for two reasons. One, they are largely fundamental; it's hard to get much done without `FILTER` or `FOREACH`. Two, the way you reason about the performance impact of these operations is largely the same. Since these operations are trivially paralellizable, they scale efficiently and the computation cost rarely impedes throughput. And when pipelined, their performance cost can be summarized as "kids eat free with purchase of adult meal". For datasets of any material size, it's very rare that the cost of preliminary or follow-on processing rivals the cost of the reduce phase. Finally, since these operations handle records in isolation, their memory impact is modest. So learn to think of these together.

Expand Down Expand Up @@ -70,7 +70,7 @@ Blocks like the following will show up after each of the patterns or groups of p
- Programmers take note: `AND`, `OR` -- not `&&`, `||`.
* _Output Count_ -- (_How many records in the output: fewer, same, more, explosively more?_) Zero to 100% of the input record count. Data size will decrease accordingly
* _Records_ -- (_A sketch of what the records coming out of this operation look like_) Identical to input
* _Data Flow_ -- (_The Hadoop jobs this operation gives rise to. In this chapter, all the lines will look like this one; in the next chapters that will change_) Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _Data Flow_ -- (_The Hadoop jobs this operation gives rise to. In this chapter, all the lines will look like this one; in the next chapters that will change_) Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _Exercises for You_ -- (_A mission to carry forward, if you choose. Don't go looking for an answer section -- we haven't done any of them. In many cases you'll be the first to find the answer._) Play around with `null`s and the conditional operators until you have a good sense of its quirks.
* _See Also_ -- (_Besides the patterns in its section of the book, what other topics might apply if you're considering this one? Sometimes this is another section in the book, sometimes it's a pointer elsewhere_) The Distinct operations, some Set operations, and some Joins are also used to eliminate records according to some criteria. See especially the Semi-Join and Anti-Join (REF), which select or reject matches against a large list of keys.

Expand Down Expand Up @@ -123,7 +123,7 @@ NOTE: Sadly, the Nobel Prize-winning physicists Gerard 't Hooft, Louis-Victor Pi
- You're far better off learning one extra thing to do with a regular expression than most of the other string conditional functions Pig offers.
- ... and enough other Importants to Know that we made a sidebar of them (REF).
* _Records_ -- You can use this in a filter clause but also anywhere else an expression is permitted, like the preceding snippet
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _Exercises for You_ -- Follow the http://regexp.info/tutorial.html[regexp.info tutorial], but _only up to the part on Grouping & Capturing_. The rest you are far better off picking up once you find you need it.
* _See Also_ -- The Pig `REGEX_EXTRACT` and http://pig.apache.org/docs/r0.12.0/func.html#replace[`REPLACE`] functions. Java's http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#sum[Regular Expression] documentation for details on its pecadilloes (but not for an education about regular expressions).

Expand Down Expand Up @@ -152,7 +152,7 @@ The general case is handled bu using a join, as described in the next chapter (R
* _Hello, SQL Users_ -- This isn't anywhere near as powerful as SQL's `IN` expression. Most importantly, you can't supply another table as the list.
* _Important to Know_ -- A regular expression alternation is often the right choice instead.
* _Output Count_ -- As many records as the cardinality of its key, i.e. the number of distinct values. Data size should decrease greatly.
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.

=== Project Only Chosen Columns by Name

Expand Down Expand Up @@ -194,7 +194,7 @@ The first projection puts the `home_team_id` into the team slot, renaming it `te
* _Important to Know_ -- As you can see, we take a lot of care visually aligning subexpressions within the code snippets. That's not because we've tidied up the house for students coming over -- this is what the code we write and the code our teammates expect us to write looks like.
* _Output Count_ -- Exactly the same as the input.
* _Records_ -- However you define them to be
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _See Also_ -- "Assembling Literals with Complex Type" (REF)

==== Extracting a Random Sample of Records
Expand All @@ -219,7 +219,7 @@ Experienced software developers will reach for a "seeding" function -- such as R
- The DataFu package has UDFs for sampling with replacement and other advanced features.
* _Output Count_ -- Determined by the sampling fraction. As a rule of thumb, variances of things are square-root-ish; expect the size of a 10% sample to be in the 7%-13% range.
* _Records_ -- Identical to the input
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _Exercises for You_ -- Modify Pig's SAMPLE function to accept a seed parameter, and submit that patch back to the open-source project. This is a bit harder to do than it seems: sampling is key to efficient sorting and so the code to sample data is intertwingled with a lot of core functionality.

==== Extracting a Consistent Sample of Records by Key
Expand All @@ -242,7 +242,7 @@ We called this a terrible hash function, but it does fit the bill. When applied
- If you'll be spending a bunch of time with a data set, using any kind of random sample to prepare your development sample might be a stupid idea. You'll notice that Red Sox players show up a lot of times in our examples -- that's because our development samples are "seasons by Red Sox players" and "seasons from 2000-2010", which lets us make good friends with the data.
* _Output Count_ -- Determined by the sampling fraction. As a rule of thumb, variances of things are square-root-ish; expect the size of a 10% sample to be in the 7%-13% range.
* _Records_ -- Identical to the input
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.

==== Sampling Carelessly by Only Loading Some `part-` Files

Expand Down
Loading

0 comments on commit 2cf4b21

Please sign in to comment.