Skip to content

Latest commit

 

History

History
155 lines (124 loc) · 8.63 KB

04b-smoothing_counts.asciidoc

File metadata and controls

155 lines (124 loc) · 8.63 KB

Exemplars and Touchstones

There are three touchstones to hit in every data exploration:

  • Confirm the things you know:

  • Confirm or refute the things you suspect.

  • Uncover at least one thing you never suspected.

Things we know: First, common words should show no geographic flavor. Geographic features — "beach", "mountain", etc — should be intensely localised.

We will jointly discover two things taking as a whole the terms that have a strong geographic flavor, we should largely see cultural terms (foods, sports, etc)

<remark>* compared to other color words, there will be a larger regional variation for the terms "white" and "black" (as they describe ra</remark>

You don’t have to stop exploring when you find a new mystery, but no data exploration is complete until you uncover at least one.

Next, we’ll choose some exemplars: familiar records to trace through "Barbeque" should cover ;

Chapter in progress — the story so far: we’ve counted the words in each document and each geographic grid region, and want to use those counts to estimate each word’s frequency in context. Picking up there…​

Smoothing the counts

The count of each word is an imperfect estimate of the probability of seeing that word in the context of the given topic. Consider for instance the words that would have shown up if the article were 50% longer, or the cases where an author chose one synonym out of many equivalents. This is particularly significant considering words with zero count.

We want to treat "missing" terms as having occurred some number of times, and adjust the probabilities of all the observed terms.

Note
Minimally Invasive

It’s essential to use "minimally invasive" methods to address confounding factors.

What we’re trying to do is expose a pattern that we believe is robust: that it will shine through any occlusions in the data. Occasionally, as here, we need to directly remove some confounding factor. The naive practitioner thinks, "I will use a powerful algorithm! That’s good, because powerful is better than not powerful!" No — simple and clear is better than powerful.

Suppose you were instead telling a story set in space - somehow or another, you must address the complication of faster-than-light travel. Star Wars does this early and well: its choices ("Ships can jump to faraway points in space, but not from too close to a planet and only after calculations taking several seconds; it happens instantaneously, causing nearby stars to appear as nifty blue tracks") are made clear in a few deft lines of dialog.

A ham-handed sci-fi author instead brings in complicated machinery requiring a complicated explanation resulting in complicated dialogue. There are two obvious problems: first, the added detail makes the story less clear. It’s literally not rocket science: concentrate on heros and the triumph over darkness, not on rocket engines. Second, writing that dialog is wasted work. If it’s enough to just have the Wookiee hit the computer with a large wrench, do that.

But it’s essential to appreciate that this also introduces extra confounding factors. Rather than a nifty special effect and a few lines shouted by a space cowboy at his hairy sidekick, your junkheap space freighter now needs an astrophysicist, a whiteboard and a reason to have the one use the other. The story isn’t just muddier, it’s flawed.

We’re trying to tell a story ("words have regional flavor"), but the plot requires a few essential clarifications ("low-frequency terms are imperfectly estimated"). If these patterns are robust, complicated machinery is detrimental. It confuses the audience, and is more work for you; it can also bring more pattern to the data than is actually there, perverting your results.

The only time you should bring in something complicated or novel is when it’s a central element of your story. In that case, it’s worth spending multiple scenes in which Jedi masters show and tell the mechanics and limitations of The Force.

There are two reasonable strategies: be lazy; or consult a sensible mathematician.

To be lazy, add a 'pseudocount' to each term: pretend you saw it an extra small number of times For the common pseudocount choice of 0.5, you would treat absent terms as having been seen 0.5 times, terms observed once as having been seen 1.5 times, and so forth. Calclulate probabilities using the adjusted count divided by the sum of all adjusted counts (so that they sum to 1). It’s not well-justified mathematically, but is easy to code.

Consult a mathematician: for something that is mathematically justifiable, yet still simple enough to be minimally invasive, she will recommend "Good-Turing" smoothing.

In this approach, we expand the dataset to include both the pool of counter for terms we saw, and an "absent" pool of fractional counts, to be shared by all the terms we didn’t see. Good-Turing says to count the terms that occurred once, and guess that an equal quantity of things would have occurred once, but didn’t. This is handwavy, but minimally invasive; we oughtn’t say too much about the things we definitionally can’t say much about.

We then make the following adjustments:

  • Set the total count of words in the absent pool equal to the number of terms that occur once. There are of course tons of terms in this pool; we’ll give each some small fractional share of an appearance.

  • Specifically, treat each absent term as occupying the same share of the absent pool as it does in the whole corpus (minus this doc). So, if "banana" does not appear in the document, but occurs at (TODO: value) ppm across all docs, we’ll treat it as occupying the same fraction of the absent pool (with slight correction for the absence of this doc).

  • Finally, estimate the probability for each present term as its count divided by the total count in the present and absent pools.