Skip to content
This repository has been archived by the owner on May 12, 2020. It is now read-only.

Overview

Peter Snyder edited this page Apr 23, 2018 · 6 revisions

Brave Ad Block

This documentation is meant to be read from start to finish, it is best not to skip to a specific section.

Lists and filters

  • Ad blocking works based on various lists, for example EasyList, EasyPrivacy, regional lists, etc.
  • Each ad block list contains a list of ad block filters.

You can get a good overview about how ad block filters work here and here.

Filters follow specific semantics and is not simply a regex which would be slow and could not be optimized.

The Brave ad block client parses these lists and helps you determine which URLs should or should not be blocked.

DAT files

Brave optimizes these ad block filters and lists into DAT files. After parsing a series of lists, the Brave ad block client can save it to a DAT file for later use. These DAT files can be easily loaded into memory and give the same result as if you parsed the lists. They are faster than re-parsing the lists on each application startup.

What are fingerprints?

This is not to be confused with browser fingerprinting, which is a completely different thing.

  • Brave ad block client defines a "fingerprint" for each ad block filter it parses.
  • The length of this fingerprints is a constant.
  • These fingerprints are entered into a bloom filter.

How fingerprints are created

Fingerprints are created by taking the first substring of a filter which is not a bad fingerprint.

Since filter syntax can contain special characters that would not be in the URL substring lookups, fingerprints cannot be created for substrings that contain those special characters.

Sometimes fingerprints cannot be created because:

  • The filter is too short in length.
  • The fingerprint is already blacklisted.
  • It doesn't have a long enough length without special characters in between.

When a "no fingerprint" filter is discovered, there is no fingerprint added to the bloom filter. Instead it goes into a manual checked no fingerprint list.

These no fingerprint checks are expensive and can slow things down because they must be manually scanned for all URL checks.

No fingerprint filters

We sometimes blacklist a fingerprint because a substring would occur for every URL substring lookup. We call blacklisted fingerprints "bad fingerprints".

https: is a bad fingerprint because it would appear in every URL lookup.

How are URLs checked to be blocked?

When checking to see if a URL should be blocked, instead of scanning N filters, where N is extremely large, we instead split up the input URL into a distinct substring of the same length as the fingerprint length.

Any URL that should be blocked, must have a fixed length substring which exists in the bloom filter.

Recall that a bloom filter gives you an answer of either:

  • Probably in the list, or
  • definitely not in the list

Since most URLs are not blocked when doing lookups, we can automatically get a deterministic answer for most URLs that should not be blocked with a series of simple bloom filter lookups for the substrings of the URL.

If the bloom filter returns "definitely not in the list" for all of them, we're done checking. This is by far the most common case. If the bloom filter returns "probably in the list", we need to do a manual scan through all the filters to verify. If verified, we block.

Frequency of matching filters

When doing URL lookups, a matching filter is very rare. Most URLs are not bad, and therefore should not be blocked.

False positives

A false positive is when a bloom filter fingerprint lookup for a substring says there is a match, but then when the filters are scanned, there is no actual match.

Keys to optimizing

The best way to optimizing is by doing these 2 things:

  1. Minimizing the number of the no fingerprint filters.
  2. Minimizing false positives.

There are probably other optimizations that could be done on speeding up the manual checks which are needed.