Skip to content

Latest commit

 

History

History

warc-truncation

Truncated Records in WARC Files

Content payload in Common Crawl archives is truncated if the content exceeds a limit of

The truncation is required to keep the crawl archives at a limited size and ensure that a broad sample of web pages is covered. It also avoids that the archives are filled by accidentally captured video or audio streams. The crawler needs to buffer the content temporarily and a limit ensures that this is possible with a limited amount of RAM for many parallel connections.

The notebooks in this folder analyze various aspects of payload truncation:

  • [cc-main-2018-43-single-warc-file.ipynb] - truncation counts for a single WARC file of CC-MAIN-2018-43 and a broken marking of truncated records
  • [cc-main-2019-35-100-warc-files.ipynb] - marking of truncated records has been fixed for CC-MAIN-2019-35 and 100 randomly selected WARC files are analyzed to verify the marking and get more detailed metrics
  • [cc-main-2019-47-truncation-by-mime-type.ipynb] - since November 2019 (CC-MAIN-2019-47) truncated records are marked in the URL indexes which allows to analyze distribution of truncated records over the entire monthly crawl