Skip to content

Latest commit

 

History

History
27 lines (19 loc) · 1.04 KB

README.md

File metadata and controls

27 lines (19 loc) · 1.04 KB

CommonCrawl NEWS dataset indexer

Creates CDXJ index for the CommonCrawl NEWS dataset (there is official index server).

Usage

  1. Set AWS API key and secret in boto.cfg (see example: example_boto.cfg)

  2. Set GNU parallel nodefile (see example: example_nodefile)

    • Copy this directory to the same path on all machines
  3. Set parameters as environment variables:

    • PYTHON (default: python3)
    • OUTPUT_DIR (default: $(PWD)/output)
    • BOTO_CFG (default: $(PWD)/boto.cfg)
    • NO_OF_THREADS (default: 80)
    • NICEVALUE (default: 10)
  4. Set languages to collect in languages_to_collect.txt. The format is "[LANGUAGE NAME AS IN LINGUA]": (because it is grepped from a JSONL for speed concerns)

Run make to execute the whole process or consult with the Makefile for the individual steps

License

This code is licensed under the GPL 3.0 license.