CommonCrawl.jl

Interface to the common crawl dataset on Amazon S3

Usage

An instance of the corpus is obtained as:

cc = CrawlCorpus(cache_location::String, debug::Bool=false)

Since the crawl corpus files are large, they are cached locally by default at cache_location. The first time a file is accessed, it is downloaded in full into the cache location. Subsequent calls to read are served locally.

All cached files, or a particular cached archive file can be deleted:

clear_cache(cc::CrawlCorpus)
clear_cache(cc::CrawlCorpus, archive::URI)

Segments and archive files in a segment can be listed as:

segment_names = segments(cc::CrawlCorpus)
archive_uris = archives(cc::CrawlCorpus, segment::String)

Archive files across all segments can be accessed easily as:

archive_uris = archives(cc::CrawlCorpus, count::Int=0)

Passing count as 0 lists all available archive files (which can be large).

A particular archive file can be opened as:

open(cc::CrawlCorpus, archive::URI)

And crawl entries can be read from an opened archive as:

entry = read_entry(cc::CrawlCorpus, f::IO, mime_part::String="", metadata_only::Bool=false)
entries = read_entries(cc::CrawlCorpus, f::IO, mime_part::String="", num_entries::Int=0, metadata_only::Bool=false)

Method read_entry returns an ArchiveEntry instance corresponding to the next entry in the file with mime type beginning with mime_part. Method read_entries returns an array of ArchiveEntry objects. If num_entries is 0, all matching entries in the archive file are returned. If metadata_only is true, only the file metadata (url and mime type) is populated in the entries.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
test		test
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
REQUIRE		REQUIRE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CommonCrawl.jl

Usage

About

Releases

Packages

Languages

License

tanmaykm/CommonCrawl.jl

Folders and files

Latest commit

History

Repository files navigation

CommonCrawl.jl

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages