Skip to content

tanmaykm/CommonCrawl.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CommonCrawl.jl

Build Status

Interface to the common crawl dataset on Amazon S3

Usage

An instance of the corpus is obtained as:

cc = CrawlCorpus(cache_location::String, debug::Bool=false)

Since the crawl corpus files are large, they are cached locally by default at cache_location. The first time a file is accessed, it is downloaded in full into the cache location. Subsequent calls to read are served locally.

All cached files, or a particular cached archive file can be deleted:

clear_cache(cc::CrawlCorpus)
clear_cache(cc::CrawlCorpus, archive::URI)

Segments and archive files in a segment can be listed as:

segment_names = segments(cc::CrawlCorpus)
archive_uris = archives(cc::CrawlCorpus, segment::String)

Archive files across all segments can be accessed easily as:

archive_uris = archives(cc::CrawlCorpus, count::Int=0)

Passing count as 0 lists all available archive files (which can be large).

A particular archive file can be opened as:

open(cc::CrawlCorpus, archive::URI)

And crawl entries can be read from an opened archive as:

entry = read_entry(cc::CrawlCorpus, f::IO, mime_part::String="", metadata_only::Bool=false)
entries = read_entries(cc::CrawlCorpus, f::IO, mime_part::String="", num_entries::Int=0, metadata_only::Bool=false)

Method read_entry returns an ArchiveEntry instance corresponding to the next entry in the file with mime type beginning with mime_part. Method read_entries returns an array of ArchiveEntry objects. If num_entries is 0, all matching entries in the archive file are returned. If metadata_only is true, only the file metadata (url and mime type) is populated in the entries.

About

Interface to common crawl dataset on Amazon S3

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages