CC_Tools

✗ python3 main.py -h        
usage: main.py [-h] {getpath,update,rank2domain,process,download} ...

用于CommonCrawl数据的下载与处理

positional arguments:
  {getpath,update,merge,rank2domain,process,download}
    getpath             下载的两个时间戳内的path, eg:2013/05
    rank2domain         转换ranks为domain,需要安装go
    process             处理warc.gz
    download            下载指定path中的文件

options:
  -h, --help            show this help message and exit

获取path文件，WARC/WET可选

✗ python3 main.py getpath -h                                             
usage: main.py getpath [-h] [--start_timestamp START_TIMESTAMP] [--end_timestamp END_TIMESTAMP] [--type TYPE]

options:
  -h, --help            show this help message and exit
  --start_timestamp START_TIMESTAMP, -s START_TIMESTAMP
  --end_timestamp END_TIMESTAMP, -e END_TIMESTAMP
  --type TYPE, -t TYPE  类型: cc-news, cc-main

从CommonCrawl的ranks文件提取domain数据~~，暂时没啥用~~

✗ python3 main.py rank2domain -h
usage: main.py rank2domain [-h] rank_path

positional arguments:
  rank_path   CommonCrawl rank地址

options:
  -h, --help  show this help message and exit

处理WARC或WET文件

✗ python3 main.py process -h                                            
usage: main.py process [-h] [--input INPUT] [--output OUTPUT] [--type TYPE]

options:
  -h, --help            show this help message and exit
  --input INPUT, -i INPUT
                        待处理文件的文件夹地址
  --output OUTPUT, -o OUTPUT
                        输出文件的文件夹地址
  --type TYPE, -t TYPE  warc or wet

下载指定文件中的WARC/WET

✗ python3 main.py download -h
usage: main.py download [-h] [-i INPUT] [-o OUTPUT] [--num NUM]

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        目标path文件
  -o OUTPUT, --output OUTPUT
                        下载位置
  --num NUM, -n NUM     同时下载的线程数

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
TextQualityScore-main		TextQualityScore-main
archive_process		archive_process
archive_process_warc		archive_process_warc
record_refine		record_refine
resource		resource
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
cc_tools-rs.md		cc_tools-rs.md
dirty_domainsv.txt		dirty_domainsv.txt
domainExtract.go		domainExtract.go
hash_dedup.py		hash_dedup.py
main.py		main.py
nohup.out		nohup.out
path_get.py		path_get.py
printlu.py		printlu.py
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
tool_funcs.py		tool_funcs.py
warc_download.py		warc_download.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CC_Tools

About

Releases

Packages

Languages

Force1ess/cc_tools

Folders and files

Latest commit

History

Repository files navigation

CC_Tools

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages