wdict

Create dictionaries by scraping webpages or crawling local files.

Similar tools (some features inspired by them):

CeWL
CeWLeR

Take it for a spin

# build with nix and run the result
nix build .#
./result/bin/wdict --help

# just run it directly
nix run .# -- --help

# run it without cloning
nix run github:pyqlsa/wdict -- --help

# install from crates.io
# (nixOS users may need to do this within a dev shell)
cargo install wdict

# using a dev shell
nix develop .#
cargo build
./target/debug/wdict --help

# ...or a release version
cargo build --release
./target/release/wdict --help

Usage

Create dictionaries by scraping webpages or crawling local files.

Usage: wdict [OPTIONS] <--url <URL>|--theme <THEME>|--path <PATH>|--resume|--resume-strict>

Options:
  -u, --url <URL>
          URL to start crawling from

      --theme <THEME>
          Pre-canned theme URLs to start crawling from (for fun)

          Possible values:
          - star-wars:   Star Wars themed URL <https://www.starwars.com/databank>
          - tolkien:     Tolkien themed URL <https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html>
          - witcher:     Witcher themed URL <https://witcher.fandom.com/wiki/Elder_Speech>
          - pokemon:     Pokemon themed URL <https://www.smogon.com>
          - bebop:       Cowboy Bebop themed URL <https://cowboybebop.fandom.com/wiki/Cowboy_Bebop>
          - greek:       Greek Mythology themed URL <https://www.theoi.com>
          - greco-roman: Greek and Roman Mythology themed URL <https://www.gutenberg.org/files/22381/22381-h/22381-h.htm>
          - lovecraft:   H.P. Lovecraft themed URL <https://www.hplovecraft.com>

  -p, --path <PATH>
          Local file path to start crawling from

      --resume
          Resume crawling from a previous run; state file must exist; existence of dictionary is optional; parameters from state are ignored, instead favoring arguments provided on the command line

      --resume-strict
          Resume crawling from a previous run; state file must exist; existence of dictionary is optional; 'strict' enforces that all arguments from the state file are observed

  -d, --depth <DEPTH>
          Limit the depth of crawling URLs

          [default: 1]

  -m, --min-word-length <MIN_WORD_LENGTH>
          Only save words greater than or equal to this value

          [default: 3]

  -x, --max-word-length <MAX_WORD_LENGTH>
          Only save words less than or equal to this value

          [default: 18446744073709551615]

  -j, --include-js
          Include javascript from <script> tags and URLs

  -c, --include-css
          Include CSS from <style> tags and URLs

      --filters <FILTERS>...
          Filter strategy for words; multiple can be specified (comma separated)

          [default: none]

          Possible values:
          - deunicode:    Transform unicode according to <https://github.com/kornelski/deunicode>
          - decancer:     Transform unicode according to <https://github.com/null8626/decancer>
          - all-numbers:  Ignore words that consist of all numbers
          - any-numbers:  Ignore words that contain any number
          - no-numbers:   Ignore words that contain no numbers
          - only-numbers: Keep only words that exclusively contain numbers
          - all-ascii:    Ignore words that consist of all ascii characters
          - any-ascii:    Ignore words that contain any ascii character
          - no-ascii:     Ignore words that contain no ascii characters
          - only-ascii:   Keep only words that exclusively contain ascii characters
          - none:         Leave the word as-is

      --site-policy <SITE_POLICY>
          Site policy for discovered URLs

          [default: same]

          Possible values:
          - same:      Allow crawling URL, only if the domain exactly matches
          - subdomain: Allow crawling URLs if they are the same domain or subdomains
          - sibling:   Allow crawling URLs if they are the same domain or a sibling
          - all:       Allow crawling all URLs, regardless of domain

  -r, --req-per-sec <REQ_PER_SEC>
          Number of requests to make per second

          [default: 5]

  -l, --limit-concurrent <LIMIT_CONCURRENT>
          Limit the number of concurrent requests to this value

          [default: 5]

  -o, --output <OUTPUT>
          File to write dictionary to (will be overwritten if it already exists)

          [default: wdict.txt]

      --append
          Append extracted words to an existing dictionary

      --output-state
          Write crawl state to a file

      --state-file <STATE_FILE>
          File to write state, json formatted (will be overwritten if it already exists)

          [default: state-wdict.json]

  -v, --verbose...
          Increase logging verbosity

  -q, --quiet...
          Decrease logging verbosity

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Lib

This crate exposes a library, but for the time being, the interfaces should be considered unstable.

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
scripts		scripts
src		src
utils/readme		utils/readme
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
default.nix		default.nix
flake.lock		flake.lock
flake.nix		flake.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

wdict

Take it for a spin

Usage

Lib

License

Contribution

About

Licenses found

Releases

Packages

Languages

License

Licenses found

pyqlsa/wdict

Folders and files

Latest commit

History

Repository files navigation

wdict

Take it for a spin

Usage

Lib

License

Contribution

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages