Ulysses Fetcher

Fetch resources for Ulysses project.

Instalation

python -m pip install "git+https://github.com/ulysses-camara/ulysses-fetcher"

Available resources

Pretrained machine learning models

Task name	Model name
`legal_text_segmentation`	- `2_layer_6000_vocab_size_bert_v3` - `4_layer_6000_vocab_size_bert_v3` - `256_hidden_dim_6000_vocab_size_1_layer_lstm_v3` - `6000_subword_tokenizer`
`sentence_similarity`	- `legal_sroberta_v0` - `legal_sroberta_v1` - `sbert_1mil_anama` - `sbert_650k_nheeng` - `sbert_map2doc_v1` - `ulysses_LaBSE_3000`
`stance_detection`	- `political_brsd_v0`

Datasets

Task name	Dataset name
`sentence_model_evaluation`	- `bill_summary_to_topics` - `code_estatutes_cf88` - `factnews_news_bias` - `factnews_news_factuality` - `fakebr_size_normalized` - `faqs` - `hatebr_offensive_lang` - `masked_law_name_in_news` - `masked_law_name_in_summaries` - `oab_first_part` - `oab_second_part` - `offcombr2` - `stj_summary` - `sts_state_news` - `summary_vs_bill` - `tampered_leg` - `trf_examinations` - `ulysses_sd`
`probing_task`	- `dataset_wikipedia_ptbr_bigram_shift_v1` - `dataset_wikipedia_ptbr_coordination_inversion_v1` - `dataset_wikipedia_ptbr_obj_number_v1` - `dataset_wikipedia_ptbr_odd_man_out_v1` - `dataset_wikipedia_ptbr_past_present_v1` - `dataset_wikipedia_ptbr_sentence_length_v1` - `dataset_wikipedia_ptbr_subj_number_v1` - `dataset_wikipedia_ptbr_top_constituents_v1` - `dataset_wikipedia_ptbr_tree_depth_v1` - `dataset_wikipedia_ptbr_word_content_v1` - `dataset_sp_court_cases_bigram_shift_v1` - `dataset_sp_court_cases_coordination_inversion_v1` - `dataset_sp_court_cases_obj_number_v1` - `dataset_sp_court_cases_odd_man_out_v1` - `dataset_sp_court_cases_past_present_v1` - `dataset_sp_court_cases_sentence_length_v1` - `dataset_sp_court_cases_subj_number_v1` - `dataset_sp_court_cases_top_constituents_v1` - `dataset_sp_court_cases_tree_depth_v1` - `dataset_sp_court_cases_word_content_v1` - `dataset_political_speeches_ptbr_bigram_shift_v1` - `dataset_political_speeches_ptbr_coordination_inversion_v1` - `dataset_political_speeches_ptbr_obj_number_v1` - `dataset_political_speeches_ptbr_odd_man_out_v1` - `dataset_political_speeches_ptbr_past_present_v1` - `dataset_political_speeches_ptbr_sentence_length_v1` - `dataset_political_speeches_ptbr_subj_number_v1` - `dataset_political_speeches_ptbr_top_constituents_v1` - `dataset_political_speeches_ptbr_tree_depth_v1` - `dataset_political_speeches_ptbr_word_content_v1` - `dataset_leg_pop_comments_ptbr_bigram_shift_v1` - `dataset_leg_pop_comments_ptbr_coordination_inversion_v1` - `dataset_leg_pop_comments_ptbr_obj_number_v1` - `dataset_leg_pop_comments_ptbr_odd_man_out_v1` - `dataset_leg_pop_comments_ptbr_past_present_v1` - `dataset_leg_pop_comments_ptbr_sentence_length_v1` - `dataset_leg_pop_comments_ptbr_subj_number_v1` - `dataset_leg_pop_comments_ptbr_top_constituents_v1` - `dataset_leg_pop_comments_ptbr_tree_depth_v1` - `dataset_leg_pop_comments_ptbr_word_content_v1` - `dataset_leg_docs_ptbr_bigram_shift_v1` - `dataset_leg_docs_ptbr_coordination_inversion_v1` - `dataset_leg_docs_ptbr_obj_number_v1` - `dataset_leg_docs_ptbr_odd_man_out_v1` - `dataset_leg_docs_ptbr_past_present_v1` - `dataset_leg_docs_ptbr_sentence_length_v1` - `dataset_leg_docs_ptbr_subj_number_v1` - `dataset_leg_docs_ptbr_top_constituents_v1` - `dataset_leg_docs_ptbr_tree_depth_v1` - `dataset_leg_docs_ptbr_word_content_v1`
`quantization`	- `ulysses_tesemo_v2_subset_static_quantization`

Deprecated resources

Task name	Model name
`legal_text_segmentation`	- ~~`2_layer_6000_vocab_size_bert_v2`~~ (DEPRECATED) - ~~`4_layer_6000_vocab_size_bert_v2`~~ (DEPRECATED) - ~~`256_hidden_dim_6000_vocab_size_1_layer_lstm_v2`~~ (DEPRECATED) - ~~`2_layer_6000_vocab_size_bert`~~ (DEPRECATED) - ~~`512_hidden_dim_6000_vocab_size_1_layer_lstm`~~ (DEPRECATED)
`sentence_similarity`	- ~~`distil_sbert_br_ctimproved_12_epochs_v1`~~ (DEPRECATED)

Usage (as a library)

import buscador

has_succeed = buscador.download_resource(
    task_name="<task_name>",
    resource_name="<resource_name_given_the_task>",
    output_dir="<directory_to_save_downloaded_resources>",
    show_progress_bar=True,
    check_cached=True,
    clean_compressed_files=True,
    check_resource_hash=True,
    timeout_limit_seconds=10,
)

print("Download was successfull!" if has_succeed else "Download was not successfull.")

task_name (str): Resource task name. You can get a list of currently supported tasks programatically by using buscador.get_available_tasks();
resource_name (str): Resource to download. You can get a list of available resources per task by using buscador.get_task_available_resources(task_name);
output_dir (str): Output directory to save downloaded resources;
show_progress_bar (bool, default=True): If True, display progress bar;
check_cached (bool, default=True): If True, do not download resources if a file with the same output URI is found;
clean_compressed_files (bool, default=True): If True, remove compressed files after decompression;
check_resource_hash (bool, default=True): If True, verify if downloaded file hash matches the expected hash value;
timeout_limit_seconds (int, default=10): Limit in seconds until the abortion of staled downloads.

Usage by command line

This library can be used directly from command line as module after installation:

python -m buscador --help

Positional arguments:
- task_name: Task name to retrieve a resource from.
- resource_name: Pretrained resource name to retrieve.
Optional arguments:
- -h, --help: display help message.
- --output-dir: Output directory to store downloaded resources.
- --timeout-limit TIMEOUT_LIMIT: Timeout limit for stale downloads, in seconds.
- --disable-progress-bar: If enabled, do not display progress bar.
- --ignore-cached-files: If enabled, download files even they are found locally.
- --keep-compressed-files: If enabled, do not exclude compressed files (.zip, .tar) after decompression.
- --ignore-resource-hash: If enabled, do not verify if downloaded file hash matches the expected value.

For developers

Register a new resource

To register a new resource in Ulysses Fetcher, please follow the steps below:

Make sure that the resource filename (or directory name, in case your resource is represented by more than one file) matches exactly the desired resource name.
Compress your resource as either .zip or .tar format (if it is a PyTorch binary - .pt - you can skip this step).

zip -r my_resource_file_or_directory.zip my_resource_file_or_directory/

Store your resource in a couple of cloud storage services, and get their download URL. It is recommended to store your resource in at least two distinct cloud providers.
Hash your resource by using SHA256 from Python hashlib, as follows:

import hashlib

def produce_hash(model_uri: str) -> str:
    read_block_size_in_bytes = 64 * 1024 * 1024  # Read in blocks of 64MiB; any amount will work.
    hasher = hashlib.sha256()
    
    with open(model_uri, "rb") as f_in:
        for data_chunk in iter(lambda: f_in.read(read_block_size_in_bytes), b""):
            hasher.update(data_chunk)
    
    return hasher.hexdigest()

my_resource_sha256 = produce_hash("path/to/my_resource.zip")
print(my_resource_sha256)

Register your resource in a JSON file within the trusted_urls directory, providing the resource task, resource name, file extension (.zip or .tar for compressed resources), SHA256, and the direct download URLs as depicted in the exemple below (use buscador/trusted_urls/models.json as an exemple). You can either create a new JSON file or register your resource in an existing file, as long as you keep your resource semantically coherent with the configuration filename. Also note that Ulysses Fetcher will try to download resources by following the provided order in urls. Hence, later URLs are fallback addresses in case something went wrong with every previous URL.

{
  "task_name": {
    "resource_name": {
      "sha256": "<my_resource_sha256>",
      "file_extension": ".zip",
      "urls": [
        "https://url_1",
        "https://url_2",
        "..."
      ]
    }
  }
}

Create a Pull Request with your changes, providing all information about your resource. Your contribution will be reviewed and, if appropriate to this library, it may get accepted.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
buscador		buscador
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ulysses Fetcher

Table of Contents

Instalation

Available resources

Pretrained machine learning models

Datasets

Deprecated resources

Usage (as a library)

Usage by command line

For developers

Register a new resource

License

About

Releases 13

Contributors 2

Languages

License

ulysses-camara/ulysses-fetcher

Folders and files

Latest commit

History

Repository files navigation

Ulysses Fetcher

Table of Contents

Instalation

Available resources

Pretrained machine learning models

Datasets

Deprecated resources

Usage (as a library)

Usage by command line

For developers

Register a new resource

License

About

Resources

License

Stars

Watchers

Forks

Releases 13

Contributors 2

Languages