Skip to content

Commit

Permalink
update instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
baberabb committed May 31, 2024
1 parent 1e38feb commit e012ee5
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 2 deletions.
8 changes: 7 additions & 1 deletion uspto/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,13 @@ USPTO dataset extracted from [Google Patents Public Dataset](https://cloud.googl

To clone the unprocessed dataset from HuggingFace run `bash setup.sh`. The default location is `/uspto/data`

The main script can be run with `bash run process_uspto.sh --output-dir <output_dir> --max-concurrency <int> --limit <max_rows>`
`pandoc` is required to run the script. The command to install it is provided in the script (commented out). Alternatively you can install it with`sudo apt-get install pandoc` but that installs an older version.


The main script can be run with `bash run process_uspto.sh --output-dir <output_dir> --max-concurrency <int> --limit <max_rows>`.

Note: The script will take a long time to run. The `--max-concurrency` flag can be used to speed up the process. The `--limit` flag can be used to limit the number of rows processed.
It takes ~30 mins to process 1 file with 256 threads. The bulk of the processing is done by pandoc.

To save the processed data to parquet add the `--to-parquet` flag.

Expand Down
3 changes: 2 additions & 1 deletion uspto/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
polars
lxml
pypandoc
rich
4 changes: 4 additions & 0 deletions uspto/setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,7 @@
# Download HF Dataset repo
git clone git@hf.co:datasets/baber/USPTO /uspto/data
echo "HF Dataset repo cloned to /uspto/data"

# install pandoc
#wget https://github.com/jgm/pandoc/releases/download/3.2/pandoc-3.2-1-amd64.deb
#sudo dpkg -i pandoc-3.2-1-amd64.deb

0 comments on commit e012ee5

Please sign in to comment.