Skip to content

Commit

Permalink
fix instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
baberabb committed May 29, 2024
1 parent 351460f commit 5f75798
Show file tree
Hide file tree
Showing 4 changed files with 8 additions and 65 deletions.
18 changes: 4 additions & 14 deletions uspto/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,11 @@ USPTO dataset extracted from [Google Patents Public Dataset](https://cloud.googl

## Data Download and Processing

The script uses a local API to convert the MATHML equations to LaTeX. To download the dataset and install the code necssary to install the server, run `bash setup.sh`.
<details>
<summary>Under the hood of run.sh</summary>
setup.sh has 3 main steps:
To clone the unprocessed dataset from HuggingFace run `bash setup.sh`. The default location is `/uspto/data`

1. Clones the dataset from Huggingface
2. Clone the MathML to LaTeX server
3. Compiles the TypeScript code.
</details>
The main script can be run with `bash run process_uspto.sh --output-dir <output_dir> --max-concurrency <int> --limit <max_rows>`

The main script can be run with `bash run process_uspto.sh --output_dir <output_dir> --max_concurrency <int> --limit <max_rows>`
To save the processed data to parquet add the `--to-parquet` flag.

<details>
<summary>Under the hood of process_uspto.sh</summary>
Expand All @@ -24,7 +18,6 @@ The main script can be run with `bash run process_uspto.sh --output_dir <output_
#### Usage
1. Ensure you are in the correct directory structure:
1. The script expects to be run from the parent directory of the `uspto` directory.
2. Inside the `uspto` directory, there should be a `mathml-to-latex` directory with the Node.js server script.

#### Running the Script:
- Make sure the script has execute permissions. If not, run:
Expand All @@ -33,10 +26,7 @@ The main script can be run with `bash run process_uspto.sh --output_dir <output_
```

#### It has the following steps:
1. Checks if we are in the `uspto` directory.
2. Starts the MathML to LaTeX server.
3. Runs the Python script to process the parquet files.
4. Cleans up after the process is finished.
1. The main bulk of the processing in the python script are the pandoc conversions. A progress bar is displayed for each column/file.

</details>

Expand Down
28 changes: 0 additions & 28 deletions uspto/process_uspto.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,35 +2,7 @@

#!/bin/bash

# Get the current directory
CURRENT_DIR=$(basename "$PWD")

# Check if we are already in the uspto directory
if [ "$CURRENT_DIR" != "uspto" ]; then
cd uspto || { echo "Failed to navigate to the uspto directory"; exit 1; }
fi

# Start the Node server in the background
cd mathml-to-latex || { echo "Failed to navigate to mathml-to-latex directory"; exit 1; }
node dist/server.js &
echo "MathML to LaTeX server started."

# Get the PID of the Node server
NODE_PID=$!
echo "MathML to LaTeX server started with PID: $NODE_PID"

# Navigate back to the uspto directory
cd ..

# Run the Python script
echo "Running the uspto-to-dolma.py script..."
python uspto-to-dolma.py "$@"
echo "uspto-to-dolma.py script completed."

# Kill the Node server after the Python script completes
kill $NODE_PID

# Navigate back to the top-level directory
cd ..
echo "MathML to LaTeX server stopped."
echo "Data processing complete."
23 changes: 2 additions & 21 deletions uspto/setup.sh
Original file line number Diff line number Diff line change
@@ -1,24 +1,5 @@
#!/bin/bash

# Download HF Dataset repo
git clone git@hf.co:datasets/baber/USPTO ./data/uspto/raw
echo "HF Dataset repo cloned to ./data/uspto/raw."

# Clone MathML to LaTeX converter
echo "Cloning MathML to LaTeX converter..."
git clone https://github.com/baberabb/mathml-to-latex.git

# Navigate to the Node.js project directory
echo "Navigating to mathml-to-latex directory..."
cd mathml-to-latex || { echo "Failed to navigate to mathml-to-latex directory"; exit 1; }

# Install Node.js dependencies
npm install

# Compile TypeScript to JavaScript
npx tsc
echo "TypeScript compilation completed."

# Navigate back to the original directory
cd ..
echo "Setup complete."
git clone git@hf.co:datasets/baber/USPTO /uspto/data
echo "HF Dataset repo cloned to /uspto/data"
4 changes: 2 additions & 2 deletions uspto/uspto-to-dolma.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,8 @@ def create_args_parser() -> argparse.ArgumentParser:
parser.add_argument(
"--max-concurrency",
type=int,
default=8,
help="Maximum number of parquet files to process concurrently",
default=0,
help="Maximum number of multiprocessing for pandoc conversions",
)
parser.add_argument(
"--to-parquet",
Expand Down

0 comments on commit 5f75798

Please sign in to comment.