Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace pdf parsing libs #1861

Merged
merged 6 commits into from
Aug 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 8 additions & 10 deletions .github/workflows/pr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,14 @@ permissions:

jobs:

pr-builder:
needs:
- prepare
- checks
- ci_pipe
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/pr-builder.yaml@branch-24.02

prepare:
# Executes the get-pr-info action to determine if the PR has the skip-ci label, if the action fails we assume the
# PR does not have the label
Expand Down Expand Up @@ -91,13 +99,3 @@ jobs:
test_container: nvcr.io/ea-nvidia-morpheus/morpheus:morpheus-ci-test-240614
secrets:
NGC_API_KEY: ${{ secrets.NGC_API_KEY }}

pr-builder:
# Always run this step even if others are skipped or cancelled
if: '!cancelled()'
needs:
- prepare
- checks
- ci_pipe
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/pr-builder.yaml@branch-24.02
2 changes: 1 addition & 1 deletion conda/environments/all_cuda-121_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ dependencies:
- pydantic
- pylint=3.0.3
- pypdf=3.17.4
- pypdfium2=4.30
- pytest-asyncio
- pytest-benchmark=4.0
- pytest-cov
Expand Down Expand Up @@ -120,7 +121,6 @@ dependencies:
- pip:
- --find-links https://data.dgl.ai/wheels-test/repo.html
- --find-links https://data.dgl.ai/wheels/cu121/repo.html
- PyMuPDF==1.23.*
- databricks-cli < 0.100
- databricks-connect
- dgl==2.0.0
Expand Down
2 changes: 1 addition & 1 deletion conda/environments/dev_cuda-121_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ dependencies:
- pybind11-stubgen=0.10.5
- pydantic
- pylint=3.0.3
- pypdfium2=4.30
- pytest-asyncio
- pytest-benchmark=4.0
- pytest-cov
Expand Down Expand Up @@ -98,7 +99,6 @@ dependencies:
- yapf=0.40.1
- zlib=1.2.13
- pip:
- PyMuPDF==1.23.*
- databricks-cli < 0.100
- databricks-connect
- milvus==2.3.5
Expand Down
2 changes: 1 addition & 1 deletion conda/environments/examples_cuda-121_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ dependencies:
- pluggy=1.3
- pydantic
- pypdf=3.17.4
- pypdfium2=4.30
- python-confluent-kafka>=1.9.2,<1.10.0a0
- python-docx==1.1.0
- python-graphviz
Expand All @@ -67,7 +68,6 @@ dependencies:
- pip:
- --find-links https://data.dgl.ai/wheels-test/repo.html
- --find-links https://data.dgl.ai/wheels/cu121/repo.html
- PyMuPDF==1.23.*
- databricks-cli < 0.100
- databricks-connect
- dgl==2.0.0
Expand Down
4 changes: 2 additions & 2 deletions dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -364,14 +364,14 @@ dependencies:
- output_types: [conda]
packages:
- &nodejs nodejs=18.*
- &pypdfium2 pypdfium2=4.30
- pytest-asyncio
- pytest-benchmark=4.0
- pytest-cov
- pytest=7.4.4
- &python-docx python-docx==1.1.0
- pip
- pip:
- &PyMuPDF PyMuPDF==1.23.*
- pytest-kafka==0.6.0

example-dfp-prod:
Expand Down Expand Up @@ -410,6 +410,7 @@ dependencies:
- onnx=1.15
- openai=1.13
- pypdf=3.17.4
- *pypdfium2
- *python-docx
- requests-toolbelt=1.0 # Transitive dep needed by nemollm, specified here to ensure we get a compatible version
- sentence-transformers=2.7
Expand All @@ -420,7 +421,6 @@ dependencies:
- faiss-gpu==1.7.*
- google-search-results==2.4
- nemollm==0.3.5
- *PyMuPDF

model-training-tuning:
common:
Expand Down
13 changes: 8 additions & 5 deletions examples/llm/vdb_upload/module/content_extractor_module.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@
from typing import Dict
dagardner-nv marked this conversation as resolved.
Show resolved Hide resolved
from typing import List

import fitz
import fsspec
import mrc
import mrc.core.operators as ops
import pandas as pd
import pypdfium2 as libpdfium
from docx import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pydantic import BaseModel # pylint: disable=no-name-in-module
Expand Down Expand Up @@ -172,10 +172,13 @@ def wrapper(input_info: ConverterInputInfo, *args, **kwargs):
@_converter_error_handler
def _pdf_to_text_converter(input_info: ConverterInputInfo) -> str:
text = ""
pdf_document = fitz.open(stream=input_info.io_bytes, filetype="pdf")
for page_num in range(pdf_document.page_count):
page = pdf_document[page_num]
text += page.get_text()
pdf_document = libpdfium.PdfDocument(input_info.io_bytes)
for page_idx in range(len(pdf_document)):
page = pdf_document.get_page(page_idx)
textpage = page.get_textpage()
page_text = textpage.get_text_bounded()
text += page_text

return text


Expand Down
Loading