Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting language page-wise #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 19 additions & 45 deletions marker/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,18 @@
from marker.cleaners.text import cleanup_text
from marker.images.extract import extract_images
from marker.images.save import images_to_dict
from marker.ocr.langdetect import get_text, detect_language_text, detect_language_ocr, keep_most_frequent_element
from marker.ocr.langdetect import (
get_text,
detect_language_text,
detect_language_ocr,
keep_most_frequent_element,
Comment on lines +32 to +35
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused imports to clean up code.

The functions get_text, detect_language_text, detect_language_ocr, and keep_most_frequent_element are imported but not used in this file. Removing these unused imports will improve code readability and maintainability.

Apply this diff to remove the unused imports:

 from marker.ocr.langdetect import (
-    get_text,
-    detect_language_text,
-    detect_language_ocr,
-    keep_most_frequent_element,
     language_detection,
 )
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
get_text,
detect_language_text,
detect_language_ocr,
keep_most_frequent_element,
from marker.ocr.langdetect import (
language_detection,
)
Tools
Ruff

32-32: marker.ocr.langdetect.get_text imported but unused

Remove unused import

(F401)


33-33: marker.ocr.langdetect.detect_language_text imported but unused

Remove unused import

(F401)


34-34: marker.ocr.langdetect.detect_language_ocr imported but unused

Remove unused import

(F401)


35-35: marker.ocr.langdetect.keep_most_frequent_element imported but unused

Remove unused import

(F401)

language_detection,
)

from typing import List, Dict, Tuple, Optional
from marker.settings import settings


def convert_single_pdf(
fname: str,
model_lst: List,
Expand Down Expand Up @@ -76,41 +84,10 @@ def convert_single_pdf(
}
)

valid_langs=["en","hi","or"]
valid_langs = ["en", "hi", "or"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider making valid_langs configurable for flexibility.

Instead of hard-coding valid_langs = ["en", "hi", "or"], consider retrieving the valid languages from configuration settings or passing them as a parameter. This approach enhances flexibility and maintainability, allowing for easier updates and broader language support in the future.


# Detecting language of the text layer present. Getting empty means OCR is needed.
language = detect_language_text(get_text(pages))
langs = [language]
# validate_langs(langs)

print("langs >",langs)
if language not in valid_langs:
OCR_ALL_PAGES = True
language = detect_language_ocr(fname)
langs = language
# if language in valid_langs:
# pages = convert_pages_to_unicode(pages)

# else:
if keep_most_frequent_element(language)[0] not in valid_langs:
langs = ["en"]
langs=list(set(langs))
if "unknown" in langs:
langs.remove("unknown")
for lang in langs:
if lang not in valid_langs:
langs.remove(lang)
if len(langs)==0:
langs = ["en"]
langs=list(langs)

print("langs >",langs)


# OCR_ALL_PAGES=True
# language = detect_language_ocr(fname)
# langs = language
# print("langs >",langs)
languages_meta = language_detection(pages, fname, valid_langs)
out_meta.update({"languages": languages_meta})

# Trim pages from doc to align with start page
if start_page:
Expand All @@ -128,7 +105,13 @@ def convert_single_pdf(

# OCR pages as needed
pages, ocr_stats = run_ocr(
doc, pages, langs, ocr_model, OCR_ALL_PAGES, batch_multiplier=batch_multiplier
doc,
pages,
langs,
ocr_model,
OCR_ALL_PAGES,
languages_meta,
batch_multiplier=batch_multiplier,
Comment on lines +108 to +114
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Remove the languages_meta parameter from the run_ocr function call in marker/convert.py (lines 108-114) as it is not utilized within the function.

Analysis chain

Verify the necessity of passing both langs and languages_meta to run_ocr.

Both langs and languages_meta are being passed to the run_ocr function. Ensure that both parameters are required and used appropriately within the function. If they serve similar purposes, consider consolidating them to streamline the code and reduce potential confusion.

Run the following script to inspect the run_ocr function's parameters and usage:

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Inspect the `run_ocr` function definition to verify parameter usage.

# Test: Search for the `run_ocr` function definition.
rg --type python -A 5 -B 2 '^def run_ocr\('

# Expected result: The function definition showing how `langs` and `languages_meta` are utilized.

Length of output: 76


Script:

#!/bin/bash
# Description: Inspect the `run_ocr` function definition without specifying file type to verify parameter usage.

# Search for the `run_ocr` function definition across all files.
rg '^def run_ocr\(' -A 5 -B 2

Length of output: 342

)
flush_cuda_memory()

Expand Down Expand Up @@ -226,13 +209,4 @@ def convert_single_pdf(
out_meta["postprocess_stats"] = {"edit": edit_stats}
doc_images = images_to_dict(pages)

language = detect_language_text(full_text)
langs=[language]
print("langs >",langs)
out_meta.update(
{
"languages": langs
}
)

return full_text, doc_images, out_meta, merged_lines
120 changes: 91 additions & 29 deletions marker/ocr/langdetect.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,33 +12,35 @@
from pdf2image import convert_from_path
import pytesseract

from langdetect import detect, DetectorFactory
from langdetect import detect, DetectorFactory, detect_langs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused import detect_langs.

The detect_langs function is imported but never used in the code. Removing unused imports helps keep the code clean and maintainable.

Apply this diff to remove the unused import:

-from langdetect import detect, DetectorFactory, detect_langs
+from langdetect import detect, DetectorFactory
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from langdetect import detect, DetectorFactory, detect_langs
from langdetect import detect, DetectorFactory
Tools
Ruff

15-15: langdetect.detect_langs imported but unused

Remove unused import: langdetect.detect_langs

(F401)

from langdetect.lang_detect_exception import LangDetectException
from marker.settings import settings
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused import settings.

The settings module from marker.settings is imported but never used. Cleaning up unused imports improves code readability.

Apply this diff to remove the unused import:

-from marker.settings import settings
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from marker.settings import settings
Tools
Ruff

17-17: marker.settings.settings imported but unused

Remove unused import: marker.settings.settings

(F401)


# Ensure the results are deterministic
DetectorFactory.seed = 0


def detect_language_text(text):
try:
# Detect the language
language = detect(text)
except LangDetectException:
# If detection fails, return 'unknown'
language = 'unknown'
language = "unknown"
return language


api_key = os.environ.get("OPENAI_API_KEY")

client = OpenAI(api_key=api_key)

text_prompt = '''
text_prompt = """
You are a bot who identifies language of a given text. If text is gibberish or encoded, do not return any language (return empty string).
Response format: json object with key: "language", value:"ISO 639-1 code of language."
Here's the text:
'''
"""

image_prompt = '''
image_prompt = """
You are a powerful language model with vision capabilities. Your task is to analyze the provided image, and then determine the language of the text in it.

Provide the result in the following JSON format:
Expand All @@ -47,50 +49,48 @@ def detect_language_text(text):
}

Here is the image for analysis:
'''
"""


def detect_language_llm(text, prompt=text_prompt):
if len(text.split()) > 1000:
text=" ".join(text.split()[0:1000])
text = " ".join(text.split()[0:1000])
try:
print("Detecting language...")
response = client.chat.completions.create(
# model="gpt-4",
model="gpt-3.5-turbo-0125",
temperature=0,
messages=[
{
"role": "user",
"content": prompt + text
}
],
messages=[{"role": "user", "content": prompt + text}],
response_format={"type": "json_object"},
)

llm_response = response.choices[0].message.content
language=json.loads(llm_response)["language"]
language = json.loads(llm_response)["language"]

except:
print("Error while detecting language")
language = ""

Comment on lines 71 to 74
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid using bare except; specify the exception type.

Using a bare except clause can catch unintended exceptions and make debugging difficult. It's better to specify the exception you intend to catch or use except Exception as e.

Apply this diff to specify the exception and handle it appropriately:

-except:
+except Exception as e:
    print("Error while detecting language")
+   print(e)
    language = ""
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except:
print("Error while detecting language")
language = ""
except Exception as e:
print("Error while detecting language")
print(e)
language = ""
Tools
Ruff

71-71: Do not use bare except

(E722)


return language
return language


def detect_language_page(image, prompt=image_prompt):
try:
base64_image = encode_image(image)
print("Detecting language.")
response = client.chat.completions.create(
model = "gpt-4o-mini",
messages = [
model="gpt-4o-mini",
messages=[
{
"role":"user",
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
},
},
],
}
Expand All @@ -99,26 +99,31 @@ def detect_language_page(image, prompt=image_prompt):
)
llm_response = response.choices[0].message.content
print(llm_response)
language=json.loads(llm_response)["language"]
language = json.loads(llm_response)["language"]

except Exception as e:
print("Error while detecting language.")
language = "en"

return language


def detect_language_ocr(pdf_path):
try:
print("Detecting language using OCR")
pdf_document = fitz.open(pdf_path)
n_pages = pdf_document.page_count

results = []
for page_number in range(min(3, n_pages)):
page = pdf_document.load_page(page_number) # Page numbers are 0-indexed in PyMuPDF
for page_number in range(n_pages):
page = pdf_document.load_page(
page_number
) # Page numbers are 0-indexed in PyMuPDF
pix = page.get_pixmap()

with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as temp_image_file:
with tempfile.NamedTemporaryFile(
suffix=".png", delete=False
) as temp_image_file:
image_path = temp_image_file.name
pix.save(image_path)

Expand All @@ -134,11 +139,46 @@ def detect_language_ocr(pdf_path):
os.remove(image_path)

except Exception as e:
print("failed ocr language detection",e)
print("failed ocr language detection", e)
return ["en"]

return results


def detect_language_ocr_page(pdf_path, page_number):
try:
print("Detecting language using OCR")
pdf_document = fitz.open(pdf_path)
n_pages = pdf_document.page_count

page = pdf_document.load_page(
page_number
) # Page numbers are 0-indexed in PyMuPDF
pix = page.get_pixmap()

with tempfile.NamedTemporaryFile(
suffix=".png", delete=False
) as temp_image_file:
image_path = temp_image_file.name
pix.save(image_path)

# language = detect_language_page(image_path)

# results.append(language)

text = extract_text_from_image(image_path)
result = detect_language_text(text)

# Clean up the temporary image file
os.remove(image_path)

Comment on lines +159 to +174
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure temporary files are deleted even if an exception occurs.

If an exception occurs after the temporary image file is created but before it's deleted, the temporary file may not be cleaned up, causing resource leakage. Use a context manager or a try...finally block to ensure the file is deleted.

Refactor the code to use a context manager:

with tempfile.NamedTemporaryFile(suffix=".png") as temp_image_file:
    image_path = temp_image_file.name
    pix.save(image_path)
    # Process the image
    text = extract_text_from_image(image_path)
    result = detect_language_text(text)
    # The temporary file is automatically deleted when the block exits

except Exception as e:
print("failed ocr language detection", e)
return "unknown"

return result


def extract_text_from_image(image_path):
"""
Extract text from an image using OCR.
Expand All @@ -151,30 +191,52 @@ def extract_text_from_image(image_path):
"""
try:
image = Image.open(image_path)
text = pytesseract.image_to_string(image, lang='eng+hin+ori')
text = pytesseract.image_to_string(image, lang="eng+hin+ori")
return text
except Exception as e:
print(f"An error occurred while OCR language detection: {e}")
return ""


def language_detection(pages: List[Page], pdf_path, valid_langs):
languages_meta = {}
for i, page in enumerate(pages):
page_text = page.prelim_text
page_language = detect_language_text(page_text)

if page_language not in valid_langs:
page_language = detect_language_ocr_page(pdf_path, i)
languages_meta[str(i)] = (
page_language if page_language in valid_langs else "unknown"
)

else:
languages_meta[str(i)] = page_language

return languages_meta


def get_text(pages: List[Page]):
full_text = ""
for page in pages:
full_text += page.prelim_text
return full_text.strip()


# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")



def pdf_page_to_image(pdf_path, page_number):
images = convert_from_path(pdf_path, first_page=page_number, last_page=page_number)
return images[0]


def keep_most_frequent_element(lst):
if not lst:
return lst
counter = Counter(lst)
most_common_element, _ = counter.most_common(1)[0]
return [most_common_element]
return [most_common_element]
9 changes: 8 additions & 1 deletion marker/ocr/recognition.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,16 +35,23 @@ def run_ocr(
langs: List[str],
rec_model,
OCR_ALL_PAGES,
languages_meta,
batch_multiplier=1,
) -> (List[Page], Dict):
ocr_pages = 0
ocr_success = 0
ocr_failed = 0
no_text = no_text_found(pages)
ocr_idxs = []
ocr_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
unknown_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
Comment on lines +46 to +51
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Logical Error: Duplicate Definitions of ocr_langs and unknown_langs

The variables ocr_langs and unknown_langs are both assigned the same list comprehension, resulting in identical lists. Given their intended purposes, this seems unintended. Specifically, ocr_langs should contain page indices where the language is neither "en" nor "unknown", while unknown_langs should contain indices where the language is "unknown".

Apply this diff to correct the definition of unknown_langs:

 ocr_langs = [
     int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
 ]
 unknown_langs = [
-    int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
+    int(k) for k, v in languages_meta.items() if v == "unknown"
 ]
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ocr_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
unknown_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
ocr_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
unknown_langs = [
int(k) for k, v in languages_meta.items() if v == "unknown"
]

for pnum, page in enumerate(pages):
ocr_needed = should_ocr_page(page, no_text, OCR_ALL_PAGES)
if ocr_needed:
if (ocr_needed or pnum in ocr_langs) and pnum not in unknown_langs:
ocr_idxs.append(pnum)
ocr_pages += 1

Expand Down