-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detecting language page-wise #18
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,10 +28,18 @@ | |
from marker.cleaners.text import cleanup_text | ||
from marker.images.extract import extract_images | ||
from marker.images.save import images_to_dict | ||
from marker.ocr.langdetect import get_text, detect_language_text, detect_language_ocr, keep_most_frequent_element | ||
from marker.ocr.langdetect import ( | ||
get_text, | ||
detect_language_text, | ||
detect_language_ocr, | ||
keep_most_frequent_element, | ||
language_detection, | ||
) | ||
|
||
from typing import List, Dict, Tuple, Optional | ||
from marker.settings import settings | ||
|
||
|
||
def convert_single_pdf( | ||
fname: str, | ||
model_lst: List, | ||
|
@@ -76,41 +84,10 @@ def convert_single_pdf( | |
} | ||
) | ||
|
||
valid_langs=["en","hi","or"] | ||
valid_langs = ["en", "hi", "or"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Consider making Instead of hard-coding |
||
|
||
# Detecting language of the text layer present. Getting empty means OCR is needed. | ||
language = detect_language_text(get_text(pages)) | ||
langs = [language] | ||
# validate_langs(langs) | ||
|
||
print("langs >",langs) | ||
if language not in valid_langs: | ||
OCR_ALL_PAGES = True | ||
language = detect_language_ocr(fname) | ||
langs = language | ||
# if language in valid_langs: | ||
# pages = convert_pages_to_unicode(pages) | ||
|
||
# else: | ||
if keep_most_frequent_element(language)[0] not in valid_langs: | ||
langs = ["en"] | ||
langs=list(set(langs)) | ||
if "unknown" in langs: | ||
langs.remove("unknown") | ||
for lang in langs: | ||
if lang not in valid_langs: | ||
langs.remove(lang) | ||
if len(langs)==0: | ||
langs = ["en"] | ||
langs=list(langs) | ||
|
||
print("langs >",langs) | ||
|
||
|
||
# OCR_ALL_PAGES=True | ||
# language = detect_language_ocr(fname) | ||
# langs = language | ||
# print("langs >",langs) | ||
languages_meta = language_detection(pages, fname, valid_langs) | ||
out_meta.update({"languages": languages_meta}) | ||
|
||
# Trim pages from doc to align with start page | ||
if start_page: | ||
|
@@ -128,7 +105,13 @@ def convert_single_pdf( | |
|
||
# OCR pages as needed | ||
pages, ocr_stats = run_ocr( | ||
doc, pages, langs, ocr_model, OCR_ALL_PAGES, batch_multiplier=batch_multiplier | ||
doc, | ||
pages, | ||
langs, | ||
ocr_model, | ||
OCR_ALL_PAGES, | ||
languages_meta, | ||
batch_multiplier=batch_multiplier, | ||
Comment on lines
+108
to
+114
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Codebase verification Remove the Analysis chainVerify the necessity of passing both Both Run the following script to inspect the Scripts executedThe following scripts were executed for the analysis: Script: #!/bin/bash
# Description: Inspect the `run_ocr` function definition to verify parameter usage.
# Test: Search for the `run_ocr` function definition.
rg --type python -A 5 -B 2 '^def run_ocr\('
# Expected result: The function definition showing how `langs` and `languages_meta` are utilized.
Length of output: 76 Script: #!/bin/bash
# Description: Inspect the `run_ocr` function definition without specifying file type to verify parameter usage.
# Search for the `run_ocr` function definition across all files.
rg '^def run_ocr\(' -A 5 -B 2
Length of output: 342 |
||
) | ||
flush_cuda_memory() | ||
|
||
|
@@ -226,13 +209,4 @@ def convert_single_pdf( | |
out_meta["postprocess_stats"] = {"edit": edit_stats} | ||
doc_images = images_to_dict(pages) | ||
|
||
language = detect_language_text(full_text) | ||
langs=[language] | ||
print("langs >",langs) | ||
out_meta.update( | ||
{ | ||
"languages": langs | ||
} | ||
) | ||
|
||
return full_text, doc_images, out_meta, merged_lines |
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -12,33 +12,35 @@ | |||||||||||||||
from pdf2image import convert_from_path | ||||||||||||||||
import pytesseract | ||||||||||||||||
|
||||||||||||||||
from langdetect import detect, DetectorFactory | ||||||||||||||||
from langdetect import detect, DetectorFactory, detect_langs | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove unused import The Apply this diff to remove the unused import: -from langdetect import detect, DetectorFactory, detect_langs
+from langdetect import detect, DetectorFactory Committable suggestion
Suggested change
ToolsRuff
|
||||||||||||||||
from langdetect.lang_detect_exception import LangDetectException | ||||||||||||||||
from marker.settings import settings | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove unused import The Apply this diff to remove the unused import: -from marker.settings import settings Committable suggestion
Suggested change
ToolsRuff
|
||||||||||||||||
|
||||||||||||||||
# Ensure the results are deterministic | ||||||||||||||||
DetectorFactory.seed = 0 | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def detect_language_text(text): | ||||||||||||||||
try: | ||||||||||||||||
# Detect the language | ||||||||||||||||
language = detect(text) | ||||||||||||||||
except LangDetectException: | ||||||||||||||||
# If detection fails, return 'unknown' | ||||||||||||||||
language = 'unknown' | ||||||||||||||||
language = "unknown" | ||||||||||||||||
return language | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
api_key = os.environ.get("OPENAI_API_KEY") | ||||||||||||||||
|
||||||||||||||||
client = OpenAI(api_key=api_key) | ||||||||||||||||
|
||||||||||||||||
text_prompt = ''' | ||||||||||||||||
text_prompt = """ | ||||||||||||||||
You are a bot who identifies language of a given text. If text is gibberish or encoded, do not return any language (return empty string). | ||||||||||||||||
Response format: json object with key: "language", value:"ISO 639-1 code of language." | ||||||||||||||||
Here's the text: | ||||||||||||||||
''' | ||||||||||||||||
""" | ||||||||||||||||
|
||||||||||||||||
image_prompt = ''' | ||||||||||||||||
image_prompt = """ | ||||||||||||||||
You are a powerful language model with vision capabilities. Your task is to analyze the provided image, and then determine the language of the text in it. | ||||||||||||||||
|
||||||||||||||||
Provide the result in the following JSON format: | ||||||||||||||||
|
@@ -47,50 +49,48 @@ def detect_language_text(text): | |||||||||||||||
} | ||||||||||||||||
|
||||||||||||||||
Here is the image for analysis: | ||||||||||||||||
''' | ||||||||||||||||
""" | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def detect_language_llm(text, prompt=text_prompt): | ||||||||||||||||
if len(text.split()) > 1000: | ||||||||||||||||
text=" ".join(text.split()[0:1000]) | ||||||||||||||||
text = " ".join(text.split()[0:1000]) | ||||||||||||||||
try: | ||||||||||||||||
print("Detecting language...") | ||||||||||||||||
response = client.chat.completions.create( | ||||||||||||||||
# model="gpt-4", | ||||||||||||||||
model="gpt-3.5-turbo-0125", | ||||||||||||||||
temperature=0, | ||||||||||||||||
messages=[ | ||||||||||||||||
{ | ||||||||||||||||
"role": "user", | ||||||||||||||||
"content": prompt + text | ||||||||||||||||
} | ||||||||||||||||
], | ||||||||||||||||
messages=[{"role": "user", "content": prompt + text}], | ||||||||||||||||
response_format={"type": "json_object"}, | ||||||||||||||||
) | ||||||||||||||||
|
||||||||||||||||
llm_response = response.choices[0].message.content | ||||||||||||||||
language=json.loads(llm_response)["language"] | ||||||||||||||||
language = json.loads(llm_response)["language"] | ||||||||||||||||
|
||||||||||||||||
except: | ||||||||||||||||
print("Error while detecting language") | ||||||||||||||||
language = "" | ||||||||||||||||
|
||||||||||||||||
Comment on lines
71
to
74
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Avoid using bare Using a bare Apply this diff to specify the exception and handle it appropriately: -except:
+except Exception as e:
print("Error while detecting language")
+ print(e)
language = "" Committable suggestion
Suggested change
ToolsRuff
|
||||||||||||||||
|
||||||||||||||||
return language | ||||||||||||||||
return language | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def detect_language_page(image, prompt=image_prompt): | ||||||||||||||||
try: | ||||||||||||||||
base64_image = encode_image(image) | ||||||||||||||||
print("Detecting language.") | ||||||||||||||||
response = client.chat.completions.create( | ||||||||||||||||
model = "gpt-4o-mini", | ||||||||||||||||
messages = [ | ||||||||||||||||
model="gpt-4o-mini", | ||||||||||||||||
messages=[ | ||||||||||||||||
{ | ||||||||||||||||
"role":"user", | ||||||||||||||||
"role": "user", | ||||||||||||||||
"content": [ | ||||||||||||||||
{"type": "text", "text": prompt}, | ||||||||||||||||
{ | ||||||||||||||||
"type": "image_url", | ||||||||||||||||
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}, | ||||||||||||||||
"image_url": { | ||||||||||||||||
"url": f"data:image/jpeg;base64,{base64_image}" | ||||||||||||||||
}, | ||||||||||||||||
}, | ||||||||||||||||
], | ||||||||||||||||
} | ||||||||||||||||
|
@@ -99,26 +99,31 @@ def detect_language_page(image, prompt=image_prompt): | |||||||||||||||
) | ||||||||||||||||
llm_response = response.choices[0].message.content | ||||||||||||||||
print(llm_response) | ||||||||||||||||
language=json.loads(llm_response)["language"] | ||||||||||||||||
language = json.loads(llm_response)["language"] | ||||||||||||||||
|
||||||||||||||||
except Exception as e: | ||||||||||||||||
print("Error while detecting language.") | ||||||||||||||||
language = "en" | ||||||||||||||||
|
||||||||||||||||
return language | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def detect_language_ocr(pdf_path): | ||||||||||||||||
try: | ||||||||||||||||
print("Detecting language using OCR") | ||||||||||||||||
pdf_document = fitz.open(pdf_path) | ||||||||||||||||
n_pages = pdf_document.page_count | ||||||||||||||||
|
||||||||||||||||
results = [] | ||||||||||||||||
for page_number in range(min(3, n_pages)): | ||||||||||||||||
page = pdf_document.load_page(page_number) # Page numbers are 0-indexed in PyMuPDF | ||||||||||||||||
for page_number in range(n_pages): | ||||||||||||||||
page = pdf_document.load_page( | ||||||||||||||||
page_number | ||||||||||||||||
) # Page numbers are 0-indexed in PyMuPDF | ||||||||||||||||
pix = page.get_pixmap() | ||||||||||||||||
|
||||||||||||||||
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as temp_image_file: | ||||||||||||||||
with tempfile.NamedTemporaryFile( | ||||||||||||||||
suffix=".png", delete=False | ||||||||||||||||
) as temp_image_file: | ||||||||||||||||
image_path = temp_image_file.name | ||||||||||||||||
pix.save(image_path) | ||||||||||||||||
|
||||||||||||||||
|
@@ -134,11 +139,46 @@ def detect_language_ocr(pdf_path): | |||||||||||||||
os.remove(image_path) | ||||||||||||||||
|
||||||||||||||||
except Exception as e: | ||||||||||||||||
print("failed ocr language detection",e) | ||||||||||||||||
print("failed ocr language detection", e) | ||||||||||||||||
return ["en"] | ||||||||||||||||
|
||||||||||||||||
return results | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def detect_language_ocr_page(pdf_path, page_number): | ||||||||||||||||
try: | ||||||||||||||||
print("Detecting language using OCR") | ||||||||||||||||
pdf_document = fitz.open(pdf_path) | ||||||||||||||||
n_pages = pdf_document.page_count | ||||||||||||||||
|
||||||||||||||||
page = pdf_document.load_page( | ||||||||||||||||
page_number | ||||||||||||||||
) # Page numbers are 0-indexed in PyMuPDF | ||||||||||||||||
pix = page.get_pixmap() | ||||||||||||||||
|
||||||||||||||||
with tempfile.NamedTemporaryFile( | ||||||||||||||||
suffix=".png", delete=False | ||||||||||||||||
) as temp_image_file: | ||||||||||||||||
image_path = temp_image_file.name | ||||||||||||||||
pix.save(image_path) | ||||||||||||||||
|
||||||||||||||||
# language = detect_language_page(image_path) | ||||||||||||||||
|
||||||||||||||||
# results.append(language) | ||||||||||||||||
|
||||||||||||||||
text = extract_text_from_image(image_path) | ||||||||||||||||
result = detect_language_text(text) | ||||||||||||||||
|
||||||||||||||||
# Clean up the temporary image file | ||||||||||||||||
os.remove(image_path) | ||||||||||||||||
|
||||||||||||||||
Comment on lines
+159
to
+174
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ensure temporary files are deleted even if an exception occurs. If an exception occurs after the temporary image file is created but before it's deleted, the temporary file may not be cleaned up, causing resource leakage. Use a context manager or a Refactor the code to use a context manager: with tempfile.NamedTemporaryFile(suffix=".png") as temp_image_file:
image_path = temp_image_file.name
pix.save(image_path)
# Process the image
text = extract_text_from_image(image_path)
result = detect_language_text(text)
# The temporary file is automatically deleted when the block exits |
||||||||||||||||
except Exception as e: | ||||||||||||||||
print("failed ocr language detection", e) | ||||||||||||||||
return "unknown" | ||||||||||||||||
|
||||||||||||||||
return result | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def extract_text_from_image(image_path): | ||||||||||||||||
""" | ||||||||||||||||
Extract text from an image using OCR. | ||||||||||||||||
|
@@ -151,30 +191,52 @@ def extract_text_from_image(image_path): | |||||||||||||||
""" | ||||||||||||||||
try: | ||||||||||||||||
image = Image.open(image_path) | ||||||||||||||||
text = pytesseract.image_to_string(image, lang='eng+hin+ori') | ||||||||||||||||
text = pytesseract.image_to_string(image, lang="eng+hin+ori") | ||||||||||||||||
return text | ||||||||||||||||
except Exception as e: | ||||||||||||||||
print(f"An error occurred while OCR language detection: {e}") | ||||||||||||||||
return "" | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def language_detection(pages: List[Page], pdf_path, valid_langs): | ||||||||||||||||
languages_meta = {} | ||||||||||||||||
for i, page in enumerate(pages): | ||||||||||||||||
page_text = page.prelim_text | ||||||||||||||||
page_language = detect_language_text(page_text) | ||||||||||||||||
|
||||||||||||||||
if page_language not in valid_langs: | ||||||||||||||||
page_language = detect_language_ocr_page(pdf_path, i) | ||||||||||||||||
languages_meta[str(i)] = ( | ||||||||||||||||
page_language if page_language in valid_langs else "unknown" | ||||||||||||||||
) | ||||||||||||||||
|
||||||||||||||||
else: | ||||||||||||||||
languages_meta[str(i)] = page_language | ||||||||||||||||
|
||||||||||||||||
return languages_meta | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def get_text(pages: List[Page]): | ||||||||||||||||
full_text = "" | ||||||||||||||||
for page in pages: | ||||||||||||||||
full_text += page.prelim_text | ||||||||||||||||
return full_text.strip() | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
# Function to encode the image | ||||||||||||||||
def encode_image(image_path): | ||||||||||||||||
with open(image_path, "rb") as image_file: | ||||||||||||||||
return base64.b64encode(image_file.read()).decode("utf-8") | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def pdf_page_to_image(pdf_path, page_number): | ||||||||||||||||
images = convert_from_path(pdf_path, first_page=page_number, last_page=page_number) | ||||||||||||||||
return images[0] | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def keep_most_frequent_element(lst): | ||||||||||||||||
if not lst: | ||||||||||||||||
return lst | ||||||||||||||||
counter = Counter(lst) | ||||||||||||||||
most_common_element, _ = counter.most_common(1)[0] | ||||||||||||||||
return [most_common_element] | ||||||||||||||||
return [most_common_element] |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -35,16 +35,23 @@ def run_ocr( | |||||||||||||||||||||||||
langs: List[str], | ||||||||||||||||||||||||||
rec_model, | ||||||||||||||||||||||||||
OCR_ALL_PAGES, | ||||||||||||||||||||||||||
languages_meta, | ||||||||||||||||||||||||||
batch_multiplier=1, | ||||||||||||||||||||||||||
) -> (List[Page], Dict): | ||||||||||||||||||||||||||
ocr_pages = 0 | ||||||||||||||||||||||||||
ocr_success = 0 | ||||||||||||||||||||||||||
ocr_failed = 0 | ||||||||||||||||||||||||||
no_text = no_text_found(pages) | ||||||||||||||||||||||||||
ocr_idxs = [] | ||||||||||||||||||||||||||
ocr_langs = [ | ||||||||||||||||||||||||||
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"] | ||||||||||||||||||||||||||
] | ||||||||||||||||||||||||||
unknown_langs = [ | ||||||||||||||||||||||||||
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"] | ||||||||||||||||||||||||||
] | ||||||||||||||||||||||||||
Comment on lines
+46
to
+51
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Logical Error: Duplicate Definitions of The variables Apply this diff to correct the definition of ocr_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
unknown_langs = [
- int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
+ int(k) for k, v in languages_meta.items() if v == "unknown"
] Committable suggestion
Suggested change
|
||||||||||||||||||||||||||
for pnum, page in enumerate(pages): | ||||||||||||||||||||||||||
ocr_needed = should_ocr_page(page, no_text, OCR_ALL_PAGES) | ||||||||||||||||||||||||||
if ocr_needed: | ||||||||||||||||||||||||||
if (ocr_needed or pnum in ocr_langs) and pnum not in unknown_langs: | ||||||||||||||||||||||||||
ocr_idxs.append(pnum) | ||||||||||||||||||||||||||
ocr_pages += 1 | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports to clean up code.
The functions
get_text
,detect_language_text
,detect_language_ocr
, andkeep_most_frequent_element
are imported but not used in this file. Removing these unused imports will improve code readability and maintainability.Apply this diff to remove the unused imports:
Committable suggestion
Tools
Ruff