-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table Detection and Parser #11
Conversation
WalkthroughWalkthroughThe recent changes enhance the functionality of the Changes
Recent review detailsConfiguration used: CodeRabbit UI Files selected for processing (6)
Additional context usedRuff
Additional comments not posted (7)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Outside diff range, codebase verification and nitpick comments (5)
marker/tables/utils.py (1)
46-49
: Remove or clarify commented-out code.The commented-out lines suggest functionality that is not currently implemented. If these lines are not needed, consider removing them to improve code clarity. If they are placeholders for future work, add a comment explaining their purpose.
- # box = [int(coord) for coord in box.tolist()] - # Crop format: (left, upper, right, lower) - # left, upper, right, lower = boxmarker/tables/schema.py (1)
51-52
: Remove or clarify commented-out code.The commented-out
__str__
method in theRectangle
class should be removed if not needed, or uncommented if it serves a purpose. Consider adding a comment to explain its intended use.- # def __str__(self) -> str: - # return f"x: {self.x}, y: {self.y}, width:{self.width}, height:{self.height}"marker/tables/intersections.py (2)
Line range hint
228-228
: Avoid ambiguous variable names.The variable name
l
is ambiguous and can be confused with the number1
or the letterI
. Consider renaming it to something more descriptive.- for l in h_lines: + for line in h_lines: - for l in v_lines: + for line in v_lines:Also applies to: 249-249
Tools
Ruff
64-64: Loop control variable
box_id
not used within loop bodyRename unused
box_id
to_box_id
(B007)
Line range hint
251-251
: Remove unused variablestart_point
.The variable
start_point
is assigned but never used. Consider removing it to clean up the code.- start_point = x - LEN_THRESHOLD, y - line_width
Tools
Ruff
64-64: Loop control variable
box_id
not used within loop bodyRename unused
box_id
to_box_id
(B007)
marker/convert.py (1)
171-172
: Clarify or remove commented-out code.The commented-out lines related to
format_tables
suggest a change in strategy. If this logic is no longer needed, consider removing it. Otherwise, add comments to explain why it's commented out.- # table_count = format_tables(pages) - # out_meta["block_stats"]["table"] = table_count
def fromCoords(x1, y1, x2, y2): | ||
return Rectangle(x1, y1, x2 - x1, y2 - y1) | ||
|
||
def fromPoints(top_left_point: Point, bottom_right_point: Point): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making fromCoords
and fromPoints
class methods.
The fromCoords
and fromPoints
methods could be converted to class methods using the @classmethod
decorator. This would make it clear that they are alternative constructors for the Rectangle
class.
+ @classmethod
def fromCoords(cls, x1, y1, x2, y2):
return cls(x1, y1, x2 - x1, y2 - y1)
+ @classmethod
def fromPoints(cls, top_left_point: Point, bottom_right_point: Point):
return cls(
top_left_point.x,
top_left_point.y,
abs(top_left_point.x - bottom_right_point.x),
abs(top_left_point.y - bottom_right_point.y),
)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def fromCoords(x1, y1, x2, y2): | |
return Rectangle(x1, y1, x2 - x1, y2 - y1) | |
def fromPoints(top_left_point: Point, bottom_right_point: Point): | |
@classmethod | |
def fromCoords(cls, x1, y1, x2, y2): | |
return cls(x1, y1, x2 - x1, y2 - y1) | |
@classmethod | |
def fromPoints(cls, top_left_point: Point, bottom_right_point: Point): | |
return cls( | |
top_left_point.x, | |
top_left_point.y, | |
abs(top_left_point.x - bottom_right_point.x), | |
abs(top_left_point.y - bottom_right_point.y), | |
) |
new_cells = [] | ||
GAP = 10 | ||
|
||
for box_id, box in enumerate(boxes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use is not
for object identity checks.
The conditions should use is not
for checking object identity.
- if not vertical_lines_width is None:
+ if vertical_lines_width is not None:
- if not horizontal_lines_height is None:
+ if horizontal_lines_height is not None:
Also applies to: 117-117
Tools
Ruff
64-64: Loop control variable
box_id
not used within loop bodyRename unused
box_id
to_box_id
(B007)
original_vertical_lines = vertical_lines.copy() | ||
|
||
# Modify the dimensions for intersection checking if specified | ||
if not vertical_lines_width is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use is not
for object identity checks.
The conditions should use is not
for checking object identity.
- if not vertical_lines_width is None:
+ if vertical_lines_width is not None:
- if not horizontal_lines_height is None:
+ if horizontal_lines_height is not None:
Also applies to: 117-117
Tools
Ruff
111-111: Test for object identity should be
is not
Convert to
is not
(E714)
img_h, img_w, _ = img.shape | ||
checking_lines_left = [] | ||
checking_lines_right = [] | ||
for l in h_lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid ambiguous variable names.
The variable name l
is ambiguous and can be confused with the number 1
or the letter I
. Consider renaming it to something more descriptive.
- for l in h_lines:
+ for line in h_lines:
- for l in v_lines:
+ for line in v_lines:
Also applies to: 249-249
Tools
Ruff
228-228: Ambiguous variable name:
l
(E741)
checking_lines_bottom = [] | ||
for l in v_lines: | ||
x, y, w, h = l.tolist() | ||
start_point = x - LEN_THRESHOLD, y - line_width |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused variable start_point
.
The variable start_point
is assigned but never used. Consider removing it to clean up the code.
- start_point = x - LEN_THRESHOLD, y - line_width
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
start_point = x - LEN_THRESHOLD, y - line_width |
Tools
Ruff
251-251: Local variable
start_point
is assigned to but never usedRemove assignment to unused variable
start_point
(F841)
# TODO: Add page check if page_num is valid | ||
page = pdf.load_page(page_num) | ||
pix = page.get_pixmap(dpi=180) | ||
image = PIL.Image.frombytes("RGB", [pix.width, pix.height], pix.samples) | ||
return image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder: Add page validation.
The TODO comment indicates that page validation is missing.
Do you want me to implement the page validation logic or open a GitHub issue to track this task?
image = get_page(pdf, page_num) | ||
|
||
encodings = encoder(image, return_tensors="pt") | ||
|
||
with torch.no_grad(): | ||
outputs = detector(**encodings) | ||
width, height = image.size | ||
target_size = [(height, width)] | ||
results = encoder.post_process_object_detection( | ||
outputs, threshold=THRESHOLD, target_sizes=target_size | ||
)[0] | ||
|
||
if len(results["boxes"]) > 0: | ||
for idx, ex_box in enumerate(results["boxes"]): | ||
box = Rectangle.fromCoords(*ex_box) | ||
with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as temp_file: | ||
image_path = temp_file.name | ||
save_table_image(image, box, image_path) | ||
image = PIL.Image.open(image_path) | ||
img_cv2 = cv2.imread(image_path) | ||
ocr_data = ocr_img(image) | ||
|
||
horizontal_lines_bbox, vertical_line_bboxes = detect_borderlines( | ||
image_path | ||
) | ||
text_line_bbox, text_line_height = detect_horizontal_textlines( | ||
ocr_data, image | ||
) | ||
|
||
text_line_bbox = [l for l in text_line_bbox if l.width > l.height] | ||
|
||
_, _, _, new_horizontal_lines_bbox = filter_non_intersecting_lines( | ||
text_line_bbox, | ||
horizontal_lines_bbox, | ||
horizontal_lines_height=text_line_height / 4, | ||
) | ||
|
||
horizontal_lines_bbox_clustered = cluster_horizontal_lines( | ||
new_horizontal_lines_bbox, text_line_height | ||
) | ||
vertical_line_bboxes_clustered = cluster_vertical_lines( | ||
vertical_line_bboxes, text_line_height | ||
) | ||
|
||
horizontal_lines_bbox_clustered_filt, _, _, _ = ( | ||
filter_non_intersecting_lines( | ||
horizontal_lines_bbox_clustered, | ||
vertical_line_bboxes_clustered, | ||
vertical_lines_width=text_line_height / 8, | ||
) | ||
) | ||
|
||
h_lines, v_lines = extend_lines( | ||
img_cv2, | ||
horizontal_lines_bbox_clustered_filt, | ||
vertical_line_bboxes_clustered, | ||
text_line_height / 8, | ||
) | ||
for h in h_lines: | ||
h.draw(img_cv2) | ||
|
||
for v in v_lines: | ||
v.draw(img_cv2) | ||
|
||
cv2.imwrite(f"out/pg_{page_num}.png", img_cv2) | ||
return h_lines, v_lines, ocr_data, ex_box, image_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming ambiguous variable names and addressing unused loop variable.
The function extract_table
is well-implemented, but consider renaming the variable l
to something more descriptive to avoid ambiguity. Also, rename the unused loop variable idx
to _
.
- for idx, ex_box in enumerate(results["boxes"]):
+ for _idx, ex_box in enumerate(results["boxes"]):
- text_line_bbox = [l for l in text_line_bbox if l.width > l.height]
+ text_line_bbox = [line for line in text_line_bbox if line.width > line.height]
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def extract_table(pdf: fitz.Document, page_num: int, encoder, detector, THRESHOLD=0.8): | |
image = get_page(pdf, page_num) | |
encodings = encoder(image, return_tensors="pt") | |
with torch.no_grad(): | |
outputs = detector(**encodings) | |
width, height = image.size | |
target_size = [(height, width)] | |
results = encoder.post_process_object_detection( | |
outputs, threshold=THRESHOLD, target_sizes=target_size | |
)[0] | |
if len(results["boxes"]) > 0: | |
for idx, ex_box in enumerate(results["boxes"]): | |
box = Rectangle.fromCoords(*ex_box) | |
with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as temp_file: | |
image_path = temp_file.name | |
save_table_image(image, box, image_path) | |
image = PIL.Image.open(image_path) | |
img_cv2 = cv2.imread(image_path) | |
ocr_data = ocr_img(image) | |
horizontal_lines_bbox, vertical_line_bboxes = detect_borderlines( | |
image_path | |
) | |
text_line_bbox, text_line_height = detect_horizontal_textlines( | |
ocr_data, image | |
) | |
text_line_bbox = [l for l in text_line_bbox if l.width > l.height] | |
_, _, _, new_horizontal_lines_bbox = filter_non_intersecting_lines( | |
text_line_bbox, | |
horizontal_lines_bbox, | |
horizontal_lines_height=text_line_height / 4, | |
) | |
horizontal_lines_bbox_clustered = cluster_horizontal_lines( | |
new_horizontal_lines_bbox, text_line_height | |
) | |
vertical_line_bboxes_clustered = cluster_vertical_lines( | |
vertical_line_bboxes, text_line_height | |
) | |
horizontal_lines_bbox_clustered_filt, _, _, _ = ( | |
filter_non_intersecting_lines( | |
horizontal_lines_bbox_clustered, | |
vertical_line_bboxes_clustered, | |
vertical_lines_width=text_line_height / 8, | |
) | |
) | |
h_lines, v_lines = extend_lines( | |
img_cv2, | |
horizontal_lines_bbox_clustered_filt, | |
vertical_line_bboxes_clustered, | |
text_line_height / 8, | |
) | |
for h in h_lines: | |
h.draw(img_cv2) | |
for v in v_lines: | |
v.draw(img_cv2) | |
cv2.imwrite(f"out/pg_{page_num}.png", img_cv2) | |
return h_lines, v_lines, ocr_data, ex_box, image_path | |
def extract_table(pdf: fitz.Document, page_num: int, encoder, detector, THRESHOLD=0.8): | |
image = get_page(pdf, page_num) | |
encodings = encoder(image, return_tensors="pt") | |
with torch.no_grad(): | |
outputs = detector(**encodings) | |
width, height = image.size | |
target_size = [(height, width)] | |
results = encoder.post_process_object_detection( | |
outputs, threshold=THRESHOLD, target_sizes=target_size | |
)[0] | |
if len(results["boxes"]) > 0: | |
for _idx, ex_box in enumerate(results["boxes"]): | |
box = Rectangle.fromCoords(*ex_box) | |
with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as temp_file: | |
image_path = temp_file.name | |
save_table_image(image, box, image_path) | |
image = PIL.Image.open(image_path) | |
img_cv2 = cv2.imread(image_path) | |
ocr_data = ocr_img(image) | |
horizontal_lines_bbox, vertical_line_bboxes = detect_borderlines( | |
image_path | |
) | |
text_line_bbox, text_line_height = detect_horizontal_textlines( | |
ocr_data, image | |
) | |
text_line_bbox = [line for line in text_line_bbox if line.width > line.height] | |
_, _, _, new_horizontal_lines_bbox = filter_non_intersecting_lines( | |
text_line_bbox, | |
horizontal_lines_bbox, | |
horizontal_lines_height=text_line_height / 4, | |
) | |
horizontal_lines_bbox_clustered = cluster_horizontal_lines( | |
new_horizontal_lines_bbox, text_line_height | |
) | |
vertical_line_bboxes_clustered = cluster_vertical_lines( | |
vertical_line_bboxes, text_line_height | |
) | |
horizontal_lines_bbox_clustered_filt, _, _, _ = ( | |
filter_non_intersecting_lines( | |
horizontal_lines_bbox_clustered, | |
vertical_line_bboxes_clustered, | |
vertical_lines_width=text_line_height / 8, | |
) | |
) | |
h_lines, v_lines = extend_lines( | |
img_cv2, | |
horizontal_lines_bbox_clustered_filt, | |
vertical_line_bboxes_clustered, | |
text_line_height / 8, | |
) | |
for h in h_lines: | |
h.draw(img_cv2) | |
for v in v_lines: | |
v.draw(img_cv2) | |
cv2.imwrite(f"out/pg_{page_num}.png", img_cv2) | |
return h_lines, v_lines, ocr_data, ex_box, image_path |
Tools
Ruff
300-300: Loop control variable
idx
not used within loop bodyRename unused
idx
to_idx
(B007)
316-316: Ambiguous variable name:
l
(E741)
from transformers import ( | ||
TableTransformerForObjectDetection, | ||
DetrFeatureExtractor | ||
) | ||
|
||
# Model setup | ||
THRESHOLD = 0.8 | ||
table_detect_model = TableTransformerForObjectDetection.from_pretrained( | ||
"microsoft/table-transformer-detection" | ||
) | ||
feature_extractor = DetrFeatureExtractor() | ||
|
||
doc = fitz.open(filename) | ||
length = len(doc) | ||
if max_pages: | ||
length = min(length, max_pages) | ||
|
||
for i_pg in range(length): | ||
extracted_data = extract_table(doc, i_pg, feature_extractor, table_detect_model, THRESHOLD=THRESHOLD) | ||
if extracted_data is None: | ||
continue | ||
|
||
( | ||
h_lines, | ||
v_lines, | ||
ocr_data, | ||
table_bbox, | ||
table_img_path, | ||
) = extracted_data | ||
img = cv2.imread(table_img_path) | ||
|
||
print("H Lines: ", len(h_lines)) | ||
print("H Lines: ", len(v_lines)) | ||
|
||
h_lines.sort(key=lambda l: l.y) | ||
v_lines.sort(key=lambda l: l.x) | ||
rowwise_intersections = detect_rowwise_intersection(h_lines, v_lines) | ||
for p_r in rowwise_intersections: | ||
for p in p_r: | ||
p.draw(img) | ||
boxes = detect_boxes(rowwise_intersections) | ||
cells = get_cells(boxes, h_lines, v_lines) | ||
|
||
words_original = [ | ||
{"x": x, "y": y, "width": width, "height": height, "text": text} | ||
for x, y, width, height, text in zip( | ||
ocr_data["left"], | ||
ocr_data["top"], | ||
ocr_data["width"], | ||
ocr_data["height"], | ||
ocr_data["text"], | ||
) | ||
if text.strip() | ||
] | ||
fill_text_in_cells(words_original, cells, img) | ||
# row wise cell segregation | ||
rows = defaultdict(dict) | ||
for cell in cells: | ||
rows[cell.r][cell.c] = cell.text | ||
|
||
table_df = pd.DataFrame.from_dict(rows, orient="index") | ||
if table_df.empty: | ||
print("The DataFrame is empty") | ||
continue | ||
table_df = table_df.drop_duplicates(keep='last') | ||
table_df.columns = table_df.iloc[0] | ||
table_df = table_df[1:].reset_index(drop=True) | ||
table_df.to_csv(f"{i_pg}_table.csv", index=False) | ||
md = table_df.to_markdown(index=False) | ||
|
||
table_block = Block( | ||
bbox=table_bbox, | ||
block_type="Table", | ||
pnum=i_pg, | ||
lines=[Line( | ||
bbox=table_bbox, | ||
spans=[Span( | ||
bbox=table_bbox, | ||
span_id=f"{i_pg}_table", | ||
font="Table", | ||
font_size=0, | ||
font_weight=0, | ||
block_type="TABLE", | ||
text=md | ||
)] | ||
)] | ||
) | ||
pages[i_pg].blocks.append(table_block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming ambiguous variable names.
The function table_detection
is well-implemented, but consider renaming the variable l
to something more descriptive to avoid ambiguity.
- h_lines.sort(key=lambda l: l.y)
- v_lines.sort(key=lambda l: l.x)
+ h_lines.sort(key=lambda line: line.y)
+ v_lines.sort(key=lambda line: line.x)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def table_detection(filename: str, pages: List[Page], max_pages: int): | |
from transformers import ( | |
TableTransformerForObjectDetection, | |
DetrFeatureExtractor | |
) | |
# Model setup | |
THRESHOLD = 0.8 | |
table_detect_model = TableTransformerForObjectDetection.from_pretrained( | |
"microsoft/table-transformer-detection" | |
) | |
feature_extractor = DetrFeatureExtractor() | |
doc = fitz.open(filename) | |
length = len(doc) | |
if max_pages: | |
length = min(length, max_pages) | |
for i_pg in range(length): | |
extracted_data = extract_table(doc, i_pg, feature_extractor, table_detect_model, THRESHOLD=THRESHOLD) | |
if extracted_data is None: | |
continue | |
( | |
h_lines, | |
v_lines, | |
ocr_data, | |
table_bbox, | |
table_img_path, | |
) = extracted_data | |
img = cv2.imread(table_img_path) | |
print("H Lines: ", len(h_lines)) | |
print("H Lines: ", len(v_lines)) | |
h_lines.sort(key=lambda l: l.y) | |
v_lines.sort(key=lambda l: l.x) | |
rowwise_intersections = detect_rowwise_intersection(h_lines, v_lines) | |
for p_r in rowwise_intersections: | |
for p in p_r: | |
p.draw(img) | |
boxes = detect_boxes(rowwise_intersections) | |
cells = get_cells(boxes, h_lines, v_lines) | |
words_original = [ | |
{"x": x, "y": y, "width": width, "height": height, "text": text} | |
for x, y, width, height, text in zip( | |
ocr_data["left"], | |
ocr_data["top"], | |
ocr_data["width"], | |
ocr_data["height"], | |
ocr_data["text"], | |
) | |
if text.strip() | |
] | |
fill_text_in_cells(words_original, cells, img) | |
# row wise cell segregation | |
rows = defaultdict(dict) | |
for cell in cells: | |
rows[cell.r][cell.c] = cell.text | |
table_df = pd.DataFrame.from_dict(rows, orient="index") | |
if table_df.empty: | |
print("The DataFrame is empty") | |
continue | |
table_df = table_df.drop_duplicates(keep='last') | |
table_df.columns = table_df.iloc[0] | |
table_df = table_df[1:].reset_index(drop=True) | |
table_df.to_csv(f"{i_pg}_table.csv", index=False) | |
md = table_df.to_markdown(index=False) | |
table_block = Block( | |
bbox=table_bbox, | |
block_type="Table", | |
pnum=i_pg, | |
lines=[Line( | |
bbox=table_bbox, | |
spans=[Span( | |
bbox=table_bbox, | |
span_id=f"{i_pg}_table", | |
font="Table", | |
font_size=0, | |
font_weight=0, | |
block_type="TABLE", | |
text=md | |
)] | |
)] | |
) | |
pages[i_pg].blocks.append(table_block) | |
def table_detection(filename: str, pages: List[Page], max_pages: int): | |
from transformers import ( | |
TableTransformerForObjectDetection, | |
DetrFeatureExtractor | |
) | |
# Model setup | |
THRESHOLD = 0.8 | |
table_detect_model = TableTransformerForObjectDetection.from_pretrained( | |
"microsoft/table-transformer-detection" | |
) | |
feature_extractor = DetrFeatureExtractor() | |
doc = fitz.open(filename) | |
length = len(doc) | |
if max_pages: | |
length = min(length, max_pages) | |
for i_pg in range(length): | |
extracted_data = extract_table(doc, i_pg, feature_extractor, table_detect_model, THRESHOLD=THRESHOLD) | |
if extracted_data is None: | |
continue | |
( | |
h_lines, | |
v_lines, | |
ocr_data, | |
table_bbox, | |
table_img_path, | |
) = extracted_data | |
img = cv2.imread(table_img_path) | |
print("H Lines: ", len(h_lines)) | |
print("H Lines: ", len(v_lines)) | |
h_lines.sort(key=lambda line: line.y) | |
v_lines.sort(key=lambda line: line.x) | |
rowwise_intersections = detect_rowwise_intersection(h_lines, v_lines) | |
for p_r in rowwise_intersections: | |
for p in p_r: | |
p.draw(img) | |
boxes = detect_boxes(rowwise_intersections) | |
cells = get_cells(boxes, h_lines, v_lines) | |
words_original = [ | |
{"x": x, "y": y, "width": width, "height": height, "text": text} | |
for x, y, width, height, text in zip( | |
ocr_data["left"], | |
ocr_data["top"], | |
ocr_data["width"], | |
ocr_data["height"], | |
ocr_data["text"], | |
) | |
if text.strip() | |
] | |
fill_text_in_cells(words_original, cells, img) | |
# row wise cell segregation | |
rows = defaultdict(dict) | |
for cell in cells: | |
rows[cell.r][cell.c] = cell.text | |
table_df = pd.DataFrame.from_dict(rows, orient="index") | |
if table_df.empty: | |
print("The DataFrame is empty") | |
continue | |
table_df = table_df.drop_duplicates(keep='last') | |
table_df.columns = table_df.iloc[0] | |
table_df = table_df[1:].reset_index(drop=True) | |
table_df.to_csv(f"{i_pg}_table.csv", index=False) | |
md = table_df.to_markdown(index=False) | |
table_block = Block( | |
bbox=table_bbox, | |
block_type="Table", | |
pnum=i_pg, | |
lines=[Line( | |
bbox=table_bbox, | |
spans=[Span( | |
bbox=table_bbox, | |
span_id=f"{i_pg}_table", | |
font="Table", | |
font_size=0, | |
font_weight=0, | |
block_type="TABLE", | |
text=md | |
)] | |
)] | |
) | |
pages[i_pg].blocks.append(table_block) |
Tools
Ruff
213-213: Ambiguous variable name:
l
(E741)
214-214: Ambiguous variable name:
l
(E741)
Summary by CodeRabbit
New Features
Bug Fixes
Documentation