采用编程的方式处理 pdf 文件 #60

winterpi · 2023-12-08T05:06:25Z

抽取出 pdf 文件中的表格

采用 camelot

样例如下：

import camelot
tables = camelot.read_pdf("./[1].pdf")  ## 抽取出 image
print(tables)
print(tables[0].df)

tables.export("./foo.json", f="json")

缺点：抽取不全，暂不支持中文的 UTF-8

采用 pdfplumber

可以处理的更细致些，抽取也更准确些
样例如下

def extract_tables(pdf_path, output_folder):

    with pdfplumber.open(pdf_path) as pdf:  
        pages = len(pdf.pages)          
        for pi in range(pages):  
            page = pdf.pages[pi]
            tables = page.extract_tables()

            # 创建输出文件夹（如果不存在）
            if not os.path.exists(output_folder):
                os.makedirs(output_folder)

            # 处理表格
            print("page:", pi,  ", table size:", len(tables))
            for i, table in enumerate(tables):              
                
                with open(os.path.join(output_folder, 'table_{}.txt'.format(str(pi) + "_" + str(i+1))), 'w') as f:
                    for row in table:   
                        row =[x if x is not None else '' for x in row]                                                 
                        f.write('\t'.join(row) + '\n')

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

采用编程的方式处理 pdf 文件 #60

采用编程的方式处理 pdf 文件 #60

winterpi commented Dec 8, 2023 •

edited

Loading

采用编程的方式处理 pdf 文件 #60

采用编程的方式处理 pdf 文件 #60

Comments

winterpi commented Dec 8, 2023 • edited Loading

抽取出 pdf 文件中的表格

采用 camelot

采用 pdfplumber

winterpi commented Dec 8, 2023 •

edited

Loading