Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

采用编程的方式处理 pdf 文件 #60

Open
winterpi opened this issue Dec 8, 2023 · 0 comments
Open

采用编程的方式处理 pdf 文件 #60

winterpi opened this issue Dec 8, 2023 · 0 comments

Comments

@winterpi
Copy link
Owner

winterpi commented Dec 8, 2023

抽取出 pdf 文件中的表格

采用 camelot

  • 样例如下:
import camelot
tables = camelot.read_pdf("./[1].pdf")  ## 抽取出 image
print(tables)
print(tables[0].df)

tables.export("./foo.json", f="json")
  • 缺点:抽取不全,暂不支持中文的 UTF-8

采用 pdfplumber

  • 可以处理的更细致些,抽取也更准确些
  • 样例如下
def extract_tables(pdf_path, output_folder):

    with pdfplumber.open(pdf_path) as pdf:  
        pages = len(pdf.pages)          
        for pi in range(pages):  
            page = pdf.pages[pi]
            tables = page.extract_tables()

            # 创建输出文件夹(如果不存在)
            if not os.path.exists(output_folder):
                os.makedirs(output_folder)

            # 处理表格
            print("page:", pi,  ", table size:", len(tables))
            for i, table in enumerate(tables):              
                
                with open(os.path.join(output_folder, 'table_{}.txt'.format(str(pi) + "_" + str(i+1))), 'w') as f:
                    for row in table:   
                        row =[x if x is not None else '' for x in row]                                                 
                        f.write('\t'.join(row) + '\n')    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant