Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured Index of Documents #9411

Merged
merged 9 commits into from
Dec 20, 2024

Conversation

dfmz759837901
Copy link
Contributor

PR types

New features

PR changes

Others

Description

A pipeline of Structured Index of documents

Copy link

paddle-bot bot commented Nov 12, 2024

Thanks for your contribution!

@DrownFish19
Copy link
Collaborator

代码里的PDF文件建议移除,给出下载路径或者网址即可

Copy link

codecov bot commented Dec 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.81%. Comparing base (da7a7d2) to head (8f4d9f1).
Report is 18 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #9411   +/-   ##
========================================
  Coverage    52.81%   52.81%           
========================================
  Files          710      710           
  Lines       111238   111238           
========================================
  Hits         58749    58749           
  Misses       52489    52489           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dfmz759837901
Copy link
Contributor Author

移除PDF文件,并给出下载链接和PDF下载脚本
data/source/source_url.json 为示例文档下载链接
data/source/download.sh 为下载脚本
并更新了README中的相关说明


```bash
conda install nccl -c conda-forge
conda install paddlepaddle-gpu==2.6.1 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的安装命令好像不是官网的安装命令
conda install paddlepaddle-gpu==2.6.2 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge

paddlenlp==3.0.0b2
tqdm
numpy
paddleocr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faiss-gpu和paddleocr需要给定特殊的版本和安装方式吗?

脚本`data/source/download.sh`可用于下载示例文档:
```bash
cd data/source
bash download.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download.sh内部调用jq,建议增加apt install jq -y 命令

--parse_model_name_or_path Qwen/Qwen2-72B-Instruct \
--summarize_model_name_or_path Qwen/Qwen2-72B-Instruct \
--encode_model_name_or_path BAAI/bge-large-en-v1.5 \
--log_dir .logs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

运行这个命令需要安装fitz和frontend,安装之后会出现报错RuntimeError: Directory 'static/' does not exist。这里检查应该是和paddleocr版本相关,建议适配paddleocr版本。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议优先适配最新版本,如果遇到困难,可使用paddleocr==2.7.3正常运行,

@DrownFish19
Copy link
Collaborator

仅需修改环境安装指令即可,README.md内容已经测试通过。

@dfmz759837901
Copy link
Contributor Author

修改 requirements.txt

  • 固定 faiss-gpu 与 paddleocr 版本,适配最新版本paddleocr 2.9.1遇到困难,原环境中的paddleocr为 2.7.3可正常执行

修改 README.md

  • 修改 paddle 的安装命令
  • 增加 apt install jq -y 命令描述

Copy link
Collaborator

@DrownFish19 DrownFish19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DrownFish19 DrownFish19 merged commit a26ddc4 into PaddlePaddle:develop Dec 20, 2024
10 of 12 checks passed
blacksheep-Aristotle pushed a commit to blacksheep-Aristotle/PaddleNLP that referenced this pull request Dec 23, 2024
* Structured Index of Documents

* 替换pdf为url

* 更改下载方式

* 更新README

* 更新README

* 修改环境安装指令
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants