-
-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Qustion] QA-Corpus mapping #854
Comments
Hello @kun432 raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df) This code is nothing wrong. So In conclusion,
|
Wohh...
|
@thap2331 Okay we will. |
@vkehfdl1 Thank you. Well, I think I got it. It's like Let's say with step2 notebook, as an example. First, parse a pdf with Parser.
this makes "/content/parse_project_dir/0/0.parquet" as 1 "parsed" parquet file. Then chunk it with Chunker with this YAML config.
this makes 4 "chunked" parquet files depends on the combination of chunk_method and chunk_size.
Then for generating QA, prepare Raw and Corpus
In this point, these all codes below makes the same "parsed" instance, I think.
|
@kun432 The So,
This is the basic structure. raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)
# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df) This code is not intended. The chunked parquet file must be in @thap2331 I assigned to you thanks:) |
@vkehfdl1 hmm...kind of lost again. I think I need more time to think... will read the docs and try. Thanks, anyway. |
Here is the brief image of our Data Creation structure. I hope it helps. For RAG, here are the steps.
|
Hello @vkehfdl1 , I'm facing the same issue. I understand that the
Then the Corpus instance must come from the chunked result, so we get it from
It makes sense, however, when I'm following this, then creating the QA and run the evaluation, I'm always getting: It's only working if I'm creating the
But this does not make sense to use the chunked parquet for What's the correct way of doing? Because the documentation is using |
Hello @RemyOstyn raw_df = pd.read_parquet("./content/parse_project_dir/0/0.parquet") # parse_project_dir folder
raw_instance = Raw(raw_df)
corpus_df = pd.read_parquet("/content/chunk_project_dir/1/6.parquet") # chunk_project_dir folder
corpus_instance = Corpus(corpus_df, raw_instance) This is the right code that you said. Just one thing, did you run the corpus & qa dataset once? |
I am having the same issue as Remy, left a comment here: #630 (comment) |
@RemyOstyn @tha23rd |
I update this issue with both the stacktrace and the code I used to generate it: #630 (comment) |
Hello @RemyOstyn @tha23rd So there are two options to use.
autorag evaluate --config your/path/to/default_config.yaml --qa_data_path your/path/to/qa.parquet --corpus_data_path your/path/to/corpus.parquet --project_dir ./your/project/directory --skip_validation true or from autorag.evaluator import Evaluator
evaluator = Evaluator(qa_data_path='your/path/to/qa.parquet', corpus_data_path='your/path/to/corpus.parquet',
project_dir='your/path/to/project_directory')
evaluator.start_trial('your/path/to/config.yaml', skip_validation=True)
In the meantime, we will fix the #863 issue ASAP. Thank you. |
Hello @vkehfdl1 You're right I was using passage augmenter so I will disable it for now. Thank you for your support! |
I'm confused with that step2 notebook and tutorial document are a little different.
In step2 notebook, QA is generated from these 2 instances:
seems initial_qa and initial corpus are created from one of "chunked" parquet data, and another optimized qa data and corpus are created from the combination of another "chunked" parquet and initial corpus and qa.
OTOH, in tutorial documents(https://docs.auto-rag.com/data_creation/tutorial.html) seems:
in this tutorial, optimized qa data and corpus are created from the combination of "initial_raw_df", which is the same as "parsed" data, and initial corpus and qa.
seems "different data" are used and I'm confused.
so my basic question: is this code in step2 notebook correct?
The text was updated successfully, but these errors were encountered: