Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Qustion] QA-Corpus mapping #854

Closed
kun432 opened this issue Oct 17, 2024 · 15 comments · Fixed by #905
Closed

[Qustion] QA-Corpus mapping #854

kun432 opened this issue Oct 17, 2024 · 15 comments · Fixed by #905
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@kun432
Copy link

kun432 commented Oct 17, 2024

I'm confused with that step2 notebook and tutorial document are a little different.

In step2 notebook, QA is generated from these 2 instances:

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, raw_instance)

initial_qa = (
    corpus_instance.sample(random_single_hop, n=3)
    (snip)
)

initial_qa.to_parquet('/content/initial_qa.parquet', '/content/initial_corpus.parquet')

new_corpus_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
new_corpus_instance = Corpus(new_corpus_df, raw_instance)

new_qa = initial_qa.update_corpus(new_corpus_instance)
new_qa.to_parquet("/content/new_qa.parquet", "/content/new_corpus.parquet")

seems initial_qa and initial corpus are created from one of "chunked" parquet data, and another optimized qa data and corpus are created from the combination of another "chunked" parquet and initial corpus and qa.

OTOH, in tutorial documents(https://docs.auto-rag.com/data_creation/tutorial.html) seems:

# initial_raw seems "parsed" data
initial_raw = Raw(initial_raw_df)

# initial chunk
initial_corpus = initial_raw.chunk(
    "llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5
)
llm = OpenAI()
initial_qa = (
    initial_corpus.sample(random_single_hop, n=3)
   (snip)
)
initial_qa.to_parquet("./initial_qa.parquet", "./initial_corpus.parquet")

# chunk optimization, now we have other variations of "chunk"s
chunker = Chunker.from_parquet("./initial_raw.parquet", "./chunk_project_dir")
chunker.start_chunking("./chunking.yaml")

# corpus-qa mapping
raw = Raw(initial_raw_df)   # <-- "parsed" data, right?
corpus = Corpus(initial_corpus_df, raw)
qa = QA(initial_qa_df, corpus)

new_qa = qa.update_corpus(Corpus(new_corpus_df, raw))

in this tutorial, optimized qa data and corpus are created from the combination of "initial_raw_df", which is the same as "parsed" data, and initial corpus and qa.

seems "different data" are used and I'm confused.

so my basic question: is this code in step2 notebook correct?

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)
@vkehfdl1
Copy link
Contributor

Hello @kun432

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

This code is nothing wrong. So Raw instance is basically a parsed data. You can make it using Parser and parse.yaml. https://docs.auto-rag.com/data_creation/parse/parse.html
Or you can use Huggingface Space to get an raw.parquet.
Of course you can make it on your own with pandas, if you have already parsed data.

In conclusion, raw.parquet is the 'parsed' file. So load it from parquet file or dataframe is no matter.

  • Raw : parsed data
  • Corpus : chunked data
  • QA : Question & Answer dataset based on the corpus

@thap2331
Copy link

Wohh...
This needs to be somewhere in the doc...Please...

Raw : parsed data
Corpus : chunked data
QA : Question & Answer dataset based on the corpus

@vkehfdl1
Copy link
Contributor

@thap2331 Okay we will.
I will make this as documentation issue

@vkehfdl1 vkehfdl1 added the documentation Improvements or additions to documentation label Oct 20, 2024
@vkehfdl1 vkehfdl1 self-assigned this Oct 20, 2024
@kun432
Copy link
Author

kun432 commented Oct 20, 2024

@vkehfdl1 Thank you.

Well, I think I got it. It's like Raw make its input parquet/df to "parsed" instance regardless of whether its input parquet/df is chunked or not, right?

Let's say with step2 notebook, as an example.

First, parse a pdf with Parser.

from autorag.parser import Parser

parser = Parser(data_path_glob="/content/raw_documents/sample.pdf", project_dir="/content/parse_project_dir")
parser.start_parsing("/content/parse.yaml")

this makes "/content/parse_project_dir/0/0.parquet" as 1 "parsed" parquet file.

Then chunk it with Chunker with this YAML config.

modules:
  - module_type: llama_index_chunk
    chunk_method: [ Token, Sentence ]
    chunk_size: [ 1024, 512 ]
    chunk_overlap: 24
    add_file_name: en
from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="/content/parse_project_dir/0/0.parquet", project_dir="/content/chunk_project_dir")
chunker.start_chunking("/content/chunk.yaml")

this makes 4 "chunked" parquet files depends on the combination of chunk_method and chunk_size.

  • /content/chunk_project_dir/0/0.parquet
  • /content/chunk_project_dir/0/1.parquet
  • /content/chunk_project_dir/0/2.parquet
  • /content/chunk_project_dir/0/3.parquet

Then for generating QA, prepare Raw and Corpus

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, raw_instance)

In this point, these all codes below makes the same "parsed" instance, I think.

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)
# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df)
# choose "parsed" parquet before chunked
raw_df = pd.read_parquet("/content/parse_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

@thap2331
Copy link

@thap2331 Okay we will. I will make this as documentation issue

@vkehfdl1 you can assign this to me. Or, I can get this going. If I do, I will assign this to you.

@vkehfdl1
Copy link
Contributor

vkehfdl1 commented Oct 21, 2024

@kun432 The Parse instance gets only 'non-chunked' dataframe. The chunked dataframe must be in the Corpus instance.
Yes you can make Raw instance with 'non-chunked' dataframe, but it is not intended.

So,

  • Raw : parsed data
  • Corpus : chunked data
  • QA : Question & Answer dataset based on the corpus

This is the basic structure.
All of your code is correct and working, but

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df)

This code is not intended. The chunked parquet file must be in Corpus instance.

@thap2331 I assigned to you thanks:)
Go to the docs/ folder and edit .md files.

@kun432
Copy link
Author

kun432 commented Oct 21, 2024

@vkehfdl1 hmm...kind of lost again. I think I need more time to think... will read the docs and try. Thanks, anyway.

@vkehfdl1
Copy link
Contributor

@kun432 Image

Here is the brief image of our Data Creation structure. I hope it helps.

For RAG, here are the steps.

  1. Parse raw doucments to the texts. => This will be the Raw instance.
  2. Chunk the parsed texts into the small passages (chunks) => This will be the Corpus instance.
  3. Generate Question & Answers from the corpus dataset => This will be the QA instance.

@RemyOstyn
Copy link

RemyOstyn commented Oct 24, 2024

Hello @vkehfdl1 ,

I'm facing the same issue.

I understand that the Raw instance must come from the parsed result, so I suppose we have to get it from /content/parse_project_dir/ right? For example:

raw_df = pd.read_parquet("./content/parse_project_dir/0/0.parquet") # parse_project_dir folder
raw_instance = Raw(raw_df)

Then the Corpus instance must come from the chunked result, so we get it from /content/chunk_project_dir/ so for example:

corpus_df = pd.read_parquet("/content/chunk_project_dir/1/6.parquet") # chunk_project_dir folder
corpus_instance = Corpus(corpus_df, raw_instance)

It makes sense, however, when I'm following this, then creating the QA and run the evaluation, I'm always getting:
ValueError: doc_id: 0eec7e3a-e1c0-4d33-8cc5-7e604b30339b not found in corpus_data.

It's only working if I'm creating the Raw instance from the /content/chunk_project_dir/ like this:

raw_df = pd.read_parquet("/content/chunk_project_dir/1/6.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/1/6.parquet"")
corpus_instance = Corpus(corpus_df, raw_instance)

But this does not make sense to use the chunked parquet for Raw instance as you mentioned.

What's the correct way of doing? Because the documentation is using chunk_project_dir for Raw and Corpus instance

@vkehfdl1
Copy link
Contributor

Hello @RemyOstyn
It is wierd. I think it mighe be a bug. The right code is

raw_df = pd.read_parquet("./content/parse_project_dir/0/0.parquet") # parse_project_dir folder
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/1/6.parquet") # chunk_project_dir folder
corpus_instance = Corpus(corpus_df, raw_instance)

This is the right code that you said.

Just one thing, did you run the corpus & qa dataset once?
If you have a change of the data, you have to make new project directory and use it.

@tha23rd
Copy link

tha23rd commented Oct 25, 2024

I am having the same issue as Remy, left a comment here: #630 (comment)

@vkehfdl1
Copy link
Contributor

@RemyOstyn @tha23rd
Can you send me a full bug trace? If we are having trouble in the similar codebase, it can be a serious bug that I still didn't figure out.

@tha23rd
Copy link

tha23rd commented Oct 25, 2024

@RemyOstyn @tha23rd Can you send me a full bug trace? If we are having trouble in the similar codebase, it can be a serious bug that I still didn't figure out.

I update this issue with both the stacktrace and the code I used to generate it: #630 (comment)

@vkehfdl1
Copy link
Contributor

vkehfdl1 commented Oct 26, 2024

Hello @RemyOstyn @tha23rd
It is certain that you are using passageaugmenter node in your YAML file!!
Unfortunately, the validation of passage augmenter is not working right now. See #863
But, when you start a trial, it automatically run validation, which occurs error like ValueError: doc_id: 0eec7e3a-e1c0-4d33-8cc5-7e604b30339b not found in corpus_data.

So there are two options to use.

  1. Simply skip validation on the trial
autorag evaluate --config your/path/to/default_config.yaml --qa_data_path your/path/to/qa.parquet --corpus_data_path your/path/to/corpus.parquet --project_dir ./your/project/directory --skip_validation true

or

from autorag.evaluator import Evaluator

evaluator = Evaluator(qa_data_path='your/path/to/qa.parquet', corpus_data_path='your/path/to/corpus.parquet',
                      project_dir='your/path/to/project_directory')
evaluator.start_trial('your/path/to/config.yaml', skip_validation=True)
  1. Do not use passage augmenter. => Delete passage augmenter node on your config YAML file.

In the meantime, we will fix the #863 issue ASAP. Thank you.

@RemyOstyn
Copy link

Hello @vkehfdl1

You're right I was using passage augmenter so I will disable it for now.

Thank you for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants