[Qustion] QA-Corpus mapping #854

kun432 · 2024-10-17T01:59:15Z

I'm confused with that step2 notebook and tutorial document are a little different.

In step2 notebook, QA is generated from these 2 instances:

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, raw_instance)

initial_qa = (
    corpus_instance.sample(random_single_hop, n=3)
    (snip)
)

initial_qa.to_parquet('/content/initial_qa.parquet', '/content/initial_corpus.parquet')

new_corpus_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
new_corpus_instance = Corpus(new_corpus_df, raw_instance)

new_qa = initial_qa.update_corpus(new_corpus_instance)
new_qa.to_parquet("/content/new_qa.parquet", "/content/new_corpus.parquet")

seems initial_qa and initial corpus are created from one of "chunked" parquet data, and another optimized qa data and corpus are created from the combination of another "chunked" parquet and initial corpus and qa.

OTOH, in tutorial documents(https://docs.auto-rag.com/data_creation/tutorial.html) seems:

# initial_raw seems "parsed" data
initial_raw = Raw(initial_raw_df)

# initial chunk
initial_corpus = initial_raw.chunk(
    "llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5
)
llm = OpenAI()
initial_qa = (
    initial_corpus.sample(random_single_hop, n=3)
   (snip)
)
initial_qa.to_parquet("./initial_qa.parquet", "./initial_corpus.parquet")

# chunk optimization, now we have other variations of "chunk"s
chunker = Chunker.from_parquet("./initial_raw.parquet", "./chunk_project_dir")
chunker.start_chunking("./chunking.yaml")

# corpus-qa mapping
raw = Raw(initial_raw_df)   # <-- "parsed" data, right?
corpus = Corpus(initial_corpus_df, raw)
qa = QA(initial_qa_df, corpus)

new_qa = qa.update_corpus(Corpus(new_corpus_df, raw))

in this tutorial, optimized qa data and corpus are created from the combination of "initial_raw_df", which is the same as "parsed" data, and initial corpus and qa.

seems "different data" are used and I'm confused.

so my basic question: is this code in step2 notebook correct?

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

The text was updated successfully, but these errors were encountered:

vkehfdl1 · 2024-10-18T06:51:20Z

Hello @kun432

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

This code is nothing wrong. So Raw instance is basically a parsed data. You can make it using Parser and parse.yaml. https://docs.auto-rag.com/data_creation/parse/parse.html
Or you can use Huggingface Space to get an raw.parquet.
Of course you can make it on your own with pandas, if you have already parsed data.

In conclusion, raw.parquet is the 'parsed' file. So load it from parquet file or dataframe is no matter.

Raw : parsed data
Corpus : chunked data
QA : Question & Answer dataset based on the corpus

thap2331 · 2024-10-20T00:32:46Z

Wohh...
This needs to be somewhere in the doc...Please...

Raw : parsed data
Corpus : chunked data
QA : Question & Answer dataset based on the corpus

vkehfdl1 · 2024-10-20T08:42:58Z

@thap2331 Okay we will.
I will make this as documentation issue

kun432 · 2024-10-20T14:11:38Z

@vkehfdl1 Thank you.

Well, I think I got it. It's like Raw make its input parquet/df to "parsed" instance regardless of whether its input parquet/df is chunked or not, right?

Let's say with step2 notebook, as an example.

First, parse a pdf with Parser.

from autorag.parser import Parser

parser = Parser(data_path_glob="/content/raw_documents/sample.pdf", project_dir="/content/parse_project_dir")
parser.start_parsing("/content/parse.yaml")

this makes "/content/parse_project_dir/0/0.parquet" as 1 "parsed" parquet file.

Then chunk it with Chunker with this YAML config.

modules:
  - module_type: llama_index_chunk
    chunk_method: [ Token, Sentence ]
    chunk_size: [ 1024, 512 ]
    chunk_overlap: 24
    add_file_name: en

from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="/content/parse_project_dir/0/0.parquet", project_dir="/content/chunk_project_dir")
chunker.start_chunking("/content/chunk.yaml")

this makes 4 "chunked" parquet files depends on the combination of chunk_method and chunk_size.

/content/chunk_project_dir/0/0.parquet
/content/chunk_project_dir/0/1.parquet
/content/chunk_project_dir/0/2.parquet
/content/chunk_project_dir/0/3.parquet

Then for generating QA, prepare Raw and Corpus

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, raw_instance)

In this point, these all codes below makes the same "parsed" instance, I think.

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df)

# choose "parsed" parquet before chunked
raw_df = pd.read_parquet("/content/parse_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

thap2331 · 2024-10-20T17:17:19Z

@thap2331 Okay we will. I will make this as documentation issue

@vkehfdl1 you can assign this to me. Or, I can get this going. If I do, I will assign this to you.

vkehfdl1 · 2024-10-21T06:47:16Z

@kun432 The Parse instance gets only 'non-chunked' dataframe. The chunked dataframe must be in the Corpus instance.
Yes you can make Raw instance with 'non-chunked' dataframe, but it is not intended.

So,

Raw : parsed data
Corpus : chunked data
QA : Question & Answer dataset based on the corpus

This is the basic structure.
All of your code is correct and working, but

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df)

This code is not intended. The chunked parquet file must be in Corpus instance.

@thap2331 I assigned to you thanks:)
Go to the docs/ folder and edit .md files.

kun432 · 2024-10-21T08:14:49Z

@vkehfdl1 hmm...kind of lost again. I think I need more time to think... will read the docs and try. Thanks, anyway.

vkehfdl1 · 2024-10-21T08:20:30Z

@kun432

Here is the brief image of our Data Creation structure. I hope it helps.

For RAG, here are the steps.

Parse raw doucments to the texts. => This will be the Raw instance.
Chunk the parsed texts into the small passages (chunks) => This will be the Corpus instance.
Generate Question & Answers from the corpus dataset => This will be the QA instance.

RemyOstyn · 2024-10-24T10:12:21Z

Hello @vkehfdl1 ,

I'm facing the same issue.

I understand that the Raw instance must come from the parsed result, so I suppose we have to get it from /content/parse_project_dir/ right? For example:

raw_df = pd.read_parquet("./content/parse_project_dir/0/0.parquet") # parse_project_dir folder
raw_instance = Raw(raw_df)

Then the Corpus instance must come from the chunked result, so we get it from /content/chunk_project_dir/ so for example:

corpus_df = pd.read_parquet("/content/chunk_project_dir/1/6.parquet") # chunk_project_dir folder
corpus_instance = Corpus(corpus_df, raw_instance)

It makes sense, however, when I'm following this, then creating the QA and run the evaluation, I'm always getting:
ValueError: doc_id: 0eec7e3a-e1c0-4d33-8cc5-7e604b30339b not found in corpus_data.

It's only working if I'm creating the Raw instance from the /content/chunk_project_dir/ like this:

raw_df = pd.read_parquet("/content/chunk_project_dir/1/6.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/1/6.parquet"")
corpus_instance = Corpus(corpus_df, raw_instance)

But this does not make sense to use the chunked parquet for Raw instance as you mentioned.

What's the correct way of doing? Because the documentation is using chunk_project_dir for Raw and Corpus instance

vkehfdl1 · 2024-10-25T05:41:51Z

Hello @RemyOstyn
It is wierd. I think it mighe be a bug. The right code is

raw_df = pd.read_parquet("./content/parse_project_dir/0/0.parquet") # parse_project_dir folder
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/1/6.parquet") # chunk_project_dir folder
corpus_instance = Corpus(corpus_df, raw_instance)

This is the right code that you said.

Just one thing, did you run the corpus & qa dataset once?
If you have a change of the data, you have to make new project directory and use it.

tha23rd · 2024-10-25T14:57:36Z

I am having the same issue as Remy, left a comment here: #630 (comment)

vkehfdl1 · 2024-10-25T15:03:42Z

@RemyOstyn @tha23rd
Can you send me a full bug trace? If we are having trouble in the similar codebase, it can be a serious bug that I still didn't figure out.

tha23rd · 2024-10-25T23:26:45Z

@RemyOstyn @tha23rd Can you send me a full bug trace? If we are having trouble in the similar codebase, it can be a serious bug that I still didn't figure out.

I update this issue with both the stacktrace and the code I used to generate it: #630 (comment)

vkehfdl1 · 2024-10-26T03:38:58Z

Hello @RemyOstyn @tha23rd
It is certain that you are using passageaugmenter node in your YAML file!!
Unfortunately, the validation of passage augmenter is not working right now. See #863
But, when you start a trial, it automatically run validation, which occurs error like ValueError: doc_id: 0eec7e3a-e1c0-4d33-8cc5-7e604b30339b not found in corpus_data.

So there are two options to use.

Simply skip validation on the trial

autorag evaluate --config your/path/to/default_config.yaml --qa_data_path your/path/to/qa.parquet --corpus_data_path your/path/to/corpus.parquet --project_dir ./your/project/directory --skip_validation true

or

from autorag.evaluator import Evaluator

evaluator = Evaluator(qa_data_path='your/path/to/qa.parquet', corpus_data_path='your/path/to/corpus.parquet',
                      project_dir='your/path/to/project_directory')
evaluator.start_trial('your/path/to/config.yaml', skip_validation=True)

Do not use passage augmenter. => Delete passage augmenter node on your config YAML file.

In the meantime, we will fix the #863 issue ASAP. Thank you.

RemyOstyn · 2024-10-26T07:31:20Z

Hello @vkehfdl1

You're right I was using passage augmenter so I will disable it for now.

Thank you for your support!

vkehfdl1 added the documentation Improvements or additions to documentation label Oct 20, 2024

vkehfdl1 self-assigned this Oct 20, 2024

vkehfdl1 assigned thap2331 Oct 21, 2024

vkehfdl1 mentioned this issue Oct 30, 2024

Edit documentation about data schema and descriptions #905

Merged

bwook00 added a commit that referenced this issue Oct 31, 2024

Merge branch 'main' into Feature/#854

f81078e

vkehfdl1 closed this as completed in #905 Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qustion] QA-Corpus mapping #854

[Qustion] QA-Corpus mapping #854

kun432 commented Oct 17, 2024 •

edited by vkehfdl1

Loading

vkehfdl1 commented Oct 18, 2024

thap2331 commented Oct 20, 2024

vkehfdl1 commented Oct 20, 2024

kun432 commented Oct 20, 2024

thap2331 commented Oct 20, 2024

vkehfdl1 commented Oct 21, 2024 •

edited

Loading

kun432 commented Oct 21, 2024

vkehfdl1 commented Oct 21, 2024

RemyOstyn commented Oct 24, 2024 •

edited

Loading

vkehfdl1 commented Oct 25, 2024

tha23rd commented Oct 25, 2024

vkehfdl1 commented Oct 25, 2024

tha23rd commented Oct 25, 2024

vkehfdl1 commented Oct 26, 2024 •

edited

Loading

RemyOstyn commented Oct 26, 2024

[Qustion] QA-Corpus mapping #854

[Qustion] QA-Corpus mapping #854

Comments

kun432 commented Oct 17, 2024 • edited by vkehfdl1 Loading

vkehfdl1 commented Oct 18, 2024

thap2331 commented Oct 20, 2024

vkehfdl1 commented Oct 20, 2024

kun432 commented Oct 20, 2024

thap2331 commented Oct 20, 2024

vkehfdl1 commented Oct 21, 2024 • edited Loading

kun432 commented Oct 21, 2024

vkehfdl1 commented Oct 21, 2024

RemyOstyn commented Oct 24, 2024 • edited Loading

vkehfdl1 commented Oct 25, 2024

tha23rd commented Oct 25, 2024

vkehfdl1 commented Oct 25, 2024

tha23rd commented Oct 25, 2024

vkehfdl1 commented Oct 26, 2024 • edited Loading

RemyOstyn commented Oct 26, 2024

kun432 commented Oct 17, 2024 •

edited by vkehfdl1

Loading

vkehfdl1 commented Oct 21, 2024 •

edited

Loading

RemyOstyn commented Oct 24, 2024 •

edited

Loading

vkehfdl1 commented Oct 26, 2024 •

edited

Loading