Web Retrieval (2.x) #5614

ZanSara · 2023-08-23T12:44:39Z

Haystack 1.x supports retrievers that perform a web search instead of looking into a document store.

In Haystack 2.0 we decide to split such component into smaller components that would look like such in a web retrieval RAG pipeline:

graph TD;

IN{IN} -- query --> WebSearch
IN{IN} -- query --> PromptBuilder
WebSearch --links --> UrlCacheChecker
UrlCacheChecker -- misses --> LinkContentFetcher
LinkContentFetcher -- html --> HTMLToDocument
LinkContentFetcher -- pdf --> PDFToDocument
LinkContentFetcher -- json --> JSONToDocument
LinkContentFetcher -- ... --> ...ToDocument
HTMLToDocument -- docs --> Join
PDFToDocument -- docs --> Join
JSONToDocument -- docs --> Join
...ToDocument -- docs --> Join
Join -- docs --> TextSplitter
TextSplitter -- docs --> DocumentWriter
TextSplitter -- docs -->  Ranker
UrlCacheChecker -- hits --> Ranker
Ranker -- docs --> PromptBuilder
PromptBuilder -- prompt --> GPTGenerator
GPTGenerator -- replies --> AnswerBuilder
AnswerBuilder -- answers --> OUT{OUT}

`...WebSearch`

WebSearch components expect a question and return a list of links, similarly to what WebSearch components do in Haystack 1.x. As in Haystack 1.x, we expect to have about 4 implementations:

SerpWebSearch
SerperWebSearch
GoogleWebSearch
BingWebSearch

and any other search engine backend we may want to support in the future.

Note that WebSearch component return either links, snippets, or both. We can specify which behavior we prefer with an init flag, or simply make the component always return both and either discard or ignore the outputs we are not interested in.

`LinkFetcher`

LinkFetcher as well works very similarly to its Haystach 1.x version, but it's much simpler, because it will not have callbacks for document processing, nor a preprocessor instance. Instead it will work similarly to a FileTypeClassifier: it will check what type of file it obtained from the input link that it just read and return such file on an edge named after the type, for example html, pdf, json, ...

The file conversion and the preprocessing will be done later in the pipeline by the respective components.

`CacheChecker`

CacheChecker is a document store aware component that simply checks for the presence of a document into a document store. In this case, rather than checking documents instances, it should check for the presence of documents with a specific URL in their metadata. It also support cache expiration features, so the documents might need to have also their retrieval datetime in the metadata.

Let's remember that CacheChecker returns two different outputs: missing, which is a list of links it could not find in the docstore (or which cached version is too old) and found, which instead is a list of Documents that correspond to the links it received in input.

Tasks

Give feedback

feat: Add SerperDevWebSearch Haystack 2.0 component #5712

2.x topic:CI topic:tests type:documentation
feat: Add LinkContentFetcher Haystack 2.0 component #5724

2.x topic:tests type:documentation
UrlCacheChecker 2.0 #5840

2.x
feat: Join component #5852

2.x topic:tests type:documentation
Options

The text was updated successfully, but these errors were encountered:

Timoeller · 2023-11-23T13:31:21Z

@vblagoje this seems finished, please close this epic if it is.

ZanSara mentioned this issue Aug 23, 2023

Migrate Components to Pipeline v2 #5265

Closed

ZanSara changed the title ~~Web Retrievers (v2)~~ Web Retrieval (v2) Aug 23, 2023

ZanSara assigned vblagoje Aug 23, 2023

ZanSara added the 2.x Related to Haystack v2.0 label Aug 23, 2023

This was referenced Sep 4, 2023

feat: Add SerperDevWebSearch Haystack 2.0 component #5712

Merged

feat: Add LinkContentFetcher Haystack 2.0 component #5724

Merged

ZanSara closed this as completed in #5712 Sep 6, 2023

ZanSara reopened this Sep 6, 2023

ZanSara mentioned this issue Sep 20, 2023

feat: Join component #5852

Closed

MichelBartels mentioned this issue Sep 20, 2023

Add file abstraction #5856

Closed

ZanSara added this to the 2.0 blockers milestone Sep 25, 2023

masci added the epic label Oct 2, 2023

masci changed the title ~~Web Retrieval (v2)~~ Web Retrieval (2.x) Oct 2, 2023

Timoeller closed this as completed Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Retrieval (2.x) #5614

Web Retrieval (2.x) #5614

ZanSara commented Aug 23, 2023 •

edited

Loading

Tasks

Timoeller commented Nov 23, 2023

Web Retrieval (2.x) #5614

Web Retrieval (2.x) #5614

Comments

ZanSara commented Aug 23, 2023 • edited Loading

...WebSearch

LinkFetcher

CacheChecker

Tasks

Timoeller commented Nov 23, 2023

ZanSara commented Aug 23, 2023 •

edited

Loading

`...WebSearch`

`LinkFetcher`

`CacheChecker`