Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web Retrieval (2.x) #5614

Closed
4 tasks done
Tracked by #5265
ZanSara opened this issue Aug 23, 2023 · 1 comment · Fixed by #5712
Closed
4 tasks done
Tracked by #5265

Web Retrieval (2.x) #5614

ZanSara opened this issue Aug 23, 2023 · 1 comment · Fixed by #5712
Assignees
Labels
2.x Related to Haystack v2.0 epic
Milestone

Comments

@ZanSara
Copy link
Contributor

ZanSara commented Aug 23, 2023

Haystack 1.x supports retrievers that perform a web search instead of looking into a document store.

In Haystack 2.0 we decide to split such component into smaller components that would look like such in a web retrieval RAG pipeline:

graph TD;

IN{IN} -- query --> WebSearch
IN{IN} -- query --> PromptBuilder
WebSearch --links --> UrlCacheChecker
UrlCacheChecker -- misses --> LinkContentFetcher
LinkContentFetcher -- html --> HTMLToDocument
LinkContentFetcher -- pdf --> PDFToDocument
LinkContentFetcher -- json --> JSONToDocument
LinkContentFetcher -- ... --> ...ToDocument
HTMLToDocument -- docs --> Join
PDFToDocument -- docs --> Join
JSONToDocument -- docs --> Join
...ToDocument -- docs --> Join
Join -- docs --> TextSplitter
TextSplitter -- docs --> DocumentWriter
TextSplitter -- docs -->  Ranker
UrlCacheChecker -- hits --> Ranker
Ranker -- docs --> PromptBuilder
PromptBuilder -- prompt --> GPTGenerator
GPTGenerator -- replies --> AnswerBuilder
AnswerBuilder -- answers --> OUT{OUT}
Loading

...WebSearch

WebSearch components expect a question and return a list of links, similarly to what WebSearch components do in Haystack 1.x. As in Haystack 1.x, we expect to have about 4 implementations:

  • SerpWebSearch
  • SerperWebSearch
  • GoogleWebSearch
  • BingWebSearch

and any other search engine backend we may want to support in the future.

Note that WebSearch component return either links, snippets, or both. We can specify which behavior we prefer with an init flag, or simply make the component always return both and either discard or ignore the outputs we are not interested in.

LinkFetcher

LinkFetcher as well works very similarly to its Haystach 1.x version, but it's much simpler, because it will not have callbacks for document processing, nor a preprocessor instance. Instead it will work similarly to a FileTypeClassifier: it will check what type of file it obtained from the input link that it just read and return such file on an edge named after the type, for example html, pdf, json, ...

The file conversion and the preprocessing will be done later in the pipeline by the respective components.

CacheChecker

CacheChecker is a document store aware component that simply checks for the presence of a document into a document store. In this case, rather than checking documents instances, it should check for the presence of documents with a specific URL in their metadata. It also support cache expiration features, so the documents might need to have also their retrieval datetime in the metadata.

Let's remember that CacheChecker returns two different outputs: missing, which is a list of links it could not find in the docstore (or which cached version is too old) and found, which instead is a list of Documents that correspond to the links it received in input.

Tasks

  1. 2.x topic:CI topic:tests type:documentation
  2. 2.x topic:tests type:documentation
  3. 2.x
  4. 2.x topic:tests type:documentation
@ZanSara ZanSara changed the title Web Retrievers (v2) Web Retrieval (v2) Aug 23, 2023
@ZanSara ZanSara added the 2.x Related to Haystack v2.0 label Aug 23, 2023
@ZanSara ZanSara reopened this Sep 6, 2023
@ZanSara ZanSara added this to the 2.0 blockers milestone Sep 25, 2023
@masci masci added the epic label Oct 2, 2023
@masci masci changed the title Web Retrieval (v2) Web Retrieval (2.x) Oct 2, 2023
@Timoeller
Copy link
Contributor

@vblagoje this seems finished, please close this epic if it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 epic
Projects
Development

Successfully merging a pull request may close this issue.

4 participants