QAEncoder: Towards Aligned Representation Learning in Question Answering System

Official implementation of our QAEncoder method for more advanced QA systems.

Introduction

Modern QA systems entail retrieval-augmented generation (RAG) for accurate and trustworthy responses. However, the inherent gap between user queries and relevant documents hinders precise matching. Motivated by our conical distribution hypothesis, which posits that potential queries and documents form a cone-like structure in the embedding space, we introduce QAEncoder, a training-free approach to bridge this gap. Specifically, QAEncoder estimates the expectation of potential queries in the embedding space as a robust surrogate for the document embedding, and attaches document fingerprints to effectively distinguish these embeddings. Extensive experiments on fourteen embedding models across six languages and eight datasets validate QAEncoder's alignment capability, which offers a plug-and-play solution that seamlessly integrates with existing RAG architectures and training-based methods.

Illustration of QAEncoder's alignment process: Solid lines represent diversified query generation, while dashed lines indicate Monte Carlo estimation. The heatmap depicts the similarity scores among the embeddings of the different queries, the document, and the mean estimation.

Architecture of QAEncoder. Left: Corpus documents are embedded using QAEncoder to obtain query-aligned representations for indexing. User queries are encoded with a vanilla encoder and used to retrieve relevant documents. Right: Internal mechanism of QAEncoder. QAEncoder addresses the document-query gap by generating a diverse set of queries for each document to create semantically aligned embeddings. Additionally, document fingerprint strategies are employed to ensure document distinguishability.

Conical distribution hypothesis validation. The figure presents three visualizations supporting the conical distribution hypothesis: (a) t-SNE visualization of queries derived from various documents in the embedding space, illustrating distinct clustering behavior. (b) Angular distribution of document and query embeddings, showing the distribution of angles between $ v_d = \mathcal{E}(d) - \mathbb{E}[\mathcal{E}(\mathcal{Q}(d))] $ and $ v_{q_i} = \mathcal{E}(q_i) - \mathbb{E}[\mathcal{E}(\mathcal{Q}(d))] $. The angles form a bell curve just below 90°, supporting that $ v_d $ is approximately orthogonal to each $ v_{q_i}$ and serves as the normal vector. (c) 3D visualization illustrating the conical distribution of the document (black point) and query (colored points) embeddings within a unit sphere. The star indicates the queries' cluster center.

Experiments

Experiments On Classical Datasets

Experiments On Latest Datasets

Ablation Studies

Training-based and Document-centric Methods

Quick Start

Set up the environment and run the demo script:

git clone https://github.com/IAAR-Shanghai/QAEncoder.git
cd QAEncoder
conda create -n QAE python==3.8
pip install -r requirements-demo.txt
python demo.py

Results should be like:

TODO

This work is currently under review and code refactoring. We plan to fully open-source our project in order.

Release Demo
Release QAEncoder codes and datasets
Release QAEncoder codes compatible with Llamaindex and Langchain
Release QAEncoder++, our future works

📖 BibTeX

@article{wang2024qaencoder,
    title={QAEncoder: Towards Aligned Representation Learning in Question Answering System}, 
    author={Wang, Zhengren and Yu, Qinhan and Wei, Shida and Li, Zhiyu and Xiong, Feiyu and Wang, Xiaoxing and Niu, Simin and Liang, Hao and Zhang, Wentao}
    journal={arXiv preprint arXiv:2409.20434},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
README.md		README.md
demo.py		demo.py
requirements-demo.txt		requirements-demo.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QAEncoder: Towards Aligned Representation Learning in Question Answering System

Introduction

Experiments

Experiments On Classical Datasets

Experiments On Latest Datasets

Ablation Studies

Training-based and Document-centric Methods

Quick Start

TODO

📖 BibTeX

About

Releases

Packages

Languages

IAAR-Shanghai/QAEncoder

Folders and files

Latest commit

History

Repository files navigation

QAEncoder: Towards Aligned Representation Learning in Question Answering System

Introduction

Experiments

Experiments On Classical Datasets

Experiments On Latest Datasets

Ablation Studies

Training-based and Document-centric Methods

Quick Start

TODO

📖 BibTeX

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages