Skip to content

SimXNS is a research project for information retrieval. This repo contains official implementations by MSRA NLC team.

License

Notifications You must be signed in to change notification settings

microsoft/SimXNS

Repository files navigation

SimXNS

✨Updates | 📜Citation | 🤘Furthermore | ❤️Contributing | 📚Trademarks

SimXNS is a research project for information retrieval by MSRA NLC team. Some of the techniques are actively used in Microsoft Bing. This repo provides the official code implementations.

Currently, this repo contains several methods that are designed for or related to information retrieval. Here are some basic descriptions to help you catch up with the characteristics of each work:

  • SimANS is a simple, general and flexible ambiguous negatives sampling method for dense text retrieval. It can be easily applied to various dense retrieval methods like AR2. This method is also applied in Bing search engine, which is proven to be effective.
  • MASTER is a multi-task pre-trained model that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture.
  • PROD is a novel distillation framework for dense retrieval, which consists of a teacher progressive distillation and a data progressive distillation to gradually improve the student.
  • CAPSTONE is a curriculum sampling for dense retrieval with document expansion, to bridge the gap between training and inference for dual-cross-encoder.
  • ALLIES leverages LLMs to iteratively generate new queries related to the original query, enabling an iterative reasoning process. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly obtainable through retrieval.
  • LEAD aligns the layer features of student and teacher, emphasizing more on the informative layers by re-weighting.

Updates

  • 2023/10/29: release the official code of CAPSTONE.
  • 2023/10/18: release the official code of ALLIES.
  • 2023/07/03: upload the pretrained MASTER checkpoints for MARCO and Wikipedia to huggingface model hub.
  • 2023/07/03: update approaches for downloading resources.
  • 2023/05/29: release the official code of LEAD.
  • 2023/02/16: refine the resources of SimANS by uploading files in a seperated style and offering the file list.
  • 2023/02/02: release the official code of PROD.
  • 2022/12/16: release the official code of MASTER.
  • 2022/11/17: release the official code of SimANS.

Citation

If you extend or use this work, please cite our paper where it was introduced:

  • SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval. Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan, Weizhu Chen. EMNLP Industry Track 2022. Code, Paper.
  • MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers. Kun Zhou, Xiao Liu, Yeyun Gong, Wayne Xin Zhao, Daxin Jiang, Nan Duan, Ji-Rong Wen. ECML-PKDD 2023. Code, Paper.
  • PROD: Progressive Distillation for Dense Retrieval. Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan. WWW 2023. Code, Paper.
  • CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion. Xingwei He, Yeyun Gong, A-Long Jin, Hang Zhang, Anlei Dong, Jian Jiao, Siu Ming Yiu, Nan Duan. EMNLP 2023. Code, Paper.
  • Allies: Prompting Large Language Model with Beam Search. Hao Sun, Xiao Liu, Yeyun Gong, Yan Zhang, Daxin Jiang, Linjun Yang, Nan Duan. Findings of EMNLP 2023. Code, Paper.
  • LEAD: Liberal Feature-based Distillation for Dense Retrieval. Hao Sun, Xiao Liu, Yeyun Gong, Anlei Dong, Jian Jiao, Jingwen Lu, Yan Zhang, Daxin Jiang, Linjun Yang, Rangan Majumder, Nan Duan. WSDM 2024. Code, Paper.
@inproceedings{zhou2022simans,
   title     = {{SimANS:} Simple Ambiguous Negatives Sampling for Dense Text Retrieval},
   author    = {Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan and Weizhu Chen},
   booktitle = {{EMNLP Industry Track}},
   year      = {2022}
}
@inproceedings{zhou2023master,
   title     = {{MASTER:} Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers},
   author    = {Kun Zhou, Xiao Liu, Yeyun Gong, Wayne Xin Zhao, Daxin Jiang, Nan Duan, Ji-Rong Wen},
   booktitle = {{ECML-PKDD}},
   year      = {2023}
}
@inproceedings{lin2023prod,
   title     = {{PROD:} Progressive Distillation for Dense Retrieval},
   author    = {Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder and Nan Duan},
   booktitle = {{WWW}},
   year      = {2023}
}
@inproceedings{he2023capstone,
   title     = {{CAPSTONE:} Curriculum Sampling for Dense Retrieval with Document Expansion},
   author    = {Xingwei He, Yeyun Gong, A-Long Jin, Hang Zhang, Anlei Dong, Jian Jiao, Siu Ming Yiu and Nan Duan},
   booktitle = {{EMNLP}},
   year      = {2023}
}
@inproceedings{sun2023allies,
   title     = {{Allies:} Prompting Large Language Model with Beam Search},
   author    = {Hao Sun, Xiao Liu, Yeyun Gong, Yan Zhang, Daxin Jiang, Linjun Yang and Nan Duan},
   booktitle = {{Findings of EMNLP}},
   year      = {2023}
}
@inproceedings{sun2024lead,
   title     = {{LEAD:} Liberal Feature-based Distillation for Dense Retrieval},
   author    = {Hao Sun, Xiao Liu, Yeyun Gong, Anlei Dong, Jian Jiao, Jingwen Lu, Yan Zhang, Daxin Jiang, Linjun Yang, Rangan Majumder and Nan Duan},
   booktitle = {{WSDM}},
   year      = {2024}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.