diff --git a/README.md b/README.md index d500cf8..9ce4633 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,11 @@ +## ✨ Latest News + +- [11/06/2024]: Our paper is available on arXiv. You can access it [here](https://arxiv.org/abs/2411.02959). +- [11/05/2024]: The open-source toolkit and models are released. You can apply HtmlRAG in your own RAG systems now. + We propose HtmlRAG, which uses HTML instead of plain text as the format of external knowledge in RAG systems. To tackle the long context brought by HTML, we propose **Lossless HTML Cleaning** and **Two-Step Block-Tree-Based HTML Pruning**. - **Lossless HTML Cleaning**: This cleaning process just removes totally irrelevant contents and compress redundant structures, retaining all semantic information in the original HTML. The compressed HTML of lossless HTML cleaning is suitable for RAG systems that have long-context LLMs and are not willing to loss any information before generation. @@ -24,6 +29,11 @@ We provide a simple tookit to apply HtmlRAG in your own RAG systems. ### 📦 Installation +Install the package using pip: +```bash +pip install htmlrag +``` +Or install the package from source: ```bash cd toolkit/ pip install -e . diff --git a/jupyter/module_test.ipynb b/jupyter/module_test.ipynb index ae89076..68e1ce7 100644 --- a/jupyter/module_test.ipynb +++ b/jupyter/module_test.ipynb @@ -2,39 +2,14 @@ "cells": [ { "cell_type": "code", - "execution_count": 1, "id": "initial_id", "metadata": { + "collapsed": true, "ExecuteTime": { - "end_time": "2024-10-20T05:33:09.950503Z", - "start_time": "2024-10-20T05:32:33.270934Z" - }, - "collapsed": true - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/data_train/search/InternData/jiejuntan/anaconda3/envs/py39/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "
The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.
\n", - "Some other text
\n", - "Some other text
\n", - "Some other text
\n", "# \n", "#