Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separators: Implement html chunking strategy. #23

Merged
merged 1 commit into from
May 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 9 additions & 8 deletions lib/text_chunker.ex
Original file line number Diff line number Diff line change
Expand Up @@ -21,20 +21,21 @@ defmodule TextChunker do
@supported_formats [
:doc,
:docx,
:elixir,
:epub,
:html,
:javascript,
:latex,
:markdown,
:odt,
:pdf,
:rtf,
:markdown,
:plaintext,
:elixir,
:ruby,
:php,
:plaintext,
:python,
:vue,
:javascript,
:typescript
:rtf,
:ruby,
:typescript,
:vue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for organising these!

]

@opts_schema [
Expand Down
76 changes: 40 additions & 36 deletions lib/text_chunker/strategies/recursive_chunk/separators.ex
Original file line number Diff line number Diff line change
Expand Up @@ -34,20 +34,12 @@ defmodule TextChunker.Strategies.RecursiveChunk.Separators do
"```\n\n",
"\n\n___\n\n",
"\n\n---\n\n",
"\n\n***\n\n",
"\n\n",
"\n",
" "
]
"\n\n***\n\n"
] ++ empty_and_new_line_separators()
end

def get_separators(:plaintext) do
[
"\n\n",
"\n",
" ",
""
]
empty_and_new_line_separators() ++ [""]
end

def get_separators(:elixir) do
Expand All @@ -68,11 +60,8 @@ defmodule TextChunker.Strategies.RecursiveChunk.Separators do
" with ",
" cond ",
" case ",
" if ",
"\n\n",
"\n",
" "
]
" if "
] ++ empty_and_new_line_separators()
end

def get_separators(:ruby) do
Expand All @@ -90,11 +79,8 @@ defmodule TextChunker.Strategies.RecursiveChunk.Separators do
" for ",
" do ",
" begin ",
" rescue ",
"\n\n",
"\n",
" "
]
" rescue "
] ++ empty_and_new_line_separators()
end

def get_separators(:php) do
Expand All @@ -113,22 +99,16 @@ defmodule TextChunker.Strategies.RecursiveChunk.Separators do
" while ",
" do ",
" switch ",
" case ",
"\n\n",
"\n",
" "
]
" case "
] ++ empty_and_new_line_separators()
end

def get_separators(:python) do
[
"\nclass ",
"\ndef ",
"\n\tdef ",
"\n\n",
"\n",
" "
]
"\n\tdef "
] ++ empty_and_new_line_separators()
end

def get_separators(:vue) do
Expand Down Expand Up @@ -157,14 +137,38 @@ defmodule TextChunker.Strategies.RecursiveChunk.Separators do
" while ",
" switch ",
" case ",
" default ",
"\n\n",
"\n",
" "
]
" default "
] ++ empty_and_new_line_separators()
end

def get_separators(:html) do
[
"<h1",
"<h2",
"<h3",
"<h4",
"<h5",
"<h6",
"<p",
"<ul",
"<ol",
"<li",
"<article",
"<section",
"<table"
] ++
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why is it splitting on <h1 rather than <h1>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just being lazy and following the other examples (vue).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just struck me while driving that this will also split on any HMTL tags that have attrs! So it makes way more sense than adding in the closing >.

empty_and_new_line_separators()
end

def get_separators(:typescript), do: get_separators(:javascript)

def get_separators(format) when format in @plaintext_formats, do: get_separators(:plaintext)

defp empty_and_new_line_separators do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

[
"\n\n",
"\n",
" "
]
end
end
107 changes: 106 additions & 1 deletion test/recursive_chunk_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,111 @@ defmodule TextChunkerTest do
end
end

describe "chunker with HTML separators" do
test "splits an HTML file" do
opts = [
chunk_size: 100,
chunk_overlap: 20,
format: :html
]

{:ok, text} = File.read("test/support/fixtures/document_fixtures/test_file.html")
result = text |> TextChunker.split(opts) |> TestHelpers.extract_text_from_chunks()

expected_result = [
"<h1>Elixir: A Powerful Language for Building Scalable Applications</h1>\n",
"<p>Elixir is a dynamic, functional programming language designed for building scalable and",
" scalable and maintainable applications. It runs on the Erlang Virtual Machine (VM), which is known",
" which is known for its robust concurrency and fault-tolerance capabilities. In this article, we'll",
" this article, we'll explore the key features of Elixir and discuss different chunking approaches",
" chunking approaches for Retrieval Augmented Generation (RAG).</p>",
"\n\n",
"<h2>Key Features of Elixir</h2>\n",
"<ul>\n ",
"<li>Functional Programming: Elixir is built on the principles of functional programming, which",
" programming, which emphasizes immutability, higher-order functions, and recursive algorithms.</li>",
"\n ",
"<li>Concurrency and Scalability: Elixir leverages the power of the Erlang VM to provide lightweight",
" provide lightweight processes and efficient message passing, enabling massive concurrency and",
" concurrency and scalability.</li>",
"\n ",
"<li>Fault-Tolerance: With its actor-based concurrency model and support for supervisors, Elixir",
" supervisors, Elixir allows you to build fault-tolerant systems that can handle failures",
" can handle failures gracefully.</li>",
"\n ",
"<li>Metaprogramming: Elixir provides powerful metaprogramming capabilities through macros, allowing",
" macros, allowing you to extend the language and write expressive and reusable code.</li>",
"\n</ul>",
"\n\n",
"<article>\n ",
"<h3>Chunking Approaches for Retrieval Augmented Generation</h3>\n ",
"<p>Retrieval Augmented Generation (RAG) is a technique that combines information retrieval with",
" retrieval with language generation to generate high-quality and informative text. Chunking, the",
" text. Chunking, the process of breaking down text into smaller units, plays a crucial role in RAG.",
" role in RAG. Let's explore different chunking approaches commonly used in Elixir:</p>",
"\n ",
"<ol>\n ",
"<li>Sentence-based Chunking: This approach splits the text into individual sentences using",
" sentences using punctuation markers such as periods, question marks, and exclamation points. Each",
" points. Each sentence becomes a separate chunk, allowing for fine-grained retrieval and",
" retrieval and generation.</li>",
"\n ",
"<li>Paragraph-based Chunking: With this approach, the text is divided into paragraphs based on the",
" based on the presence of newline characters or specific paragraph delimiters. Paragraphs provide a",
" provide a coherent and self-contained unit of information suitable for RAG.</li>",
"\n ",
"<li>Semantic Chunking: Semantic chunking involves analyzing the text and identifying meaningful",
" meaningful semantic units or phrases. This can be achieved using techniques like named entity",
" like named entity recognition, noun phrase extraction, or dependency parsing. Semantic chunks",
" Semantic chunks capture the core concepts and ideas within the text.</li>",
"\n ",
"<li>Custom Chunking: Elixir provides the flexibility to define custom chunking rules based on",
" rules based on specific requirements. For example, you can chunk text based on a certain number of",
" a certain number of words, specific delimiters, or regular expressions that match particular",
" match particular patterns.</li>",
"\n </ol>\n ",
"<p>The choice of chunking approach depends on the nature of the text and the desired granularity of",
" granularity of retrieval and generation. Elixir's powerful string manipulation and pattern matching",
" pattern matching capabilities make it easy to implement various chunking strategies",
" chunking strategies efficiently.</p>",
"\n</article>",
"\n\n",
"<section>\n ",
"<h3>Benefits of Elixir/Erlang</h3>\n ",
"<h4>Elixir and Erlang offer several advantages over other language stacks when it comes to building",
" comes to building scalable and fault-tolerant systems. Let's take a look at some of the key",
" at some of the key benefits:</h4>",
"\n ",
"<table>\n <thead>\n <tr>\n <th>Benefit</th>\n <th>Description</th>\n </tr>",
"\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>Concurrency and Scalability</td>",
"\n <td>Built-in support for lightweight processes and efficient message passing.</td>",
"\n </tr>\n <tr>\n <td>Fault-Tolerance</td>",
"\n <td>\"Let it crash\" philosophy and automatic recovery from failures.</td>\n </tr>",
"\n </tr>\n <tr>\n <td>Hot Code Swapping</td>",
"\n <td>Update code without stopping the system, providing zero downtime.</td>\n </tr>",
"\n </tr>\n <tr>\n <td>Distribution and Clustering</td>",
"\n <td>Built-in support for distributed systems and easy horizontal scaling.</td>\n </tr>",
"\n </tr>\n <tr>\n <td>Ecosystem and Libraries</td>",
"\n <td>Growing ecosystem with a wide range of libraries and supportive community.</td>",
"\n </tr>\n </tbody>\n </table>\n ",
"<p>These benefits make Elixir and Erlang a compelling choice for building scalable, fault-tolerant,",
" fault-tolerant, and maintainable systems, especially in domains like web development, real-time",
" real-time applications, and distributed systems.</p>",
"\n</section>",
"\n\n",
"<h5>Conclusion</h5>\n",
"<p>Elixir's combination of functional programming, concurrency, and fault-tolerance makes it a",
" makes it a powerful language for building scalable and maintainable RAG systems. By leveraging the",
" By leveraging the appropriate chunking approaches and utilizing Elixir's strengths, you can create",
" you can create efficient and high-quality retrieval augmented generation solutions.</p>",
"\n\n",
"<h6>© Elixir RAG. All rights reserved.</h6>"
]

assert result == expected_result
end
end

describe "rejects unsupported options" do
test "rejects a chunk_overlap of -1" do
opts = [
Expand Down Expand Up @@ -384,7 +489,7 @@ defmodule TextChunkerTest do

assert result == {
:error,
"invalid value for :format option: expected one of [:doc, :docx, :epub, :latex, :odt, :pdf, :rtf, :markdown, :plaintext, :elixir, :ruby, :php, :python, :vue, :javascript, :typescript], got: :made_up_format"
"invalid value for :format option: expected one of [:doc, :docx, :elixir, :epub, :html, :javascript, :latex, :markdown, :odt, :pdf, :php, :plaintext, :python, :rtf, :ruby, :typescript, :vue], got: :made_up_format"
}
end

Expand Down
63 changes: 63 additions & 0 deletions test/support/fixtures/document_fixtures/test_file.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
<h1>Elixir: A Powerful Language for Building Scalable Applications</h1>
<p>Elixir is a dynamic, functional programming language designed for building scalable and maintainable applications. It runs on the Erlang Virtual Machine (VM), which is known for its robust concurrency and fault-tolerance capabilities. In this article, we'll explore the key features of Elixir and discuss different chunking approaches for Retrieval Augmented Generation (RAG).</p>

<h2>Key Features of Elixir</h2>
<ul>
<li>Functional Programming: Elixir is built on the principles of functional programming, which emphasizes immutability, higher-order functions, and recursive algorithms.</li>
<li>Concurrency and Scalability: Elixir leverages the power of the Erlang VM to provide lightweight processes and efficient message passing, enabling massive concurrency and scalability.</li>
<li>Fault-Tolerance: With its actor-based concurrency model and support for supervisors, Elixir allows you to build fault-tolerant systems that can handle failures gracefully.</li>
<li>Metaprogramming: Elixir provides powerful metaprogramming capabilities through macros, allowing you to extend the language and write expressive and reusable code.</li>
</ul>

<article>
<h3>Chunking Approaches for Retrieval Augmented Generation</h3>
<p>Retrieval Augmented Generation (RAG) is a technique that combines information retrieval with language generation to generate high-quality and informative text. Chunking, the process of breaking down text into smaller units, plays a crucial role in RAG. Let's explore different chunking approaches commonly used in Elixir:</p>
<ol>
<li>Sentence-based Chunking: This approach splits the text into individual sentences using punctuation markers such as periods, question marks, and exclamation points. Each sentence becomes a separate chunk, allowing for fine-grained retrieval and generation.</li>
<li>Paragraph-based Chunking: With this approach, the text is divided into paragraphs based on the presence of newline characters or specific paragraph delimiters. Paragraphs provide a coherent and self-contained unit of information suitable for RAG.</li>
<li>Semantic Chunking: Semantic chunking involves analyzing the text and identifying meaningful semantic units or phrases. This can be achieved using techniques like named entity recognition, noun phrase extraction, or dependency parsing. Semantic chunks capture the core concepts and ideas within the text.</li>
<li>Custom Chunking: Elixir provides the flexibility to define custom chunking rules based on specific requirements. For example, you can chunk text based on a certain number of words, specific delimiters, or regular expressions that match particular patterns.</li>
</ol>
<p>The choice of chunking approach depends on the nature of the text and the desired granularity of retrieval and generation. Elixir's powerful string manipulation and pattern matching capabilities make it easy to implement various chunking strategies efficiently.</p>
</article>

<section>
<h3>Benefits of Elixir/Erlang</h3>
<h4>Elixir and Erlang offer several advantages over other language stacks when it comes to building scalable and fault-tolerant systems. Let's take a look at some of the key benefits:</h4>
<table>
<thead>
<tr>
<th>Benefit</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Concurrency and Scalability</td>
<td>Built-in support for lightweight processes and efficient message passing.</td>
</tr>
<tr>
<td>Fault-Tolerance</td>
<td>"Let it crash" philosophy and automatic recovery from failures.</td>
</tr>
<tr>
<td>Hot Code Swapping</td>
<td>Update code without stopping the system, providing zero downtime.</td>
</tr>
<tr>
<td>Distribution and Clustering</td>
<td>Built-in support for distributed systems and easy horizontal scaling.</td>
</tr>
<tr>
<td>Ecosystem and Libraries</td>
<td>Growing ecosystem with a wide range of libraries and supportive community.</td>
</tr>
</tbody>
</table>
<p>These benefits make Elixir and Erlang a compelling choice for building scalable, fault-tolerant, and maintainable systems, especially in domains like web development, real-time applications, and distributed systems.</p>
</section>

<h5>Conclusion</h5>
<p>Elixir's combination of functional programming, concurrency, and fault-tolerance makes it a powerful language for building scalable and maintainable RAG systems. By leveraging the appropriate chunking approaches and utilizing Elixir's strengths, you can create efficient and high-quality retrieval augmented generation solutions.</p>

<h6>© Elixir RAG. All rights reserved.</h6>