Fuzzy Parser

Currently just a fuzzy regex-inspired parser at the moment, but hopefully the fuzzy family will grow.

This project is still in it's infancy, and in particular, I expect to make big changes to the pattern language and the cost model used to find the best match.

For example: the current cost model equally penalizes skipping pattern characters vs. skipping text characters, so if you have a large text and a relatively small pattern, then the cost of skipping the large text is much bigger than the cost of skipping the pattern and it makes complete sense to cram nearly all the text into whatever ".*" parts exist in the pattern, regardless of whether some of that text matched other parts of the pattern.

The main algorithm appears to be solid enough to work on most inputs you might want to give it (albeit memory hungry), and given enough memory to create the required tables can process a 62,000 word book in a matter of seconds. Presumably there is a lot of performance still on the table.

We haven't yet created a proper executable, but will do so soon. As we do that, we'll test the program on real world use-cases and tweak the algorithm until it appears useful.

To do

Improve practical use of tool based on real world experience:
- Extend pattern to enable us to give more restricted forms of "variable text"
- Create a more powerful yaml based pattern representation to enable named matches and other non-regex tools
- Make Match more detailed and create a yaml based representation for it
Get benchmarks up and running and then start investigating whether we can make code more optimised and safer. Current approach is both overly simplistic but also more skeptical of creating new objects than standard Scala code.
Think about how to reduce memory usage. How many cells in early rows in table will never be visited by any of the optimal traversals we are tracking from later rows?
Can we support generic monadic parsing rather than just regex inspired pattern? Will require both effiency and a more dynamic replacement for tables
Support for optional reporting of lots of intermediate state in yaml output, and visualisation mechanisms for that output. Can we stream output?
Think about parallelisation. If we had no kleene stars, we could potentially process all elements in pattern in parallel, as long as we process from (end,end) to (start,start) in a "diagonal-breadth-first" pattern. This would require elements on the diagonal to be computed in lock-step and elements at step N to be available to computations at step N+1. Not sure how badly the precence of kleene stars in pattern interferes with this?

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
api-testkit/src/main/scala/fuzzy/api/testkit		api-testkit/src/main/scala/fuzzy/api/testkit
api/src/main/scala/fuzzy/api		api/src/main/scala/fuzzy/api
cli/src/main/scala/fuzzy/cli		cli/src/main/scala/fuzzy/cli
examples		examples
matcher-loop/src		matcher-loop/src
matcher-simple/src		matcher-simple/src
matcher-testkit/src/main/scala/fuzzy/matcher/testkit		matcher-testkit/src/main/scala/fuzzy/matcher/testkit
pattern-regex/src		pattern-regex/src
project		project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fuzzy Parser

To do

About

Releases

Packages

Languages

License

SamRoberts/fuzzy

Folders and files

Latest commit

History

Repository files navigation

Fuzzy Parser

To do

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages