yenta

A fast, fuzzy, flexible command-line matchmaker for textual data

Overview

yenta matches names across two data files. It has the following features:

Intelligent: Matching is based on rareness of words, which means that one does not need to preprocess the names to remove common, non-informative words in names (i.e. and, the, company). Just feed your data in to the program and get results.
Robust: yenta incorporates feautes that are commonly needed in name matching. It is both word-order and case insensitive (Shawn Spencer matches SPENCER, SHAWN). yenta removes punctuation by default.
Unicode aware: By default, yenta automatically converts unicode accented characters to their ASCII equivalents.
Customizable: Users may optionally allow for misspellings, implement phonetic algorithms, trim the constituent words of a name at a prespecified number of characters, output any number of potential matches (with and without ties), and combine any of the preceding customizations.
High performance: yenta is a multi-core program written in Rust, a blazingly fast and memory-efficient language.
Subset matching: yenta supports finding a match in a subset using the --group-match option

Installation

Install Rust
Clone this repository
At the command line, change to the root of the cloned repository and then type: cargo install --path=. This may be cargo install --release depending on your version.

Quick Start

Save your data files in CSV format. You will match names from one file to potential matches in a second file. Assume that the first file is called from_names.csv and the second file is called to_names.csv. yenta requires that each of your CSV files has a column called name, in lower case. This column will be used by the fuzzy matcher. You may also have an optional column called id, which, if used, simply serves as a reference identifier that is echoed to the output.

On the command line, cd into the directory with your files. To create an output file called matches.csv use the following command:

yenta from_names.csv to_names.csv --output-file=matches.csv

Recipes

Information

See the wiki for information on installation, usage, and best practices. It also includes some examples for matching problems that commonly arise in research.

Contributing

Submit a pull request and I will respond.

If yenta has in any way made your life easier, please send me an email or star this repository. If you would like to see a feature added, let me know through the Github forum.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
ChangeLog.md		ChangeLog.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

yenta

Overview

Installation

Quick Start

Recipes

Information

Contributing

About

Releases

Packages

Languages

License

tumarkin/yenta

Folders and files

Latest commit

History

Repository files navigation

yenta

Overview

Installation

Quick Start

Recipes

Information

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages