Skip to content

Latest commit

 

History

History
113 lines (82 loc) · 6.4 KB

index.md

File metadata and controls

113 lines (82 loc) · 6.4 KB

String Comparisons

NPM version npm GitHub stars GitHub license example workflow

This library offers a range of functions to calculate text similarity, allowing you to measure the likeness of text data in an application. It implements well-established similarity metrics. The library currently supports the following algorithms:

  • Cosine Similarity
  • Jaccard Similarity
  • Jaro Similarity
  • Damerau-Levenshtein Distance
  • Hamming Distance
  • Levenshtein Distance
  • Smith-Waterman Alignment
  • Sørensen-Dice Coefficient
  • Jaccard Similarity based on Trigrams
  • Szymkiewicz Simpson Overlap
  • N-Gram
  • Q-Gram
  • Optimal String Alignment

Installation

Assuming you have Node.js and npm/yarn/pnpm installed, install the library using:

# Install the 'string-comparisons' package using npm
npm install string-comparisons

# Alternatively, install the 'string-comparisons' package using yarn
yarn add string-comparisons

# Or, install the 'string-comparisons' package using pnpm
pnpm add string-comparisons

String Similarity Algorithm Comparison

Algorithm Normalized Metric Similarity Distance Space Complexity
cosine.js Yes Vector Space Model O(n)
jaro.js No Edit Distance O(min(n, m))
jaccard.js No Set Theory O(min(n, m))
damerauLevenshtein.js No Edit Distance O(max(n, m)²)
hammingDistance.js No Bitwise Operations O(1)
jaroWinkler.js No Edit Distance O(min(n, m))
levenshtein.js No Edit Distance O(max(n, m)²)
smithWaterman.js No Dynamic Programming (Local Alignment) O(n * m)
sorensenDice.js No Set Theory O(min(n, m))
trigram.js No N-gram Overlap O(n²)
szymkiewiczSimpsonOverlap.js Yes Overlap Coefficient O(min(m, n))
nGram.js Yes Jaccard similarity coefficient O(m * n)
qGram.js Yes Jaccard similarity coefficient O(n + m)
optimalStringAlignment.js No Edit distance O(max(n, m)²)

Explanation of Columns:

  • Normalized: Indicates whether the algorithm produces a score between 0 and 1 (normalized).
  • Metric: The underlying mathematical concept used for comparison.
  • Similarity: Whether the algorithm outputs a higher score for more similar strings.
  • Distance: Whether the algorithm outputs a lower score for more similar strings. (One algorithm might use similarity, another distance - they provide the opposite information).
  • Space Complexity: The amount of extra memory the algorithm needs to run the comparison.

Notes:

  • ✓ indicates the algorithm applies to that category.
  • Some algorithms can be used for both similarity and distance calculations depending on the interpretation of the score.

Example Usage

import StringComparisons from 'string-comparisons';

const { Cosine, Jaccard, Jaro, DamerauLevenshtein, HammingDistance, JaroWrinker, Levenshtein, SmithWaterman, SorensenDice, Trigram } = StringComparisons;

const string1 = 'programming';
const string2 = 'programmer';


console.log('Jaro-Winkler similarity:', JaroWrinker.similarity(string1, string2)); // Output: ~0.9054545454545454
console.log('Levenshtein distance:', Levenshtein.similarity(string1, string2)); // Output: 3
console.log('Smith-Waterman similarity:', SmithWaterman.similarity(string1, string2)); // Output: 16

const set1 = new Set([1, 2, 3]);
const set2 = new Set([2, 3, 4]);

console.log('Sørensen-Dice similarity:', SorensenDice.similarity(set1, set2)); // Output: 0.6666666666666667

const trigram1 = 'hello';
const trigram2 = 'world';

console.log('Trigram Jaccard similarity:', Trigram.similarity(trigram1, trigram2)); // Output: 0 (no shared trigrams)

// so on

Resources