Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove allocations from jaro #6050

Merged
merged 7 commits into from
Jan 11, 2019

Conversation

AviAvni
Copy link
Contributor

@AviAvni AviAvni commented Dec 21, 2018

I removed the allocation of the ResizeArray
Also remove the string concat from FilterPredictions

@forki
Copy link
Contributor

forki commented Dec 21, 2018

awesome.

here is another version that you might try:

// The Winkler modification will not be applied unless the 
// percent match was at or above the mWeightThreshold percent 
// without the modification. 
// Winkler's paper used a default value of 0.7
let private weightThreshold = 0.7

// Size of the prefix to be concidered by the Winkler modification. 
// Winkler's paper used a default value of 4
let private prefixLenght = 4

/// Calculates the Jaro-Winkler edit distance between two strings.
/// The edit distance is a metric that allows to measure the amount of similarity between two strings.
let JaroWinklerDistance (s1:string) (s2:string) = 
    let length1 = s1.Length
    let length2 = s2.Length
    if length1 = 0 then 
        if length2 = 0 then 1.0 else 0.0
    else
        let searchRange = Math.Max(0,Math.Max(length1,length2) / 2 - 1)

        // default initialized to false
        let matched1 = Array.zeroCreate<bool> length1
        let matched2 = Array.zeroCreate<bool> length2

        let mutable commonChars = 0
        for i in 0..length1-1 do
            let lStart = Math.Max(0,i-searchRange)
            let lEnd = Math.Min(i+searchRange+1,length2)
            let mutable j = lStart
            let mutable cont = true
            while cont && j < lEnd do
                if (not matched2.[j]) && s1.[i] = s2.[j] then
                    matched1.[i] <- true
                    matched2.[j] <- true
                    commonChars <- commonChars + 1
                    cont <- false
              
        if commonChars = 0 then 0.0 else

        let mutable numberOfHalfTransposed = 0
        let mutable k = 0
        for i in 0..length1-1 do
            if matched1.[i] then
                while not matched2.[k] do 
                    k <- k + 1
                if s1.[i] <> s2.[k] then
                    numberOfHalfTransposed <- numberOfHalfTransposed + 1
                k <- k + 1

        let numberOfTransposed = float numberOfHalfTransposed / 2.0

        let commonChars = float commonChars
        let weight = (commonChars / float length1
                            + commonChars / float length2
                            + (commonChars - numberOfTransposed) / commonChars) / 3.0

        if weight <= weightThreshold then weight else
        let lMax = Math.Min(prefixLenght,Math.Min(length1,length2))
        let mutable pos = 0
        while pos < lMax && s1.[pos] = s2.[pos] do
            pos <- pos + 1
        if pos = 0 then weight else weight + 0.1 * float pos * (1.0 - weight)

@AviAvni
Copy link
Contributor Author

AviAvni commented Dec 21, 2018

@forki your suggestion start with allocating 2 arrays

@AviAvni
Copy link
Contributor Author

AviAvni commented Dec 21, 2018

@forki I tried your code but the loop while cont && j < lEnd do is never ending

@forki
Copy link
Contributor

forki commented Dec 21, 2018

I tried to port https://stackoverflow.com/a/19165108/145701 but probably messed it up ;-)

But your's looks better anywaay

src/utils/EditDistance.fs Outdated Show resolved Hide resolved
@AviAvni
Copy link
Contributor Author

AviAvni commented Dec 21, 2018

Why the "Test request for parse and check doesn't check whole project" test failed?

@cartermp
Copy link
Contributor

Ugh it's one of those flaky FCS tests

@cartermp cartermp closed this Dec 21, 2018
@cartermp cartermp reopened this Dec 21, 2018
@cartermp
Copy link
Contributor

@TIHan and I will benchmark this before pulling. At least on the surface it seems fine, but we'll have to measure it to be sure.

src/fsharp/ErrorResolutionHints.fs Show resolved Hide resolved
src/utils/EditDistance.fs Outdated Show resolved Hide resolved
src/utils/EditDistance.fs Outdated Show resolved Hide resolved
@vasily-kirichenko

This comment has been minimized.

@cartermp

This comment has been minimized.

@vasily-kirichenko

This comment has been minimized.

@majocha

This comment has been minimized.

@cartermp
Copy link
Contributor

cartermp commented Dec 31, 2018

(Edited to reflect a correct benchmark with proper baselining, setup, and FilterPredictions implementation that matches this PR)

I'll want to benchmark this with actual data from VisualFSharp, but here's a quick benchmark comparing my name with 200 other male first names (appended it itself 42 times). so comparing against a list of 8400 strings:

https://github.com/cartermp/JaroWinklerBenchmarks

BenchmarkDotNet=v0.11.3, OS=macOS Mojave 10.14.2 (18C54) [Darwin 18.2.0]
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.2.100
  [Host]     : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT DEBUG
  DefaultJob : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT

Method Mean Error StdDev Median Ratio RatioSD Gen 0/1k Op Gen 1/1k Op Gen 2/1k Op Allocated Memory/Op
JaroCurrent 3.500 ms 0.1343 ms 0.3744 ms 3.384 ms 1.00 0.00 679.6875 - - 2108.52 KB
JaroNew 3.041 ms 0.1387 ms 0.3913 ms 2.950 ms 0.88 0.15 578.1250 - - 1783.68 KB
JaroNewStructTuple 2.636 ms 0.0527 ms 0.1322 ms 2.628 ms 0.76 0.09 136.7188 - - 424.59 KB

Takeaways:

  • Changes with struct tuples is best (much less CPU time, 92% less memory usage)
  • Changes with reference tuples is still better, but the additional allocations appear to affect CPU time as well as not making memory usage tremendously better

Keep in mind that these represent measurements of a single run of the algorithm over 8400 strings. In an IDE session, it would be run many more times as the user generates errors that trigger this code path. In my own testing with #6044, this is all of the time.

@cartermp
Copy link
Contributor

cartermp commented Jan 1, 2019

I wonder if there is a way to remove the need for normalizing the strings with ToUpper. Naiively removing it does decrease CPU time and does remove more allocations (in the #6044, this was 228 MB over 90 seconds). However, it clearly gives bad results so it can't be used. If I replace the character comparisons with normalized character comparisons, CPU time is increased 2-3x, which wouldn't make that worth it.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jan 4, 2019

I wonder if there is a way to remove the need for normalizing the strings with ToUpper

@cartermp: A micro-optimization would be to replace it with CultureInfo.InvariantCulture.TextInfo.ToUpper (where the TextInfo object is module-level), this removes a null-check and two function call levels, BMDN timings on strings of size 1-100 suggest 1% to 20% performance increase (but it is not even the most inner loop so whether this has measurable impact, I doubt it).

If I replace the character comparisons with normalized character comparisons

Perhaps instead of normalization for an invariant culture, it would be "good enough" to use normalization similar to OrdinalIgnoreCase (though I don't see a BCL function for chars)? Special-casing ASCII comparisons (only two ops) could also be beneficial perhaps.

I can see numerous other potential improvements in the code, like loop unrolling and creating hash tables of the set of identifiers, which will largely remain unchanged (I've seen some examples online where it was suggested that hashing is much faster than ToUpper, but more importantly, this would make the allocations a one-time penalty and you could store the strings as arrays, which may save a few ops down the line).

It may also be interesting to see if the inner loop in jaro can be replaced with IndexOf or IndexOfAny, the former potentially benefiting from OrdinalIgnoreCase which is a magnitude faster than InvariantIgnoreCase.

Copy link
Contributor

@cartermp cartermp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change, thanks @AviAvni

Copy link
Contributor

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Based on @cartermp 's testing, this seems to improve compute time and allocations overall; therefore, the change is very much worth it.

@cartermp cartermp merged commit 4566e64 into dotnet:master Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants