replace addressit with pelias native parser #1287

missinglink · 2019-05-02T14:15:42Z

This PR replaces addressit with the new pelias/parser module.

Ready for testing and review

Functionally, I tried to keep them pretty similar, although the native parser is better at detecting addresses in general, and particularly in places like Germany and France 🎉

One major change is that we used to rely on the admin parsing to generate targeted subqueries (ie. when state='CA' was matched we would produce a subquery for region='CA').

This had the disadvantage of not generating a scoring subquery for cases where 'CA' referred to the country of Canada.

So the change I made is that we consider everything from the first matched admin character to the end of the string as 'admin_parts' and generate subqueries for them to match on any admin field.

missinglink · 2019-05-02T15:15:56Z

improvements:

regressions:

changes:

https://pelias.github.io/compare/#/v1/autocomplete%3Fdebug=true&text=123%20main%20st%20new%20york%20ny%2010010%20USA

Joxit · 2019-05-07T15:02:18Z

I did some tests (only in France for now) and it seems that this solves #1279 🎉

Joxit · 2019-05-21T15:54:56Z

Here is the current status of acceptance tests on our planet build.

-	v3.65.0	pelias_parser
Pass	535	516
Improvements	18	22 (+4 👍 )
Fail	91	87
Placeholders	0	0
Regressions	65	84 (+19 👎 )
Test success rate	90.59%	87.77%

What are we targeting ? As good as addressit (+ improvements) ?

missinglink · 2019-05-21T19:24:17Z

I'd like to see all the sane tests still passing, if tests are brittle or just bad then they can be changed.

I'm working off the assumption that all the broken tests at this stage are either venue queries or intersection queries?

Could you please post any of your test cases which don't fall in one of those two buckets?

But yeah, pretty good so far considering that it's pretty much a rewrite of autocomplete :)

Joxit · 2019-05-22T14:09:58Z

Yes, most of them are venue queries.

Here is a set of master vs pelias_parser branch

--- master	2019-05-21 17:14:45.528007350 +0200
+++ pelias_parser	2019-05-21 17:18:56.555871513 +0200
 autocomplete admin areas
-  ✔ [1-2] "{"text":"brook"}"
+  ✘ regression [1-2] "{"text":"brook"}": score 0 out of 1
+  diff:
+    label
+      expected: Brooklyn, New York, NY, USA
+      actual:   Crystal Brook, NY, USA
...
 autocomplete daly city
-  ✔ [1] "{"focus.point.lat":"37.769316","focus.point.lon":"-122.484223","text":"dal"}"
+  ✘ regression [1] "{"focus.point.lat":"37.769316","focus.point.lon":"-122.484223","text":"dal"}": score 0 out of 1
+  diff:
+    name
+      expected: Daly City
+      actual:   Berg en Dal
-  ✔ [5] "{"focus.point.lat":"37.769316","focus.point.lon":"-122.484223","text":"daly ci"}"
+  ✘ regression [5] "{"focus.point.lat":"37.769316","focus.point.lon":"-122.484223","text":"daly ci"}": score 1 out of 2
+  diff:
+    priorityThresh is 1 but found at position 9
...
autocomplete focus
-  ✔ [4] "{"focus.point.lat":52.507054,"focus.point.lon":13.321714,"text":"hard rock cafe"}"
-  ✔ [5] "{"focus.point.lat":40.744243,"focus.point.lon":-73.990342,"text":"hard rock cafe"}"
+  ✘ regression [4] "{"focus.point.lat":52.507054,"focus.point.lon":13.321714,"text":"hard rock cafe"}": score 1 out of 2
+  diff:
+    priorityThresh is 1 but found at position 5
+  ✘ regression [5] "{"focus.point.lat":40.744243,"focus.point.lon":-73.990342,"text":"hard rock cafe"}": score 1 out of 2
+  diff:
+    priorityThresh is 1 but found at position 4
...
 autocomplete street centroids
-  ✔ [1] "{"sources":"osm","layers":"street","text":"rushendon furlong, england"}"
-  ✔ [2] "{"sources":"osm","layers":"street","text":"grolmanstr, berlin"}"
+  ✘ regression [1] "{"sources":"osm","layers":"street","text":"rushendon furlong, england"}": no results returned
+  ✘ regression [2] "{"sources":"osm","layers":"street","text":"grolmanstr, berlin"}": no results returned
...
 autocomplete street fallback
-  ✔ [1] "{"text":"grolmanstr, berlin"}"
+  ✘ regression [1] "{"text":"grolmanstr, berlin"}": no results returned
...
 autocomplete synonyms
-  ✔ [3] "{"text":"ucsf mt zion, san francisco, ca"}"
+  ✘ regression [3] "{"text":"ucsf mt zion, san francisco, ca"}": score 2 out of 3
+  diff:
+    name
+      expected: UCSF Mount Zion Campus
+      actual:   San Francisco
...
labels
-  ✔ [24] "{"text":"national air and space museum, washington dc"}"
+  ✘ regression [24] "{"text":"national air and space museum, washington dc"}": score 0 out of 1
+  diff:
+    label
+      expected: National Air and Space Museum, Washington, DC, USA
+      actual:   Frigid Air, Tacoma, WA, USA
...
landmarks
-  ✔ [15] "{"text":"australian war memorial, canberra, australia"}"
+  ✘ regression [15] "{"text":"australian war memorial, canberra, australia"}": score 2 out of 4
+  diff:
+    name
+      expected: Australian War Memorial
+      actual:   Australian Capital Territory
+    locality
+      expected: Campbell
+      actual:   
...
 locality geodisambiguation
-  ✔ [22] "{"text":"Germa","lang":"ru"}"
+  ✘ regression [22] "{"text":"Germa","lang":"ru"}": score 0 out of 2
+  diff:
+    name
+      expected: Германия
+      actual:   Germa
+    country_a
+      expected: DEU
+      actual:   GRC
...
search
-  ✔ [1426636804303:51] "{"text":"4th and King"}"
+  ✘ regression [1426636804303:51] "{"text":"4th and King"}": score 0 out of 6
+  diff:
+    name
+      expected: 4th & King
+      actual:   ۴/۴
+    locality
+      expected: San Francisco
+      actual:   
+    country_a
+      expected: USA
+      actual:   IRN
+    name
+      expected: San Francisco 4th & King Street
+      actual:   ۴/۴
+    locality
+      expected: San Francisco
+      actual:   
+    country_a
+      expected: USA
+      actual:   IRN
...
search_poi
-  ✔ [9] "{"text":"Ohio State University"}"
-  ✔ [10] "{"text":"Antioch University Seattle"}"
-  ✔ [11] "{"text":"Union college, kentucky"}"
+  ✘ regression [9] "{"text":"Ohio State University"}": score 0 out of 2
+  diff:
+    name
+      expected: Ohio State University Lima
+      actual:   Ohio
+    localadmin
+      expected: Bath Township
+      actual:   
+  ✘ regression [10] "{"text":"Antioch University Seattle"}": score 1 out of 2
+  diff:
+    name
+      expected: Antioch University
+      actual:   Seattle
+  ✘ regression [11] "{"text":"Union college, kentucky"}": score 1 out of 2
+  diff:
+    name
+      expected: Union College
+      actual:   Kentucky

orangejulius · 2019-06-07T18:13:14Z

This PR is showing really good results, especially for intersection queries, where it's a massive improvement over using addressit, which really can't handle them.

Before merging it would be great to get a handle on the performance implications for request time. Here are some things we can look at:

Does the new parser have any impact on memory used by the API? I saw the parser does now include a bunch of data from WOF, is that possibly significant? (looking at my local API process usage, it doesn't appear so, but could be good to check)
Does the parser itself take a noticeable amount of time during any queries? I notice the CLI version of pelias/parser often shows that the complete parsing process often takes several miliseconds (but never even more than 10). Ideally, we would add the parse time to the output of our summary events in places such as controller/search.js. Then, we can get great info on overall performance of the parser and find any cases where it's slow
The impact on the pelias/loadtest numbers. These test have recently been shown to not be super correlated with production performance because they don't test a huge number of queries, but they can still be useful for seeing large trends or identifying any queries that are now extremely slow before sending live queries to the new code.
Finally, we would generally test a change like this by directing some small percentage of our geocode.earth production traffic towards it. This is a super rich source of real world info, although we'll have to figure out a good plan here since the behavior is dramatically changed. Furthermore, since this change affects autocomplete, it could be very confusing for users to see results from the master branch for some keypresses, but this branch for others.

Joxit · 2019-06-09T19:49:41Z

So many improvements since my last acceptance test 😄

-	v3.65.3	pelias_parser
Pass	544	539
Improvements	18	23 (+5 👍 )
Fail	114	109
Placeholders	0	0
Regressions	40	45 (+5 👎 )
Test success rate	94.27%	93.51%

missinglink · 2019-06-11T10:46:43Z

I tested the heap usage and it's actually a lot lower than I expected!

const mem = function(){
  let used = process.memoryUsage().heapUsed / 1024 / 1024
  console.log(`The script uses approximately ${used} MB\n`)
}

console.log('baseline')
mem()

let AddressParser = require('./parser/AddressParser')
console.log('require')
mem()

let parser = new AddressParser()
console.log('instantiate')
mem()

baseline
The script uses approximately 4.654182434082031 MB

require
The script uses approximately 6.072669982910156 MB

instantiate
The script uses approximately 36.584625244140625 MB

sanitizer/_text_pelias_parser.js

Joxit · 2019-09-09T09:01:40Z

Hi there,

Just to let you know that I'm using this this (mixed with #1268) at @jawg and I have some feedback.

The analyzer peliasQuery in the must may be a bit too restrictive for autocomplete.
When we do autocomplete, such as search, we can use some abbreviations/synonyms.

If I use the full address 20 boulevard saint germain paris france, it's well parsed by our beautiful parser

{
        "subject": "20 boulevard saint germain",
        "housenumber": "20",
        "street": "boulevard saint germain",
        "locality": "paris",
        "country": "france",
        "admin": "paris france"
}

And we found in first position the correct answer : 20 Boulevard Saint-Germain, Paris, France from OSM openstreetmap:address:node/715428517. That's good.

But when we do the same search using some synonyms, like 20 boulevard st germain paris france the parser is still OK

{
        "subject": "20 boulevard st germain",
        "housenumber": "20",
        "street": "boulevard st germain",
        "locality": "paris",
        "country": "france",
        "admin": "paris france"
}

But we have no answers, that's because we use the analyzer peliasStreet in a should for the street part which contains some synonyms combined with the analyzer peliasQuery in a must part.

If I replace the peliasQuery by the peliasStreet the API give me the correct answer but this may not be the solution for this issue.

We should create a custom analyzer which will contains synonyms ? Side effect: response latency
Or remove definitely the subject from the must clause ? Side effect: maybe non pertinent results ?
Move the subject for the must clause to the should clause with some boost ? Why not

…ultimatch", clean up related code

missinglink · 2019-10-01T15:19:22Z

merged!

thanks everyone for your patience, I think this is a big step forward to improving autocomplete and taking on more control of how we parse inputs.

🎉 🍾

Joxit · 2019-10-01T15:23:14Z

👏 🚀 🎉

pelias/api#1287

These all appear to pass now thanks to pelias/api#1287

Unclear why these fail, but it may be due to pelias/api#1287

nilsnolde · 2019-10-21T06:20:23Z

Hi guys,
just about to maintain our Pelias servers and I'd like to do a long-needed update. Is there any change in the config required to include the new parser? I guess the textAnalyzer entry is deprecated along with addressit? Thanks for this milestone btw:)

missinglink · 2019-10-28T11:19:19Z

Hi @nilsnolde there is no configuration change required.
That textAnalyzer setting was deprecated, see 0a95dd2

nilsnolde · 2019-10-28T18:19:50Z

Oh damn, blind.. Thanks Peter. Good rant btw! Totally agreed. Might steal your wording maybe;)

The issue described has been solved by the Pelias Parser in pelias/api#1287!

missinglink requested review from Joxit, orangejulius and NickStallman May 2, 2019 14:37

missinglink force-pushed the pelias_parser branch from cdd9b51 to 4596360 Compare May 2, 2019 14:43

missinglink force-pushed the pelias_parser branch from 4596360 to 05679fb Compare May 2, 2019 15:31

missinglink force-pushed the pelias_parser branch 2 times, most recently from 536104c to cfe5851 Compare May 15, 2019 14:15

missinglink mentioned this pull request May 21, 2019

Use default index + user-lang index in autocomplete #1300

Merged

missinglink force-pushed the pelias_parser branch 2 times, most recently from 07e0296 to e9b68ba Compare June 4, 2019 12:18

orangejulius reviewed Jun 14, 2019

View reviewed changes

sanitizer/_text_pelias_parser.js Outdated Show resolved Hide resolved

missinglink force-pushed the pelias_parser branch 5 times, most recently from b7dfd00 to c411c5e Compare June 19, 2019 08:39

missinglink mentioned this pull request Jun 29, 2019

Enable interpolation on autocomplete pelias/interpolation#131

Open

orangejulius mentioned this pull request Jul 12, 2019

Pelias intersections #1058

Closed

This was referenced Aug 1, 2019

Autocomplete: Multi-lang search (based on user's lang) #1296

Open

remove geocoding.query.tokens array #1338

Merged

Joxit mentioned this pull request Sep 10, 2019

Very high parser.solve response time pelias/parser#57

Closed

missinglink added 4 commits October 1, 2019 17:08

refactor(pelias_parser): add code comments relating to "add_name_to_m…

97f6496

…ultimatch", clean up related code

refactor(pelias_parser): remove disused code/comments

b2d3b16

feat(pelias_parser): completely remove "addressit" and references to it

1e1cf24

refactor(pelias_parser): remove references to "original style queries"

88656e0

missinglink force-pushed the pelias_parser branch from 3dc0d8f to 88656e0 Compare October 1, 2019 15:10

missinglink merged commit e11b0c8 into master Oct 1, 2019

missinglink deleted the pelias_parser branch October 1, 2019 15:17

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Oct 3, 2019

Record new failures due to pelias_parser

3207861

pelias/api#1287

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Oct 3, 2019

Record tests that now pass with pelias_parser

f2472a7

pelias/api#1287

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Oct 3, 2019

Mark Target, Eureka, CA tests passing

cc14465

These all appear to pass now thanks to pelias/api#1287

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Oct 3, 2019

Mark street centroid tests failing

11626bf

Unclear why these fail, but it may be due to pelias/api#1287

orangejulius mentioned this pull request Oct 3, 2019

October upates pelias/acceptance-tests#506

Merged

orangejulius added a commit to pelias/documentation that referenced this pull request Jan 3, 2020

Remove autocomplete from roadmap

dd0a5c2

The issue described has been solved by the Pelias Parser in pelias/api#1287!

missinglink mentioned this pull request May 29, 2023

street intersections pelias/pelias#21

Closed

missinglink mentioned this pull request Oct 21, 2024

Removed unused boost_exact_matches #1679

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace addressit with pelias native parser #1287

replace addressit with pelias native parser #1287

missinglink commented May 2, 2019 •

edited

Loading

missinglink commented May 2, 2019 •

edited

Loading

Joxit commented May 7, 2019

Joxit commented May 21, 2019

missinglink commented May 21, 2019 •

edited

Loading

Joxit commented May 22, 2019

orangejulius commented Jun 7, 2019 •

edited by missinglink

Loading

Joxit commented Jun 9, 2019

missinglink commented Jun 11, 2019 •

edited

Loading

Joxit commented Sep 9, 2019

missinglink commented Oct 1, 2019

Joxit commented Oct 1, 2019

nilsnolde commented Oct 21, 2019

missinglink commented Oct 28, 2019

nilsnolde commented Oct 28, 2019

replace addressit with pelias native parser #1287

replace addressit with pelias native parser #1287

Conversation

missinglink commented May 2, 2019 • edited Loading

missinglink commented May 2, 2019 • edited Loading

Joxit commented May 7, 2019

Joxit commented May 21, 2019

missinglink commented May 21, 2019 • edited Loading

Joxit commented May 22, 2019

orangejulius commented Jun 7, 2019 • edited by missinglink Loading

Joxit commented Jun 9, 2019

missinglink commented Jun 11, 2019 • edited Loading

Joxit commented Sep 9, 2019

missinglink commented Oct 1, 2019

Joxit commented Oct 1, 2019

nilsnolde commented Oct 21, 2019

missinglink commented Oct 28, 2019

nilsnolde commented Oct 28, 2019

missinglink commented May 2, 2019 •

edited

Loading

missinglink commented May 2, 2019 •

edited

Loading

missinglink commented May 21, 2019 •

edited

Loading

orangejulius commented Jun 7, 2019 •

edited by missinglink

Loading

missinglink commented Jun 11, 2019 •

edited

Loading