New parsing engine #12

giuseongit · 2019-04-12T13:12:30Z

With this engine the HTML class stores a list of pointers to some specific nodes: all id, class and tags are easily accessible without browsing the entire document using xpath.

giuseongit · 2019-04-12T13:20:54Z

I'm trying to improve performances. Instead of using xpath to browse te document every time the user needs to access a node, I chose to visit the resulting tree starting from the root node (html tag) and populate a map of internal pointers to easly access those nodes.

I'm using the following code to test these performances:

require "crystagiri"

Crystagiri::HTML.experimental = false
t = Time.now
# doc = Crystagiri::HTML.from_url "https://www.google.com"
doc = Crystagiri::HTML.from_file "./index.html"
# doc = Crystagiri::HTML.from_file "./HTML.html"
1..100000.times do
    doc.at_tag("div")
    doc.where_class("step-title") { |tag| tag }
end
puts "executed in #{Time.now - t} milliseconds"

Crystagiri::HTML.experimental = true
t = Time.now
# doc = Crystagiri::HTML.from_url "https://www.google.com"
doc = Crystagiri::HTML.from_file "./index.html"
# doc = Crystagiri::HTML.from_file "./HTML.html"
1..100000.times do
  doc.at_tag("div")
  doc.where_class("step-title") { |tag| tag }
end
puts "executed with experimental code in #{Time.now - t} milliseconds"

having the following results:

crystagiri_test ❯ crystal test.cr
executed in 00:00:56.401452000 milliseconds
executed with experimental code in 00:00:00.085011000 milliseconds

crystagiri_test ❯ crystal test.cr
executed in 00:00:06.580082000 milliseconds
executed with experimental code in 00:00:00.119495000 milliseconds

The file HTML.html is simply the one in spec/fixture/HTML.html. index.html is this file

It seems a good performance improvement, but I'm still working on it. Anyway, any ideas, suggestions or feedbacks of every kind are welcome.

madeindjs · 2019-04-12T13:47:19Z

It seem a good idea! But in my mind this method is faster when you make many search on the same document but is it faster when you make on simple search?

I think you should make deeper test because Html#visit visit the whole document and performance may be worse for big documents...

giuseongit · 2019-04-12T13:53:52Z

I think you should make deeper test because Html#visit visit the whole document and performance may be worse for big documents...

Yes, I'd like to have a more robust structure first, then I'll look for some big docs to parse (any help is welcome).

The use case I'm thinking about is the one in which the user needs this library to extract several data from the document. Instead of visiting it every single time data need to be extracted I think that visit the whole tree at the beginning may be a good tradeoff, as the subsequent access will be much faster, as we already have a bunch of pointers to some nodes inside the doc. Keep in mind that when extracting data the underlying documents must not change, so basically there's no need to browse it every time.

giuseongit · 2019-04-15T15:41:50Z

@madeindjs I've wrote a little benchmark for testing the new code's performances, you can find it in the benchmark folder. I've counted only computational time (download times are not taken in account). It seems that the parsing is indeed a bit longer, but as every access time is drastically reduced, the overall computing time is quite shorter.

Here's my last run of the benchmark (time is expressed in ms):

benchmark git:experimental/better-engine ❯ crystal tree_visit_vs_xpath.cr                                                                                                                                                                                                             Running on https://www.quirksmode.org/html5/tests/video.html
*|..........*|..........
method  |  total time | parse time
========+=============+============
xpath   |  22685.673   |  1.921
--------+-------------+------------
exper.  |   148.412   |  1.484
~~~~~~~~~~~~~~~
Running on http://www.8164.org/the-big-table-issue/
*|..........*|..........
method  |  total time | parse time
========+=============+============
xpath   |  97709.436   |  2.55
--------+-------------+------------
exper.  |   785.565   |  3.861
~~~~~~~~~~~~~~~
Running on https://lawsofux.com/
*|..........*|..........
method  |  total time | parse time
========+=============+============
xpath   |  162025.837   |  4.871
--------+-------------+------------
exper.  |   175.667   |  9.446
~~~~~~~~~~~~~~~
Running on https://www.youtube.com/
*|..........*|..........
method  |  total time | parse time
========+=============+============
xpath   |  4222152.497   |  19.293
--------+-------------+------------
exper.  |   8095.195   |  72.428
~~~~~~~~~~~~~~~

Test ran on 4 links.
Time with xpath method: 4504573.443 ms with 28.634999999999998 ms needed for parsing.
Time with experimental method: 9204.839 ms with 87.219 ms needed for parsing.

Even for a big HTML doc like the one youtube returns the parse time is quite irrelevant (~72ms), so I think it still would be a big improvement in performances.

madeindjs · 2019-04-16T08:43:39Z

This is really impressive. I made some tests using your script and I draw this graph:

Also note you should build you file with --release flag before to run benchmark (but I had sames results..)

So I think the best option is remove old comportment and rewrite with your own. Then publish this version with an alpha flag and wait a moment to see I there are no problem with a more complicated case..

What do you think about it?

giuseongit · 2019-04-16T09:40:51Z

Sounds good to me. I'll keep the HTML#css method because the new code is lacking a css selectors parser (we'll work on it).
Ok then, I'll clean the code, remove the benchmark. write some docstrings for internal docs. I'll try to push a git history as clean as possible.

With this engine the `HTML` class stores a list of pointers to some specific nodes: all `id`, `class` and `tags` are easily accessible without browsing the entire document using xpath.

The feature is almost complete, we're merging it in the main codebase This is not a fixup or sqashable commit so hystory is preserved (hence the benchmark used to decide the parsing strategy switch)

giuseongit · 2019-04-16T11:57:31Z

The pr is now ready, you can review it. Keep in mind that also the README.md must be updated to show the exetution time of the current code instead the old one (don't use HTML#css and HTML#at_css for comparison because they still use old code)

madeindjs · 2019-04-18T15:24:43Z

Published on v0.4.0-alpha, thanks a lot!

giuseongit force-pushed the experimental/better-engine branch from 03e15c9 to 42bc2bd Compare April 12, 2019 13:24

madeindjs added the enhancement label Apr 12, 2019

giuseongit force-pushed the experimental/better-engine branch 3 times, most recently from cd7dd97 to 894254f Compare April 15, 2019 14:29

giuseongit and others added 3 commits April 16, 2019 13:49

New parsing engine

bd638d3

With this engine the `HTML` class stores a list of pointers to some specific nodes: all `id`, `class` and `tags` are easily accessible without browsing the entire document using xpath.

Add benchmark

790550d

Code cleanup

17590d1

The feature is almost complete, we're merging it in the main codebase This is not a fixup or sqashable commit so hystory is preserved (hence the benchmark used to decide the parsing strategy switch)

giuseongit force-pushed the experimental/better-engine branch from 894254f to 17590d1 Compare April 16, 2019 11:52

giuseongit marked this pull request as ready for review April 16, 2019 11:53

giuseongit mentioned this pull request Apr 16, 2019

Tag browsing #13

Merged

3 tasks

madeindjs added this to the 0.4.0-alpha milestone Apr 16, 2019

madeindjs merged commit e70753b into madeindjs:master Apr 18, 2019

giuseongit deleted the experimental/better-engine branch April 21, 2019 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New parsing engine #12

New parsing engine #12

giuseongit commented Apr 12, 2019 •

edited

Loading

giuseongit commented Apr 12, 2019 •

edited

Loading

madeindjs commented Apr 12, 2019

giuseongit commented Apr 12, 2019

giuseongit commented Apr 15, 2019 •

edited

Loading

madeindjs commented Apr 16, 2019

giuseongit commented Apr 16, 2019

giuseongit commented Apr 16, 2019 •

edited

Loading

madeindjs commented Apr 18, 2019

New parsing engine #12

New parsing engine #12

Conversation

giuseongit commented Apr 12, 2019 • edited Loading

giuseongit commented Apr 12, 2019 • edited Loading

madeindjs commented Apr 12, 2019

giuseongit commented Apr 12, 2019

giuseongit commented Apr 15, 2019 • edited Loading

madeindjs commented Apr 16, 2019

giuseongit commented Apr 16, 2019

giuseongit commented Apr 16, 2019 • edited Loading

madeindjs commented Apr 18, 2019

giuseongit commented Apr 12, 2019 •

edited

Loading

giuseongit commented Apr 12, 2019 •

edited

Loading

giuseongit commented Apr 15, 2019 •

edited

Loading

giuseongit commented Apr 16, 2019 •

edited

Loading