Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New parsing engine #12

Merged
merged 3 commits into from
Apr 18, 2019

Conversation

giuseongit
Copy link
Contributor

@giuseongit giuseongit commented Apr 12, 2019

With this engine the HTML class stores a list of pointers to some specific nodes: all id, class and tags are easily accessible without browsing the entire document using xpath.

@giuseongit
Copy link
Contributor Author

giuseongit commented Apr 12, 2019

I'm trying to improve performances. Instead of using xpath to browse te document every time the user needs to access a node, I chose to visit the resulting tree starting from the root node (html tag) and populate a map of internal pointers to easly access those nodes.

I'm using the following code to test these performances:

require "crystagiri"

Crystagiri::HTML.experimental = false
t = Time.now
# doc = Crystagiri::HTML.from_url "https://www.google.com"
doc = Crystagiri::HTML.from_file "./index.html"
# doc = Crystagiri::HTML.from_file "./HTML.html"
1..100000.times do
    doc.at_tag("div")
    doc.where_class("step-title") { |tag| tag }
end
puts "executed in #{Time.now - t} milliseconds"

Crystagiri::HTML.experimental = true
t = Time.now
# doc = Crystagiri::HTML.from_url "https://www.google.com"
doc = Crystagiri::HTML.from_file "./index.html"
# doc = Crystagiri::HTML.from_file "./HTML.html"
1..100000.times do
  doc.at_tag("div")
  doc.where_class("step-title") { |tag| tag }
end
puts "executed with experimental code in #{Time.now - t} milliseconds"

having the following results:

crystagiri_test ❯ crystal test.cr
executed in 00:00:56.401452000 milliseconds
executed with experimental code in 00:00:00.085011000 milliseconds

crystagiri_test ❯ crystal test.cr
executed in 00:00:06.580082000 milliseconds
executed with experimental code in 00:00:00.119495000 milliseconds

The file HTML.html is simply the one in spec/fixture/HTML.html. index.html is this file

It seems a good performance improvement, but I'm still working on it. Anyway, any ideas, suggestions or feedbacks of every kind are welcome.

@giuseongit giuseongit force-pushed the experimental/better-engine branch from 03e15c9 to 42bc2bd Compare April 12, 2019 13:24
@madeindjs
Copy link
Owner

It seem a good idea! But in my mind this method is faster when you make many search on the same document but is it faster when you make on simple search?

I think you should make deeper test because Html#visit visit the whole document and performance may be worse for big documents...

@giuseongit
Copy link
Contributor Author

I think you should make deeper test because Html#visit visit the whole document and performance may be worse for big documents...

Yes, I'd like to have a more robust structure first, then I'll look for some big docs to parse (any help is welcome).

The use case I'm thinking about is the one in which the user needs this library to extract several data from the document. Instead of visiting it every single time data need to be extracted I think that visit the whole tree at the beginning may be a good tradeoff, as the subsequent access will be much faster, as we already have a bunch of pointers to some nodes inside the doc. Keep in mind that when extracting data the underlying documents must not change, so basically there's no need to browse it every time.

@giuseongit giuseongit force-pushed the experimental/better-engine branch 3 times, most recently from cd7dd97 to 894254f Compare April 15, 2019 14:29
@giuseongit
Copy link
Contributor Author

giuseongit commented Apr 15, 2019

@madeindjs I've wrote a little benchmark for testing the new code's performances, you can find it in the benchmark folder. I've counted only computational time (download times are not taken in account). It seems that the parsing is indeed a bit longer, but as every access time is drastically reduced, the overall computing time is quite shorter.

Here's my last run of the benchmark (time is expressed in ms):

benchmark git:experimental/better-engine ❯ crystal tree_visit_vs_xpath.cr                                                                                                                                                                                                             Running on https://www.quirksmode.org/html5/tests/video.html
*|..........*|..........
method  |  total time | parse time
========+=============+============
xpath   |  22685.673   |  1.921
--------+-------------+------------
exper.  |   148.412   |  1.484
~~~~~~~~~~~~~~~
Running on http://www.8164.org/the-big-table-issue/
*|..........*|..........
method  |  total time | parse time
========+=============+============
xpath   |  97709.436   |  2.55
--------+-------------+------------
exper.  |   785.565   |  3.861
~~~~~~~~~~~~~~~
Running on https://lawsofux.com/
*|..........*|..........
method  |  total time | parse time
========+=============+============
xpath   |  162025.837   |  4.871
--------+-------------+------------
exper.  |   175.667   |  9.446
~~~~~~~~~~~~~~~
Running on https://www.youtube.com/
*|..........*|..........
method  |  total time | parse time
========+=============+============
xpath   |  4222152.497   |  19.293
--------+-------------+------------
exper.  |   8095.195   |  72.428
~~~~~~~~~~~~~~~

Test ran on 4 links.
Time with xpath method: 4504573.443 ms with 28.634999999999998 ms needed for parsing.
Time with experimental method: 9204.839 ms with 87.219 ms needed for parsing.

Even for a big HTML doc like the one youtube returns the parse time is quite irrelevant (~72ms), so I think it still would be a big improvement in performances.

@madeindjs
Copy link
Owner

This is really impressive. I made some tests using your script and I draw this graph:

Also note you should build you file with --release flag before to run benchmark (but I had sames results..)

graph

So I think the best option is remove old comportment and rewrite with your own. Then publish this version with an alpha flag and wait a moment to see I there are no problem with a more complicated case..

What do you think about it?

@giuseongit
Copy link
Contributor Author

Sounds good to me. I'll keep the HTML#css method because the new code is lacking a css selectors parser (we'll work on it).
Ok then, I'll clean the code, remove the benchmark. write some docstrings for internal docs. I'll try to push a git history as clean as possible.

giuseongit and others added 3 commits April 16, 2019 13:49
With this engine the `HTML` class stores a list of pointers
to some specific nodes: all `id`, `class` and `tags` are easily
accessible without browsing the entire document using xpath.
The feature is almost complete, we're merging it in the main codebase
This is not a fixup or sqashable commit so hystory is preserved
(hence the benchmark used to decide the parsing strategy switch)
@giuseongit giuseongit force-pushed the experimental/better-engine branch from 894254f to 17590d1 Compare April 16, 2019 11:52
@giuseongit giuseongit marked this pull request as ready for review April 16, 2019 11:53
@giuseongit
Copy link
Contributor Author

giuseongit commented Apr 16, 2019

The pr is now ready, you can review it. Keep in mind that also the README.md must be updated to show the exetution time of the current code instead the old one (don't use HTML#css and HTML#at_css for comparison because they still use old code)

@giuseongit giuseongit mentioned this pull request Apr 16, 2019
3 tasks
@madeindjs madeindjs added this to the 0.4.0-alpha milestone Apr 16, 2019
@madeindjs madeindjs merged commit e70753b into madeindjs:master Apr 18, 2019
@madeindjs
Copy link
Owner

Published on v0.4.0-alpha, thanks a lot!

@giuseongit giuseongit deleted the experimental/better-engine branch April 21, 2019 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants