-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New parsing engine #12
New parsing engine #12
Conversation
I'm trying to improve performances. Instead of using xpath to browse te document every time the user needs to access a node, I chose to visit the resulting tree starting from the root node ( I'm using the following code to test these performances: require "crystagiri"
Crystagiri::HTML.experimental = false
t = Time.now
# doc = Crystagiri::HTML.from_url "https://www.google.com"
doc = Crystagiri::HTML.from_file "./index.html"
# doc = Crystagiri::HTML.from_file "./HTML.html"
1..100000.times do
doc.at_tag("div")
doc.where_class("step-title") { |tag| tag }
end
puts "executed in #{Time.now - t} milliseconds"
Crystagiri::HTML.experimental = true
t = Time.now
# doc = Crystagiri::HTML.from_url "https://www.google.com"
doc = Crystagiri::HTML.from_file "./index.html"
# doc = Crystagiri::HTML.from_file "./HTML.html"
1..100000.times do
doc.at_tag("div")
doc.where_class("step-title") { |tag| tag }
end
puts "executed with experimental code in #{Time.now - t} milliseconds" having the following results: crystagiri_test ❯ crystal test.cr
executed in 00:00:56.401452000 milliseconds
executed with experimental code in 00:00:00.085011000 milliseconds
crystagiri_test ❯ crystal test.cr
executed in 00:00:06.580082000 milliseconds
executed with experimental code in 00:00:00.119495000 milliseconds The file It seems a good performance improvement, but I'm still working on it. Anyway, any ideas, suggestions or feedbacks of every kind are welcome. |
03e15c9
to
42bc2bd
Compare
It seem a good idea! But in my mind this method is faster when you make many search on the same document but is it faster when you make on simple search? I think you should make deeper test because |
Yes, I'd like to have a more robust structure first, then I'll look for some big docs to parse (any help is welcome). The use case I'm thinking about is the one in which the user needs this library to extract several data from the document. Instead of visiting it every single time data need to be extracted I think that visit the whole tree at the beginning may be a good tradeoff, as the subsequent access will be much faster, as we already have a bunch of pointers to some nodes inside the doc. Keep in mind that when extracting data the underlying documents must not change, so basically there's no need to browse it every time. |
cd7dd97
to
894254f
Compare
@madeindjs I've wrote a little benchmark for testing the new code's performances, you can find it in the Here's my last run of the benchmark (time is expressed in ms): benchmark git:experimental/better-engine ❯ crystal tree_visit_vs_xpath.cr Running on https://www.quirksmode.org/html5/tests/video.html
*|..........*|..........
method | total time | parse time
========+=============+============
xpath | 22685.673 | 1.921
--------+-------------+------------
exper. | 148.412 | 1.484
~~~~~~~~~~~~~~~
Running on http://www.8164.org/the-big-table-issue/
*|..........*|..........
method | total time | parse time
========+=============+============
xpath | 97709.436 | 2.55
--------+-------------+------------
exper. | 785.565 | 3.861
~~~~~~~~~~~~~~~
Running on https://lawsofux.com/
*|..........*|..........
method | total time | parse time
========+=============+============
xpath | 162025.837 | 4.871
--------+-------------+------------
exper. | 175.667 | 9.446
~~~~~~~~~~~~~~~
Running on https://www.youtube.com/
*|..........*|..........
method | total time | parse time
========+=============+============
xpath | 4222152.497 | 19.293
--------+-------------+------------
exper. | 8095.195 | 72.428
~~~~~~~~~~~~~~~
Test ran on 4 links.
Time with xpath method: 4504573.443 ms with 28.634999999999998 ms needed for parsing.
Time with experimental method: 9204.839 ms with 87.219 ms needed for parsing. Even for a big HTML doc like the one youtube returns the parse time is quite irrelevant (~72ms), so I think it still would be a big improvement in performances. |
This is really impressive. I made some tests using your script and I draw this graph:
So I think the best option is remove old comportment and rewrite with your own. Then publish this version with an alpha flag and wait a moment to see I there are no problem with a more complicated case.. What do you think about it? |
Sounds good to me. I'll keep the |
With this engine the `HTML` class stores a list of pointers to some specific nodes: all `id`, `class` and `tags` are easily accessible without browsing the entire document using xpath.
The feature is almost complete, we're merging it in the main codebase This is not a fixup or sqashable commit so hystory is preserved (hence the benchmark used to decide the parsing strategy switch)
894254f
to
17590d1
Compare
The pr is now ready, you can review it. Keep in mind that also the |
Published on v0.4.0-alpha, thanks a lot! |
With this engine the
HTML
class stores a list of pointers to some specific nodes: allid
,class
andtags
are easily accessible without browsing the entire document using xpath.