Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar that parses in milliseconds in Java runtime, takes many seconds in python #1219

Open
pavelvelikhov opened this issue Jun 23, 2016 · 77 comments

Comments

@pavelvelikhov
Copy link

I have a grammar of Python3 (from the antlr4 grammar repository), extended with query language constructs. The grammar file is here: Grammar file

The whole project is here: PythonQL

This tiny program parses in milliseconds with the Java runtime, but takes about 1.5 seconds in python (after the recent fix, before it was over 2 seconds).

# This example illustrates the window query in PythonQL

from collections import namedtuple
trade = namedtuple('Trade', ['day','ammount', 'stock_id'])

trades = [ trade(1, 15.34, 'APPL'),
           trade(2, 13.45, 'APPL'),
           trade(3, 8.34,  'APPL'),
           trade(4, 9.87,  'APPL'),
           trade(5, 10.99, 'APPL'),
           trade(6, 76.16, 'APPL') ]

# Maximum 3-day sum

res = (select win
        for sliding window win in ( select t.ammount for t in trades )
        start at s when True
        only end at e when (e-s == 2))

print (res)

Here is a profiler trace just in case (I left only the relevant entries):

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    21    0.000    0.000    0.094    0.004 PythonQLParser.py:7483(argument)
    8    0.000    0.000    0.195    0.024 PythonQLParser.py:7379(arglist)  
    9    0.000    0.000    0.196    0.022 PythonQLParser.py:6836(trailer)  
    5/3    0.000    0.000    0.132   0.044 PythonQLParser.py:6765(testlist_comp)
    1    0.000    0.000    0.012    0.012 PythonQLParser.py:6154(window_end_cond)
    1    0.000    0.000    0.057    0.057 PythonQLParser.py:6058(sliding_window)
    1    0.000    0.000    0.057    0.057 PythonQLParser.py:5941(window_clause)
    1   0.000    0.000    0.004   0.004 PythonQLParser.py:5807(for_clause_entry)
    1    0.000    0.000    0.020 0.020 PythonQLParser.py:5752(for_clause) 
    2/1   0.000   0.000   0.068   0.068 PythonQLParser.py:5553(query_expression)    
   48/10    0.000    0.000    0.133    0.013 PythonQLParser.py:5370(atom) 
    48/7    0.000    0.000    0.315    0.045 PythonQLParser.py:5283(power) 
    48/7    0.000    0.000    0.315    0.045 PythonQLParser.py:5212(factor) 
    48/7    0.000    0.000    0.331    0.047 PythonQLParser.py:5132(term) 
    47/7    0.000    0.000    0.346    0.049 PythonQLParser.py:5071(arith_expr) 
    47/7    0.000    0.000    0.361    0.052 PythonQLParser.py:5010(shift_expr) 
    47/7    0.000    0.000    0.376    0.054 PythonQLParser.py:4962(and_expr) 
    47/7    0.000    0.000    0.390    0.056 PythonQLParser.py:4914(xor_expr) 
    47/7    0.000    0.000    0.405    0.058 PythonQLParser.py:4866(expr) 
    44/7    0.000    0.000    0.405    0.058 PythonQLParser.py:4823(star_expr) 
    43/7    0.000    0.000    0.422    0.060 PythonQLParser.py:4615(not_test) 
    43/7    0.000    0.000    0.438    0.063 PythonQLParser.py:4563(and_test) 
    43/7    0.000    0.000    0.453    0.065 PythonQLParser.py:4509(or_test) 
    43/7    0.000    0.000    0.467    0.067 PythonQLParser.py:4293(old_test) 
    43/7    0.000    0.000    0.467    0.067 PythonQLParser.py:4179(try_catch_expr)
    43/7    0.000    0.000    0.482    0.069 PythonQLParser.py:3978(test) 
    1    0.000    0.000    0.048    0.048 PythonQLParser.py:2793(import_from) 
    1    0.000    0.000    0.048    0.048 PythonQLParser.py:2702(import_stmt) 
    7    0.000    0.000    1.728    0.247 PythonQLParser.py:2251(testlist_star_expr) 
    4    0.000    0.000    1.770    0.443 PythonQLParser.py:2161(expr_stmt) 
    5    0.000    0.000    1.822    0.364 PythonQLParser.py:2063(small_stmt) 
    5    0.000    0.000    1.855    0.371 PythonQLParser.py:1980(simple_stmt) 
    5    0.000    0.000    1.859    0.372 PythonQLParser.py:1930(stmt) 
    1    0.000    0.000    1.898    1.898 PythonQLParser.py:1085(file_input)
    176    0.002    0.000    0.993    0.006 Lexer.py:127(nextToken)
    420    0.000   0.000   0.535   0.001 ParserATNSimulator.py:1120(closure)
   705    0.003    0.000    1.642    0.002 ParserATNSimulator.py:315(adaptivePredict)

I have attached a file that parses for 7 seconds on my Macbook Pro as well.

I'd be happy to reduce this case to a minimal case for debugging, but don't really know where to start.

The grammar doesn't seem to have any problems like ambiguity, etc.

@pavelvelikhov
Copy link
Author

This program takes 5 seconds to parse


import csv
import sys
from collections import namedtuple
import json
import numpy as np
from dateutil.parser import parse
from pythonql.PQTuple import pq_wrap,pq_flatten

fields = ['', 'event_date', 'event_name', 'client_id', 'lead_id', 'visitor_id', 'partner_id', 'address', 'amount', 'birthdate', 'card_type', 'cardholder_name', 'city', 'credit_limit', 'current_duration', 'domain', 'duration', 'effective_amount', 'email', 'first_name', 'gender', 'interest_rate', 'last_name', 'lead_source', 'middle_name', 'mobile_head', 'mobile_number', 'new_amount', 'new_duration', 'new_interest_rate', 'pan_tail', 'query', 'referer', 'region', 'repaid_amount', 'repaid_interest', 'user_agent', 'utm_campaign', 'webmaster_id', 'reason', 'dt']

schema = {f:i for (i,f) in enumerate(fields)}

out_fields = ['event','date']
out_schema = {f:i for (i,f) in enumerate(out_fields)}

csv.field_size_limit(sys.maxsize)

data = []
f = open('sm_cust_journey.csv')
rd = csv.reader(f)
for (i, (cust_id,cj_data)) in enumerate(rd):
  steps = json.loads(cj_data)
  data.append( {'id':cust_id, 'cj': steps } )
  if i==1000:
    break

# compute number of new users, registrations, scorecard checks, accepts, issued, repaid, npl 1, 30, 60, 90
events =( select [
           select 'reg' as event, parse(e.event_date) as dt, channel
           for e in [item for item in cj_data if item.event_name=="NAME_SAVED"]
           count c1
           where c1 == 0,

           select 'repaid' as event, parse(e2.event_date) as dt, channel
           for e in [item for item in cj_data if item.event_name=="LOAN_IS_REPAID"]
           count c1
           where c1 == 0
           for e2 in [item for item in cj_data if item.event_name=="LOAN_ISSUED"]
           count c2
           where c2 == 0,

           select 'np1' as event, parse(e.event_date) as dt, channel
           for e in [item for item in cj_data if item.event_name=="LOAN_ISSUED"]
           count c1
           where c1 == 0
           let issue_date = parse(e.event_date)
           where not (
                        select item
                        for item in cj_data
            where item.event_name=="LOAN_IS_REPAID" and (parse(item.event_date) - issue_date ).days < 31 )
   ]
   for cj in data
   let cj_data = list( pq_wrap(cj["cj"], schema) )
   let channel = try {[item for item in cj_data if item.event_name=="NEW_USER_DETECTED" ][0].reason} except {None}
   where not channel is None )


res = (select month, year, channel, event, len(e) as cnt
       for e in pq_flatten(events)
       let event = e.event
       group by e.dt.month as month, e.dt.year as year, e.channel as channel, event
       order by year,month)

print(res)

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jun 24, 2016

Hi
Please check #1218, could be the same issue
Eric

@pavelvelikhov
Copy link
Author

pavelvelikhov commented Jun 24, 2016

Hi Eric,

I did look at #1218 (mentioned it as a recent fix), and checked out build where it was fixed. It did effect the performance a bit (20-30%), but still the very long parse times compared to Java are there. The numbers given are on the build from: Jun 23 08:54:27 2016 -0700

@ericvergnaud
Copy link
Contributor

Hi,
Not sure I can help here.
A couple comments:

  • Python is 20 to 30 times slower than Java, so I'd say that if it takes 50ms in Java, 1.5s in Python is in range
  • if your goal is to run this in Python only, than you should optimize the grammar for Python, not for Java
  • you might find it useful to isolate which piece of the above sample is 'slow'. The idea that a complex grammar will just perform globally is legitimate but unrealistic in Python or Javascript.

Eric

Envoyé de mon iPhone

Le 23 juin 2016 à 21:32, pavelvelikhov notifications@github.com a écrit :

5132

@pavelvelikhov
Copy link
Author

Hi Eric,

Yes, the goal is to parse in Python only. Could you recommend some techniques to optimize the grammar for Python runtime?
Yeah, isolating a slow part would be helpful, I know. Its just a lot of work and I don't see a clear strategy of how to do it.

Thanks!
Pavel

@SimonStPeter
Copy link

I don't buy it. ANTLR is supposed to produce RD parsers comparable in efficiency to hand written ones. Python is slow but it's not that flipping slow. Something is not right here. I know I could manually write one that would beat this hands down.

I don't think it's the parser, I suspect python's malfunctioning. I happen to have generated code left over from looking at this Q on stackoverflow. My first suspicion is on the fat and mostly constant expressions like this

if (((_la) & ~0x3f) == 0 and ((1 << _la) & ((1 << MSSQLParser.T__2) | (1 << MSSQLParser.RETURN) | (1 << MSSQLParser.TRY) | (1 << MSSQLParser.LAMBDA) | (1 << MSSQLParser.NOT) | (1 << MSSQLParser.NONE) | (1 << MSSQLParser.TRUE) | (1 << MSSQLParser.FALSE) | (1 << MSSQLParser.NAME) | (1 << MSSQLParser.STRING_LITERAL) | (1 << MSSQLParser.BYTES_LITERAL) | (1 << MSSQLParser.DECIMAL_INTEGER) | (1 << MSSQLParser.OCT_INTEGER) | (1 << MSSQLParser.HEX_INTEGER) | (1 << MSSQLParser.BIN_INTEGER))) != 0) or ((((_la - 64)) & ~0x3f) == 0 and ((1 << (_la - 64)) & ((1 << (MSSQLParser.FLOAT_NUMBER - 64)) | (1 << (MSSQLParser.IMAG_NUMBER - 64)) | (1 << (MSSQLParser.ELLIPSIS - 64)) | (1 << (MSSQLParser.STAR - 64)) | (1 << (MSSQLParser.OPEN_PAREN - 64)) | (1 << (MSSQLParser.OPEN_BRACK - 64)) | (1 << (MSSQLParser.ADD - 64)) | (1 << (MSSQLParser.MINUS - 64)) | (1 << (MSSQLParser.NOT_OP - 64)) | (1 << (MSSQLParser.OPEN_BRACE - 64)))) != 0):

AFAICS the only variables in there is _la. Machine generated stuff is famous for breaking compilers/interpreters and I think this may be a possibility.

2nd thought is garbage collection. I'd like to know how much GC is being done.

My suggestion is yes, this may be an antlr issue indirectly but my primary concern would be python. I suggest you take this up in a python newsgroup and ask for guys with python profiling and python internals experience to have a look.

Sorry I can't suggest anything else, very busy ATM.

@jimidle
Copy link
Collaborator

jimidle commented Jun 26, 2016

If you need speed, then you should consider changing targets. You are flogging a dead horse trying to get performance from Python. If you just need it to be a little faster, then buy all means pursue.

Jim

On Jun 25, 2016, at 02:54, pavelvelikhov notifications@github.com wrote:

Hi Eric,

Yes, the goal is to parse in Python only. Could you recommend some techniques to optimize the grammar for Python runtime?
Yeah, isolating a slow part would be helpful, I know. Its just a lot of work and I don't see a clear strategy of how to do it.

Thanks!
Pavel


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@SimonStPeter
Copy link

@jimidle: to say that is to say that the python target in ANTLR is functionally unusable, in which case it should be deprecated ASAP to stop other people from going down that blind alley.
However, I repeat, I don't believe it. Python is slow but not that slow. Nowhere near. Something else is wrong.
I'd love to dig into it but I'm starting a new job tomorrow and I'm going to be snowed under for the foreseeable future.
@pavelvelikhov: please see my suggestion about taking it up with pthon experts.

@ericvergnaud
Copy link
Contributor

Hi,
Many people use Python as a wrapper around c++ libraries, so they get the feeling that it's not that slow.
Pure Python is 20-30 times slower than Java, not my say, but what benchmarks say.
As we've seen recently, there is a possibility that the Python runtimes don't behave like the Java runtime in terms of performance, due to bugs which prevent algorithmic optimizations.
Add to that the fact that the current optimizations were tuned for Java and C#, there is also a possibility that they hit slow parts of Python.
It will just take time to identify specific patterns that need further Python specific implementations.
That said, there are plenty of good use cases for a not so fast Python parser. People who choose Python in the first place are not looking at performance as a key driver.
So no reason to deprecate it.
Eric

Envoyé de mon iPhone

Le 26 juin 2016 à 17:51, SimonStPeter notifications@github.com a écrit :

@jimidle: to say that is to say that the python target in ANTLR is functionally unusable, in which case it should be deprecated ASAP to stop other people from going down that blind alley.
However, I repeat, I don't believe it. Python is slow but not that slow. Nowhere near. Something else is wrong.
I'd love to dig into it but I'm starting a new job tomorrow and I'm going to be snowed under for the foreseeable future.
@pavelvelikhov: please see my suggestion about taking it up with pthon experts.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@jimidle
Copy link
Collaborator

jimidle commented Jun 27, 2016

As Eric says, getting a target to work perfectly does take time.  But even if it is now perfect it will not perform like the other targets. Lots of good uses for Pythonand many tasks are not driven by performance, though I feel that Python will get better in this regard as time goes on. 

On Mon, Jun 27, 2016 at 12:32 AM +0800, "ericvergnaud" notifications@github.com wrote:

Hi,

Many people use Python as a wrapper around c++ libraries, so they get the feeling that it's not that slow.

Pure Python is 20-30 times slower than Java, not my say, but what benchmarks say.

As we've seen recently, there is a possibility that the Python runtimes don't behave like the Java runtime in terms of performance, due to bugs which prevent algorithmic optimizations.

Add to that the fact that the current optimizations were tuned for Java and C#, there is also a possibility that they hit slow parts of Python.

It will just take time to identify specific patterns that need further Python specific implementations.

That said, there are plenty of good use cases for a not so fast Python parser. People who choose Python in the first place are not looking at performance as a key driver.

So no reason to deprecate it.

Eric

Envoyé de mon iPhone

Le 26 juin 2016 à 17:51, SimonStPeter notifications@github.com a écrit :

@jimidle: to say that is to say that the python target in ANTLR is functionally unusable, in which case it should be deprecated ASAP to stop other people from going down that blind alley.

However, I repeat, I don't believe it. Python is slow but not that slow. Nowhere near. Something else is wrong.

I'd love to dig into it but I'm starting a new job tomorrow and I'm going to be snowed under for the foreseeable future.

@pavelvelikhov: please see my suggestion about taking it up with pthon experts.

You are receiving this because you commented.

Reply to this email directly, view it on GitHub, or mute the thread.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@pinaraf
Copy link

pinaraf commented Jun 27, 2016

Hi

Do you have a simple archive to download that would contain everything needed to reproduce that issue ?
In order to discard the «python is slow» argument, I suggest you also try with PyPy. After a few (~100) iterations, once the JIT triggers, if performance is not significantly better, then there may be something wrong in the antlr runtime or generated code. But I agree with SimonStPeter, this should not be a reason for that performance problem, otherwise a bug report should be done against the CPython interpreter.

@pavelvelikhov
Copy link
Author

Hi!

You can grab the whole thing from the git repository: https://github.com/pavelvelikhov/pythonql
Don't have an archive, but git is pretty friendly.

I'll try a few things over the weekend (e.g. removing all the grammar added to the original Python3 grammar) and I'll add a launcher with a profiler.

@pavelvelikhov
Copy link
Author

Hi!

I've taken out all my productions that were in addition to the Python3 grammar, which is on the official ANTLR4 grammar repository. The speed is about the same, so the original Python3 grammar parses very slow as well.

@SimonStPeter
Copy link

Regarding my prior thought about performance being caused perhaps by giant expressions, I stripped down one and iterated it 1000 times, updating _la in the loop to prevent result caching, and it took under a second so it's not that. I'm even more suspicious now about garbage & GC.

@thisiscam
Copy link

I encountering performance issues with ANTLR's python target also, and I've compared my grammar on cpython and pypy -- both were at least thousand times slower than java counterparts..

It drives me to the conclusion that it's probably still due to the inefficient translated code or the python runtime code.

Like one of the previous comments suggested, people in python community interested in speed usually uses C/C++ extensions. Considering that we've already have a C++ antlr target, having an ANTLR python target based on C++ extension is very practical and will be tons lot useful than the current python target! I can get hands on contributing when I get some time later, but right now this might be a good idea to anyone who might be interested in such an idea.

@parrt
Copy link
Member

parrt commented Dec 10, 2016

@pavelvelikhov hi! heh, can you check again due to recent fix? #1441

@pavelvelikhov
Copy link
Author

pavelvelikhov commented Dec 13, 2016 via email

@millergarym
Copy link
Contributor

@pavelvelikhov take a look at the last comment in #1540. I would be interested to know if it's a similar issue.

@parrt
Copy link
Member

parrt commented Mar 14, 2017

@pavelvelikhov still an issue? I'd like to know if I can close.

@pavelvelikhov
Copy link
Author

Hi Terrence, checked out an older version of pythonql, still a simple test takes ~7 sec to parse, the same test parses in about 100msec in PLY.

@parrt
Copy link
Member

parrt commented Mar 15, 2017

dang. ok, i'm leaving this open.

@ricardojuanpalmaduran
Copy link

Is there any progress on this? I'm currently using Antlr4 to parse some SQLite files (my SQLite grammar comes directly from the Anltr4 website), and it takes a few seconds to parse the simplest of files

@ricardojuanpalmaduran
Copy link

@ericvergnaud is there any plan to improve the performance of the Python parser? I'm curious why PLY would take 100ms on an example while antlr4 would take 7s, as pointed out by @pavelvelikhov. Such a big difference is difficult to attribute to Python only.

You recommended maybe tuning grammars to be more Python friendly. Is there any advice on how to do that? I'm currently using the SQLite grammar that you guys provide on the website and parsing some simple SQLite file with a few tables takes seconds.

Thanks.

@ericvergnaud
Copy link
Contributor

Hi,
it's easy to attribute the slowness to Python only because the exact same algorithm behaves much better with Java, C#, and even JavaScript! This is unfortunately proven by profilers, which show the exact same # of calls to core parts of the algorithm for a given grammar and input.
Re PLY, a fundamental difference between ANTLR and other parser generators is support for left recursion, which lets developers express grammars in a natural way, as opposed to traditional parser generators.
This comes with a cost: there is an infinite number of valid token sequences, which cannot therefore be precomputed (and the corresponding code pre-generated). Instead they are discovered and registered at runtime. This may seem illogical but actually works because every input is finite.
Beyond the natural slowness of Python, there is the cost of the above unique algorithm, where simplicity of use is provided at the cost of a performance penalty.
I can think of many ways to improve Python runtime performance, such as optionally dropping left-recursion and/or using more optimal data structures (no idea which ones...), but any of these would come at the cost of dropping cross-target compatibility, which is the very reason I provided those runtimes in the first place, so not keen to do this myself.
More broadly, I would tend to argue that anybody looking for performance should not use Python in the first place, but I appreciate this is very defensive since I have not been able to improve performance at a level which competes with the Java version or PLY.
Re how to improve a given grammar, I would simply profile the parser and count the # of calls to ParserATNSimulator.closure_, which is where parsing decisions are made. Reducing the # of alternatives for a given rule is the way to improve performance, which you can measure by counting the above calls (FYI, on all grammars and inputs I've tested the # of such calls is exactly the same for Java and all targets I have implemented).

@ericvergnaud
Copy link
Contributor

Some simple potential optimisations for SQLite would be as follows:

  • separate lexer and parser so you can better identify reusable grammar fragments
  • drop the alpha fragments such as fragment: A : [aA];, and replace the keyword definitions accordingly: K_ABORT : A B O R T; should read K_ABORT : 'ABORT'; (you can support case insensitivity using a TokenStreamRewriter and converting IDENTIFIER if required)

@ricardojuanpalmaduran
Copy link

I've removed all the fragments and hardcoded the values of the keywords and it seems that it does not have a significant impact on the number of calls to closure_. To give some numbers, with the change the number of calls as reported by cProfile is '955807/13579' while without the change the number of calls is '1016393/13928' (that's a ~6% reduction).

Any other advice off the top of your head that might improve the performance of this grammar in Python?

As you pointed out the # calls to core methods of the parsing algorithm seems to be the same in both Java and Python. It seems that unless there's a significant improvement of the underlying parsing algorithm (in which case all runtimes would benefit but I assume is unlikely to happen) the Python runtime can't do much.

@parrt
Copy link
Member

parrt commented Jan 4, 2019

I wonder if a cython version of the runtime would help much?

@ricardojuanpalmaduran
Copy link

ricardojuanpalmaduran commented Jan 4, 2019

I wonder if a cython version of the runtime would help much?

I was wondering the same. I'm not familiar with cython (just learned about it two days ago). If I understood the principle behind it correctly, it works well as long as it's capable of compiling things down to C, so the question is how much of the runtime could end up being compiled as C actually.

For what I read it seems that providing type information (with cython syntax) plays a big role there, but how much of that can help things like managing Python objects? For instance, if part of the slowness of the runtime comes from the fact that we're dealing with Python classes, can cython somehow do some magic that turns that into C code? For what I understood from http://docs.cython.org/en/latest/src/tutorial/cdef_classes.html that is not the case unless we have cython syntax in the runtime.

@sharwell
Copy link
Member

sharwell commented Feb 4, 2019

Is anyone able to break down the profiled performance of adaptivePredict further? ANTLR 4 has a bunch of subtleties in the implementation that are easy to incorrectly translate from one language (Java) to another language. These have impacted various targets in the past, and the more specific we can get with profiler data the easier it will be to figure out.

@SimonSntPeter
Copy link

SimonSntPeter commented Feb 4, 2019

@sharwell: I've been looking at the code and the ALL(*) paper and I'm badly out of my depth so I've just been experimenting to get an intuitive feeling, hence the above , but yes I've been suspicious of that for a while - closureCheckingStopState and closure_ are getting a mad number of calls so I've been wondering if the DFA cache is working (for 26k input file, sample run: 6.9 million calls in 2.6 secs, roughly 300,000 each for closureCheckingStopState and closure_, and they seem particularly expensive, if I'm reading this right).
But not knowing the code, the ALL(*) algorithm, how they relate together, parsing not being my area, it's uphill. I'll try to get some further improvement for the users by tweaking the grammar, then I'll start seriously prodding the underlying code to understand it.

@sharwell
Copy link
Member

sharwell commented Feb 4, 2019

@SimonSntPeter Any semantic predicate at the beginning of any lexer rule will disable the lexer DFA with overwhelming performance impact. The grammar indicated at the beginning of this topic contains such a predicate here:

https://github.com/pythonql/pythonql/blob/8f77df73ba25f5db2f15c7d430d6c75a9822a8d3/PythonQL.g4#L759-L764

It's unlikely anything will make much of a difference without relocating that predicate (to somewhere after the first character) or eliminating it (via modes).

@SimonSntPeter
Copy link

@sharwell I didn't know that, and thanks, but the grammar here is an SQLite grammar, kindly bundled to me by @ricardojuanpalmaduran (see #1219 (comment) above), and the only non-antlr stuff it has is some error reporting, not any predicates AFAICS. I just commented that error bit out and it runs the same.
Playing with the grammar, I can cause it to run faster by deleting bits but I just can't see why. It doesn't act like a handwritten RD parser might be expected, and its behaviour is opaque to me, so I'm throwing in the towel at this level and will try to understand the code it outputs.
I'd appreciate some help understanding bits of the ALL(*) paper if that's possible? Basic stuff.
FYI per the python profiler, for a run as mentioned above, of the full 3.6 secs, adaptivePredict is called 5991 times and takes about 3 seconds.
That's after my 'patch'; before that it was 14.9 secs overall and adaptivePredict had 5992 calls for 14.4 secs just for that func.

@sharwell
Copy link
Member

sharwell commented Feb 4, 2019

@SimonSntPeter Thanks, that makes sense. The number of calls to adaptivePredict isn't cause for concern. This is an extremely heavily-used method, even in DFA scenarios. Can you break down the 14.4 seconds by which method(s) adaptivePredict calls?

@SimonSntPeter
Copy link

Sure. The calls seem to be, for the 14.9 sec original, where cumtime is cumulative time in that function and all function called, per python profiler:

adaptivePredict (5,992 calls, cumtime 14.4 secs)
-- which calls --
execATN (5,992 calls, cumtime 14.3 secs)
-- which calls --
execATNWithFullContext (1,431 calls, cumtime12.6 secs)
-- which calls --
closure (24,248 calls, cumtime 12.1 secs)
-- which calls --
closureChckingStopState (1,445,631 calls (24,248 of these recursive), cumtime 12.04 secs)
-- and also --
closure_ (1,378,886 calls (24,248 recursive), cumtime 12.01 secs)

The sequence of calls are what I found tracing through manually.

There's also the following but I don't know where they fit in:
computeReachSet (7,459 calls, cumtime 11.8 secs)
getEpsilonTarget (1,817,881 calls, cumtime 4.2 secs)
computeStartState (468 calls (364 recursive), cumtime 1.1 secs)
ruleTransition (199,361 calls, cumtime 1.1 secs)
hashCodeForConfigSet (483,798 calls, cumtime 0.8 secs)
PredictionContext.merge (90,318 calls (82783 recursive), cumtime 0.74 secs)
PredictionContext.create(209,286 calls, cumtime 0.7 secs)
PredictionMode.getConflictingAltSubsets (7,422 calls, cumtime 0.55 secs)

After that what's not grammar rules is mostly lexer overheads.

I'm not familiar with python profiler so if anything looks not credible, shout and I'll double-check it. I'd trust the call count a bit more than the times reported.

On UK time so will pick up tomorrow.

@SimonSntPeter
Copy link

NB if you look at @ricardojuanpalmaduran's bundled code, the SQL being parsed is basically 2 things, a create table statement and a create view statement, both of these copied&pasted a dozen times, so once you've parsed one of each the rest should be trivial regarding adaptive stuff ( = lookups straight from the DFA cache?)

@amykyta3
Copy link
Contributor

amykyta3 commented Jan 9, 2020

Hey everyone. Posting here since it may be useful to people that find this issue thread.

@parrt and @thisiscam's comments about using cython inspired me to put together an accelerator module for Antlr.
It isn't implemented using cython, but it uses the Antlr C++ target as a Python extension. Lexing & parsing is done exclusively in C++, and then an auto-generated visitor is used to re-build the resulting parse tree in Python. Initial tests show a 5x-25x speedup depending on the grammar and input, and I have a few ideas on how to improve it further.

Here is the code-generator tool: https://github.com/amykyta3/speedy-antlr-tool
And here is a fully-functional example: https://github.com/amykyta3/speedy-antlr-example

It is still a work in progress, and I hope to fill in more documentation details this weekend.

@JameelNabbo
Copy link

Is this issue sorted?

@parrt
Copy link
Member

parrt commented Jun 2, 2020

@JameelNabbo nope, not as far as i know.

@data-harmonization
Copy link

I have a similar behaviour with a query of mine. Unfortunately, I'm not allowed to share my g4-file, but I hope the following explanation will help to pinpoint where the issue is likely to be.

Bahaviour
My input file has one statetement of 1246 lines and 9642 tokens that I use as a test case. When I paste the command the ANTLR Preview in PyCharm (ANTLR plugin), I have results in less than 10 seconds (I assume the Java parser is used here). When I read the same command using the ANTLR BufferedTokenStream and use the Python3 ANTLR-parser, the following RecursionError is given after roughly 5 minutes.

    self.closure_(config, configs, closureBusy, collectPredicates, fullCtx, depth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1187: in closure_
    self.closureCheckingStopState(c, configs, closureBusy, continueCollecting, fullCtx, newDepth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1136: in closureCheckingStopState
    self.closure_(config, configs, closureBusy, collectPredicates, fullCtx, depth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1187: in closure_
    self.closureCheckingStopState(c, configs, closureBusy, continueCollecting, fullCtx, newDepth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1136: in closureCheckingStopState
    self.closure_(config, configs, closureBusy, collectPredicates, fullCtx, depth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1143: in closure_
    configs.add(config, self.mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ATNConfigSet.py:91: in add
    merged = merge(existing.context, config.context, rootIsWildcard, mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:263: in merge
    return mergeSingletons(a, b, rootIsWildcard, mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:324: in mergeSingletons
    parent = merge(a.parentCtx, b.parentCtx, rootIsWildcard, mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:278: in merge
    return mergeArrays(a, b, rootIsWildcard, mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:512: in mergeArrays
    merged = ArrayPredictionContext(mergedParents, mergedReturnStates)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:184: in __init__
    super().__init__(calculateListsHashCode(parents, returnStates))
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:77: in calculateListsHashCode
    h = hash((h, calculateHashCode(parent, returnState)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

parent = <antlr4.PredictionContext.ArrayPredictionContext object at 0x132697af0>
returnState = 1124

    def calculateHashCode(parent:PredictionContext, returnState:int):
>       return hash("") if parent is None else hash((hash(parent), returnState))
E       RecursionError: maximum recursion depth exceeded

/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:72: RecursionError

Source of the issue
While there are only 9642 tokens, the Python parser reads a total of 341.253 tokens. A visualization of this pattern is found in this graph:

https://acheron.cloud/wp-content/uploads/2021/07/antlr_python_graph.png

In particular, the function "execATN" in ParserATNSimulation.py was going over-and-over the tokens.

Can this information help with the issue why the ANTLR-Python-parser is slower than its Java brother?

@parrt
Copy link
Member

parrt commented Jul 7, 2021

Hi. Yep, unfortunately Python is not particularly efficient with recursion and does not allow a great deal of depth. Maybe increase it? https://docs.python.org/3/library/sys.html#sys.setrecursionlimit

@data-harmonization
Copy link

data-harmonization commented Jul 13, 2021

Thank you for the suggestion! I will certainly have a look at this.

I'm still a bit flabbergasted by the number of tokens Antlr wishes to visit in the function "execATN". I'm wondering why and if the Java-build requires the same number of tokens to be visited. If I have some time the coming weekend, I will have a look at this.

@parrt
Copy link
Member

parrt commented Jul 13, 2021

Yep, lookahead can be HUUUUGE. it's the nature of ALL(*).

@data-harmonization
Copy link

Thank you for the clarification! That explains the high number. Makes a lot more sense now!

@KvanTTT
Copy link
Member

KvanTTT commented May 7, 2022

Performance can be improved for Python using constant folding during generation especially for large grammars, see #3698

@SimonStPeter
Copy link

SimonStPeter commented May 7, 2022 via email

@KvanTTT
Copy link
Member

KvanTTT commented May 7, 2022

Any idea how much it would improve a real grammar if implemented?

I don't how much but anyway it should. It looks like the code is located in quite hot path. The simplest case:

grammar Test;
e: T1 | T2 | T3;
T1: 't1';
T2: 't2';
T3: 't3';

The more alternatives rule have the more operations are required. For instance, SQL grammars contain a lot of alternatives.

BTW perhaps easier than folding constant exprs it would be just to hoist them into vars evaluated once:

Vars also should be stored somewhere. Also, vars not so efficient as const literals because Python has no idea about constants.

folding is easy but maybe a bit messy, hoisting should be much easier with the same effect, I guess. Would it help here?

It looks like any option is unclear for end user because of bitwise operations. In this case the most effective implementation should be choosed (with folded constants).

@SimonStPeter
Copy link

SimonStPeter commented May 7, 2022 via email

@KvanTTT
Copy link
Member

KvanTTT commented May 7, 2022

I mean, unless I missed it, might it be a 5% speedup or 20 X speedup in a real grammar?

I'll check on a real grammar after implementation.

Global vars. Stick them at the top. The efficiency is slightly less I agree but hoisting const exprs into vars is very easy to do, so very easy to try out I hope.

Actually implementation with constant folding is also very easy, may be even easier than vars.

@parrt
Copy link
Member

parrt commented May 7, 2022

Doesn't the python compiler do any kind of optimization?

@parrt
Copy link
Member

parrt commented May 7, 2022

oh, what if we started generating cython?

@KvanTTT
Copy link
Member

KvanTTT commented May 7, 2022

Doesn't the python compiler do any kind of optimization?

I haven't checked, but not all users use cpython. Also, not sure cpython optimized such cases because Python as language doesn't have conception of constant, interpreter or compiler just has no idea if a var will be changed. If cypython has constants than the optimization is possible.

oh, what if we started generating cython?

I don't know how many changes are required for compatibility with cython and if it's possible to use such library from ordinary Python. Actually I know almost nothing about cython.

@pinaraf
Copy link

pinaraf commented May 8, 2022

Doesn't the python compiler do any kind of optimization?

Where we go, there is no compiler… :)

The only «compiler» step is the conversion of source code to bytecode, and the bytecode keeps everything, there is almost no optimization possible. In the previously mentioned no_folding_const_literals_test function, CPython 3.10 will optimize it as return 131070. Otherwise, since the interpreter doesn't know what is a constant and what is variable, it cannot optimize away the variable dereferences, and thus the no_folding_const_refs_test has a wooping 128 bytes body (instead of 4), and does for each part LOAD_CONST, LOAD_GLOBAL, BINARY_LSHIFT. LOAD_GLOBAL will have to go through the global dictionnary, thus an «expensive» call.
Maybe the "faster cpython" project will help later, cf https://lwn.net/SubscriberLink/893686/c047ba1f808dd27d/ for the latest report on this. While waiting for this, well… folding, replacing consts literals, or using cython are some of the only options.

@KvanTTT
Copy link
Member

KvanTTT commented May 8, 2022

Yet another potential performance improvemet: #3703

@KvanTTT
Copy link
Member

KvanTTT commented May 8, 2022

I see a lot of

if token in [PythonQLParser.NOT, PythonQLParser.NONE, PythonQLParser.TRUE, PythonQLParser.FALSE, PythonQLParser.NAME, PythonQLParser.STRING_LITERAL, PythonQLParser.BYTES_LITERAL, PythonQLParser.DECIMAL_INTEGER, PythonQLParser.OCT_INTEGER, PythonQLParser.HEX_INTEGER, PythonQLParser.BIN_INTEGER, PythonQLParser.FLOAT_NUMBER, PythonQLParser.IMAG_NUMBER, PythonQLParser.ELLIPSIS, PythonQLParser.STAR, PythonQLParser.OPEN_PAREN, PythonQLParser.OPEN_BRACK, PythonQLParser.ADD, PythonQLParser.MINUS, PythonQLParser.NOT_OP, PythonQLParser.OPEN_BRACE]:

and

if ((((_la - 41)) & ~0x3f) == 0 and ((1 << (_la - 41)) & ((1 << (PythonQLParser.LAMBDA - 41)) | (1 << (PythonQLParser.NOT - 41)) | (1 << (PythonQLParser.NONE - 41)) | (1 << (PythonQLParser.TRUE - 41)) | (1 << (PythonQLParser.FALSE - 41)) | (1 << (PythonQLParser.NAME - 41)) | (1 << (PythonQLParser.STRING_LITERAL - 41)) | (1 << (PythonQLParser.BYTES_LITERAL - 41)) | (1 << (PythonQLParser.DECIMAL_INTEGER - 41)) | (1 << (PythonQLParser.OCT_INTEGER - 41)) | (1 << (PythonQLParser.HEX_INTEGER - 41)) | (1 << (PythonQLParser.BIN_INTEGER - 41)) | (1 << (PythonQLParser.FLOAT_NUMBER - 41)) | (1 << (PythonQLParser.IMAG_NUMBER - 41)) | (1 << (PythonQLParser.ELLIPSIS - 41)) | (1 << (PythonQLParser.STAR - 41)) | (1 << (PythonQLParser.OPEN_PAREN - 41)) | (1 << (PythonQLParser.OPEN_BRACK - 41)) | (1 << (PythonQLParser.ADD - 41)) | (1 << (PythonQLParser.MINUS - 41)) | (1 << (PythonQLParser.NOT_OP - 41)) | (1 << (PythonQLParser.OPEN_BRACE - 41)))) != 0):

in generated PythonQLParser.py by @pavelvelikhov. I believe suggested improvements might improve parser performance (but it looks like author already doesn't use ANTLR).

@amykyta3
Copy link
Contributor

amykyta3 commented May 8, 2022

@KvanTTT Ive put together a post-processing script that attempts optimizing the constant shift operations you pointed out in #3698
See: https://gist.github.com/amykyta3/8285559e95074c4431c2836d78b36530
This script patches a *Parser.py file and replaces these constant expressions with literals.

I tried it on an existing grammar on one of my projects and didn't see a huge improvement see about a 2% speed improvement, but it also doesn't branch tokens all too deeply. I wonder if other grammars would benefit from this more.

EDIT: Added the optimization described in #3703 to the above gist as well

@amykyta3
Copy link
Contributor

amykyta3 commented May 8, 2022

Heres the diff the patch script produces: SystemRDL/systemrdl-compiler@c4d425e

Edited above comment - I had initially botched my benchmark. Now am seeing ~2% improvement in overall parse speed after implementing both optimizations

@KvanTTT
Copy link
Member

KvanTTT commented May 8, 2022

I wonder if other grammars would benefit from this more.

Not sure because they have constants that can be effectively folded during compilation. Also they have switch case unlike Python.

Edited above comment - I had initially botched my benchmark. Now am seeing ~2% improvement in overall parse speed after implementing both optimizations

Not impressive. Maybe it's better for other grammars or input files. Anyway, I consider it's useful optimization since it also reduces the size of generated code and eliminates very long lines that are confusing.

@KvanTTT
Copy link
Member

KvanTTT commented May 8, 2022

Heres the diff the patch script produces: SystemRDL/systemrdl-compiler@c4d425e

Have you tried array of literals [0, 1, 2, 3] instead of (1 << token) & 0x20000000f00000003ffffff000000000 != 0:. Not sure it's the best implementation since it looks like big numbers (>64 bit) are used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests