Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build regexes only once #133

Closed

Conversation

mvorisek
Copy link
Contributor

@mvorisek mvorisek commented Jun 23, 2024

Huge speedup when the Tokenizer is created many times (not cached/reused in user app).

speedup non-first Tokenizer bootstrap
@mvorisek mvorisek force-pushed the fix_perf_build_regexes_once branch from 993d0e0 to 5d88fec Compare June 24, 2024 07:09
@mvorisek mvorisek marked this pull request as ready for review June 24, 2024 07:10
@derrabus derrabus added the enhancement New feature or request label Jun 24, 2024
@derrabus
Copy link
Member

Huge speedup when the Tokenizer is created many times (not cached/reused in user app).

Is there a reason why I would want to destroy and recreate the Tokenizer all the time?

@mvorisek
Copy link
Contributor Author

Is there a reason why I would want to destroy and recreate the Tokenizer all the time?

See https://github.com/atk4/data/blob/6fbe7e23a4/src/Persistence/Sql/Expression.php#L437 - the regex build is quite slow, so if someone use it like this, this PR will help.

@derrabus
Copy link
Member

I'd rather fix the code you've linked, tbh.

@mvorisek
Copy link
Contributor Author

Sure, we already did, and let's fix it here too.

@derrabus
Copy link
Member

No.

@greg0ire greg0ire closed this Jun 24, 2024
@mvorisek
Copy link
Contributor Author

@greg0ire can you please reopen this PR - I did a test with:

for ($i = 0; $i < 100_000; $i++) {
    (new Tokenizer())->tokenize('select 1');
}

and the speed difference is 500x - currently, single Tokenizer bootstrap takes ~1.5 ms!

Creating the regexes from huge string lists is quite slow process and even when the Tokenizer is reused a few times, this PR imrpoves real world performance by a big amount!

Imagine usecases like:

  • the Tokenizer used in many user libs within a single app - impossible to be fixed by user
  • the Tokenizer instanciated in some class that is instanciated many times too - I belive we do not want to recommend assigning the Tokenizer/SqlFormatter into static property because of this
  • SqlFormatter created with multiple highligters - https://github.com/doctrine/sql-formatter/blob/1.4.0/src/SqlFormatter.php#L36 - also impossible to be fixed by user

@greg0ire
Copy link
Member

greg0ire commented Jun 25, 2024

None of these use cases seem realistic.

@mvorisek
Copy link
Contributor Author

mvorisek commented Jun 25, 2024

Please, the improvement is really huge and (new SqlFormatter())->format($query) syntax is very common, actually the popular doctrine/migrations does use it itself - https://github.com/doctrine/migrations/blob/3.7.4/lib/Doctrine/Migrations/Generator/SqlGenerator.php#L58

I hope this provides strong evidence so let me rephrase the question differently, what is againt this PR - the class is final, so the regexes can be easily hold within static property.

@derrabus
Copy link
Member

actually the popular doctrine/migrations does use it itself

We should indeed fix that as well.

Sorry, I certainly don't want a static cache here. And honestly, if you put the tokenizer on a hot path and create and destroy 100k instances of that class instead of reusing a single instance, you had it coming.

@greg0ire
Copy link
Member

greg0ire commented Jun 25, 2024

actually the popular doctrine/migrations does use it itself

We should indeed fix that as well.

Even then, I doubt anybody will notice the resulting performance improvement. Maybe generating the initial migration of a project will get slightly faster, but after that… meh.

@mvorisek
Copy link
Contributor Author

@greg0ire the speed difference is 500x and although the migratins speedup might be only ~100 ms, as long as someone would use it for web, even if for debug only, the unwanted slowdown should be avoided - please, is there really anything againts this PR?

@greg0ire
Copy link
Member

as long as someone would use it for web

What do you mean by that? Do people generate migrations through a web interface nowadays?

@mvorisek
Copy link
Contributor Author

In atk4, we use it to format SQL debug/log queries, witch this feature (and not reusing the SqlFormatter instance), the slowdown can be easily the whole page render.. So given how easy but powerful this PR is, I do not see a major problem by making it fast when not used perfectly.

@derrabus
Copy link
Member

I do not see a major problem

I do see a major problem with static caches. This is a no, please stop wasting our time.

@mvorisek mvorisek deleted the fix_perf_build_regexes_once branch June 25, 2024 21:11
@mvorisek
Copy link
Contributor Author

OK - because of memory kept allocated?

@derrabus
Copy link
Member

actually the popular doctrine/migrations does use it itself

We should indeed fix that as well.

doctrine/migrations#1436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants