Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Faster hashing logic in string-keyed dicts (~x2 speed-up) #3427

Open
wants to merge 9 commits into
base: nightly
Choose a base branch
from

Conversation

msaelices
Copy link
Contributor

@msaelices msaelices commented Aug 28, 2024

Faster String.__hash__() algorithm by not using the builtin hash function, which uses SIMD underneath, and somehow it's slowing the hash logic down (maybe because unneeded data conversion or memcpy). The hashing logic was slowing down Mojo string-keyed Dicts compared with Python, and with this optimization it is way faster now.

I'm now using a very simplified DJBX33A algorithm version only for the String struct. If the reviewer thinks the approach is valid, I can refine it.

See #1747

Note: there is a less disruptive approach here: #3436. It should have the same x2 outcome but, unfortunately, it's failing when compiling, maybe because of this compiler issue #3437

Benchmark 1

Same benchmark used in this other optimization: #3071

from collections import Dict
from time import now
from random import *

alias iteration_size = 128 #2048
def main():
    var result: Int=0
    var start = now()
    var stop = now()

    small2 = Dict[Int,Int]()
    start = now()
    for x in range(100):
        for i in range(iteration_size):
            small2[i]=i
        for i in range(iteration_size):
            result += small2[i]
    stop = now()
    print(stop-start, result)

    small3 = Dict[String,String]()
    start = now()
    for x in range(100):
        for i in range(iteration_size):
            small3[str(i)]=str(i)
        for i in range(iteration_size):
            result += len(small3[str(i)])
    stop = now()

Results (lower is better)

Before :

187688 812800
17102347 840200

After:

147619 812800
8381432 840200

This is roughly x2 speed-up in Dict[String, X].

Benchmark 2

Results (the last column is timing, lower is better)

Before:

version,n_wds,n_keys,the,sec
13638,1944,236,0.0083167180000000007
13638,1944,236,0.008287684
13638,1944,236,0.0084045869999999998
13638,1944,236,0.0083934049999999996
13638,1944,236,0.0083934230000000006
13638,1944,236,0.0084422609999999995
13638,1944,236,0.0084124249999999994
13638,1944,236,0.0084025450000000008
13638,1944,236,0.0084028699999999998
13638,1944,236,0.0083881839999999999

After:

version,n_wds,n_keys,the,sec
13638,1944,236,0.0038483929999999999
13638,1944,236,0.003731285
13638,1944,236,0.003709426
13638,1944,236,0.0037753890000000001
13638,1944,236,0.0038318750000000002
13638,1944,236,0.0043440989999999997
13638,1944,236,0.0039485559999999998
13638,1944,236,0.0040425610000000001
13638,1944,236,0.0040622460000000003
13638,1944,236,0.0041538590000000002

This means at least a x2 speed-up in my tests

Benchmark algorithm used:

from collections import List, Dict
from time import now


fn get_wds() raises -> List[String]:
    # String shortened because Github PR description limitation
    # Taken from this: https://github.com/ekbrown/scripting_for_linguists/blob/main/0a0HuaT4Vm7FoYvccyRRQj.txt
    input = String("""
Hey friends, it's your girl Bray. Enjoy Jolene. Welcome to back to her. If you aspire to heal evolve or revolutionize this podcast is for you. Make sure you subscribe and follow us on Instagram at official back to her...
    """)
    return input.upper().split(" ")


fn get_freqs(wds: List[String]) raises -> Dict[String, UInt64]:
    var freqs = Dict[String, UInt64]()
    for wd_ref in wds:
        wd = wd_ref[]
        if wd in freqs:
            freqs[wd] = freqs[wd] + 1
        else:
            freqs[wd] = 1
    return freqs


fn main() raises:
    var wds: List[String] = get_wds()
    var n_wds = len(wds)

    var out_path = "report.csv"
    with open(out_path, "w") as outfile:
        outfile.write(str("version,n_wds,n_keys,the,sec\n"))
        for _ in range(10):
            var t0 = now()
            var freqs = get_freqs(wds)
            var t1 = now()
            var duration = (t1 - t0) / 1_000_000_000
            var the = freqs["THE"]
            var n_keys = len(freqs.keys())
            var out_str = str(n_wds) + "," + str(n_keys) + "," + str(the) + "," + str(duration) + "\n"
            outfile.write(out_str)
    print("DONE, saved to", out_path)

My machine:

> lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  20
  On-line CPU(s) list:   0-19
Vendor ID:               GenuineIntel
  Model name:            12th Gen Intel(R) Core(TM) i7-12700H
    CPU family:          6
    Model:               154
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           1
    Stepping:            3
    CPU max MHz:         4700,0000
    CPU min MHz:         400,0000
    BogoMIPS:            5376.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch
                         _perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm ss
                         e4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tp
                         r_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 x
                         saves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movd
                         iri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   544 KiB (14 instances)
  L1i:                   704 KiB (14 instances)
  L2:                    11,5 MiB (8 instances)
  L3:                    24 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-19
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Faster String.__hash__() algorithm by not using the builtin `hash`
function, which uses SIMD underneath and somehow it's slowing the hash
logic down. The hashing logic was slowing down Mojo string-keyed dict
compared with Python, and with this optimization it should be faster
now.

See modularml#1747

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices requested a review from a team as a code owner August 28, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant