Predict 1.9-4.2x faster #1341

kpu · 2023-06-22T10:50:27Z

I made prediction 1.9x to 4.2x faster than before.

Motivation

I want to use https://tinyurl.com/nllblid218e and similarly parametrized models to run language classification on petabytes of web data.

Methodology

The costliest operation is summing the rows for each model input. I've optimized this in three ways:

addRowToVector was a virtual function call for each row. I've replaced this with one virtual function call per prediction by adding averageRowsToVector to Matrix calls.
Vector and DenseMatrix were not 64-byte aligned so the CPU was doing a lot of unaligned memory access. I've brought in my own vector replacement that does 64-byte alignment.
Write the averageRowsToVector in intrinsics for common vector sizes. This works on SSE, AVX, and AVX512F.

See the commit history for a breakdown of speed improvement from each change.

Experiments

Test set docs1000.txt.gz which is a bunch of random documents https://data.statmt.org/heafield/classified-fasttext/
CPU: AMD Ryzen 9 7950X 16-Core

Model https://tinyurl.com/nllblid218e with 256-dimensional vectors
Before
real 0m8.757s
user 0m8.434s
sys 0m0.327s

After
real 0m2.046s
user 0m1.717s
sys 0m0.334s

Model https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin with 16-dimensional vectors
Before
real 0m0.926s
user 0m0.889s
sys 0m0.037s

After
real 0m0.477s
user 0m0.436s
sys 0m0.040s

Previously: one virtual function call per input Now: one virtual function call per prediction (At least as far as the Matrix is concerned). Prediction time on 1000 documents: Previously: real 0m8.757s user 0m8.434s sys 0m0.327s Now: real 0m8.500s user 0m8.169s sys 0m0.333s Same predictions. Performance impact is mild, but it will allow inlining faster code inside DenseMatrix without breaking the quantized version

This makes memory operations much faster for addition. Adds https://github.com/kpu/intgemm/blob/master/intgemm/aligned.h for aligned vector of PODs. Note this is already under an MIT license. Timed lid218e.bin on 1000 documents. Same predictions Previously: real 0m8.500s user 0m8.169s sys 0m0.333s Now: real 0m2.546s user 0m2.215s sys 0m0.334s

Timed lid218e.bin on 1000 documents. Same predictions. Using AMD Ryzen 9 7950X 16-Core Processor (which has AVX512F). Previously: real 0m2.546s user 0m2.215s sys 0m0.334s Now: real 0m2.046s user 0m1.717s sys 0m0.334s

facebook-github-bot · 2023-06-22T10:50:33Z

Hi @kpu!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

kpu · 2023-06-22T10:56:03Z

cc @Celebio

Compiler was complaining about noexcept on Vector(Vector&&) = default;

plo- · 2023-09-01T13:27:06Z

Any update @kpu?

(cherry picked from commit ffee8e4)

Add string_view to dictionary for fast lookup

facebook-github-bot · 2023-12-13T20:36:05Z

@kpuatfb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ZJaume · 2024-01-09T10:41:34Z

FasterText is failing in the Python module build because Setuptools does not set the c++17 flag, giving this kind of errors:

src/dictionary.h:75:28: error: ‘string_view’ in namespace ‘std’ does not name a type
   75 |   int32_t getId(const std::string_view) const;
      |                            ^~~~~~~~~~~
src/dictionary.h:75:23: note: ‘std::string_view’ is only available from C++17 onwards
   75 |   int32_t getId(const std::string_view) const;
      |                       ^~~
src/dictionary.h:76:28: error: ‘string_view’ in namespace ‘std’ does not name a type
   76 |   int32_t getId(const std::string_view, uint32_t h) const;

Changing the flag in setup.py fixes it:

diff --git a/setup.py b/setup.py
index 50c166b..8aff0b5 100644
--- a/setup.py
+++ b/setup.py
@@ -98,15 +98,14 @@ def has_flag(compiler, flags):


 def cpp_flag(compiler):
-    """Return the -std=c++[11/14] compiler flag.
-    The c++14 is preferred over c++11 (when it is available).
+    """Return the -std=c++17 compiler flag.
     """
-    standards = ['-std=c++11']
+    standards = ['-std=c++17']
     for standard in standards:
         if has_flag(compiler, [standard]):
             return standard
     raise RuntimeError(
-        'Unsupported compiler -- at least C++11 support '
+        'Unsupported compiler -- at least C++17 support '
         'is needed!'
     )

facebook-github-bot · 2024-01-09T15:08:33Z

@kpuatfb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-01-09T15:11:28Z

@kpuatfb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-01-09T15:24:36Z

@kpuatfb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-01-09T17:31:57Z

@kpuatfb merged this pull request in b733943.

Summary: I made prediction 1.9x to 4.2x faster than before. I want to use https://tinyurl.com/nllblid218e and similarly parametrized models to run language classification on petabytes of web data. The costliest operation is summing the rows for each model input. I've optimized this in three ways: 1. `addRowToVector` was a virtual function call for each row. I've replaced this with one virtual function call per prediction by adding `averageRowsToVector` to `Matrix` calls. 2. `Vector` and `DenseMatrix` were not 64-byte aligned so the CPU was doing a lot of unaligned memory access. I've brought in my own `vector` replacement that does 64-byte alignment. 3. Write the `averageRowsToVector` in intrinsics for common vector sizes. This works on SSE, AVX, and AVX512F. See the commit history for a breakdown of speed improvement from each change. Test set [docs1000.txt.gz](https://github.com/facebookresearch/fastText/files/11832996/docs1000.txt.gz) which is a bunch of random documents https://data.statmt.org/heafield/classified-fasttext/ CPU: AMD Ryzen 9 7950X 16-Core Model https://tinyurl.com/nllblid218e with 256-dimensional vectors Before real 0m8.757s user 0m8.434s sys 0m0.327s After real 0m2.046s user 0m1.717s sys 0m0.334s Model https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin with 16-dimensional vectors Before real 0m0.926s user 0m0.889s sys 0m0.037s After real 0m0.477s user 0m0.436s sys 0m0.040s Pull Request resolved: facebookresearch#1341 Reviewed By: graemenail Differential Revision: D52134736 Pulled By: kpuatfb fbshipit-source-id: 42067161f4c968c34612934b48a562399a267f3b

* Replace outdated url in the scripts Summary: Replace outdated url in the scripts Reviewed By: piotr-bojanowski Differential Revision: D43464784 fbshipit-source-id: 51a98a9ad5a0939acd0d578126290909a613938b * Add documentation about Hugging Face integration (facebookresearch#1335) Summary: [Word vectors](https://huggingface.co/facebook/fasttext-en-vectors) for 157 languages are now hosted on the Hugging Face Hub as well as the [language identification model](https://huggingface.co/facebook/fasttext-language-identification). (cc ajoulin) A newer language model [referred in the NLLB project](https://github.com/facebookresearch/fairseq/blob/nllb/README.md#lid-model) is not mentioned in the official website, so I updated the doc accordingly. Pull Request resolved: facebookresearch#1335 Reviewed By: Celebio Differential Revision: D46507563 Pulled By: jmp84 fbshipit-source-id: 64883a6829c68b968acd980ba77a712b8e7a1365 * Migrate "deeplearning/fastText" from LLVM-12 to LLVM-15 Summary: fbcode is migrating to LLVM-15 for safer and more up-to-date code and new compiler features. All contbuilds in your directory have passed our build test with LLVM-15, and your directory does not host any packages. This diff will migrate it to LLVM-15. If you approve of this diff, please use the "Accept & Ship" button. If you have a reason for why it should not build with LLVM 15, please make a comment and send it back to author. Otherwise we will land this on Thursday 06/15/2023. See the [FAQ post](https://fb.workplace.com/groups/llvm15platform010/posts/749154386769776/)! Please also direct any questions to [this group](https://fb.workplace.com/groups/llvm15platform010). - If you approve of this diff, please use the "Accept & Ship" button :-) Reviewed By: meyering Differential Revision: D46661531 fbshipit-source-id: 7278fbfcadec2392c94efd6deb710bdd5e9280f8 * Del `(object)` from 200 inc deeplearning/aicamera/trainer/utils/metrics.py Summary: Python3 makes the use of `(object)` in class inheritance unnecessary. Let's modernize our code by eliminating this. Reviewed By: itamaro Differential Revision: D48673901 fbshipit-source-id: 3e0ef05efe886b32a07bb58bd0725fa2ec934c14 * deeplearning, dcp (2972240286315620591) Reviewed By: r-barnes Differential Revision: D49677606 fbshipit-source-id: ec5b375177586c76ecccb83a29b562bc6e9961f6 * Add pyproject.toml to comply with PEP-518 (facebookresearch#1292) Summary: Adds pyproject.toml to comply with PEP-518, which fixes the building of the library by poetry - See python-poetry/poetry#6113 . This is a copy of facebookresearch#1270 , but I have signed the CLA. Pull Request resolved: facebookresearch#1292 Differential Revision: D51601444 Pulled By: alexkosau fbshipit-source-id: 357d702281ca3519c3640483eba04d124d0744b4 * fix compile error with gcc13 facebookresearch#1281 (facebookresearch#1340) Summary: Due to[ header dependency changes](https://gcc.gnu.org/gcc-13/porting_to.html#header-dep-changes) in GCC 13, we need to include the <cstdint> header. Pull Request resolved: facebookresearch#1340 Reviewed By: jmp84 Differential Revision: D51602433 Pulled By: alexkosau fbshipit-source-id: cc9bffb276cb00f1db8ec97a36784c484ae4563a * Predict 1.9-4.2x faster (facebookresearch#1341) Summary: I made prediction 1.9x to 4.2x faster than before. # Motivation I want to use https://tinyurl.com/nllblid218e and similarly parametrized models to run language classification on petabytes of web data. # Methodology The costliest operation is summing the rows for each model input. I've optimized this in three ways: 1. `addRowToVector` was a virtual function call for each row. I've replaced this with one virtual function call per prediction by adding `averageRowsToVector` to `Matrix` calls. 2. `Vector` and `DenseMatrix` were not 64-byte aligned so the CPU was doing a lot of unaligned memory access. I've brought in my own `vector` replacement that does 64-byte alignment. 3. Write the `averageRowsToVector` in intrinsics for common vector sizes. This works on SSE, AVX, and AVX512F. See the commit history for a breakdown of speed improvement from each change. # Experiments Test set [docs1000.txt.gz](https://github.com/facebookresearch/fastText/files/11832996/docs1000.txt.gz) which is a bunch of random documents https://data.statmt.org/heafield/classified-fasttext/ CPU: AMD Ryzen 9 7950X 16-Core Model https://tinyurl.com/nllblid218e with 256-dimensional vectors Before real 0m8.757s user 0m8.434s sys 0m0.327s After real 0m2.046s user 0m1.717s sys 0m0.334s Model https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin with 16-dimensional vectors Before real 0m0.926s user 0m0.889s sys 0m0.037s After real 0m0.477s user 0m0.436s sys 0m0.040s Pull Request resolved: facebookresearch#1341 Reviewed By: graemenail Differential Revision: D52134736 Pulled By: kpuatfb fbshipit-source-id: 42067161f4c968c34612934b48a562399a267f3b * deeplearning/fastText 2/2 Reviewed By: azad-meta Differential Revision: D53908330 fbshipit-source-id: b2215f0522c32a82cd876633210befefe9317d76 * Delete .circleci directory (facebookresearch#1366) Summary: Pull Request resolved: facebookresearch#1366 Reviewed By: jailby Differential Revision: D54850920 Pulled By: bigfootjon fbshipit-source-id: 9a3eec7b7cb42335a786fb247cb16be9ed3c2d59 * this page intentionally left blank --------- Co-authored-by: Onur Çelebi <celebio@meta.com> Co-authored-by: Sheon Han <sheon.han@gmail.com> Co-authored-by: generatedunixname89002005320047 <generatedunixname89002005320047@meta.com> Co-authored-by: Richard Barnes <rbarnes@meta.com> Co-authored-by: generatedunixname89002005287564 <generatedunixname89002005287564@meta.com> Co-authored-by: Chris Culhane <cfculhane@gmail.com> Co-authored-by: Cherilyn Buren <88433283+NiuBlibing@users.noreply.github.com> Co-authored-by: Kenneth Heafield <github@kheafield.com> Co-authored-by: Jon Janzen <jon@jonjanzen.com>

kpu added 3 commits June 22, 2023 10:35

Write vector averaging in x86 intrinsics.

c5db1a9

Timed lid218e.bin on 1000 documents. Same predictions. Using AMD Ryzen 9 7950X 16-Core Processor (which has AVX512F). Previously: real 0m2.546s user 0m2.215s sys 0m0.334s Now: real 0m2.046s user 0m1.717s sys 0m0.334s

kpu added 2 commits June 22, 2023 12:39

Add missing return

0a89a3f

Fix gcc 9 compilation

c50f3e5

Compiler was complaining about noexcept on Vector(Vector&&) = default;

kpu added 2 commits October 10, 2023 14:01

Add string_view to dictionary for fast lookup

c844acb

(cherry picked from commit ffee8e4)

Merge github.com:facebookresearch/fastText

cbb6c11

facebook-github-bot added the CLA Signed label Nov 28, 2023

Merge pull request #1 from jelmervdl/string-view-dict

0d5e7e2

Add string_view to dictionary for fast lookup

Fix python to C++17 thanks @ZJaume

d6235b6

Fix cmake command line to C++17

c1f29fa

kpu added 3 commits January 9, 2024 10:20

Lint: using and noexcept

ec87d27

Change Makefile to C++17 as well

bafea55

Remove extra std::make_pair

11ffc1c

facebook-github-bot closed this in b733943 Jan 9, 2024

facebook-github-bot added the Merged label Jan 9, 2024

sreekants mentioned this pull request Jan 31, 2024

Backward compatibility with C++11 #1357

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predict 1.9-4.2x faster #1341

Predict 1.9-4.2x faster #1341

kpu commented Jun 22, 2023 •

edited

Loading

facebook-github-bot commented Jun 22, 2023

kpu commented Jun 22, 2023

plo- commented Sep 1, 2023

facebook-github-bot commented Dec 13, 2023

ZJaume commented Jan 9, 2024

facebook-github-bot commented Jan 9, 2024

facebook-github-bot commented Jan 9, 2024

facebook-github-bot commented Jan 9, 2024

facebook-github-bot commented Jan 9, 2024

Predict 1.9-4.2x faster #1341

Predict 1.9-4.2x faster #1341

Conversation

kpu commented Jun 22, 2023 • edited Loading

Motivation

Methodology

Experiments

facebook-github-bot commented Jun 22, 2023

Action Required

Process

kpu commented Jun 22, 2023

plo- commented Sep 1, 2023

facebook-github-bot commented Dec 13, 2023

ZJaume commented Jan 9, 2024

facebook-github-bot commented Jan 9, 2024

facebook-github-bot commented Jan 9, 2024

facebook-github-bot commented Jan 9, 2024

facebook-github-bot commented Jan 9, 2024

kpu commented Jun 22, 2023 •

edited

Loading