Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: typo fixes #3488

Merged
merged 1 commit into from
Nov 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions barretenberg/cpp/src/barretenberg/common/thread.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@
* The first implementation was `parallel_for_spawning`. You can read a description of each implementation in the
* relevant source file, but parallel_for_spawning is the simplest approach imaginable.
* Once WASM was working, I checked its performance in native code by running it against the polynomials benchmarks.
* In doing so, OMP outperformed it significantly (at least for FFT algorithims). This set me on a course to try
* In doing so, OMP outperformed it significantly (at least for FFT algorithms). This set me on a course to try
* and understand why and to provide a suitable alternative. Ultimately I found solutions that compared to OMP with
* "moody" and "atomic_pool" solutions, although they were not *quite* as fast as OMP. However interestingly, when it
* comes to actual "real world" testing (with proof construction), rather than raw benchmarking, most of the solutions
* performaed about the same, with OMP *actually slightly worse*. So maybe all this effort was a bit redundant.
* performed about the same, with OMP *actually slightly worse*. So maybe all this effort was a bit redundant.
* Remember to always do real world testing...
*
* My theory as to why OMP performs so much better in benchmarks is because it runs the tests in a very tight loop,
Expand Down Expand Up @@ -85,4 +85,4 @@ void parallel_for(size_t num_iterations, const std::function<void(size_t)>& func
// parallel_for_queued(num_iterations, func);
#endif
#endif
}
}
6 changes: 3 additions & 3 deletions barretenberg/cpp/src/barretenberg/ecc/groups/wnaf.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -239,12 +239,12 @@ inline uint64_t get_num_scalar_bits(const uint64_t* scalar)
*
* N.B. IN OUR STDLIB ALGORITHMS THE SKEW VALUE REPRESENTS AN ADDITION NOT A SUBTRACTION (i.e. we add +1 at the end of
* the scalar mul algo we don't sub 1) (this is to eliminate situations which could produce the point at infinity as an
* output as our circuit logic cannot accomodate this edge case).
* output as our circuit logic cannot accommodate this edge case).
*
* Credits: Zac W.
*
* @param scalar Pointer to the 128-bit non-montgomery scalar that is supposed to be transformed into wnaf
* @param wnaf Pointer to output array that needs to accomodate enough 64-bit WNAF entries
* @param wnaf Pointer to output array that needs to accommodate enough 64-bit WNAF entries
* @param skew_map Reference to output skew value, which if true shows that the point should be added once at the end of
* computation
* @param wnaf_round_counts Pointer to output array specifying the number of points participating in each round
Expand Down Expand Up @@ -497,4 +497,4 @@ inline void fixed_wnaf_with_restricted_first_slice(uint64_t* scalar,
// }
} // namespace barretenberg::wnaf

// NOLINTEND(readability-implicit-bool-conversion)
// NOLINTEND(readability-implicit-bool-conversion)
8 changes: 4 additions & 4 deletions barretenberg/cpp/src/barretenberg/ecc/pippenger.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ So, for example, if the most significant 6 bits of a scalar are `011001` (25), w

At the end of each round, we then 'concatenate' all of the buckets into a sum. Let's represent each bucket accumulator in an array `A[num_buckets]`. The concatenation phase will compute `A[0] + 2A[1] + 3A[2] + 4A[3] + 5A[4] + ... = Sum`.

Finally, we add each `Sum` point into an overall accumulator. For example, for a set of 254 bit scalars, if we evaluate the most 6 significant bits of each scalar and accumulate the resulting point into `Sum`, we actually need `(2^{248}).Sum` to accomodate for the bit shift.
Finally, we add each `Sum` point into an overall accumulator. For example, for a set of 254 bit scalars, if we evaluate the most 6 significant bits of each scalar and accumulate the resulting point into `Sum`, we actually need `(2^{248}).Sum` to accommodate for the bit shift.

This final step is similar to the 'double and add' algorithm in a traditional scalar multiplication algorithm - we start at the most significant bit slice and work our way down to minimize the number of doublings. At each round, we multiply the overall accumulator by 2^{bit slice size} - by the time we iterate over every round, we will have performed the total required number of doublings for every 'bit slice' that we add into our accumulator.

Expand All @@ -38,7 +38,7 @@ Total run time = 16 * 2^{18} + 16 * 2^{15} = 18 * 2^{18}. So the aggregate numbe

## The problem with Pippenger's algorithm

As it is currently implemented, each round will iterate over the points to be added, and add each point into one of the round's buckets. Whilst point access is sequential in memory, bucket access is very much not. In fact, if the points being multiplied are from a zero-knowlege proof, bucket access is literally uniformly randomly distributed and therefore presents the worst-case scenario.
As it is currently implemented, each round will iterate over the points to be added, and add each point into one of the round's buckets. Whilst point access is sequential in memory, bucket access is very much not. In fact, if the points being multiplied are from a zero-knowledge proof, bucket access is literally uniformly randomly distributed and therefore presents the worst-case scenario.

This makes it difficult to parallelize. It is not possible to simply assign threads a section of points to iterate over, because of race conditions when two threads access the same bucket.

Expand Down Expand Up @@ -69,8 +69,8 @@ Drawbacks of this approach:

## Summary

By restructuring the memory heirarchy of our pippenger algorithm, we can create a parallelizable version of pippenger. This will significantly simplify the logic of our PLONK prover (instead of allocating threads for batches of multi-exponentations, we can multi-thread individual multi-exponentiations, simplifying our thread logic).
By restructuring the memory hierarchy of our pippenger algorithm, we can create a parallelizable version of pippenger. This will significantly simplify the logic of our PLONK prover (instead of allocating threads for batches of multi-exponentations, we can multi-thread individual multi-exponentiations, simplifying our thread logic).

This will concretely reduce the number of pippenger rounds of our multi-exponentiations by approximately 1, giving a theoretical 15% speed-up. Some of this will be eaten by the run-time of the radix sort.

Longer term, this parallelizable algorithm will be significantly easier to adapt for GPUs, using OpenCL.
Longer term, this parallelizable algorithm will be significantly easier to adapt for GPUs, using OpenCL.
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ inline void compute_gen_permutation_lagrange_base_single(barretenberg::polynomia
// here, 'index' refers to an element of our subgroup H
// we can almost use permutation[i] to directly index our `roots` array, which contains our subgroup elements
// we first have to mask off the 2 high bits, which describe which wire polynomial our permutation maps to (we'll
// deal with that in a bit) we then have to accomodate for the fact that, `roots` only contains *half* of our
// deal with that in a bit) we then have to accommodate for the fact that, `roots` only contains *half* of our
// subgroup elements. this is because w^{n/2} = -w and we don't want to perform redundant work computing roots of
// unity

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ inline void compute_permutation_lagrange_base_single(barretenberg::polynomial& o
// `permutation[i]` will specify the 'index' that this wire value will map to.
// Here, 'index' refers to an element of our subgroup H.
// We can almost use `permutation[i]` to directly index our `roots` array, which contains our subgroup elements.
// We first have to accomodate for the fact that `roots` only contains *half* of our subgroup elements. This is
// We first have to accommodate for the fact that `roots` only contains *half* of our subgroup elements. This is
// because ω^{n/2} = -ω and we don't want to perform redundant work computing roots of unity.

size_t raw_idx = permutation[i].subgroup_index;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -390,7 +390,7 @@ template <typename Flavor> class ECCVMCircuitBuilder {
}
}
for (size_t i = 0; i < precompute_table_state.size(); ++i) {
// first row is always an empty row (to accomodate shifted polynomials which must have 0 as 1st
// first row is always an empty row (to accommodate shifted polynomials which must have 0 as 1st
// coefficient). All other rows in the precompute_table_state represent active wnaf gates (i.e.
// precompute_select = 1)
polys.precompute_select[i] = (i != 0) ? 1 : 0;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -701,7 +701,7 @@ std::vector<uint32_t> UltraCircuitBuilder_<Arithmetization>::decompose_into_defa
* This is not strictly required iff num_bits <= target_range_bitnum.
* However, this produces an edge-case where a variable is range-constrained but NOT present in an arithmetic gate.
* This in turn produces an unsatisfiable circuit (see `create_new_range_constraint`). We would need to check for
* and accomodate/reject this edge case to support not adding addition gates here if not reqiured
* and accommodate/reject this edge case to support not adding addition gates here if not reqiured
* if (num_bits <= target_range_bitnum) {
* const uint64_t expected_range = (1ULL << num_bits) - 1ULL;
* create_new_range_constraint(variable_index, expected_range);
Expand Down Expand Up @@ -3454,4 +3454,4 @@ template class UltraCircuitBuilder_<arithmetization::UltraHonk<barretenberg::fr>
// To enable this we need to template plookup
// template class UltraCircuitBuilder_<grumpkin::fr>;

} // namespace proof_system
} // namespace proof_system
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ inline void compute_standard_plonk_lagrange_polynomial(barretenberg::polynomial&
// `permutation[i]` will specify the 'index' that this wire value will map to.
// Here, 'index' refers to an element of our subgroup H.
// We can almost use `permutation[i]` to directly index our `roots` array, which contains our subgroup elements.
// We first have to accomodate for the fact that `roots` only contains *half* of our subgroup elements. This is
// We first have to accommodate for the fact that `roots` only contains *half* of our subgroup elements. This is
// because ω^{n/2} = -ω and we don't want to perform redundant work computing roots of unity.

size_t raw_idx = permutation[i].row_index;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ namespace keccak_tables {
*
* We need multiple Rho tables in order to efficiently range-constrain our input slices.
*
* The maximum number of bits we can accomodate in this lookup table is MAXIMUM_MULTITABLE_BITS (assume this is 8)
* The maximum number of bits we can accommodate in this lookup table is MAXIMUM_MULTITABLE_BITS (assume this is 8)
* For example take a left-rotation by 1 bit. The right-slice will be a 63-bit integer.
* 63 does not evenly divide 8. i.e. an 8-bit table cannot correctly range-constrain the input slice and we would need
* additional range constraints.
Expand All @@ -49,7 +49,7 @@ namespace keccak_tables {
* We can stitch together a lookup table sequence that correctly range constrains the left/right slices for any of our
* 25 rotation values
*
* @tparam TABLE_BITS The number of bits each lookup table can accomodate
* @tparam TABLE_BITS The number of bits each lookup table can accommodate
* @tparam LANE_INDEX Required by get_rho_output_table to produce the correct MultiTable
*/
template <size_t TABLE_BITS = 0, size_t LANE_INDEX = 0> class Rho {
Expand Down
4 changes: 2 additions & 2 deletions barretenberg/cpp/src/barretenberg/smt_verification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ To store it on the disk just do

`FFTerm` - the symbolic value that simulates finite field elements.

`FFTerm` - the symbolic value that simulates integer elements which behave like finite field ones. Usefull, when you want to create range constraints or perform operations like XOR.
`FFTerm` - the symbolic value that simulates integer elements which behave like finite field ones. Useful, when you want to create range constraints or perform operations like XOR.

`Bool` - simulates the boolean values and mostly will be used only to simulate complex `if` statements if needed.

Expand Down Expand Up @@ -250,4 +250,4 @@ void model_variables(Circuit<smt_terms::FFTerm>& c, Solver* s, FFTerm& evaluatio
}
```

More examples can be found in *.test.cpp files
More examples can be found in *.test.cpp files
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ uint<Builder, Native> uint<Builder, Native>::operator>>(const size_t shift) cons
*
* We have a special selector configuration in our arithmetic widget that extracts 6.b_x from given the two
* relevant accumulators. The factor of 6 is for efficiency reasons. We need to scale our other gate
* coefficients by 6 to accomodate this.
* coefficients by 6 to accommodate this.
**/

if ((shift & 1) == 0) {
Expand Down