This is a parallelized full adder implented in C++ using OpenMP libraries.
This project aims at implementing a fast way for calculating the sum of two numbers (binary numbers), and parallelize that implementation using OpenMP directives in C++.
Adding two n-bit numbers, using a full adder, requires waiting for the generated carray affter the addition of the previous bit (least significant bit), before adding the current working bit.
The following diagram illustrates how the Full Adder works:
As we can see, each adder block will have to wait for the Cout(carry out) of the previous adder block.
Using this implementation, it will be impossible for us to parallelize the calculation, because the Cirtical Path Length of the dependency graph will be of size N.
The new approach will be using the carry look ahead adder's algorithm in order to reduce the dependency between adder blocks. The main idea is to create another circuit that generates the carry out, of any given block, using only C0 and Ai and Bi.
The floowing diagram illustrates the idea:
The combinational ciruit will be using the Generate and the Propagate logic to calculate the the carry out of any given block.
We've made two different approaches to implement the carry look ahead algorithm. One is recursive, in which we divide the problem into different stages, and each stage contains a different set of binary blocks (portions of the initial ‘a’ and ‘b to be added). After implementing the recursive approach, we found out that we don’t need other stages than 0 to make the addition, so the other approach is the iterative approach, in which we used a simple for loop to initialize the Propagate and Generate values, on one hand, and also to make the sum, on the other hand.
We're going to use two matrices to hold the values of Propagate and Generate, in order to avoid calculating the P and G for the same block twice. Here's an example of that logic for two 4-bit numbers a=1010
and b=1100
:
Both P
and G
matrices will be of size [log(N)+1][N]
, where N is the number of bits in each number (assuming both numbers are of equal number of bits).
Here's the content of those two matrices after executing the program for that same example:
Complexity: todo.
Now that we have the Propagate
and the Genreate
of each block, we can make the sum of each block independently, without waiting for carry out of the previous block. Using the Carry formulas described here.
For the iterative approach, it's basically the same idea as the recursive one but for only stage 0. So We're going to use two arrays to hold the values of Propagate and Generate. Here's an example of that logic for two 4-bit numbers a=1010
and b=1100
:
Both P
and G
arrays will be of size [N]
, where N is the number of bits in each number (assuming both numbers are of equal number of bits).
Here's the content of those two arrays after executing the program for that same example:
So the logic for the iterative approach, is basically the same as the recursive one, except that we'll only use stage 0 blocks. We'll be using a simple for loop.
In this section we'll put some explanations about some terms used in the code:
bool getFirstPatternVal(int k, int N){}
: this will return the value of the following section of the Ci formula:
Here, i=k=4 (in our code). You can learn more about the carry look ahead carries' formulas here.
bool getSecondPatternVal(int k, int c0)
: this will return the value of the following section of the Ci formula:
Note: if you want to change the arguments, you can do it by referring to: a
, b
, N
Important (you should update the N).
Note: N
should be a power of 2.
Now that we have a much less decoupling dependecy graph, we can profit from Parallelization using OpenMP directives.
We used OMP Tasks
to parallelize both the Propagate and Generate initialization and the addition process. Here's a portion of how we implemented it in th code:
// main()
#pragma omp parallel
{
#pragma omp single
{
init_P_and_G(a, b, log2(N), 0, N-1);
}
}
...
#pragma omp parallel
{
#pragma omp single
{
full_adder(a, b, c0, s, 0, N-1, log2(N), N);
}
}
// init_P_and_G()
#pragma omp task shared(P) shared(G)
{
init_P_and_G(a, b, stage - 1, l, (l+r)/2); // left
}
#pragma omp task shared(P) shared(G)
{
init_P_and_G(a, b, stage - 1, (l+r)/2+1, r); // right
}
// full_adder()
#pragma omp task shared(s)
{
full_adder(a, b, c0, s, l, (l+r)/2, stage-1, N);
}
#pragma omp task shared(s)
{
full_adder(a, b, c0, s, (l+r)/2+1, r, stage-1, N);
}
For the iterative approach, we can use two ways of paralellization: using prallel for
or using tasks
.
// init_P_and_G()
#pragma omp parallel for
for (int i=n-1; i>=0; i--) {
//printf("Thread %d\n", omp_get_thread_num());
P[i] = a[i] - '0' || b[i] - '0';
G[i] = a[i] - '0' && b[i] - '0';
}
// full_adder()
...
#pragma omp parallel for
for (int i=N-2; i>=0; i--) {
// other bits than the less significant
int j = N-i-1;
...
}
//TODO
The benchmark.sh
is a bash script that executes the differnt approaches, both the sequential and the parallel one, and displays the execution time for each.
However, we still have to update it to make it free the cache after each execution to avoid and misleading values.
Here's a sample output of the benchmark.sh
script:
Here are some important formulas used in this implementation:
Gb = Gl + (Pl . Gr): The Generate
of a block equals the Generate
of the left side
OR the (Propagate
of the left side
AND Generate
of the right side
)
Pb = Pl . Pr: The Propagate
of a block equals the Propagate
of the left side
AND the Propagate
of the right side
P = x + y (x OR y): Propagate of a block of 1-bit (Can also work with x XOR y)
G = x . y (x AND y): Genreate of a block of 1-bit
If you find it useful, please consider giving it a star :)