Link to home
Start Free TrialLog in
Avatar of saturnian
saturnianFlag for United Kingdom of Great Britain and Northern Ireland

asked on

How do a multiply-accumulator work?

Hi there! I'm trying to fill in a table of a systolic architecture made by 4 PE's (Processing Elements).Each PE has a multiply-accumulator inside.I have done the NxN matrix-multiplication table for it and the first 2 clock cycles of the multiply-accumulator table but I cant figure out the output for the rest of the clock cycles.Anyone has an idea how they're going to be?I'm sending a sketch of the architecture and table.
Thanks
Multiply-accumulator-table.jpg
Avatar of Infinity08
Infinity08
Flag of Belgium image

If I interprete the diagram correctly, then PE0 and PE2 can work in parallel (as well as PE1 and PE3 of course).

So, at clock cycle 1, PE0 and PE2 start calculating a00 x b00 resp. a10 x b00. They then pass in the accumulator, and the result is fed as input into PE1 resp. PE3, which can start working as soon as those results are available.

Similarly, as soon as PE0 and PE2 have finished calculating the first multiplication, they can start with the second : a00 x b01 resp. a10 x b01, etc.

In the end, there will be 4 values as output : 2 from the topmost output line, and 2 from the bottommost output line.
Avatar of saturnian

ASKER

So what should be the results on the clock cycle 3 in the multiplier and in the accumulator on all PE's?

The first to cycles were easy to understand but the following I have to add the previous cycle into the accumulator of the next cycle.Is quite confusing.

A systolic architecture works rhythmically so compute and pass date through a system.I just cant understand is the application of the multiply-accumulator inside the PE. I think that the flow of data should go  rhythmically from PE0 to PE3.

The problem is that I have to follow the order by the NxN matrix multiplication above.
Thanks
Well i notice a error on the fist matrix.Should be a00 and a01 and not a00 twice.
Multiply-accumulator-table-corre.jpg
If this is a normal circuit, then all 4 result values can be available after 2 clock cycles. That is, assuming that the multiplier and the accumulator are not controlled by the clock ... Are they ?


>> The first to cycles were easy to understand but the following I have to add the previous cycle into the accumulator of the next cycle.Is quite confusing.

I don't think what you have is right ... As I said in my previous post : my understanding is that PE0,PE1 and PE2,PE3 work in parallel (there is no flow of data from PE0 to PE2, and no flow from PE1 to PE3).


>> A systolic architecture works rhythmically so compute and pass date through a system.

In this case (again, if my understanding is correct), the rhythm would be one set of b values per clock tick. In the first tick, the input will be b00 and b10, and in the next tick b01 and b11.


>> I think that the flow of data should go  rhythmically from PE0 to PE3.

I don't see any flow between PE0 and PE3 ... Am I missing something ?


>> The problem is that I have to follow the order by the NxN matrix multiplication above.

First understand how the circuit works ... Read my first post again if that helps.
>>>>If this is a normal circuit, then all 4 result values can be available after 2 clock cycles. That is, assuming that the multiplier and the accumulator are not controlled by the clock ... Are they ?

They're not controlled by the clock.

>>>>I don't think what you have is right ... As I said in my previous post : my understanding is that PE0,PE1 and PE2,PE3 work in parallel (there is no flow of data from PE0 to PE2, and no flow from PE1 to PE3).

Well I have showed the results first two clock cycles to my teacher and he told me that they were all right.Because the accumulator is going to add the result from the previous multiplication based on the NxN matrix table.But you are rite thou as there is no flow of data from PE0 to PE2, and no flow from PE1 to PE3.

>>>>In this case (again, if my understanding is correct), the rhythm would be one set of b values per clock tick. In the first tick, the input will be b00 and b10, and in the next tick b01 and b11.

Based on the correct results on the first two clock cycles the inputs were (a00 and b00,b01) like I have on the table.

>>>>I don't see any flow between PE0 and PE3 ... Am I missing something ?

You are rite I made a mistake when I said "I think that the flow of data should go  rhythmically from PE0 to PE3."
What I meant was the date is supposed to go  rhythmically from PE0 to PE1, then PE2 and finally to PE3 and not directly from PE0 to PE3 as stated on the previous comment.
Thanks


>> They're not controlled by the clock.

Then why do they need different clock ticks to work ?


>> Well I have showed the results first two clock cycles to my teacher and he told me that they were all right.

Then I must be completely misunderstanding the assignment. Because as far as I can see, at the very least, PE0 and PE2 can work in parallel, and thus BOTH start at the first clock tick.
The fact that two clock ticks are needed to calculate the result of PE0, could indicate that the multiplicator and accumulator are controlled by the clock, but you just said they weren't.
So, I'm lost heh.


>> Based on the correct results on the first two clock cycles the inputs were (a00 and b00,b01) like I have on the table.

Yes, a00 will be on the input line for both ticks, while b00 will be on the other input line for the first tick, and b01 for the second tick. That's the same interpretation I gave earlier.


>> What I meant was the date is supposed to go  rhythmically from PE0 to PE1, then PE2 and finally to PE3

Why the "then" ? Why can't they work in parallel ? They have separate inputs, and no dependency ...
>>>>Then why do they need different clock ticks to work ?

Well now you got me confused.Is a sure thing that they need the clock to work so why they need a different clock cycle to multiply and to accumulate?You are rite.They need the clock to work.

>>>>Then I must be completely misunderstanding the assignment. Because as far as I can see, at the very least, PE0 and PE2 can work in parallel, and thus BOTH start at the first clock tick.
The fact that two clock ticks are needed to calculate the result of PE0, could indicate that the multiplicator and accumulator are controlled by the clock, but you just said they weren't.
So, I'm lost heh.

The tricky thing here is that I cannot put thus circuit with both inputs because I have to follow the multiplication of the matrices.Specially on the NxN matrix multiplication the first row of the first matrix multiply by the first column in the second matrix and so on.But even in this I have my doubts because following the  NxN matrix multiplication principles the fist cycle is correct with a00 x b00 but the second cycle would be a01xb10 and not a00xb01 like my teacher said.Now once more I'm confused :-(
>> You are rite.They need the clock to work.

Ok, then all that's left is to figure out whether PE0 and PE2 can work in parallel or not (I see nothing that prevents that).


>> The tricky thing here is that I cannot put thus circuit with both inputs because I have to follow the multiplication of the matrices.

There are 8 input values, and 4 output values. Each of the PE's takes two input values, and multiplies them, then adds the possible carry from the previous PE. Each of the PE's has one output. The final 4 outputs are generated by PE1 and PE3, each of which generate 2 results.

PE0 will perform two calculations :

        c1 = (a00 * b00) + 0
        c2 = (a00 * b01) + 0

PE1 will also perform two calculations :

        x00 = (a01 * b10) + c1
        x01 = (a01 * b11) + c2

Similarly for PE2 and PE3, with x10 and x11 as end results.

That's how I understand it ...
Well that's a good idea.So if we have the for example a NxN matrix multiplication table with the values of:

 a00=1     a01=2              b00=2     b01=2
                               X                                    = ?
 a10=2     a11=1              b10=1     b11=1

which foolowing the principles of NxN matrix multiplication gives:

a00 x b00 + a01 x b10             a00 x b01 + a01 x b11
a10 x b00 + a11 x b10             a10 x b01 + a11 x b11

Just to have a slightest idea of what is a NxN matrix multiplication http://en.wikipedia.org/wiki/Matrix_multiplication

Don't you think that the inputs in the circuit aren't like they were supposed to be?Probably I did something wrong and the teacher didn't notice, so I'm going on th wrong path.
Thanks, I'm very appreciated.
I forgot to mention the question.Sorry ;-)
What would be the outputs of the first 4 clock cycles based on the inputs that I gave you?Thanks
>> Just to have a slightest idea of what is a NxN matrix multiplication

I know how matrix multiplication works ;)



>> Don't you think that the inputs in the circuit aren't like they were supposed to be?Probably I did something wrong and the teacher didn't notice, so I'm going on th wrong path.

If the multiplicator and accumulator are controlled by the clock, then the first two ticks that you filled in in your original post are correct for PE0 and PE1.

However, I think that PE2 can also start working already (in a similar fashion to PE0). Verify that with your teacher to make sure, but I see no reason not to allow that.
>> What would be the outputs of the first 4 clock cycles based on the inputs that I gave you?Thanks

So, for now let's wait for the teacher's answer regarding PE2, and let's work on getting PE0 and PE1 right.

You already filled out the first two clock cycles correctly for PE0. PE1 can already do something in the second clock cycle though ... it can already start multiplying. The third clock cycle will then be used for the accumulator using the result from PE0's accumulator, as well as the result of PE1's multiplicator.

Give it a try, and post what you think happens ...
Hi infinity08.
Sorry about the delay

>>>If the multiplicator and accumulator are controlled by the clock, then the first two ticks that you filled in in your original post are correct for PE0 and PE1.

I though the same thing.

>>>>However, I think that PE2 can also start working already (in a similar fashion to PE0). Verify that with your teacher to make sure, but I see no reason not to allow that.

I will have a meeting with him tomorrow.Although PE0 and PE2 share at least 2 inputs which is b01 and b00 they're not working in parallel.The data must flow from PE0 to PE1 next PE2 and finally PE3.

>>>>So, for now let's wait for the teacher's answer regarding PE2, and let's work on getting PE0 and PE1 right.You already filled out the first two clock cycles correctly for PE0. PE1 can already do something in the second clock cycle though ... it can already start multiplying. The third clock cycle will then be used for the accumulator using the result from PE0's accumulator, as well as the result of PE1's multiplicator.

So are you saying that PE1 can work straight away on the second clock?
But what would be the inputs? Is just a00xb10 or a01xb11?What about the accumulation from the first PE?
Thanks
>> The data must flow from PE0 to PE1 next PE2 and finally PE3.

Again, I don't see that in the circuit. There is no flow from PE1 to PE2. PE2 has one input in common with PE0, but that doesn't matter, as they can both use the same input at the same time.

Your teacher will be able to cut this knot ;)



>> So are you saying that PE1 can work straight away on the second clock?

PE1 only needs the result from PE0 in the accumulator. The multiplicator just needs the a01 and b10 inputs which are already available. It would make sense to already do the multiplication in the second clock cycle ...
>>>>Again, I don't see that in the circuit. There is no flow from PE1 to PE2. PE2 has one input in common with PE0, but that doesn't matter, as they can both use the same input at the same time.
Your teacher will be able to cut this knot ;)


Is a very interesting point.I will make note of it and I will talk to him tomorrow just to point what you have  told me.

>>>>PE1 only needs the result from PE0 in the accumulator. The multiplicator just needs the a01 and b10 inputs which are already available. It would make sense to already do the multiplication in the second clock cycle

Another very interesting point.
Because soon as he saw the table he told that was OK.But I have to ask him why PE1 cant work straight away on the second cycle.

A matrix by matrix multiplication will generate a Systolic architecture with 4 PE's.
Consider two N x N matrices  A=[Aij] and B=[Bij]. The product C=[Cij]  of A by B is given by                                                                                 C=AB,  
Next i'm sending a systolic architecture based on matrix multiplication
                                                                                                                     

matrix-matrix.jpg
>> Next i'm sending a systolic architecture based on matrix multiplication

This new picture clearly shows that both the multiplicator and accumulator are controlled by the clock (which is one of the questions I asked earlier), so that confirms that what you already have is ok.

Let me know what your teacher says about the parallel operation of the PE's.
OK thanks.I will post again tomorrow after the meeting.
Hi Infinity08:

>>>>This new picture clearly shows that both the multiplicator and accumulator are controlled by the clock (which is one of the questions I asked earlier), so that confirms that what you already have is ok.
Let me know what your teacher says about the parallel operation of the PE's.


I'm sending a new table that I did in accordance from last meeting.
You're rite the PE's are working in parallel so in the first clock cycle all PE's multiply and on the second accumulate from the previous multiplication.
My teacher told that I need only 4 clock cycles to do all multiplications and accumulations.
I did the first two but I'm stuck on the 3rd and 4th.
Can you help me?
Thanks
multiply-accumulator-new-version.jpg
>> My teacher told that I need only 4 clock cycles to do all multiplications and accumulations.

Exactly ... That's what I though too ;)


>> in the first clock cycle all PE's multiply

Not all of them. Just PE0 and PE2. PE1 and PE2 start multiplying in the second clock cycle.

I'll just comment on PE0 and PE1. PE2 and PE3 are completely similar :

PE0 : clock ticks 1 and 2 correct. Now, you're missing one accumulation ... Where would that go ?

PE1 : started too early ... PE1 can start in clock cycle 2 only, because its accumulator will need the output from PE0's accumulator which will be available in clock cycle 3.
Also, the accumulator step is not correct ... You need to accumulate with the output from PE0.


Each PE will perform two multiplications and two accumulations. PE1's accumulator will take as an extra input the output of PE0.

Try to follow what happens. Just look at one calculation, for example x00 = a00*b00 + a01*b10. Follow the flow of the data, and the results of each step, and make sure that the inputs are available when you use them.

If you want, write down how you think it works, and I'll verify it for you ... Once you understand this basic flow, it should be easy to extrapolate to the calculation of the entire matrix multiplication.
>>>>Try to follow what happens. Just look at one calculation, for example x00 = a00*b00 + a01*b10. Follow the flow of the data, and the results of each step, and make sure that the inputs are available when you use them.If you want, write down how you think it works, and I'll verify it for you ... Once you understand this basic flow, it should be easy to extrapolate to the calculation of the entire matrix multiplication.



Hi! This is the table that I worked today.Do you see any mistake?Can you post the results for the 1st and 2nd cycles? ;)
 I could work out the results for the 3rd and 4th.Thanks
Table.jpg
You've got all multiplicators correct ! So, that's a good start.

There are however a few more problems with the accumulators.

For example : clock cycle 3 for PE0 : why do you accumulate with the value (a00*b00) ? That value is the output of PE0's accumulator in the previous clock cycle, and is used as input for PE1's accumulator.

Also, PE1's accumulator in clock cycle 2 can't use the result (a00*b00 + 0) of PE0's accumulator yet, because it isn't available yet ... That result will be available in clock cycle 3. The same is true for the result (a01*b10) of PE1's multiplicator ... That's only available in the third clock cycle.
Hi.What do you think about this new table for PE0 and PE1?
Table2.jpg
Ignoring PE2 and PE3 for now (they'll be completely similar to PE0 and PE1 as you no doubt know) ...

You're getting closer ;) PE0 is completely correct now. As is the multiplicator for PE1.

That just leaves the accumulator for PE1. The clock cycle slots are chosing correctly. What's in them however is not 100% correct.
That accumulator will output two values : first it will output x00, then x01 ... That should be a good hint ;)
Ok I think that the two values will be the multiplyer from PE1 and the accumulator from PE0.Correct?
Another table.
Table3.jpg
You have :

        a00*b01 + a01*b10

in clock cycle 3 for PE1's accumulator.

Two questions :

1) is a00*b01 available yet ?
2) is the calculated value interesting for us ? Always keep in mind what we're trying to calculate !!


Also, I gave a big hint in my previous post : What are x00 and x01 ?
>>>>>1) is a00*b01 available yet ?

I think not! Just on the 3rd cycle

So the value on the PE1 in the 3rd cycle will be a00xb00 + a00xb01, where a00xb00 comes from the accumulator from the 2nd cycle on PE0 and a00xb01 from the 3rd cycle on the PE0.
So do you think is ok?
>> and a00xb01 from the 3rd cycle on the PE0.

This says it all ... it comes from the 3rd cycle, so you can't use it in the 3rd cycle yet ... It can be used in the 4th cycle.

As I said before : keep an eye on what the output should be ... I'll post the hint for the third time :

        The outputs of PE1 are x00 and x01 ? What are x00 and x01 ?

Think about it ;)
So let me try! The 4th cycle on PE should be a00xb01 and the 3rd cycle on PE1 should be a00xb00+0.Correct?
Omg , if it isn't i dunno what to do eheh.
No.

Let's try again ... First, you really have to understand how the circuit works.

There are 6 input lines and 2 output lines. The 2 output lines will generate 4 values, namely x00, x01, x10 and x11, which together form the result of the multiplication of the two input matrices.
Take a look at your own first post to see what values x00, x01, x10 and x11 have, how they're calculated, and what input values they depend on.

I'll take one example :

        x00 = (a00 * b00) + (a01 * b10)

Notice that x00 depends on 4 input values (from 4 different input lines). Also notice that there are two multiplications and one addition needed to calculated x00.

So, the first step will be to calculate (a00 * b00). Now, look at the output of PE0 after clock cycle 2 :

        (a00 * b00) + 0

which is what we need. The + 0 is just there because there is no value to accumulate (no input line for that). Note that the accumulator isn't really necessary in PE0.

The second step is to calculate (a01 * b10). Now look at the output of PE1's multiplicator after clock cycle 2 :

        (a01 * b10)

which is what we need.

So, we now have both (a00 * b00) and (a01 * b10) available right after the second clock cycle. All we have to do now is add them together using PE1's accumulator in clock cycle 3.

Does that make it more clear for you ?
So the 3rd cycle on PE1 the accumulator should be should be a00xb00+a01xb10 and on the 4th clock should be a00xb00+a01xb10  and a01xb11+a00xb01.

There is other thing on my mind and if on the 4th cycle do a00xb00+a01xb10+a01xb11?

In my opinion should be the first one.Correct?
>> So the 3rd cycle on PE1 the accumulator should be should be a00xb00+a01xb10

Yes. That's it.


>> and on the 4th clock should be a00xb00+a01xb10  and a01xb11+a00xb01.

There's just one result (the second). Not both.

So, PE1 will give x00 as output after clock cycle 3, and x01 after clock cycle 4.

Can you now fill out the entire table to make sure that it's correct ?

If there's anything that you don't completely understand or aren't sure about, please ask. I want to make sure that you understand what's happening :)


>> There is other thing on my mind and if on the 4th cycle do a00xb00+a01xb10+a01xb11?

What would be the point of that ? Look at what we want to calculate (multiplication of two matrices) ... Do we need a00xb00+a01xb10+a01xb11 for that ?

In any case, with the circuit as it is now you can't do that. The accumulator input is the output from the previous PE, which in turn is its accumulator's output.
Ok I will fill in a new table with all the results and I wll do the same thing for PE2 and PE3.I will post it later on, so you can check if I'm doing alrite.Thanks
Hi! I have completed the table just like I said.Can you check it for me to see If it got any errors?

So:
c00 = a00 x b00 + a01 x b10
c01 = a00 x b01 + a01 x b11

c10 = a10 x b00 + a11 x b10
c11 = a11 x b11 + a10 x b01

Thanks

table4.jpg
ASKER CERTIFIED SOLUTION
Avatar of Infinity08
Infinity08
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Ok. I think I did something wrong.So the last values in the PE1 and PE3 should be the value in the multiplier that is passing now in the 4th clock into the  accumulator and receiving also the last value from the accumulator from the previous PE.I have emended the table
Correct?
Table5.jpg
That's it indeed !

If you're still unsure about something, let me know, and I'll explain it ...
Btw, note that the 0 slots can be used for multiplication of other matrices. This is called pipelining.
Thank you so much for your help.I'm very appreciated for that.Thanks
Thank you so much for your help.I wouldn't do it if I didn't had your help.Thanks once more for helping me understand.Bye
Glad to be of assistance, saturnian. I hope I helped you understand this :)
You helped me indeed.I'm very appreciated for that.Bye