Skip to content

Instantly share code, notes, and snippets.

@ZipCPU
Created March 24, 2017 00:36
Show Gist options
  • Save ZipCPU/da036df9f3cd57a54451205fa85b783e to your computer and use it in GitHub Desktop.
Save ZipCPU/da036df9f3cd57a54451205fa85b783e to your computer and use it in GitHub Desktop.
<lama> hello
<lama> can somebody please help me, i'm trying to implement the modified booth's algorithm for multiplication, but the vhdl code gives incorrect results like 125 * 12 = 1076605135488 :D i don't know why http://pastebin.com/dKXJbsb5
<lama> i have also written a quick software implementation in c++ so i would be sure that i understand the algorithm and that works:
<lama> http://pastebin.com/4MhLdhtA
<lama> so can somebody see anything wrong with the vhdl code?
* jkent is now known as jkent`
* jkent` is now known as jkent
<lama> just to make things clear, if i have two's complement number in std_logic_vector and cast this to unsigned in order to add this vector to some other vector; the casting won't alter the std_logic_vector bit format, right?
<lama> let say that slv = 11111011 which is -5 in two's complement and i will do unsigned(slv), will this still hold the same binary value? unsigned(slv) + unsigned(1) = 11111100 (-4) ?
<MatthiasM> why do you keep using std_logic_vector or unsigned for signed data?!?
<lama> well, the coprocessor itself use signed data type ; but because i got confused about all these bitshits, i have decided to make it std_logic_vector and handle the two's complement on its own in this very entity
<lama> basically i'm shifting two joined vectors to the right
<lama> and both of them are originally signed
<lama> but when i joined them, then it doesn't make sense anymore
<lama> so i have decided to just make it std_logic_vector so it wouldn't confused me so much, like is the sign bit implicit and hidden for signed etc...
<lama> :D
<lama> i have found some typos, but still i'm getting bullshit results
<lama> http://pastebin.com/ic2bG6Jn
<lama> this is the algorithm in c++ that i wrote before the vhld and that works http://pastebin.com/4MhLdhtA
<lama> why aren't those two codes equivalent
<lama> ?
<lama> basically the algorithm takes two input operands Q,M ; both of them are zero extended so they have even number of bits and futher that if shifted left, the sign wouldn't be lost
<lama> then the Q is LSB zero extended, and this is the "implicit bit"
<lama> then bitstring is formed from ACC & Q
<lama> acc is the same bitwidth as M (and Q without the -1th bit)
<lama> then you will check for those last three bits and perform operation according to the switch statement and then this gets added to the ACC and then the entire bit string is shifted arithmetically to the right by 2 bits; this is repeated n/2 where n is the bitwidth of M
<lama> but it looks like nothing is shifting very much at all and i don't know why
<lama> the output should be A & Q without the implicit bit
<lama> everything is variable basically, so the fucking isim cannot display it, so i can only print the values
<lama> oh, i guess i have fucked up the ACC assigments
<lama> so now 125 * 12 = 14000
<lama> :D
<lama> i don't understand whats wrong
<lama> so apparently i'm useless, that stupid c++ took me like 10 minutes, but i have already spend 2 hours trying to rewrite it in vhdl
* jkent is now known as jkent`
<svenn> You are too intelligent for your intentions
<Asu> i'm a few clicks away from ordering the cyclone v gx starter
<Asu> any objection of some importance of something i could have missed?
<scrts> Asu-> do you really need just transceivers?
<scrts> for me, SoC kits seem to be a way better deal
<scrts> e.g. from ebay
<Asu> i want to avoid the ARM cores
<Asu> i would have chosed the de10-nano else
<Asu> chosen*
<Asu> the cyclone v has the i/o i ideally want (not specifically the large transceivers), and enough resources
<Asu> starter kit
<scrts> you can always ignore ARM cores on FPGA
<Asu> scrts: what if i want to use everything that is handled by the ARM core within the FPGA?
<scrts> there is nothing handled by the ARM core in the FPGA
<Asu> i mean, on the board
<scrts> you have to instantiate glue logic between ARM core and FPGA for it to handle anything
<Asu> yes, but i rather want to only use the fpga
<scrts> oh from that perspective... the SoCKit board I had had the 0ohm resistors or physical jumpers
<scrts> so I could switch to FPGA and control the peripherals from Nios
<scrts> which was the case when I was setting up clock generator
<scrts> the SoCKit had huge FPGA compared to price
<scrts> somebody calculated the BoM of that board and the FPGA alone in single quantities was close to whole devkit price
<scrts> also had transceivers
<scrts> I've tried SFP loopback on it
<Asu> so... in my case, would the gx starter kit still be a good bet?
<Asu> i am not sure the de10-nano has such switches to handle everything through the fpga
<Asu> jumpers, rather
<MatthiasM> you can use all HPS pins from the FPGA side too
<Asu> so i really can directly use the sd pins of the de10-nano kit from the fpga without interacting with the HPS, for example?
<MatthiasM> Asu: https://www.altera.com/documentation/doq1481305867183.html#yja1481303809832
<Asu> interesting
<MatthiasM> I think the DDR interface is dedicated - so it can only be used via the HPS - but you can just create an HPS element and export the memory controller to the FPGA side
<AsuMagic> whoops, if anyone sent a message since my last one, i didn't receive it
<AsuMagic> i will look into the de10-nano documentation tomorrow if i have enough free time
* AsuMagic is now known as Asu
<Asu> see you
* jkent` is now known as jkent
* wbraun_ is now known as wbraun
* grumble is now known as grumble2
* grumble2 is now known as grumble
* jkent is now known as jkent`
* jkent` is now known as jkent
<RavenholmDX> Heyo
<Flea86> Hey
<RavenholmDX> What's shakin'?
<Flea86> Just testing out my newest toy.. how about you?
<RavenholmDX> What's your newest toy Flea86?
<Flea86> RavenholmDX: http://www.fleasystems.com/images/flea_mk2_geos.jpg That :)
<Flea86> Only thing left to test is Ethernet
<RavenholmDX> XT?
<Flea86> XT is an early form of IBM PC (Model 5160)
<Flea86> ie. 4.77MHz 8088 :)
<RavenholmDX> Ah, gotcha
<Flea86> Note this is only a test tho - that x86 SoC occupies around ~22% of that FPGA..
<RavenholmDX> Hot damn
<RavenholmDX> impressive :)
<Flea86> RavenholmDX: I'm tempted to integrate OPL3 to the core, but not immediately
<Flea86> would certainly have tried it if the x86 was an i386
<Flea86> *it already
<RavenholmDX> Flea86 have you seen https://github.com/gtaylormb/opl3_fpga ?
* Now talking on ##fpga
* Topic for ##fpga is: Welcome to ##fpga! Don't ask to ask! Ask and lurk, answers may and will take time! | Be polite and precise! | Recommended Verilog papers at http://www.sunburst-design.com/papers/ | Help ##embedded get off the ground! (embedded programming) | Related channels: ##verilog ##vhdl | FPGA for dummies: http://goo.gl/FHQPPH
* Topic for ##fpga set by scrts!~quassel@unaffiliated/scrts (Wed May 27 16:59:12 2015)
* Channel ##fpga modes: +cn
* Channel ##fpga created on Sat Jul 18 17:44:32 2009
<Flea86> RavenholmDX: Yes I have. That is what I was referring to.
<gaqwas> i've got a question
<gaqwas> LUTs in a typical xilinx ultrascale+ architecture are SRAM-based. Is that correct?
<lain> all xilinx fpgas use sram luts
<lain> (the non-volatile spartan3an series is a standard spartan3a with a serial flash bonded to the fpga inside the package)
<gaqwas> good
<gaqwas> question number two
<gaqwas> is an sram cell of a typical xilinx LUT fundamentally different in any way from a flip-flop in the same architecture? AFAIK both consist of two or more interconnected flip-flops and can hold a single bit of information
<gaqwas> what's the difference in that architecture between a flip-flop and an SRAM cell of a LUT, if any?
<lain> that I don't know :3
<Flea86> gaqwas: I'm with lain.. though from memory those differences: they are subtle yet distinct.
<gaqwas> thanks guys
<gaqwas> i'd like to know what matthiasm thinks
<gaqwas> ill ask the same question later when he's here
<promach> just to bother you guys with a short, simple question. Could you guys show me (beginner) an existing, simple verilog example for pipelining ?
<promach> https://www.edaplayground.com/x/aHv http://i.imgur.com/B0dTDdh.png pipeline depth of 4
<promach> proof-of-concept
<lama> how do i check if that multiplier works? i managed to get it working, so i put some numbers and the results is correct, but shouldn't i test the output for all possible combinations?but that is 33 bit x 33 bit so the means 2^33 * 2^33 - 2^33 combinations i think, i'm not sure how i'm suppose to test that
<lama> how is the hardware generally tested if there are if the input range so large?
<Zarutian> you could test it with primes that fit into 33 bits. They are nicely randomly sparse. Then you could test by setting the x msb bits to high on both operands and see if it overflows.
<lama> same for the cordic; but there i have decided to make 1000 steps which whould mean that the input angle is being stepped by 0.001570796 radians and then i have compare that to the matlab and i found that the average error is like -90 dB and that the error is homogenously spread accros the domain which is tihnk good
<lama> ok, so i should try to multiply primes up to 33 bits
<lama> is there a list of primes, i suppose there is
<Zarutian> that is just one suggestion. You can do that. And you can go to random.org and get quite a few random numbers to use a operands
<lama> and the test with primes does say anything important? i mean why would you pick primes?
<lama> ah, random, thats a good idea too
<lama> True Random Number Generator
<lama> heh
<lama> how
<lama> do they sample noise or something
<lama> The randomness comes from atmospheric noise
<lama> ah
<Zarutian> why prime? sparness and seemingly no other pattern.
<lama> ok, great :)
<lama> also what division algorithm would you suggest to implement?
<lama> is there any standard for that
<Zarutian> I do not know. I am no mathematican nor numerical analyst.
<lama> me neither
<Zarutian> there is an divmod one that is constant time. I do not recall what is is called though.
<lama> well, i try to properly test the multiplier first :)
<lama> also, the result should never overflow, right?
<lama> 33 bits * 33 bits = 66 bits
<lama> thus the sign bit should be correct all the time, no?
<ZipCPU> lama: Just a thought: Don't use a *true* random number generator, but rather a pseudo-random number generator.
<ZipCPU> Second thought: Always check things like +/- 1, the maximum negative or positive integer, zero, and combinations of these.
<lama> i have tried to search for some primes too and there are still several millions of them in range of 33 bits, isn't that a bit much?
<ZipCPU> Personal opinion? Yes.
<ZipCPU> I would personally use the rand() function, several examples of 33-bit numbers, both signs (assuming this is a signed multiply), and the example I just gave ... rather than every prime.
<lama> so i should just generate 10 000 random numbers and try those along with special numbers like +/- 1 , all ones and all zeros?
<ZipCPU> Are you doing signed multiplies?
<lama> does vhdl have rand() ?
<lama> yes
<lama> thats why the odd 33 bits
<ZipCPU> ??
<ZipCPU> Can you explain that?
<lama> well, because originally, i wasn't using two's complement, but rather ancient sign + magnitude and i wanted range for -+ 32 bits
<ZipCPU> Ok ... just be aware that you need a little more special sauce to multiply signed numbers together above and beyond what unsigned numbers requires.
<lama> but then i figured that the hardware would be too complex, because when adding or substracting you basically have to "evaluate" the absolute value and check which operand is larger and arrange the order of calculation accordingly
<lama> so i have changed to two's complement but i still wanted to retain the +- 32 bit range ; so i just added a bit
<pticochon2> hi
<lama> i have implemented MBE and this uses two's complement so it natively multiplies unsigned and signed
<lama> because the whole number is always signed
<ZipCPU> I can point you at an example of a 32x32-bit multiply if you would like, one that handles both signed and unsigned operands and uses DSP48's if avaialble.
<ZipCPU> Hi pticochon2!
<pticochon2> o/ ZipCPU
<mfgmfg> my approach is generally something like making a smaller module (say, 16x16) and doing full coverage to verify that my design is sound, then i change the parameter to 32 and test corner cases + randoms to make sure it checks out
<Suzeren> Hi guys! Sorry for disturbing you. Could anyone download XAPP1294 reference design files from Xilinx pleeeeeease? It seems that I'm not able to do it =(
<lama> that would be nice, i can learn more about multiply algorithms that way :)
<ZipCPU> mfgmfg: That approach would not have worked for me, as I was trying to exploit several hardware 16x16 multiplies to create a 32x32 bit multiply.
<lama> that sounds good too , 16 bits that 2^16 * 2^16 - 2^16 ; wait thats still like 4 bilions of numberes :D
<mfgmfg> the correct 'hard' way is to do it with mathematical proofs
<lama> well, the algorithm is proven and works
<ZipCPU> lama: You can find my multiply within https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/cpuops.v
<mfgmfg> 2^32 is not that big
<lama> its the implementation that may be faulty
<lama> well, it will create several tens of GB of results
<mfgmfg> well you don't write to disk if it passes and if it fails, log up to a few hundred then quit
<mfgmfg> like you can do a full coverage test on fp16*fp16 in less than 2 minutes on a decent system with verilator compiled code
<lama> oh so you will do the hardware multiplication and check against VHDL language *
<lama> you write in verilog, i don't know verilog :( i will have to learn verilog as well, do i
<lama> what algorithm did you use?
<mfgmfg> what do you mean
<ZipCPU> I'd never go that far, guys, come on, to be a reasonable test ... it's got to complete while you are watching it.
<lama> what do i mean what?
<mfgmfg> a watched test nver fails
<ZipCPU> mfgmfg: Heheheh ... good one.
<ZipCPU> However, a test that never completes never passes either.
<lama> from what you said i have assumed the following : 1st step you will issue a multiply command to your implemented multiply entity with corresponding operands ; 2nd step you will obtain the result and will perform the same multiply operations but now using the language (software) and compare them , if the results don't match the hardware has a bug an
<lama> d you will log something ; 3th step load next operands and repeat with step 1
<lama> how else would you check if the result is correct
<mfgmfg> i compared it against multiplication in c++
<ZipCPU> lama: Yes, that's what I would do.
<lama> it takes 16 cycles for the multiply algorithm i have implemented to finish one 33 bits * 33 bits calculation
<lama> yes, or in c++ but then you need to log that thing
<lama> you said that you won't log anything but the bad values, so i assumed that you will use the HDL language itself (which is still a general purpose language) to compute the verification
<lama> which is a better thing to do i guess
<ZipCPU> Why 16 cycles to finish? Why not 6 cycles?
<mfgmfg> if (top->tb->r16_as_f64 != rcpp_bits) printf("a: %.16f b: %.16f r: %.16f rcpp: %.16f\n", a, b, r, rcpp);
<lama> for example currently i'm exporting everything to log in known format which i parse in matlab and check it there ; it calculates and plots the error in dB
<lama> for the domain
<lama> how can do i in 6 cycles? i would have to drop some pipelining
<mfgmfg> i used Verilator which compiles verilog code to c++ which is then compiled with gcc
<lama> i have 16 iterations
<lama> i have dedicated one cycle per each to save chip area and timing
<lama> i haven't found any algorithm that can do 33 bits in 6 iterations , but i'm not very good at anything, so
<mfgmfg> you can even do it in one cycle but you're at the mercy of the critical path
<mfgmfg> fmax drop like a rock
<ZipCPU> lama: Here's an example that should be close to 6 clocks (7 perhaps) and uses no hardware multiplies: https://gist.github.com/ZipCPU/ed023491163f8568b48334c3b35af309
<ZipCPU> It offers both 33x33 signed and unsigned multiplies, although it does so by taking an absolute value and then possibly negating at the end, so ... that's another 2 clocks.
<lama> The file itself is
<lama> / computer generated, ?
<ZipCPU> Yes.
<lama> you generated that code somehow? i don't understand
<ZipCPU> That way I could build a multiply for any bit width.
<ZipCPU> Oh, come on, it's not that hard ... you build a C++ file that builds a Verilog file. That allows you to create a lot of configurability too.
<lama> oh, right
<lama> verilog doesn't have generics?
<lama> or the code will be very different for any bit with
<lama> *bit width
<ZipCPU> It has parameters which are very similar to VHDL generics. I ... just didn't know how to do this with generics.
<ZipCPU> The logic was more complex than I was able to figure out how to do with generics.
<lama> you keep adding things without anything shifting, am i retared
<ZipCPU> No ... there's shifting going on.
<lama> did i say that i can't do verilog
<lama> you have lookup table of some sort
<ZipCPU> You mean the bimpy code? The binary multiply?
<lama> yes
<lama> i'm stupid, i never learned verilog, i'm writing in vhdl, i should learn the verilog
<ZipCPU> Well ... that's based on the fact that most FPGA's are built out of 6-LUT's. Hence, we can do in one clock a 2-bit multiply by an arbitrary length number.
<lama> could you tell me how that thing works
<ZipCPU> Yeah, sure.
<ZipCPU> You know how long multiplication works? You line two values up against each other, multiply the top by each digit of the bottom, and create a big table o fthings to be added together?
<lama> oh, you are using LUT as memory lookup table where results for multiplication is stored?
<ZipCPU> The bimpy routine does the multiplication. It multiplies the top value by two bits on the bottom row.
<ZipCPU> I'm just doing long multiplication.
* Loaded log from Thu Mar 23 15:57:19 2017
* Now talking on ##fpga
* Topic for ##fpga is: Welcome to ##fpga! Don't ask to ask! Ask and lurk, answers may and will take time! | Be polite and precise! | Recommended Verilog papers at http://www.sunburst-design.com/papers/ | Help ##embedded get off the ground! (embedded programming) | Related channels: ##verilog ##vhdl | FPGA for dummies: http://goo.gl/FHQPPH
* Topic for ##fpga set by scrts!~quassel@unaffiliated/scrts (Wed May 27 16:59:12 2015)
<lama> how come that you will do this in 6 cycles?
<ZipCPU> Ok, yeah, here we go. 33/2 intermediate results, yep, now ... you can add pairs for your first clock.
<ZipCPU> 1+2, 3+4, 5+6, etc.
<ZipCPU> On the next clock, you can add the next set of pairs: 1+3, 5+7, 9+11, etc.
<ZipCPU> On the next one, 1+5, 9+13, etc.
<ZipCPU> I mean, if the FPGA can do everything in parallel ... why not do things in parallel?
<lama> and you will generate all those intermediate results in one cycle? i mean i guess you will if each and every 1 bit * 1 bit multiply is independent of the other
<lama> i tried, but i have failed
<ZipCPU> There you go!
<lama> this is my multiplier
<lama> http://pastebin.com/YS60hPVD
<lama> the math coprocessor can compute other things while the multiplication runs, but still, 6 cycles you did that is awesome
<lama> is the code bad?
<ZipCPU> Sorry ... my son just called and is asking for homework help ... ;)
<lama> its FPGA homework too? :)
<ZipCPU> Not yet ... it's been fixing bicycles and running ambulance calls so far.
<ZipCPU> Usually he's asking about calculus and integrals, though.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment