Sunday 11 August 2013

Object detection, maths, ...

Curiosity got the better of me and I poked abit around the code side of my parallella board today.

Working on from a forum post about it, i started thinking about how to fit the viola & jones 'haar cascade' object detector into the epiphany, and I started looking at an assembly version of the inner loop. If I arrange the data appropriately and force some assumptions i can get it down to around 15 instructions per feature test which is pretty decent. And infact it can even run in the lower 8 registers and so is very compact too (16-bit encodings). I'm actually pretty confident I can get some decent efficiency out of it, although i'm not sure how that will translate to performance yet.

I have a few ideas how i can handle the large size of cascades fairly efficiently - first by having the lowest and most frequently accessed levels stored in LDS, and then either relying on their rarity to handle the upper stages, or using some pre-fetch mechanism. 32K is bloody tight though.

Although most of the calculations are simple mul + add and some comparisions, there is also a square root required per window. I can probably move the calculation of that to the ARM side (although then I would pretty much have to move all the scaling there too), depending on how fast the epihany ploughs through the window tests anyway.

I'm still not quite sure how the data flows between host and cu, but the examples should contain enough information to work that out.

fdiv

So I also looked into - and got pretty much distracted by - floating point divide. Something that is required if e.g. calculating the square root. Since the ephiany has neither reciprocal estimate or divide, one must implement it oneself.

I think I managed to implement the Newton-Raphson division mechanism from Wikipedia in about 40 instructions. Unfortunately there are a lot of data-stalls due to the feedback nature of the algorithm (and me not particularly wanting to hand-schedule the bits where there is leeway), but it runs in about 75 clock cycles with no divide by zero or other checks going on. A C implementation of the identical algorithm takes about 123, and using / in C takes about 131 (with -ffast-math, i'm not sure how the ieee error checks differ). Anyway it's not something vj needs that much of, so C would probably suffice. Divide seems to drag in a bit of libc too, whereas a stand-alone is very compact.

Update: not sure if libm was in external memory either actually, i would have to go back and have a look.

Instruction set

So these two little investigations helped expose me to some of the nuances of the instruction set. Such as the lack of a unary NOT - one must use EOR(~0) instead (which takes 3 instructions - 2 to load ~0, and the eor itself). The lack of bit-field instructions is no surprise, although that fact still doesn't make the pain of emulating them any more fun.

The offset addressing modes are nice in that the index is scaled to the data-size, which makes the 3-bit version quite a bit more useful than it might otherwise be.

Not having r15 as the PC has an obviuos side-effect: no direct way to implement PC-relative addressing and position independent code ... although I presume movfs can be used to load the PC to a register, and that can be used instead.

No comments: