Thursday 4 September 2014

ezegpu stuff

Did a bit more playing around on the ezegpu. I think i've hit another dead-end in performance although I guess I got somewhere reasonable with it.

  • I re-did the controller so it uses async dma for everything. It didn't make any difference to the performance but the code is far cleaner.

  • I tried quite a bit to get the rasteriser going a bit faster but with so few cycles to play and all the data xfer overheads with there wasn't much possible. I got about 5% on one test case by changing a pointer to volatile which made the compiler use a longword write. I got another 5% by separating that code out and optimising it in assembly language using hardware loops. This adds some extra overhead since the whole function can't then be compiled in-line but it was still an improvement. However 5% is 40s to 38s and barely noticeable.

    This is the final rasteriser (hardware) loop I ended up with. The scheduling might be able to be improved a tiny bit but I don't think it will make a material difference. This is performing the 3 edge-equation tests; interpolating, testing and updating the z-buffer; and storing the fragment location and interpolated 1/w value.

      d8:   529f 1507       fsub r2,r44,r21        ; z-buffer depth test
      dc:   69ff 090a       orr r3,r18,r19         ; v0 < 0 || v1 < 0
      e0:   480f 4987       fadd r18,r18,r24       ; v0 += edge 0 x
      e4:   6e7f 010a       orr r3,r3,r20          ; v0 < 0 || v1 < 0 || v2 < 0
      e8:   6c8f 4987       fadd r19,r19,r25       ; v1 += edge 1 x
      ec:   a41b a001       add r45,r1,8           ; frag + 1
      f0:   910f 4987       fadd r20,r20,r26       ; v2 += edge 2 x
      f4:   047c 4000       strd r16,[r1]          ; *frag = ( x, 1/w )
      f8:   69ff 000a       orr r3,r2,r3           ; v0 < 0 || v1 < 0 || v2 < 0 || (z buffer test)
      fc:   278f 4907       fadd r17,r17,r23       ; 1/w += 1/w x
     100:   347f 1402       movgte r1,r45          ; frag = !test ? frag + 1 : frag
     104:   947f a802       movgte r44,r21         ; oldzw = !test ? newzw : oldzw
     108:   90dc a500       str r44,[r12,-0x1]     ; zbuffer[-1] = zvalue
     10c:   b58f 4987       fadd r21,r21,r27       ; newz += z/w x increment
     110:   90cc a600       ldr r44,[r12],+0x1     ; oldzw = *zbuffer++;
     114:   009b 4800       add r16,r16,1          ; x += 1
    

    Nothing wasted eh?

    Using rounding mode there are I think 4 stalls (i'm not, but i should be). One when orring in the result of the zbuffer test and three between the last fadd and the first fsub. Given the whole lot is 10 cycles without the stalls that doesn't really leave enough room to do any better for the serial-forced sequence of 2 flops + load + store required to implement the zbuffer. To do better I would have to unroll the loop once and use ldrd/strd for the zbuffer which would let me do two pixels in 30 instructions rather than one in 16 instructions. It seems insignificant but if the scheduling improved such that the execution time went down to the ideal of 10 cycles and then an additional cycle was lopped off - from 14 to 9 cycles per pixel - that's a definitely not-insignificant 55% faster for this individual loop.

    The requirement of even-pixel starting location is an added cost though. Ahh shit i dunno, externalising the edge tests might be more fruitful, but maybe not.

  • I did a bunch more project/cvs stuff; moving things around for consistency; removing some old samples, fixing license headers and so on.

  • I tried loading another model because I was a bit bored of the star - I got the Candide-3 face model. This hits performance a lot more and the ARM-only code nearly catches up with the 16-core epiphany. I'm not sure why this is.

Although there are some other things I haven't gotten to yet i've pretty much convinced myself this design is a dead-end now mostly due to the overhead of fragment transfer and poor system utilisation.

First I will put the rasterisers and fragment shaders back together again: splitting them didn't save nearly as much memory as I'd hoped and made it too difficult to fully utilise the flops due to the work imbalance and the transfer overheads. I'm not sure yet on the controller. I could keep the single controller and gang-schedule groups of 2/3/4 cores from the primitive input - i think some sort of multiplier is necessary here for bandwidth. Or I could use 3 or 4 of the first column of cores for this purpose since they all have fair access to the external ram.

No comments: