Wednesday 23 June 2010

convert_uint4

Got the main algorithm i'm working on converted to run on the GPU today - it doesn't seem to be producing the correct results yet, but it should be doing the right steps. Speed is a bit disappointing so far too. But on linux you're flying pretty blind as to what is going on on the card so I don't know if it's merely an uncoalesced memory access problem, not mapping well to the multiprocessors, or something else. I guess I should work out the FLOPS and see if i'm just pushing any limits.

Because local memory is so limited i'm storing data packed in bytes (I need to re-read the same section of data many multiple times), but the unpacking into vectors is pretty inefficient. Even the built-in functions provided aren't as efficient as they could be:
  uchar4 *src;
uint4 work = convert_uint4(src[0]);

Compiles into more code than:
  uchar4 *src;
uchar4 v = src[0];
uint4 work = ((uint4) { v.x, v.y, v.z, v.t }) & ((uint4)0xff);

Which surprisingly compiles into more code than:
  uint *src;
uint v = src[0];
uint4 work = (uint4) { v.x>>24, (v.y<<8)>>24, (v.z<<16)>>24, (v.t<<24)>24 };

The latter would be pretty nasty with SIMD (without an unpack or shuffle at least), but with at least the AMD GPU it is only a half dozen shifts and and's.

Changing a single convert_float4 to a variation of the last one above gave me a 20% speedup ... which probably means i'm just doing it the wrong way to start with ... If I had more local memory I'd convert to a more convenient working size as I loaded it, but alas I don't have enough for this algorithm.

Next i'm going to look at using image objects to access the memory instead. This gives automatic format conversion and caching, and removing the need to manually cache stuff in local memory allows for more flexibility in parallelisation. Unfortunately JOGL doesn't seem to expose the basic image routines from OpenCL without using OpenGL as well which I was hoping to put off for a bit.

I guess I should really try and get a version that at least gives correct results too.

No comments: