Three more changes to your routine Petr, and the time goes from 3.4 seconds to 2.6, then 1.68, then 1.47....
The fastest version is this one:
vinceCircleFill
400, 300, 200, _RGB32(0, 255, 0)
x0 = R
y0 = 0
e = 0
y0 = y0 + 1
MEM_LINE x - x0, y + y0, x + x0, y + y0, C
MEM_LINE x - x0, y - y0, x + x0, y - y0, C
e = e + 2 * y0
MEM_LINE x - y0, y - x0, x + y0, y - x0, C
MEM_LINE x - y0, y + x0, x + y0, y + x0, C
x0 = x0 - 1
e = e - 2 * x0
MEM_LINE x - R, y, x + R, y, C
x1 = x1t * 4: x2 = x2t * 4
y1 = y1t * w * 4: y2 = y2t * w * 4
dX = x2 - x1
dY = y2 - y1
x = x1: y = y1
inc = dY \ dX
x = x + 4
y = y + inc
inc = dX \ dY
x = x + inc
y = y + w
d = y1
d = d + w
d = x1
d = d + 4
The first change, as I mentioned above, was to change how we think of X/Y and calculate our offset. It's no longer m.OFFSET + (w * y + x) * 4. Now it's simply m.OFFSET + y + x.... This changed the speed from 3.4 seconds to 2.6.
The second change, I also did as mentioning above: I replaced the real-precision division (/) with integer division (\). This further improved the speed from 2.6 to 1.68 seconds.
The final change, I noticed that the integer division never actually changes INSIDE the loop. dX and dY aren't changing values, so dX \ dY isn't going to ever generate any altered value. I calculated the increment ONCE before each loop with inc = dX \ dY, and then used that value inside the loop itself. Doing math ONCE is faster than doing it multiple times. This increased the speed from 1.68 seconds to 1.47.
And, when you figure we started with a process that took over 10 seconds to begin with, optimizing it down to only taking 1.47 is quite a boost in overall performance! It still doesn't compare to the speeds we see from CircleFill, which I plugged in for testing and took 0.32 seconds to do the same thing, but it's a heckuva change from what it was originally. ;)