Looking at it fresh this morning, I see another way you could speed it up even more, with the use of some creative math.
DO WHILE y <> y2
x = x + (dX / dY)
y = y + 1
_MEMPUT m, m.OFFSET + (w * y + x) * 4, clr
LOOP
If you make y1 and y2 in relation to w to begin with, you could reduce a few math operations in the DO LOOPS...
Y1 = Y1 * w: Y2 = Y2 * w <---- first line at the start of the SUB
Y = Y + w <---- in the DO LOOP
_MEMPUT m, m.offset + (y + x) *4, clr <---- internally in the routine now, Y increments by _WIDTH amount naturally, and we remove a multiplication step of operations.
*************
Since X and Y are interger values, you can probably change the math to be a little more efficient elsewhere as well:
Instead of X = X + (DX / DY), make it X = X + (DX \ DY)
Integer division (\) is considerably faster than real division (/), so if you can use it and need a program to optimize speed, do so.
****************
Lots of little tricks and tweaks which can be used to optimize speed. The main thing you have to be *really* careful of is not to obfuscate the code beyond the point of being able to understand it in the future. Just because you *can* make it faster, it doesn't mean you always *should* -- especially if you alter it so much you can't figure it out and debug it or alter it, at a later date.
Fast is good, but understanding is better. ;D