@SMcNeill I offer the following explanation as to how it is possible that by addition of one extra line of code, significant reduction in timing results rather than expected an increase.
This reply assumes you understand the background information as per the link below
https://www.qb64.org/forum/index.php?topic=2276.msg138538#msg138538
Reply 184
Firstly, very roughly when determining the size of your code (i.e. a program containing no code (untitled.exe) subtracted from your program.exe) - your actual running code (.exe) is about 10.2 Kbytes (I get 2094080 -2083840 =~10.2 Kbytes) . If your computer (which runs almost exactly 2x faster than mine) is of similar/related architecture (eg INTEL >= i7) - then the 10K of .exe code is a candidate for L1.
On my laptop, L1 is 32K - but from my observations Windows background processes are using all processors (4 physical, 8 logical) even when CPU useage (Task Manager) is only say 30%. So at any time, a fairly large percentage of L1 is being "hogged" by Windows and you would be lucky to have any where near 10 Kbytes of "your" code in L1 (similar considerations for L2 and L3).
When, for a clearer understanding, I reduce your program (supplied below) to minimal requirements, the active code (.exe) is (2092032 - 2083840) =~ 8.2 Kbytes. To me, although the difference (10.2 - 8.2 Kb) may not seem much, in relative terms for L1 location (which is being "butchered" in useage by Windows background processes) - it is apparent that even a relatively slight change in .exe code can yield wildly differing timing results (because of Windows).
So your.exe (heavily referred part = 10.2 Kbytes) is only partly residing in L1 (similarly for L2 and L3) at any instance of time. When the _MemCopy code is acted on - some (_WIDTH * _HEIGHT =~ 730,000) memory accesses are efficiently at machine level being processed, which is a very large number, which means L1 (after a setup time) is preloaded with the relevant _MemCopy machine Code. Also probably the _MemCopy is relatively small (say << 1K bytes) and has a high chance of 'displacing' not-so-much utilized Windows background processes that are "lying around still in L1". By virtue of the fact that _MemCopy machine code is "resident" in L1, any future references to m and m1 _MEM BLOCKS (i.e. the rest of your program) would benefit from MEM code being in L1. In fact, with your test limit value of 100, some 73 million MEM operations (over about 10 seconds) would have speed improvement simply because the fact that _MEMCOPY was utilized once. HAD all of your 10.2 Kbyte of code been resident in L1 then POSSIBLY there may not be any significant between (with _MEMCOPY) versus (without _MEMCOPY). I do not know how to measure what is L1, L2, L3 at any time but as far as L1 goes it seems very apparent that windows is very aggressive in "hogging" L1, leaving much less than 10K bytes for anything else, and ONLY when a highly repetitive "small" amount of code (_MemCopy), in this case >730,000 iterations, is when some of L1 being "hogged" by windows is released for other applications (your program). In the first running part of your program (before _MemCopy), it is apparent that the relevant code is "too large" for what is immediately free in L1 and even a billion iterations of that code set may not "displace Windows background processes". In conclusion, _MEMCOPY (machine code small) performing 730000 iterations of memory referencing, is the winner.
Now it is interesting when running the code below, which is a drastically reduced version of your program (and there is NOTHING ACTUALLY TO SEE WHEN RUNNING), that the timing differences (with _MEMCOPY) versus (without _MEMCOPY) are relatively minor and may not even warrant discussion. But now we are taking about a smaller program (8.2 Kbyte) which has a much higher chance of being "resident" in L1 than the 10.2 Kbyte (your program). Now if you ran this shortened program many times, AND also with varying applications installed/removed (eg other instances of QB64, EDGE, etc) - then you will get "widely varied results" including sometimes (with _MEMCOPY) taking LONGER than (without _MEMCOPY) and relative timing ratios approaching 3:1. It appears that the "erratic behaviour" (i.e. non-consistent) of Windows background processes is the reasoning of this.
As a side note, referring to the reply #184 mentioned above, it would appear (without proper study/analysis) that anything running in L1 is approaching being 10x faster than the first time code is being performed. One has to wait for the setup times for preloading L1 before taking advantage of this.
TestLimit = 100
for trials%
=1 to TestLimit
o = 0
o = o + 4
_MemCopy n1
, n1.OFFSET
, n1.SIZE
To n
, n.OFFSET
'*** this _MemCopy line added for trials%
=1 to TestLimit
o = 0
o = o + 4
Print "(without _MemCopy)"; t0#
- tb#
print "# of MEM references GRAND TOTAL =";