Author Topic: Somebody explain this one to me... (Read 27088 times)

SMcNeill · « **Reply #15 on:** November 06, 2021, 03:18:09 pm »

Quote from: Petr on November 06, 2021, 04:41:54 am

@SMcNeill

Hi, here is my results after replacing DIM
0.7 seconds (before 15 seconds)
0.6 seconds (the same as before)

Let me express my deep respect for you, because I would not really look for the problem in this. This is an absolutely crucial finding. I am very glad that you have alerted us to this matter. Great job!

The code as written is kind of buggy (it only works with a clear background), but it does a good job illustrating how arrays work in subs/functions. In the main part of the program, they reserve memory when we DIM them. In SUB/FUNCTION they don't -- the memory is allocated dynamically as needed. This difference can produce quite a speed/performance difference, so there may be times when it's better to DIM SHARED an array and let it stay in memory than it is to DIM it and free it repeatedly in a SUB.

It's a shame that STATIC won't work with an array in a sub. At least, I can't get it to work. When I try I just get a crash to desktop.

Just something to keep in the back of the mind from now on for speed optimizations.

Richard · « **Reply #16 on:** December 05, 2021, 01:20:09 am »

@SMcNeill

I offer the following explanation as to how it is possible that by addition of one extra line of code, significant reduction in timing results rather than expected an increase.

This reply assumes you understand the background information as per the link below

https://www.qb64.org/forum/index.php?topic=2276.msg138538#msg138538
Reply 184

Firstly, very roughly when determining the size of your code (i.e. a program containing no code (untitled.exe) subtracted from your program.exe) - your actual running code (.exe) is about 10.2 Kbytes (I get 2094080 -2083840 =~10.2 Kbytes) . If your computer (which runs almost exactly 2x faster than mine) is of similar/related architecture (eg INTEL >= i7) - then the 10K of .exe code is a candidate for L1.

On my laptop, L1 is 32K - but from my observations Windows background processes are using all processors (4 physical, 8 logical) even when CPU useage (Task Manager) is only say 30%. So at any time, a fairly large percentage of L1 is being "hogged" by Windows and you would be lucky to have any where near 10 Kbytes of "your" code in L1 (similar considerations for L2 and L3).

When, for a clearer understanding, I reduce your program (supplied below) to minimal requirements, the active code (.exe) is (2092032 - 2083840) =~ 8.2 Kbytes. To me, although the difference (10.2 - 8.2 Kb) may not seem much, in relative terms for L1 location (which is being "butchered" in useage by Windows background processes) - it is apparent that even a relatively slight change in .exe code can yield wildly differing timing results (because of Windows).

So your.exe (heavily referred part = 10.2 Kbytes) is only partly residing in L1 (similarly for L2 and L3) at any instance of time. When the _MemCopy code is acted on - some (_WIDTH * _HEIGHT =~ 730,000) memory accesses are efficiently at machine level being processed, which is a very large number, which means L1 (after a setup time) is preloaded with the relevant _MemCopy machine Code. Also probably the _MemCopy is relatively small (say << 1K bytes) and has a high chance of 'displacing' not-so-much utilized Windows background processes that are "lying around still in L1". By virtue of the fact that _MemCopy machine code is "resident" in L1, any future references to m and m1 _MEM BLOCKS (i.e. the rest of your program) would benefit from MEM code being in L1. In fact, with your test limit value of 100, some 73 million MEM operations (over about 10 seconds) would have speed improvement simply because the fact that _MEMCOPY was utilized once. HAD all of your 10.2 Kbyte of code been resident in L1 then POSSIBLY there may not be any significant between (with _MEMCOPY) versus (without _MEMCOPY). I do not know how to measure what is L1, L2, L3 at any time but as far as L1 goes it seems very apparent that windows is very aggressive in "hogging" L1, leaving much less than 10K bytes for anything else, and ONLY when a highly repetitive "small" amount of code (_MemCopy), in this case >730,000 iterations, is when some of L1 being "hogged" by windows is released for other applications (your program). In the first running part of your program (before _MemCopy), it is apparent that the relevant code is "too large" for what is immediately free in L1 and even a billion iterations of that code set may not "displace Windows background processes". In conclusion, _MEMCOPY (machine code small) performing 730000 iterations of memory referencing, is the winner.

Now it is interesting when running the code below, which is a drastically reduced version of your program (and there is NOTHING ACTUALLY TO SEE WHEN RUNNING), that the timing differences (with _MEMCOPY) versus (without _MEMCOPY) are relatively minor and may not even warrant discussion. But now we are taking about a smaller program (8.2 Kbyte) which has a much higher chance of being "resident" in L1 than the 10.2 Kbyte (your program). Now if you ran this shortened program many times, AND also with varying applications installed/removed (eg other instances of QB64, EDGE, etc) - then you will get "widely varied results" including sometimes (with _MEMCOPY) taking LONGER than (without _MEMCOPY) and relative timing ratios approaching 3:1. It appears that the "erratic behaviour" (i.e. non-consistent) of Windows background processes is the reasoning of this.

As a side note, referring to the reply #184 mentioned above, it would appear (without proper study/analysis) that anything running in L1 is approaching being 10x faster than the first time code is being performed. One has to wait for the setup times for preloading L1 before taking advantage of this.

Code: QB64: [Select]

Screen _NewImage(1024, 720, 32):Do: Loop Until _ScreenExists:_ScreenMove _Middle
TestLimit = 100
Dim TempGrid(0 To _Width - 1, 0 To _Height - 1) As Long
dim TempGridn(0 To _Width - 1, 0 To _Height - 1) As Long
Dim As _MEM m, m1: m = _Mem(TempGrid()): m1 = _MemImage(0)
Dim As _MEM n, n1: n = _Mem(TempGridn()): n1 = _MemImage(0)
Dim o As _Offset:o = 0
 
tb#=timer
 
for trials%=1 to TestLimit
    o = 0
    Do Until o >= m.SIZE
        If _MemGet(m, m.OFFSET + o, _Unsigned Long) <> 0 Then _MemPut m1, m1.OFFSET + o, kolor As _UNSIGNED LONG
        o = o + 4
    Loop
next
 
t0#=timer:
 
_MemCopy n1, n1.OFFSET, n1.SIZE To n, n.OFFSET '*** this _MemCopy line added
for trials%=1 to TestLimit
    o = 0
    Do Until o >= n.SIZE
        If _MemGet(n, n.OFFSET + o, _Unsigned Long) <> 0 Then _MemPut n1, n1.OFFSET + o, kolor As _UNSIGNED LONG
        o = o + 4
    Loop
next:
 
t2#=timer
 
_MemFree m: _MemFree m1:_MemFree n:_MemFree n1
Print  "(without   _MemCopy)"; t0# - tb#
Print  "(including _MemCopy)"; t2# - t0#:print
print "# of MEM references GRAND TOTAL =";
print using "###,###,###";_WIDTH*_HEIGHT*Testlimit;
 
 
 
 

TempodiBasic · « **Reply #17 on:** December 05, 2021, 12:16:41 pm »

Wow you're following the path of crucial difference in results of translator into C++.
No bad BASIC code but bad final C++ code.
The issue moves from coder's ability to ability of translator into C++.

SMcNeill · « **Reply #18 on:** December 05, 2021, 03:08:17 pm »

The issue for us isn't one to do with L1 or L2 memory. It has to do with how QB64 allocates memory

DIM foo(1000), if used in the main module, creates and reserves space in memory foryour 1001 elements from the moment you DIM it.

DIM foo(1000), if used in a SUB or FUNCTION, doesn't create or reserve any space, until you put values into each element.

And the difference in these two behaviors is one which affects performance quite a bit, if called repeatedly. It's this difference that creates our difference in speed and performance.

And, I suppose it makes sense. DIM, in the main module, creates a permanent array. We never free it, unless we do so manually with ERASE or CLEAR, or whatnot. All variables and such, however, end up being temporary when created for a SUB/FUNCTION, unless we specifically declare it as STATIC.

It's the STATIC vs DYNAMIC creation of memory which is the difference in speed in the xamples here..

Richard · « **Reply #19 on:** January 04, 2022, 08:00:45 am »

@SMcNeill

I understand your reply regarding differences in memory allocation for arrays (Main_Program) versus (SUBS) and how this affects timings in using these techniques for array work.

I attempted to time the big DIM array

Dim TempGrid(0 To _Width - 1, 0 To _Height - 1) As Long

for the two "passes" of your circle program and did not see any significant changes (minor ones exist) between your two methods as far as DIM being setup.

I attempted to slightly modify your program (my alterations shown with "*** at the end of each relevant line) and is supplied below.

Perhaps you are correct - in which case there is a a coding problem in my alterations to your program.
Would you be kind enough to point out any coding problems in my alterations to establish correct timing results to support your last reply?

Code: QB64: [Select]

 
DECLARE DYNAMIC LIBRARY "kernel32"                                             '***
    SUB QueryPerformanceCounter (BYVAL lpPerformanceCount AS _UNSIGNED _OFFSET)'***
    SUB QueryPerformanceFrequency (BYVAL lpFrequency AS _UNSIGNED _OFFSET)     '***
END DECLARE                                                                    '***
DIM SHARED PerformanceCount AS _INTEGER64                                      '***
DIM SHARED Frequency AS _INTEGER64                                             '***
DIM SHARED StartingCount AS _INTEGER64                                         '***
DIM SHARED dElapsedTime AS DOUBLE                                              '***
DIM SHARED dElapsedTimeOLD AS DOUBLE                                           '***
DIM SHARED CumulativeTime AS DOUBLE                                            '***
QueryPerformanceFrequency (_OFFSET(Frequency))                                 '***
QueryPerformanceCounter (_OFFSET(StartingCount))                               '***
CumulativeTime#=0                                                              '***
 
 
Screen _NewImage(1024, 720, 32)
Do: Loop Until _ScreenExists
_ScreenMove _Middle
 
 
TestLimit = 100
Radius = 350
 
Randomize Timer
t# = Timer
For i = 1 To TestLimit
    'Cls , &HFF000000 + Int(Rnd * &HFFFFFF)
    k& = &HFF000000 + Int(Rnd * &HFFFFFF)
 
    Circle (512, 360), Radius, k&
    Fill 512, 360, k&, 0
Next
 
t1# = Timer
CumulativeTime_T1_T#=CumulativeTime#                                           '***
CumulativeTime#=0                                                              '***
 
For i = 1 To TestLimit
    'Cls , &HFF000000 + Int(Rnd * &HFFFFFF)
    k& = &HFF000000 + Int(Rnd * &HFFFFFF)
 
    Circle (512, 360), Radius, k&
    Fill 512, 360, k&, -1
Next
t2# = Timer
CumulativeTime_T2_T1#=CumulativeTime#                                          '***
 
Print Using "###.#### seconds for Fill."; t1# - t#;
COLOR &HFF00FF00~&                                                             '***
print "             ";                                                        '***
print using "#,###";CumulativeTime_T1_T#;                                      '***
print " microseconds to DIM array"                                            '***
COLOR &HFFFFFFFF~&
 
 
Print Using "###.#### seconds for Fill Method -1."; t2# - t1#;
COLOR &HFF00FF00~&                                                             '***
print "   ";                                                                  '***
print using "#,###";CumulativeTime_T2_T1#;                                     '***
print " microseconds to DIM array"                                            '***
COLOR &HFFFFFFFF~&                                                             '***
 
 
 
 
Sub Fill (Passx, Passy, kolor As Long, METHOD)
$Checking:Off
B = _Blend
_DontBlend
 
 
TIMERe:dElapsedTimeOLD = dElapsedTime                                          '***
Dim TempGrid(0 To _Width - 1, 0 To _Height - 1) As Long
TIMERe                                                                         '***
CumulativeTime#=CumulativeTime#+(dElapsedTime - dElapsedTimeOLD)*1000000       '***
 
 
Dim As _MEM m, m1: m = _Mem(TempGrid()): m1 = _MemImage(0)
Dim o As _Offset
If METHOD Then
    _MemCopy m1, m1.OFFSET, m1.SIZE To m, m.OFFSET
End If
o = 0
Do Until o >= m.SIZE
    If _MemGet(m1, m1.OFFSET + o, _Unsigned Long) = kolor Then
        _MemPut m, m.OFFSET + o, -1 As LONG
        'Else
        '_MemPut m, m.OFFSET + o, 0 As LONG
    End If
    o = o + 4
Loop
 
TempGrid(Passx, Passy) = 1
startx = Passx: finishx = Passx
starty = Passy: finishy = Passy
 
Do Until finished
    pass = pass + 1
    finished = -1
    For x = startx To finishx
        For y = starty To finishy
            'If TempGrid(x, y) <> 0 Then Print x, y, TempGrid(x, y): Sleep
 
            If TempGrid(x, y) = pass Then
                tempx = x
                Do Until tempx = 0
                    If TempGrid(tempx - 1, y) = 0 Then
                        TempGrid(tempx - 1, y) = pass + 1
                        tempx = tempx - 1
                        If tempx < startx Then startx = tempx
                        finished = 0
                    Else
                        tempx = 0
                    End If
                Loop
 
                tempx = x
                Do Until tempx = _Width - 1
                    If TempGrid(tempx + 1, y) = 0 Then
                        TempGrid(tempx + 1, y) = pass + 1
                        tempx = tempx + 1
                        If tempx > finishx Then finishx = tempx
                        finished = 0
                    Else
                        tempx = _Width - 1
                    End If
                Loop
 
                tempy = y
                Do Until tempy = 0
                    If TempGrid(x, tempy - 1) = 0 Then
                        TempGrid(x, tempy - 1) = pass + 1
                        tempy = tempy - 1
                        If tempy < starty Then startx = tempy
                        finished = 0
                    Else
                        tempy = 0
                    End If
                Loop
 
                tempy = y
                Do Until tempy = _Height - 1
                    If TempGrid(x, tempy + 1) = 0 Then
                        TempGrid(x, tempy + 1) = pass + 1
                        tempy = tempy + 1
                        If tempy > finishy Then finishy = tempy
                        finished = 0
                    Else
                        tempy = _Height - 1
                    End If
                Loop
            End If
    Next y, x
Loop
 
o = 0
Do Until o >= m.SIZE
    If _MemGet(m, m.OFFSET + o, _Unsigned Long) <> 0 Then _MemPut m1, m1.OFFSET + o, kolor As _UNSIGNED LONG
    o = o + 4
Loop
_MemFree m: _MemFree m1
$Checking:On
If B Then _Blend
End Sub
 
'***
SUB TIMERe                                                                     '***
QueryPerformanceCounter (_OFFSET(PerformanceCount))                            '***
dElapsedTime = (CDBL(PerformanceCount - StartingCount)) / Frequency            '***
END SUB                                                                        '***
 
 

News:

Author Topic: Somebody explain this one to me... (Read 27088 times)

SMcNeill

Re: Somebody explain this one to me...

Richard

Re: Somebody explain this one to me...

TempodiBasic

Re: Somebody explain this one to me...

SMcNeill

Re: Somebody explain this one to me...

Richard

Re: Somebody explain this one to me...