Author Topic: Somebody explain this one to me...  (Read 8159 times)

0 Members and 1 Guest are viewing this topic.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Somebody explain this one to me...
« Reply #15 on: November 06, 2021, 03:18:09 pm »
@SMcNeill

Hi, here is my results after replacing DIM
0.7 seconds (before 15 seconds)
0.6 seconds (the same as before)

Let me express my deep respect for you, because I would not really look for the problem in this. This is an absolutely crucial finding. I am very glad that you have alerted us to this matter. Great job!

The code as written is kind of buggy (it only works with a clear background), but it does a good job illustrating how arrays work in subs/functions.  In the main part of the program, they reserve memory when we DIM them.  In SUB/FUNCTION they don't -- the memory is allocated dynamically as needed.  This difference can produce quite a speed/performance difference, so there may be times when it's better to DIM SHARED an array and let it stay in memory than it is to DIM it and free it repeatedly in a SUB.

It's a shame that STATIC won't work with an array in a sub.  At least, I can't get it to work.  When I try I just get a crash to desktop. 

Just something to keep in the back of the mind from now on for speed optimizations.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Richard

  • Seasoned Forum Regular
  • Posts: 364
    • View Profile
Re: Somebody explain this one to me...
« Reply #16 on: December 05, 2021, 01:20:09 am »
@SMcNeill





I offer the following explanation as to how it is possible that by addition of one extra line of code, significant reduction in timing results rather than expected an increase.


This reply assumes you understand the background information as per the link below



https://www.qb64.org/forum/index.php?topic=2276.msg138538#msg138538

Reply 184



Firstly, very roughly when determining the size of your code (i.e. a program containing no code (untitled.exe) subtracted from your program.exe) - your actual running code (.exe) is about 10.2 Kbytes (I get 2094080 -2083840 =~10.2 Kbytes) . If your computer (which runs almost exactly 2x faster than mine) is of similar/related architecture (eg INTEL >= i7) - then the 10K of .exe code is a candidate for L1.

On my laptop, L1 is 32K - but from my observations Windows background processes are using all processors (4 physical, 8 logical) even when CPU useage (Task Manager) is only say 30%. So at any time, a fairly large percentage of L1 is being "hogged" by Windows and you would be lucky to have any where near 10 Kbytes of "your" code in L1 (similar considerations for L2 and L3).


When, for a clearer understanding, I reduce your program (supplied below) to minimal requirements, the active code (.exe) is (2092032 - 2083840) =~ 8.2 Kbytes. To me, although the difference (10.2 - 8.2 Kb) may not seem much, in relative terms for L1 location (which is being "butchered" in useage by Windows background processes) - it is apparent that even a relatively slight change in .exe code can yield wildly differing timing results (because of Windows).


So your.exe (heavily referred part = 10.2 Kbytes) is only partly residing in L1 (similarly for L2 and L3) at any instance of time. When the _MemCopy code is acted on - some (_WIDTH * _HEIGHT  =~ 730,000) memory accesses are efficiently at machine level being processed, which is a very large number, which means L1 (after a setup time) is preloaded with the relevant _MemCopy machine Code. Also probably the _MemCopy is relatively small (say << 1K bytes) and has a high chance of 'displacing' not-so-much utilized Windows background processes that are "lying around still in L1". By virtue of the fact that _MemCopy machine code is "resident" in L1, any future references to m and m1 _MEM BLOCKS  (i.e. the rest of your program) would benefit from MEM code being in L1. In fact, with your test limit value of 100, some 73 million MEM operations (over about 10 seconds) would have speed improvement simply because the fact that _MEMCOPY was utilized once. HAD all of your 10.2 Kbyte of code been resident in L1 then POSSIBLY there may not be any significant between (with _MEMCOPY) versus (without _MEMCOPY). I do not know how to measure what is L1, L2, L3 at any time but as far as L1 goes it seems very apparent that windows is very aggressive in "hogging" L1, leaving much less than 10K bytes for anything else, and ONLY when a highly repetitive "small" amount of code (_MemCopy), in this case >730,000 iterations, is when some of L1 being "hogged" by windows is released for other applications (your program). In the first running part of your program (before _MemCopy), it is apparent that the relevant code is "too large" for what is immediately free in L1 and even a billion iterations of that code set may not "displace Windows background processes". In conclusion, _MEMCOPY (machine code small) performing 730000 iterations of memory referencing, is the winner.


Now it is interesting when running the code below, which is a drastically reduced version of your program (and there is NOTHING ACTUALLY TO SEE WHEN RUNNING), that the timing differences (with _MEMCOPY) versus (without _MEMCOPY) are relatively minor and may not even warrant discussion. But now we are taking about a smaller program (8.2 Kbyte) which has a much higher chance of being "resident" in L1 than the 10.2 Kbyte (your program). Now if you ran this shortened program many times, AND also with varying applications installed/removed (eg other instances of QB64, EDGE, etc) - then you will get "widely varied results" including sometimes (with _MEMCOPY) taking LONGER than (without _MEMCOPY) and relative timing ratios approaching 3:1. It appears that the "erratic behaviour" (i.e. non-consistent) of Windows background processes is the reasoning of this.


As a side note, referring to the reply #184 mentioned above, it would appear (without proper study/analysis) that anything running in L1 is approaching being 10x faster than the first time code is being performed. One has to wait for the setup times for preloading L1 before taking advantage of this.



Code: QB64: [Select]
  1. TestLimit = 100
  2. Dim TempGrid(0 To _Width - 1, 0 To _Height - 1) As Long
  3. dim TempGridn(0 To _Width - 1, 0 To _Height - 1) As Long
  4. Dim As _MEM m, m1: m = _Mem(TempGrid()): m1 = _MemImage(0)
  5. Dim As _MEM n, n1: n = _Mem(TempGridn()): n1 = _MemImage(0)
  6. Dim o As _Offset:o = 0
  7.  
  8. tb#=timer
  9.  
  10. for trials%=1 to TestLimit
  11.     o = 0
  12.     Do Until o >= m.SIZE
  13.         If _MemGet(m, m.OFFSET + o, _Unsigned Long) <> 0 Then _MemPut m1, m1.OFFSET + o, kolor As _UNSIGNED LONG
  14.         o = o + 4
  15.     Loop
  16.  
  17. t0#=timer:
  18.  
  19. _MemCopy n1, n1.OFFSET, n1.SIZE To n, n.OFFSET '*** this _MemCopy line added
  20. for trials%=1 to TestLimit
  21.     o = 0
  22.     Do Until o >= n.SIZE
  23.         If _MemGet(n, n.OFFSET + o, _Unsigned Long) <> 0 Then _MemPut n1, n1.OFFSET + o, kolor As _UNSIGNED LONG
  24.         o = o + 4
  25.     Loop
  26.  
  27. t2#=timer
  28.  
  29. Print  "(without   _MemCopy)"; t0# - tb#
  30. Print  "(including _MemCopy)"; t2# - t0#:print
  31. print "# of MEM references GRAND TOTAL =";
  32. print using "###,###,###";_WIDTH*_HEIGHT*Testlimit;
  33.  
  34.  
  35.  
  36.  

Offline TempodiBasic

  • Forum Resident
  • Posts: 1792
    • View Profile
Re: Somebody explain this one to me...
« Reply #17 on: December 05, 2021, 12:16:41 pm »
Wow you're following the path of crucial difference in results of translator into C++.
No  bad BASIC code but bad final  C++ code.
The issue moves from coder's ability  to ability of translator into C++.
Programming isn't difficult, only it's  consuming time and coffee

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Somebody explain this one to me...
« Reply #18 on: December 05, 2021, 03:08:17 pm »
The issue for us isn't one to do with L1 or L2 memory.  It has to do with how QB64 allocates memory

DIM foo(1000), if used in the main module, creates and reserves space in memory foryour 1001 elements from the moment you DIM it.

DIM foo(1000), if used in a SUB or FUNCTION, doesn't create or reserve any space, until you put values into each element.

And the difference in these two behaviors is one which affects performance quite a bit, if called repeatedly.  It's this difference that creates our difference in speed and performance. 

And, I suppose it makes sense.  DIM, in the main module, creates a permanent array.  We never free it, unless we do so manually with ERASE or CLEAR, or whatnot.  All variables and such, however, end up being temporary when created for a SUB/FUNCTION, unless we specifically declare it as STATIC.

It's the STATIC vs DYNAMIC creation of memory which is the difference in speed in the xamples here..
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Richard

  • Seasoned Forum Regular
  • Posts: 364
    • View Profile
Re: Somebody explain this one to me...
« Reply #19 on: January 04, 2022, 08:00:45 am »
@SMcNeill


I understand your reply regarding differences in memory allocation for arrays (Main_Program)  versus  (SUBS) and how this affects timings in using these techniques for array work.


I attempted to time the big DIM array


Dim TempGrid(0 To _Width - 1, 0 To _Height - 1) As Long


for the two "passes"  of your circle program and did not see any significant changes (minor ones exist) between your two methods as far as DIM being setup.

I attempted to slightly modify your program (my alterations shown with "*** at the end of each relevant line) and is supplied below.

Perhaps you are correct - in which case there is a a coding problem in my alterations to your program.
Would you be kind enough to point out any coding problems in my alterations to establish correct timing results to support your last reply?



Code: QB64: [Select]
  1.  
  2. DECLARE DYNAMIC LIBRARY "kernel32"                                             '***
  3.     SUB QueryPerformanceCounter (BYVAL lpPerformanceCount AS _UNSIGNED _OFFSET)'***
  4.     SUB QueryPerformanceFrequency (BYVAL lpFrequency AS _UNSIGNED _OFFSET)     '***
  5. END DECLARE                                                                    '***
  6. DIM SHARED PerformanceCount AS _INTEGER64                                      '***
  7. DIM SHARED Frequency AS _INTEGER64                                             '***
  8. DIM SHARED StartingCount AS _INTEGER64                                         '***
  9. DIM SHARED dElapsedTime AS DOUBLE                                              '***
  10. DIM SHARED dElapsedTimeOLD AS DOUBLE                                           '***
  11. DIM SHARED CumulativeTime AS DOUBLE                                            '***
  12. QueryPerformanceFrequency (_OFFSET(Frequency))                                 '***
  13. QueryPerformanceCounter (_OFFSET(StartingCount))                               '***
  14. CumulativeTime#=0                                                              '***
  15.  
  16.  
  17. Screen _NewImage(1024, 720, 32)
  18.  
  19.  
  20. TestLimit = 100
  21. Radius = 350
  22.  
  23. t# = Timer
  24. For i = 1 To TestLimit
  25.     'Cls , &HFF000000 + Int(Rnd * &HFFFFFF)
  26.     k& = &HFF000000 + Int(Rnd * &HFFFFFF)
  27.  
  28.     Circle (512, 360), Radius, k&
  29.     Fill 512, 360, k&, 0
  30.  
  31. t1# = Timer
  32. CumulativeTime_T1_T#=CumulativeTime#                                           '***
  33. CumulativeTime#=0                                                              '***
  34.  
  35. For i = 1 To TestLimit
  36.     'Cls , &HFF000000 + Int(Rnd * &HFFFFFF)
  37.     k& = &HFF000000 + Int(Rnd * &HFFFFFF)
  38.  
  39.     Circle (512, 360), Radius, k&
  40.     Fill 512, 360, k&, -1
  41. t2# = Timer
  42. CumulativeTime_T2_T1#=CumulativeTime#                                          '***
  43.  
  44. Print Using "###.#### seconds for Fill."; t1# - t#;
  45. COLOR &HFF00FF00~&                                                             '***
  46. print "             ";                                                        '***
  47. print using "#,###";CumulativeTime_T1_T#;                                      '***
  48. print " microseconds to DIM array"                                            '***
  49. COLOR &HFFFFFFFF~&
  50.  
  51.  
  52. Print Using "###.#### seconds for Fill Method -1."; t2# - t1#;
  53. COLOR &HFF00FF00~&                                                             '***
  54. print "   ";                                                                  '***
  55. print using "#,###";CumulativeTime_T2_T1#;                                     '***
  56. print " microseconds to DIM array"                                            '***
  57. COLOR &HFFFFFFFF~&                                                             '***
  58.  
  59.  
  60.  
  61.  
  62. Sub Fill (Passx, Passy, kolor As Long, METHOD)
  63. B = _Blend
  64.  
  65.  
  66. TIMERe:dElapsedTimeOLD = dElapsedTime                                          '***
  67. Dim TempGrid(0 To _Width - 1, 0 To _Height - 1) As Long
  68. TIMERe                                                                         '***
  69. CumulativeTime#=CumulativeTime#+(dElapsedTime - dElapsedTimeOLD)*1000000       '***
  70.  
  71.  
  72. Dim As _MEM m, m1: m = _Mem(TempGrid()): m1 = _MemImage(0)
  73. If METHOD Then
  74.     _MemCopy m1, m1.OFFSET, m1.SIZE To m, m.OFFSET
  75. o = 0
  76. Do Until o >= m.SIZE
  77.     If _MemGet(m1, m1.OFFSET + o, _Unsigned Long) = kolor Then
  78.         _MemPut m, m.OFFSET + o, -1 As LONG
  79.         'Else
  80.         '_MemPut m, m.OFFSET + o, 0 As LONG
  81.     End If
  82.     o = o + 4
  83.  
  84. TempGrid(Passx, Passy) = 1
  85. startx = Passx: finishx = Passx
  86. starty = Passy: finishy = Passy
  87.  
  88. Do Until finished
  89.     pass = pass + 1
  90.     finished = -1
  91.     For x = startx To finishx
  92.         For y = starty To finishy
  93.             'If TempGrid(x, y) <> 0 Then Print x, y, TempGrid(x, y): Sleep
  94.  
  95.             If TempGrid(x, y) = pass Then
  96.                 tempx = x
  97.                 Do Until tempx = 0
  98.                     If TempGrid(tempx - 1, y) = 0 Then
  99.                         TempGrid(tempx - 1, y) = pass + 1
  100.                         tempx = tempx - 1
  101.                         If tempx < startx Then startx = tempx
  102.                         finished = 0
  103.                     Else
  104.                         tempx = 0
  105.                     End If
  106.                 Loop
  107.  
  108.                 tempx = x
  109.                 Do Until tempx = _Width - 1
  110.                     If TempGrid(tempx + 1, y) = 0 Then
  111.                         TempGrid(tempx + 1, y) = pass + 1
  112.                         tempx = tempx + 1
  113.                         If tempx > finishx Then finishx = tempx
  114.                         finished = 0
  115.                     Else
  116.                         tempx = _Width - 1
  117.                     End If
  118.                 Loop
  119.  
  120.                 tempy = y
  121.                 Do Until tempy = 0
  122.                     If TempGrid(x, tempy - 1) = 0 Then
  123.                         TempGrid(x, tempy - 1) = pass + 1
  124.                         tempy = tempy - 1
  125.                         If tempy < starty Then startx = tempy
  126.                         finished = 0
  127.                     Else
  128.                         tempy = 0
  129.                     End If
  130.                 Loop
  131.  
  132.                 tempy = y
  133.                 Do Until tempy = _Height - 1
  134.                     If TempGrid(x, tempy + 1) = 0 Then
  135.                         TempGrid(x, tempy + 1) = pass + 1
  136.                         tempy = tempy + 1
  137.                         If tempy > finishy Then finishy = tempy
  138.                         finished = 0
  139.                     Else
  140.                         tempy = _Height - 1
  141.                     End If
  142.                 Loop
  143.             End If
  144.     Next y, x
  145.  
  146. o = 0
  147. Do Until o >= m.SIZE
  148.     If _MemGet(m, m.OFFSET + o, _Unsigned Long) <> 0 Then _MemPut m1, m1.OFFSET + o, kolor As _UNSIGNED LONG
  149.     o = o + 4
  150.  
  151. '***
  152. SUB TIMERe                                                                     '***
  153. QueryPerformanceCounter (_OFFSET(PerformanceCount))                            '***
  154. dElapsedTime = (CDBL(PerformanceCount - StartingCount)) / Frequency            '***
  155. END SUB                                                                        '***
  156.  
  157.