Author Topic: v1.5 128 bit (and beyond) math * COMPILER ERROR * (Read 151383 times)

Richard · « **Reply #180 on:** September 06, 2021, 12:44:25 am »

In preparation for my QB64 coding for 128+bit math - I have run a number of modules, each being on purpose similar in coding style (essential difference between modules is the variable elements used and the memory allocated for same).

The objective was to simulate INC (increment, +1) but with the eventual aim of scalability (i.e. to go beyond 128 bits). The modules all were scaled down to 64 bits only (to allow in the near future easy cross-checking with INTEGER64 standard code) and as tested were not used for anything useful (i.e. an "empty" FOR NEXT loop would be the QB64 code equivalent). The test results (shown below) were for 16 INC's in succession and the "trial" was repeated 128 times.

The methodology used was purely bit-wise (i.e. base 2) - and consequently one bit only at a time was stored in the respective variable elements, for instance an INTEGER64 array stored only one bit value per array element (~2% memory efficient).

The overall aim was to determine which element variable and memory allocation combination would produce distinctive calculation time gains.

In the process of this development I gained some experience in using MEM blocks and UDT.

[ You are not allowed to view this attachment ]

Notes

In the tabulated values section, inc = 0 should read inc = 16
The horizontal axis is a linear time scale (0 to 3 seconds)
The vertical axis is a linear count (one count per pixel) for the respective modules.
Some modules are known to be in error
() designates QB64 arrays being used - one element per bit
memory and screen are the memory allocations as per using MEM coding
screen_0 and screen_2 are the old DOS memory allocations (&HB800 and &HA000 addresses) and NOT being used by MEM
udt as per user defined type
direct means that allocation unit is "hard-coded" (i.e. a03% variable used instead of a%(03) array element)
TIMER_empty_loops is actually a module without any code in the loop (here 16x) - to estimate overhead in timing of modules
TIMER_null is two time calls difference (i.e the resolution of the timing process for two successive enquiries)
The tabulated values represent minimum, average, maximum timings for the collection of 128 trials per module
The associated Histogram shows all data points above a line of half-intensity (line = span minimum to maximum) - no data point is repeated on the exact same pixel (and always above the line).
The program was run without computer connection to internet, devices etc or essentially any other software running.
All modules are essentially the same coding style, only the variable elements used differ - the "heart" of each module - INC - was only 3 lines of functional QB64 code equivalent per iteration.
All modules were tested in "SLOW" mode, i.e. to roughly simulate in actual useage - so for instance at each iteration of INC, the binary string representation was displayed as text on a graphics screen, select case block used to redirect as necessary depending on a particular bit field value.
The timings (unit = second) do not include any setup overheads (zeroing etc) - and assumed the initial value (for the INC) was preloaded and valid.

Eventually it is planned to use resources as mentioned/written by @SMcNeill , @luke, @bplus and @jack as the program develops past INC to the more common modes. These other resources would be used to cross-check my results.

Richard · « **Reply #181 on:** October 03, 2021, 10:26:24 pm »

This reply is really meant for any relatively new programmers.

QB64 Number Space

This is a first attempt to summarize the IDE_help/Wiki/internet/computer_science_math regarding the QB64 Number Space.

For the purpose of this reply - the QB64 Number Space are those numbers that can be explicitly entered and accurately retrieved by QB64. Examples of explicit values are 1000 0.5 0.333 1E3. Examples of implicit numbers are 1/3 1.000000089E99 1.0000000000000000000001.

A simple (not complete) survey of explicit numbers (for QB64 x64) comparing _SINGLE _DOUBLE _FLOAT variables is tabulated in the screenshot. Also tabulated are elevated C floats (to 19 and 24 decimal places). For reference _INTEGER64 equivalents are included.

One intention of the survey was to exactly define what the contiguous range of integers are available for the various real variable types (i.e. to say e.g. going from 0 1 2 3 ... ? without missing any integers). Because of the way when having a small number of bits (32, 64, 80) the internal binary representation is governed by certain rules - simply put - when going to relatively high integer values - instead of the LSD for a contiguous series of numbers being 0 1 2 3 4 5 (LSD = Least Significant Digit) the resolution (granularity) is decreased to say 2 2 2 2 6 6 (for the LSD) for the exact same numbers originally desired.

An attempt was made to color code the LSD (least significant digits). Though not complete - think of RED digits (do not use), YELLOW (use with caution) and GREEN (probably >99.999% OK). Of course a word of caution is that when have a "string of trailing 0's or 9's - round off errors and rounding methods are to be considered.

So the output provided is not complete (and possibly some minor errors from my programming attempts exist).

[ You are not allowed to view this attachment ]

Many thanks to @jack for his help in my starting to learn C (and to use C code within QB64 programs).

Edit: when have a Function of 1,000 + 1 say, it means that

c! = 1000! + 1!
c# = 1000# + 1#
c## = 1000## + 1##

etc - i.e. only consistent variable types is used (so no e.g. c! = 1000# + 1## )

SMcNeill · « **Reply #182 on:** October 04, 2021, 09:58:56 am »

I think you have a few typos in there Richard... Or else math is broken bad.

9,999,969 + 1 = 9,999,997.....

Is that just a typo in the 9,999,996, or what’s going on there? I can't tell from the screenshot alone.

Richard · « **Reply #183 on:** October 28, 2021, 03:19:48 am »

@SMcNeill

Thanks for pointing out the typo's.

Attached is the preliminary work in trying to establish the first transition point in going from Number Space Ordinary (NSO) where all positive integers are representable in _DOUBLE TO where the granularity is larger than 1 (i.e. resulting in "skipped" integers).

[ You are not allowed to view this attachment ]

Richard · « **Reply #184 on:** November 27, 2021, 07:36:13 am »

(INC by bit enumeration) versus (&h...~&& + 1~&&)

Previously compared timings of various modules (differing essentially by variable_type/memory_allocation) by looping of same (about 100x) to obtain better timing results for comparison purposes (to establish which modules resulted in significant timing/resource gains). This effectively was a comparison of

(INC by bit enumeration) versus (FOR temp = 1 to #loops:...(no code here)...:NEXT)

This approach (i.e. having say 100x loops) was not technically correct approach as hardware caching would bias results towards measurements of loop times for code that eventually was stored in cache (in particular level 1 cache).

Current timings below are for once performed INC and measuring with 100 nano second resolution timer. Because QB64 code (&h...~&& + 1~&&) is very fast in comparison to highly scalable (i.e. many more) bit coding - resulting in the QB64 code not being measurable - 100 replications of (&hFFFFFFFFFFFFFF00~&& + 1~&&) were hard-coded (i.e. not in a FOR-NEXT LOOP). The result of this is the last module on the graph below had an apparent measurable result (all the other modules were in the sense only 1 replication and were somewhat measurable with 100 nano-second timer).

At this stage, without proper analysis etc, very roughly (&hFFFFFFFFFFFFFF00~&& + 1~&&) is about 100x FASTER than any (INC by bit enumeration) as used here. The requirement was set that all (INC by bit enumeration) methods were practically identical - i.e. the "heart" of each iteration was 3 lines of BASIC code and essentially the only difference between the various modules was the choice of variable_types/memory_used.

Regarding Cache memory.

My computer (roughly 8 years old) has the following Cache sizes:-

L1 cache size 32K byte
L2 cache size 256 Kbyte
L3 cache size 6144 Kbyte

L1 is actually on the same silicon die as the processor itself and is the fastest of the three (it runs at the same speed as the microprocessor).

Though not specified, typically L2 cache runs at half the speed as L1 and typically is NOT located on the same die, but rather located on a separated die which may be INSIDE the CPU module or in another module (possibly "double-decking" the CPU Integrated circuit module i.e. in parallel and physically located underneath the main CPU chip - so no additional PCB space is being used). Apparently to manufacture L1 cache is very expensive, so the L2 running at half-speed (of L1) is an economic advantage.

L3 is located somewhat away from the CPU but closer than the standard RAM memory - it would run faster to some extent when comparing to accessing from RAM.

L4 might either be the RAM memory if PCIe hard drive M.2 is NOT present, or SATA SSD, or SATA HDD and may classified based on whether the performed .exe (and similarly .dll etc) are actually loaded entirely into memory (not all .exe's are fully loaded into memory when running). In either case L4 is slower than L3.

There is hardware and timing overheads to load up the cache - apparently to load up L1 cache seems to be about 5 clock cycles to pre-load cache with 16 bytes - and obviously only relatively small code blocks (<32 Kbytes) would ever, and after significant time delay, be pre-loaded into L1 cache. On the other hand, the much larger L3 cache may be constantly mirrored with all the program code (up to 6 Mbytes) but at significant speed performance losses.

It is hard enough to model/understand how program execution speed increases as the various levels of cache are utilized - things become much more complicated when Windows takes priority over your desired running program .exe, resulting in cache (all levels) resources being spread over more than one application (i.e. your .exe and various Windows background processes).

On my computer (which has 4 physical processors i.e. 8 logical processors) even when overall system CPU useage is say 30% - ALL CPUs are simultaneously doing Windows background processes. There never seems to be any time when a CPU is "dedicated" to my program .exe. @Pete sums it up as Window Priority.

The Graph below

In preparation for my QB64 coding for 128+bit math - I have run a number of modules, each being on purpose similar in coding style (essential difference between modules is the variable elements used and the memory allocated for same).

The objective was to simulate INC (increment, +1) but with the eventual aim of scalability (i.e. to go beyond 128 bits). The modules all were scaled down to 64 bits only (to allow in the near future easy cross-checking with INTEGER64 standard code) and as tested were not used for anything useful i.e. (INC by bit enumeration) versus (&h...~&& + 1~&&)(QB64 code equivalent). The test results (shown below) were for single iterations (i.e. a total of one replication) and the "trial" was repeated 128 times.

The methodology used was purely bit-wise (i.e. base 2) - and consequently one bit only at a time was stored in the respective variable elements, for instance an INTEGER64 array stored only one bit value per array element (~2% memory efficient), similarly using FLOAT (0.5% memory efficient).

The overall aim was to determine which element variable and memory allocation combination would produce distinctive calculation time gains.

Notes

Fundamental time unit = micro-seconds (resolution 100 nanoseconds, uncalibrated)
The horizontal axis is a linear time scale (0 to 64 micro-seconds),(i.e. in the zone immediately left of any printed text)
The vertical axis is a linear count (one count per pixel) for the respective modules.

() designates QB64 arrays being used - one element per bit
memory and screen are the memory allocations as per using MEM coding
screen_0 and screen_2 are the old DOS memory allocations (&HB800 and &HA000 addresses) and NOT being used by MEM
UDT as per user defined type
direct means that allocation unit is "hard-coded" (i.e. a03% variable used instead of a%(03) array element)
TIMER_empty_loops is actually a module without any code in the iteration (i.e. replication) - to estimate overhead in timing of modules
TIMER_null is two time calls difference (i.e. the resolution of the timing process for two successive enquiries)
The tabulated values represent minimum, average, maximum timings for the collection of 128 trials per module
The 1st, 2nd, 3rd, 4th, and 5th timing values are listed in the text on the RHS (right hand side).

The plotted 1st point is indicated with a "+" (8-pixel branch length)
The plotted 2nd point is indicated with a "+" (4-pixel branch length)
The plotted 3rd point is indicated with a "+" (2-pixel branch length)
The plotted 4th point is indicated with a "+" (1-pixel branch length)

The associated Histogram shows all data points above a line of half-intensity (line = span minimum to maximum) - no data point is repeated on the exact same pixel (and always above the line).
The program was run without computer connection to internet, devices etc or essentially any other software running.
All modules are essentially the same coding style, only the variable elements used differ - the "heart" of each module - INC - was only 3 lines of functional QB64 code equivalent per iteration.

The timings (unit = micro-second) do not include any setup overheads (zeroing etc) - and assumed the initial value (for the INC) was preloaded and valid.

Since only 64 micro-seconds is spanning the un-texted portion of the display and since a timing resolution of 100 nanoseconds (uncalibrated) is in effect - the resulting "histogram" is distinctly "heavily quantized" (i.e. "gaps" = resolution steps in timings appear).

Cache memory effects

If Windows was not present then quite simply would expect:-

1st timing to be longest - MyProgram.exe running via L4 cache (in my case NVME PCIe M.2 SSD)
2nd timing to be shorter (and to the left) - MyProgram.exe running via L3 cache
3rd timing may be shorter still - MyProgram.exe POSSIBLY running via L2 cache
unknown - MyProgram.exe running via L1 cache

BUT Windows background processes (Window Priority) are continuously running over ALL CPU's simultaneously. Timing alterations are unpredictable (and would appear to be resulting in RANDOM timing elongations).

This is illustrated by the horizontal line (base line for histograms for each module of code) extending to the right of the 1st timing point and further illustrated by any change in Right-Left ordering of 1st, 2nd, 3rd, 4th and 5th points i.e. Windows "adds" to the timings. It is further illustrated by the lack of "bias skew" - when the L1 cache for the program code is envoked, the histogram points will disproportionally "clump" to the LHS (left-hand side) of the histogram (so NO "Bell-shaped" traditional statistical distribution curves).

Sadly, one might question (in general across all programs) whether there is any point in trying to achieve maximum speed performance increases since "Windows messes things up with timings" and is unpredictable and it seems cannot be switched off (at an individual CPU level).

[ You are not allowed to view this attachment ]

Richard · « **Reply #185 on:** December 26, 2021, 08:34:59 am »

Possible WIKI error (in Variable Type ranges) - case study unsigned integer components for

single 3.402823 e+38 (2^128) OK
double 1.797693134862310 e+308 (2^1024) OK
float 1.18 e+4932 (2^16384) ???

The following code displays what I get when trying to access the range for FLOAT (for positive whole number component)

Code: QB64: [Select]

a!=3.402823e+38 'single  
print "single";a!
print
 
a# = 1.797693134862310e+308    
print "double";a#
print
 
a## = 1.79769313486231580788856f308 
print "float ";a##
print
 
a## = 1.79769313486231580788857f308 
print "float ";a##
 

Does anyone get a higher positive range value for FLOAT than illustrated above?

@SpriggsySpriggs

re your replies

Quote

« Reply #113 on: May 01, 2021, 05:17:03 am »

What is the reason for needing such a large number right now?

Quote

« Reply #115 on: May 02, 2021, 12:47:22 am »

I highly doubt we're doing that in QB64, though. It's not really being used for any serious applications. Simple graphics demos and games.

You question the reason for large numbers (128 bits) yet about 128 bits is needed to exactly represent the positive whole number component of QB64 single variables (refer table at start of this reply). SINGLE types are used commonly in programming tasks and because of limited range, QB64 has DOUBLE.

128 bit math would lessen the impact of the number space issues with the current variable types - it does not even begin to widen the range of the existing QB64 variable types.

jack · « **Reply #186 on:** December 26, 2021, 11:54:26 am »

Richard you keep forgetting that QB64's print statement converts the Float argument to double, the only thing that halfway works with Float is print using

Code: QB64: [Select]

$NoPrefix
$Console:Only
Dest Console
 
a## = 1.79769313486231580788857F308
a## = a## * a##
Print Using "##.##################^^^^"; a##
 

output

Code: [Select]

% 3.231700607131100371F+616

except for the unneeded % the output is correct, if you want better output for the Float type then use the C runtime library as I have shown you in the past

jack · « **Reply #187 on:** December 26, 2021, 12:54:57 pm »

print using works ok with float that don't have a large exponent but in the following example the exponent is wrong

Code: QB64: [Select]

$NoPrefix
$Console:Only
Dest Console
Option Explicit
 
Dim As Float i, f
 
f = 1##
For i = 1## To 1750##
    f = f * i
Next
Print Using "##.##################^^^^^"; f
 

output

Code: [Select]

% 2.098318982648383492F+4969

correct value is

Code: [Select]

  2.098318982648383492F+4917

Richard · « **Reply #188 on:** January 24, 2022, 09:59:23 am »

This reply is really meant for any relatively new programmers.

QB64 comparison of output formatting

@jack earlier mentioned about how print and using formats may not give the same numerical results. To attempt to do "proper" comparisons of some of the variable types of QB64, I have now standardized on including:-

STR$() (which is often used as in PRINT i.e. a$=STR$(a%) gives identical output result as PRINT a%)

PRINT USING "######.######" (say) for fixed format printing

PRINT USING "##.######^^^^^" (say) for scientific notation

and Hexadecimal format.

By simultaneously displaying these four formats, it is easier to see when the various formats begin to break, in the study of the QB64 number space set.

Relatively new programmers might be tempted to "dial up maximum resolution" for the output streams (which on purpose I did below to highlight the problems associated) - however without due "care", this will typically result in numerical artifacts. Of course, it is up to the programmer/program_user/intended_program_market/standards/government_regulations (if applicable) which would be the deciding factor on how to handle this.

At present, the attachment below has not been edited for "errors" (as numerical artifacts) with say "red boxes" etc - so in the meantime consider it a QUIZ - to identify/confirm any errors (say pick one set (function) , and as far as you are comfortable to analyze - use any readily available techniques (your own QB64 program, other programming languages, calculators, wiki/google etc) to "support" your answer.

Because of obvious formatting problems - just "ignore" the last dozen or so lines in the attachment.

The attachment is a "practice run" in preparation for further QB64 number space study.

Richard · « **Reply #189 on:** February 23, 2022, 08:56:09 am »

Changed compiling options for enumeration of INC in QB64.

File -O2.png is for gcc compiler flag settings.

File Standard.png is the default QB64 compiler settings.

QB64 INC is now for only one iteration (in line with all other modules).

Richard · « **Reply #190 on:** March 22, 2022, 11:46:31 am »

256 bit INC

Using techniques developed for 64 bit INC (i.e. +1), tests performed on various variable/memory combinations for 256 bit INC.

Updated to include a large "X" to clearly show where the timings end for a particular module.

Also, on LHS, shows a linear-linear graph TIME vs TRIAL#. Because of Windows background tasks - there is a wide spread of timings per module.

Included are QB64 code results, using INTEGER64 unsigned variables, for timing comparisons.

News: