Print Page - Binary Search Method

Title: Binary Search Method
Post by: SMcNeill on January 24, 2019, 08:09:35 pm

A simple demostration of how to implement and use a binary search method:

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
DIM SHARED WordList(1 TO 466544) AS STRING
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
 
 
 
DO
    _KEYCLEAR
    INPUT "Give me any word"; search$
    search$ = LTRIM$(RTRIM$(search$))
    PRINT "Searching for "; search$
    IF search$ = "" THEN END
    index = FindIndex(search$)
    IF index THEN
        PRINT "Word was found at position "; index; " in "; SearchTimes; "passes."
    ELSE
        PRINT "Word was not in list."
        PRINT "Previous word was ==> "; WordList(LastIndex - 1)
        PRINT "Next word was ======> "; WordList(LastIndex + 1)
        PRINT "Search took"; SearchTimes; "passes."
    END IF
    PRINT
LOOP
 
 
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = 370099 'ubound(wordlist)
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (min + max) \ 2
        compare = _STRICMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap + 1
        ELSEIF compare < 0 THEN
            max = gap - 1
        ELSE
            FindIndex = gap
            found = -1
        END IF
        IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list
        PRINT min, max, search$, WordList(gap), compare
        SLEEP
    LOOP
END FUNCTION
 

The word list has 466544 words in it, so it takes a few seconds to load into memory. Searching for a word in the list however, and finding its index, is almost instantaneous. At *MOST*, this will make 19 passes to either find, or eliminate a word from the list.

Now, let's apply the same list to STx's hash tables as written here: https://www.qb64.org/forum/index.php?topic=1001.msg101972#msg101972

Code: QB64: [Select]

DIM SHARED HashTableSize AS LONG
HashTableSize = 300007 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
 
DIM Counter(300007) 'So we can count how deep some of the tables go
 
PRINT "Loading dictionary..."
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
DO WHILE NOT EOF(1)
    LINE INPUT #1, a$
    d = HashFunc(a$) ' Calculate the hash value (array address) of the word on hand.
    Counter(d) = Counter(d) + 1
LOOP
CLOSE #1
 
FOR i = 1 TO 300007
    IF Counter(i) > max THEN max = Counter(i)
NEXT
 
PRINT "Lists go up to "; max; " levels deep."
 
 
PRINT "Done."
 
END
 
 
 
FUNCTION HashFunc (a AS STRING) ' input string
    DIM sum AS DOUBLE
    sum = HashTableSize
    FOR k = 1 TO LEN(a)
        sum = sum + k * COS(ASC(MID$(a, k, 1))) ' Without the linear factor of k, permutations have same hash values.
    NEXT
    sum = ABS(VAL(ReplaceSubString$(STR$(sum), ".", "")))
    sum = sum MOD HashTableSize
    HashFunc = sum
END FUNCTION
 
FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
    j = INSTR(a, b)
    IF j > 0 THEN
        r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
    ELSE
        r$ = a
    END IF
    ReplaceSubString$ = r$
END FUNCTION

Notice that the hash table has entries which go up to 10 levels deep in it, so we first need to hash to the proper address and then check the list for up to 10 different entries before finding the right one.

19 loops MAX in this case for the Binary Search Method, verses 11 (hash + 10 levels) MAX for the Hash Table...

... so it seems as if the Hash Table wins in this case. (Also depending on how efficient the method is for looking up the 10 levels deep info is; some commands have more overhead than others.)

BUT...

Hash Tables require your data in memory PLUS the table itself being in memory.

All the Binary Search Method requires is for the data to be memory (and sorted). (And if memory constraints are a true issue, it works quite well directly with Random Access Files, as you can simply use the index number to reference them with GET #1, index, word$.)

Either way, it seems to be a nice improvement over Pete's old method which he talks about here: https://www.qb64.org/forum/index.php?topic=1001.msg101983#msg101983

And, as for Stx's question:

Quote

17 steps? 18 steps? Where did these numbers come from? How do I interpret them in O(n) notation?

That was covered here: https://www.qb64.org/forum/index.php?topic=1001.msg101981#msg101981

Quote

A max of N searches, where 2^N > Index. ;)

2^18 = 262,144
INDEX = 466544 words
2^19 = 524,288

So our max number of searches is 2^N > Index... Or 19 in this case.

Quote

I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand).

I think the above offers a fairly decent comparison of the two methods. Feel free to run them and actually time them in various performance tests if you want.

One large advantage to the binary search is lesser memory usage and simplicity to implement.

A large advantage to hash tables is the fact that your data doesn't need to be sorted to search properly.

Both seem equally useful to me, and neither should be completely dismissed out-of-hand. ;)

NOTE: Feel free to remove (or comment out) the following statements, if you want, in the code above. It's mainly there just to help highlight the search process.

Code: QB64: [Select]

        PRINT min, max, search$, WordList(gap), compare
        SLEEP

Title: Re: Binary Search Method
Post by: STxAxTIC on January 24, 2019, 08:23:04 pm

Nice and honest writeup.

An actual speed test will be to rattle off say 50,000 consecutive lookups. To say it "feels instaneous" for one word at a time is a bit pedestrian. Let's talk Pete into making the test.

EDIT: Just looked up the average conversion time for binary sort is O(log n). That solves the speed question. On binary sort's best day you have O(n), which is incidentally what you get on hash's worst day.

You're right about the memory argument. Good thing it's no longer 1973 and we have been cramming gigabytes into keychains since 2002. Hashing is safe, memory-wise. (Speed is still coveted though.)

Title: Re: Binary Search Method
Post by: SMcNeill on January 24, 2019, 08:42:05 pm

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
DIM SHARED WordList(1 TO 466544) AS STRING
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
 
DIM RandomWords(50000) AS STRING
FOR i = 1 TO 50000
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
 
t# = TIMER
FOR i = 1 TO 50000
    index = FindIndex(RandomWords(i))
NEXT
PRINT USING "###.### seconds lookup"; TIMER - t#
 
END
 
 
 
 
DO
    _KEYCLEAR
    INPUT "Give me any word"; search$
    search$ = LTRIM$(RTRIM$(search$))
    PRINT "Searching for "; search$
    IF search$ = "" THEN END
    index = FindIndex(search$)
    IF index THEN
        PRINT "Word was found at position "; index; " in "; SearchTimes; "passes."
    ELSE
        PRINT "Word was not in list."
        PRINT "Previous word was ==> "; WordList(LastIndex - 1)
        PRINT "Next word was ======> "; WordList(LastIndex + 1)
        PRINT "Search took"; SearchTimes; "passes."
    END IF
    PRINT
LOOP
 
 
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = 370099 'ubound(wordlist)
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (min + max) \ 2
        compare = _STRICMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap + 1
        ELSEIF compare < 0 THEN
            max = gap - 1
        ELSE
            FindIndex = gap
            found = -1
        END IF
        IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list
        ' PRINT min, max, search$, WordList(gap), compare
        ' SLEEP
    LOOP
END FUNCTION

0.055 seconds on my PC to find 50,000 random words. Personally, I’d call that “almost instantaneous”, but what do I know? My naming sense sucks. LOL

Now you just need to speed test your method to lookup the same 50,000 words. (Which is why theres no RANDOMIZER TIMER in there.)

Title: Re: Binary Search Method
Post by: Pete on January 24, 2019, 08:50:53 pm

Hey I'm still over in the other thread with my new method that makes you guys look like losers! Sorry, I gave brain cells today at the Head Cross clinic and for a few hours after that procedure, I find myself channeling Ted.

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 24, 2019, 08:53:52 pm

Nevermind. It’s not worth the effort. Use it or not; there it is.

Title: Re: Binary Search Method
Post by: Pete on January 24, 2019, 09:12:57 pm

Not alphabetizing a file for a binary search would be as stupid as not using the Dewey Decimal system for stacking books. The books in here somewhere, just start doing a shelf by shelf search. So I don't think we should be considering that as part of this discussion. This is about an organized word list vs a hash table, which alphabetizing in the hash method is of absolutely no consequence.

In my 25 year old example, the speed was adequate because I organized the data and sub-divided the list into different alphabetized folders. If I had to loop through 50,000 random words in a single list with the slower INPUT AS method, I'd still be waiting to see if I spelled dumb-ass correctly.

Pete

Title: Re: Binary Search Method
Post by: STxAxTIC on January 24, 2019, 09:14:15 pm

Steve, the big-O notation is designed to specifically not compare apples to oranges. Dicscrete calculus will be a lesson for another time... I just cant right now, my facepalm hurts.

Title: Re: Binary Search Method
Post by: SMcNeill on January 24, 2019, 09:17:35 pm

Use it or not; there it is.

I’d still love to see a similar speed test for your method. I’m curious how big a difference it could be.

Title: Re: Binary Search Method
Post by: STxAxTIC on January 24, 2019, 09:24:47 pm

It's open source. Have a ball. (Might wanna optimize it first.)

Title: Re: Binary Search Method
Post by: Pete on January 24, 2019, 09:29:23 pm

Quote from: STxAxTIC on January 24, 2019, 09:14:15 pm

Steve, the big-O notation is designed to specifically not compare apples to oranges. Dicscrete calculus will be a lesson for another time... I just cant right now, my facepalm hurts.

Dicscrete calculus, Bill? I think you'd better chose a search method and get that spell checker up and working for craps sake.

If you mean discrete calculus, is that one of those math courses offered at night for active singles? If so, what is the probably of completing the course without contracting a nasty case of standard deviations?

Pete :D

Title: Re: Binary Search Method
Post by: STxAxTIC on January 24, 2019, 10:21:24 pm

Haha, not sure who promised I was making a spell checker. The dictionizer thing I posted is still a working product... or... prototype. I consider the whole case closed pretty much.

Title: Re: Binary Search Method
Post by: bplus on January 25, 2019, 12:16:04 am

Steve there is a bug in your FindIndex function:

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
DIM SHARED WordList(1 TO 466544) AS STRING
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
DIM RandomWords(50000) AS STRING
FOR i = 1 TO 50000
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
PRINT "Steve's 50000 word lookup with binary search"
t# = TIMER
FOR i = 1 TO 10
    index = FindIndex(RandomWords(i))
    IF index THEN PRINT WordList(index) ELSE PRINT RandomWords(i) + " not found!"
NEXT
PRINT USING "###.### seconds lookup"; TIMER - t#
 
 
'pete's method
OPEN "466544 Word List.txt" FOR BINARY AS #1
word$ = SPACE$(LOF(1))
GET #1, , word$
CLOSE #1
PRINT "Pete's 50000 word lookup using INSTR to get position in string of word."
t# = TIMER
FOR i = 1 TO 10
    place = INSTR(word$, RandomWords(i) + CHR$(13) + CHR$(10))
    PRINT MID$(word$, place, LEN(RandomWords(i)))
NEXT
PRINT USING "####.### seconds lookup"; TIMER - t#
 
END
 
 
 
 
DO
    _KEYCLEAR
    INPUT "Give me any word"; search$
    search$ = LTRIM$(RTRIM$(search$))
    PRINT "Searching for "; search$
    IF search$ = "" THEN END
    index = FindIndex(search$)
    IF index THEN
        PRINT "Word was found at position "; index; " in "; SearchTimes; "passes."
    ELSE
        PRINT "Word was not in list."
        PRINT "Previous word was ==> "; WordList(LastIndex - 1)
        PRINT "Next word was ======> "; WordList(LastIndex + 1)
        PRINT "Search took"; SearchTimes; "passes."
    END IF
    PRINT
LOOP
 
 
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = 370099 'ubound(wordlist)
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (min + max) \ 2
        compare = _STRICMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap + 1
        ELSEIF compare < 0 THEN
            max = gap - 1
        ELSE
            FindIndex = gap
            found = -1
        END IF
        IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list
        ' PRINT min, max, search$, WordList(gap), compare
        ' SLEEP
    LOOP
END FUNCTION
 
 

[ This attachment cannot be displayed inline in 'Print Page' view ]

Title: Re: Binary Search Method
Post by: SMcNeill on January 25, 2019, 02:08:45 am

A couple of little things was working against the original running as intended:

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
DIM SHARED WordList(466545) AS STRING
PRINT "Loading library"
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
 
PRINT "Sorting"
Sort WordList()
 
PRINT "Looking up"
DIM RandomWords(50000) AS STRING
FOR i = 1 TO 50000
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
PRINT "Steve's 50000 word lookup with binary search"
t# = TIMER
FOR i = 1 TO 10
    index = FindIndex(RandomWords(i))
    PRINT "Searching for: "; RandomWords(i),
    IF index THEN PRINT WordList(index) ELSE PRINT "NOT FOUND!"
NEXT
PRINT USING "###.### seconds lookup"; TIMER - t#
 
 
'pete's method
 
'FOR i = 1 TO 466545
'    adding brackets to words to stop a false match, such as "dog" being found in "dogfood"
'    word$ = word$ + "{" + WordList(i) + "}" 'build up the search list
'    IF i < 50001 THEN RandomWords(i) = "{" + RandomWords(i) + "}"
'NEXT
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
word$ = SPACE$(LOF(1))
GET #1, , word$
CLOSE #1
 
 
 
PRINT "Pete's 50000 word lookup using INSTR to get position in string of word."
t# = TIMER
FOR i = 1 TO 10
    place = INSTR(word$, RandomWords(i) + CHR$(13) + CHR$(10))
    PRINT "Searching for: "; RandomWords(i),
    IF place THEN PRINT MID$(word$, place, LEN(RandomWords(i))) ELSE PRINT "NOT FOUND!"
NEXT
PRINT USING "####.### seconds lookup"; TIMER - t#
 
END
 
 
 
 
DO
    _KEYCLEAR
    INPUT "Give me any word"; search$
    search$ = LTRIM$(RTRIM$(search$))
    PRINT "Searching for "; search$
    IF search$ = "" THEN END
    index = FindIndex(search$)
    IF index THEN
        PRINT WordList(index); " was found at position "; index; " in "; SearchTimes; "passes."
    ELSE
        PRINT "Word was not in list."
        PRINT "Previous word was ==> "; WordList(LastIndex - 1)
        PRINT "Next word was ======> "; WordList(LastIndex + 1)
        PRINT "Search took"; SearchTimes; "passes."
    END IF
    PRINT
LOOP
 
 
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = 466544 'ubound(wordlist)
 
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (max + min) \ 2
        'IF gap = oldgap THEN gap = gap + 1
        compare = _STRCMP(search$, WordList(gap))
        IF compare > 0 THEN
            oldmin = min
            min = gap
        ELSEIF compare < 0 THEN
            oldmax = max
            max = gap
        ELSE
            FindIndex = gap
            found = -1
            EXIT FUNCTION
        END IF
        oldgap = gap
        IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
        ' PRINT min, max, search$, WordList(gap), compare
        ' SLEEP
    LOOP
END FUNCTION
 
SUB Sort (Array() AS STRING)
    'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
    'It's more than fast enough for our needs here I th ink.  ;)
    gap = UBOUND(array)
    DO
        gap = 10 * gap \ 13
        IF gap < 1 THEN gap = 1
        i = 0
        swapped = 0
        DO
            IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
                SWAP Array(i), Array(i + gap)
                swapped = -1
            END IF
            i = i + 1
        LOOP UNTIL i + gap > UBOUND(Array)
    LOOP UNTIL swapped = 0 AND gap = 1
END SUB
 

First issue: The data in the wordlist isn't fully alphabetical... O_o!! It needed a quick sort so we can search it properly. Just look at the first few lines; it should've been obvious, but I overlooked it:

Quote

2
1080
&c
10-point
10th
11-point
12-point
16-point
18-point
1st

Since when does "2" come before "1"??

Sorting routine added, and bug was squashed.

Glitch 2:

OPEN "466544 Word List.txt" FOR BINARY AS #1
max = 370099 'ubound(wordlist)

There's a small difference in those two numbers... (I'd originally was using a smaller file, before I found this one on the web and grabbed it for the larger data set.)

There's no way we could find a ton of words in the list, while only checking about 3/4 of them...

Glitch fixed.

Glitch #3:

ELSE
FindIndex = gap
found = -1
END IF
IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list

We're missing one important piece of program flow here:

ELSE
FindIndex = gap
found = -1
EXIT FUNCTION
END IF
IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list

We'd find the proper result, and then if the gap was small enough between the last numbers, we'd lie and claim it wasn't actually in the list...

Glitch fixed.

All should be working as advertised now, in the code above. ;)

Sidenote: Your implementation of Pete's method is glitched. "dog" would be found if "dogfood" was included in the list, while "dog" actually wasn't.

I was going to do a fix for the issue, but you'd need to take a nap while it was running:

Code: QB64: [Select]

FOR i = 1 TO 466545 ' adding brackets to words to stop a false match, such as "dog" being found in "dogfood"
    word$ = word$ + "{" + WordList(i) + "}" 'build up the search list
    IF i < 50001 THEN RandomWords(i) = "{" + RandomWords(i) + "}"
NEXT

I tried to rebuild the already loaded and sorted list with brackets delimiting the words, but I didn't have the patience to sit and listen to my PC try to sound like an airplane with all the fans going full throttle for who knows how long....

Title: Re: Binary Search Method
Post by: luke on January 25, 2019, 05:42:29 am

And just because we can, here's a recursive binary sort:

Code: QB64: [Select]

DEFLNG A-Z
 
OPEN "words" FOR INPUT AS #1
DIM SHARED WordList(99170) AS STRING
DO UNTIL EOF(1)
    LINE INPUT #1, WordList(count)
    count = count + 1
LOOP
 
DO
    INPUT "> ", search$
    IF search$ = "" THEN END
    index = find(search$, 0, UBOUND(wordlist))
    IF index >= 0 THEN
        PRINT "Found "; WordList(index); " at"; index
    ELSE
        PRINT "Not found"
    END IF
LOOP
 
FUNCTION find (word$, start, finish)
    size = finish - start + 1
    mid = size \ 2
    cmp = _STRCMP(WordList(start + mid), word$)
    SELECT CASE cmp
        CASE 0
            find = start + mid
        CASE IS > 0
            IF size = 1 THEN find = -1 ELSE find = find(word$, start, start + mid - 1)
        CASE IS < 0
            IF size = 1 THEN find = -1 ELSE find = find(word$, start + mid + 1, finish)
    END SELECT
END FUNCTION

(My word list had all the words beginning with a capital letter listed first, so I'm using _STRCMP)

Title: Re: Binary Search Method
Post by: SMcNeill on January 25, 2019, 09:18:49 am

Quote from: SMcNeill on January 24, 2019, 09:17:35 pm

Use it or not; there it is.

I’d still love to see a similar speed test for your method. I’m curious how big a difference it could be.

Quote from: STxAxTIC on January 24, 2019, 09:24:47 pm

It's open source. Have a ball. (Might wanna optimize it first.)

So, since Stx wasn't willing to do a speed comparison, I did:

Code: QB64: [Select]

DIM SHARED HashTableSize AS LONG
HashTableSize = 300007 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{"
RB = "}"
 
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
 
DIM SHARED WordList(466545) AS STRING
PRINT "Loading library"
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
 
Sort WordList()
 
i = 0
FOR i = 1 TO 466545
    b$ = WordList(i) 'the word to store
    c$ = LTRIM$(RTRIM$(STR$(i))) 'to store the index
    d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
    EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB + LB + c$ + RB
NEXT
CLOSE #1
PRINT "Done creating Hash Table."
 
' Done developing fast lookup tool. Now time for an application.
 
PRINT "Looking up"
DIM RandomWords(50000) AS STRING
FOR i = 1 TO 50000
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
 
t# = TIMER
FOR i = 1 TO 50000
    'PRINT "Searching for: "; RandomWords(i),
    l$ = Lookup$(RandomWords(i))
    IF l$ <> "" THEN
        l = INSTR(l$, " ")
        word$ = LEFT$(l$, l - 1)
        index = VAL(MID$(l$, l + 1))
        'PRINT WordList(index) 'to show that we got the index back successfully
        'ELSE
        'PRINT "NOT FOUND!"
    END IF
NEXT
PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
 
t# = TIMER
FOR i = 1 TO 50000
    index = FindIndex(RandomWords(i))
    'PRINT "Searching for: "; RandomWords(i),
    'IF index THEN 'to show that we got the index back successfully
    '    PRINT WordList(index)
    'ELSE
    '    PRINT "NOT FOUND!"
    'END IF
NEXT
PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
 
PRINT
PRINT "And just to compare results -- the first 15 words:"
FOR i = 1 TO 10
    'PRINT "Searching for: "; RandomWords(i),
    l$ = Lookup$(RandomWords(i))
    IF l$ <> "" THEN
        l = INSTR(l$, " ")
        word$ = LEFT$(l$, l - 1)
        index = VAL(MID$(l$, l + 1))
        PRINT WordList(index), 'to show that we got the index back successfully
    ELSE
        PRINT "NOT FOUND!",
    END IF
    index = FindIndex(RandomWords(i))
    'PRINT "Searching for: "; RandomWords(i),
    IF index THEN 'to show that we got the index back successfully
        PRINT WordList(index)
    ELSE
        PRINT "NOT FOUND!"
    END IF
NEXT
 
 
 
 
FUNCTION Lookup$ (a AS STRING)
    r$ = ""
    b$ = EnglishDictionary(HashFunc(a))
    c$ = ""
    d$ = ""
    IF b$ <> "" THEN
        DO WHILE c$ <> a
            c$ = ReturnBetween(b$, LB, RB)
            IF c$ = "" THEN EXIT DO
            b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
            d$ = ReturnBetween(b$, LB, RB)
        LOOP
    END IF
    r$ = a + "  " + d$
    Lookup$ = r$
END FUNCTION
 
FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
    i = INSTR(a, b)
    j = INSTR(a, c)
    f = LEN(c)
    ReturnBetween$ = MID$(a, i + f, j - (i + f))
END FUNCTION
 
FUNCTION HashFunc (a AS STRING) ' input string
    DIM sum AS DOUBLE
    sum = HashTableSize
    FOR k = 1 TO LEN(a)
        sum = sum + k * COS(ASC(MID$(a, k, 1))) ' Without the linear factor of k, permutations have same hash values.
    NEXT
    sum = ABS(VAL(ReplaceSubString$(STR$(sum), ".", "")))
    sum = sum MOD HashTableSize
    HashFunc = sum
END FUNCTION
 
FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
    j = INSTR(a, b)
    IF j > 0 THEN
        r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
    ELSE
        r$ = a
    END IF
    ReplaceSubString$ = r$
END FUNCTION
 
 
SUB Sort (Array() AS STRING)
    'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
    'It's more than fast enough for our needs here I th ink.  ;)
    gap = UBOUND(array)
    DO
        gap = 10 * gap \ 13
        IF gap < 1 THEN gap = 1
        i = 0
        swapped = 0
        DO
            IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
                SWAP Array(i), Array(i + gap)
                swapped = -1
            END IF
            i = i + 1
        LOOP UNTIL i + gap > UBOUND(Array)
    LOOP UNTIL swapped = 0 AND gap = 1
END SUB
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = 466544 'ubound(wordlist)
 
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (max + min) \ 2
        compare = _STRCMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap
        ELSEIF compare < 0 THEN
            max = gap
        ELSE
            FindIndex = gap
            found = -1
            EXIT FUNCTION
        END IF
        IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
        ' PRINT min, max, search$, WordList(gap), compare
        ' SLEEP
    LOOP
END FUNCTION
 
 

In both search instances, we use the same word list, the same 50,000 word search list, and we return both the word we're searching for and its index position in the list.

Now, I imagine that Stx will come along and tell me I'm doing something wrong with the routine he provided, but the screenshot below is the results I get. If there's something wrong with things, I'm more than happy to learn from whatever the issue is:

Title: Re: Binary Search Method
Post by: STxAxTIC on January 25, 2019, 09:34:54 am

As predicted, it's not apples-to-apples.

Gotta replace my hash function which uses a cosine, and then a completely separate recursive string function just to remove the decimal point. The ReplaceSubString function should be gone from the hash function - use Lukes suggestion instead.

Also your wordlist of 400k+ entries exceeded the value of the hash table size of 300007. You wired it for collisions in other words. Not sure if you noticed or if that's an accidental straw man.

This isn't worth beans until you get an honest comparison.

EDIT

LOL and I just noticed you're using the hash table wrong. Why on earth are you storing the index (as a collision) next to the word? This is a speed test - trim that fat!

... And once we see where this test finally goes, let's push the record limit to the maximum and see how each one does with like, a billion words.

Title: Re: Binary Search Method
Post by: Pete on January 25, 2019, 10:01:20 am

Bill, you made my point about the problems with hashing. With my methods and Steve's method, you don't need a complicated pre-plan, which may need to be overhauled or adjusted someday. Just alphabetize the list, which a one-time never change it algorithm.

I do love the hashing method and it is a damn clever fast and efficient way to set up a list to be searched and indexed. I suppose the merits in any of these methods depends on the use. For instance, indexing doesn't mean squat in a spell check routine, so I need that feature like a fish needs a bicycle. That's one reason why even my quick and dirty QB64 instr() method is appealing to me. So while you keep searching for the Holly Grail of hash, Steve and I just grab a couple of Dixie cups and drink up.

In any event, good luck with your hash refinements and thank you for starting this awesome topic for discussion, originally in the, "Dictionizer" thread, and to Steve for continuing it here. Guys, we need more critical thinking threads like these around here.

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 25, 2019, 10:02:36 am

Quote from: STxAxTIC on January 25, 2019, 09:34:54 am

As predicted, it's not apples-to-apples.

Gotta replace my hash function which uses a cosine, and then a completely separate recursive string function just to remove the decimal point. The ReplaceSubString function should be gone from the hash function - use Lukes suggestion instead.

Also your wordlist of 400k+ entries exceeded the value of the hash table size of 300007. You wired it for collisions in other words. Not sure if you noticed or if that's an accidental straw man.

This isn't worth beans until you get an honest comparison.

I told you you were comparing apples to peanut butter...

The Binary Search is a "search and done" method for data retrieval. The Hash Table is an "instant jump to a spot in memory and then parse a much smaller data set" method for data retrieval. As long as you're going to have collisions and need to search for those, you have to take that search time into consideration.

Using your suggestions of "Luke's method" + "Table Larger Than List", we end up with the following code:

Code: QB64: [Select]

DIM SHARED HashTableSize AS LONG
HashTableSize = 611953 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{"
RB = "}"
 
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
 
DIM SHARED WordList(466545) AS STRING
PRINT "Loading library"
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
 
Sort WordList()
 
i = 0
FOR i = 1 TO 466545
    b$ = WordList(i) 'the word to store
    c$ = LTRIM$(RTRIM$(STR$(i))) 'to store the index
    d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
    EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB + LB + c$ + RB
NEXT
CLOSE #1
PRINT "Done creating Hash Table."
 
' Done developing fast lookup tool. Now time for an application.
 
PRINT "Looking up"
DIM RandomWords(50000) AS STRING
FOR i = 1 TO 50000
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
 
t# = TIMER
FOR i = 1 TO 50000
    'PRINT "Searching for: "; RandomWords(i),
    l$ = Lookup$(RandomWords(i))
    'IF l$ <> "" THEN
    'l = INSTR(l$, " ")
    'word$ = LEFT$(l$, l - 1)
    'index = VAL(MID$(l$, l + 1))
    'PRINT WordList(index) 'to show that we got the index back successfully
    'ELSE
    'PRINT "NOT FOUND!"
    'END IF
NEXT
PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
 
t# = TIMER
FOR i = 1 TO 50000
    index = FindIndex(RandomWords(i))
    'PRINT "Searching for: "; RandomWords(i),
    'IF index THEN 'to show that we got the index back successfully
    '    PRINT WordList(index)
    'ELSE
    '    PRINT "NOT FOUND!"
    'END IF
NEXT
PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
 
PRINT
PRINT "And just to compare results -- the first 15 words:"
FOR i = 1 TO 10
    'PRINT "Searching for: "; RandomWords(i),
    l$ = Lookup$(RandomWords(i))
    IF l$ <> "" THEN
        l = INSTR(l$, " ")
        word$ = LEFT$(l$, l - 1)
        index = VAL(MID$(l$, l + 1))
        PRINT WordList(index), 'to show that we got the index back successfully
    ELSE
        PRINT "NOT FOUND!",
    END IF
    index = FindIndex(RandomWords(i))
    'PRINT "Searching for: "; RandomWords(i),
    IF index THEN 'to show that we got the index back successfully
        PRINT WordList(index)
    ELSE
        PRINT "NOT FOUND!"
    END IF
NEXT
 
 
 
 
FUNCTION Lookup$ (a AS STRING)
    r$ = ""
    b$ = EnglishDictionary(HashFunc(a))
    c$ = ""
    d$ = ""
    IF b$ <> "" THEN
        DO WHILE c$ <> a
            c$ = ReturnBetween(b$, LB, RB)
            IF c$ = "" THEN EXIT DO
            b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
            d$ = ReturnBetween(b$, LB, RB)
        LOOP
    END IF
    r$ = a + "  " + d$
    Lookup$ = r$
END FUNCTION
 
FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
    i = INSTR(a, b)
    j = INSTR(a, c)
    f = LEN(c)
    ReturnBetween$ = MID$(a, i + f, j - (i + f))
END FUNCTION
 
FUNCTION HashFunc (s$)
    DIM hash~&, i
    hash~& = 5381
    FOR i = 1 TO LEN(s$)
        hash~& = ((hash~& * 33) XOR ASC(s$, i)) MOD 611953
    NEXT i
    HashFunc = hash~&
END FUNCTION
 
 
FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
    j = INSTR(a, b)
    IF j > 0 THEN
        r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
    ELSE
        r$ = a
    END IF
    ReplaceSubString$ = r$
END FUNCTION
 
 
SUB Sort (Array() AS STRING)
    'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
    'It's more than fast enough for our needs here I th ink.  ;)
    gap = UBOUND(array)
    DO
        gap = 10 * gap \ 13
        IF gap < 1 THEN gap = 1
        i = 0
        swapped = 0
        DO
            IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
                SWAP Array(i), Array(i + gap)
                swapped = -1
            END IF
            i = i + 1
        LOOP UNTIL i + gap > UBOUND(Array)
    LOOP UNTIL swapped = 0 AND gap = 1
END SUB
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = 466544 'ubound(wordlist)
 
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (max + min) \ 2
        compare = _STRCMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap
        ELSEIF compare < 0 THEN
            max = gap
        ELSE
            FindIndex = gap
            found = -1
            EXIT FUNCTION
        END IF
        IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
        ' PRINT min, max, search$, WordList(gap), compare
        ' SLEEP
    LOOP
END FUNCTION
 
 

Faster than previously by about half, but still about twice as slow as the binary search method...

Regardless; I think the results speak to show that when your data is already sorted, doing a binary search of it is rather quite efficient.

Anyway, I think I've did as you asked:

Quote

I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand).

If it's *wrong* at this point, I think the burden's on you to show us where and how.

Title: Re: Binary Search Method
Post by: STxAxTIC on January 25, 2019, 10:11:06 am

Nah, I don't need to prove shit to you guys again on this. Fun talk though. Someone who takes the time to sift through this convo, and the way Steve split it among two threads, will get a real chuckle.

If you really want a *win* on this Steve, adjust it for a dataset size typical of a respectable database. A few billion if you can. There's the bait. You might even be right. Try a billion or so.

EDIT:

shit car's been running. lemme fix this up later

Title: Re: Binary Search Method
Post by: Pete on January 25, 2019, 10:29:42 am

Check your refrigerator. If it's running too, the chances are it's having an affair with your car. You'll probably be back soon, because history proves frigid relationships don't last too long.

Seriously Bill, billions of data entries and a hash algorithm to accommodate that project? What, thousands of collisions that need to be sorted? I would think at that level, you would at least need to combine our methods to get adequate results.

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 25, 2019, 11:15:27 am

Quote from: STxAxTIC on January 25, 2019, 10:11:06 am

...adjust it for a dataset size typical of a respectable database. A few billion if you can. There's the bait. You might even be right. Try a billion or so.

Here's you a list with 10 million "words" in it, but it'll easily scale to whatever size you want (as long as you have the memory limits to run it):

Code: QB64: [Select]

DEFLNG A-Z
DIM SHARED HashTableSize AS LONG
HashTableSize = 10000019 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{"
RB = "}"
 
PRINT "Reserving memory for Hashtable..."
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
 
CONST limit = 10000000
DIM SHARED WordList(limit) AS STRING
 
PRINT "Generating List..."
FOR i = 1 TO limit
    WordList(i) = STR$(i) 'generate a list of "words"
NEXT
 
PRINT "Sorting list to alphabeticlize it..."
Sort WordList() 'to put it in alphabetic order and not numeric
 
PRINT "Creating Hash Table..."
FOR i = 1 TO limit
    b$ = WordList(i) 'the word to store
    d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
    EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB
NEXT
PRINT "Done creating Hash Table."
 
' Done developing fast lookup tool. Now time for an application.
 
PRINT "Generating search list..."
DIM RandomWords(limit) AS STRING
RandomWords() = WordList() 'lets just look up all the damn words
 
 
PRINT "Searching using Hash Table"
t# = TIMER
FOR i = 1 TO limit
    l$ = Lookup$(RandomWords(i))
NEXT
PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
 
PRINT "Searching using Binary Search"
t# = TIMER
FOR i = 1 TO limit
    index = FindIndex(RandomWords(i))
    'PRINT i, index, "?"; RandomWords(index)
    'SLEEP
NEXT
PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
 
 
FUNCTION Lookup$ (a AS STRING)
    r$ = ""
    b$ = EnglishDictionary(HashFunc(a))
    c$ = ""
    d$ = ""
    IF b$ <> "" THEN
        DO WHILE c$ <> a
            c$ = ReturnBetween(b$, LB, RB)
            IF c$ = "" THEN EXIT DO
            b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
            'd$ = ReturnBetween(b$, LB, RB)
        LOOP
    END IF
    r$ = a '+ "  " + d$
    Lookup$ = r$
END FUNCTION
 
FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
    i = INSTR(a, b)
    j = INSTR(a, c)
    f = LEN(c)
    ReturnBetween$ = MID$(a, i + f, j - (i + f))
END FUNCTION
 
FUNCTION HashFunc (s$)
    DIM hash~&, i
    hash~& = 5381
    FOR i = 1 TO LEN(s$)
        hash~& = ((hash~& * 33) XOR ASC(s$, i)) MOD HashTableSize
    NEXT i
    HashFunc = hash~&
END FUNCTION
 
 
FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
    j = INSTR(a, b)
    IF j > 0 THEN
        r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
    ELSE
        r$ = a
    END IF
    ReplaceSubString$ = r$
END FUNCTION
 
 
SUB Sort (Array() AS STRING)
    'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
    'It's more than fast enough for our needs here I th ink.  ;)
    gap = UBOUND(array)
    DO
        gap = 10 * gap \ 13
        IF gap < 1 THEN gap = 1
        i = 0
        swapped = 0
        DO
            IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
                SWAP Array(i), Array(i + gap)
                swapped = -1
            END IF
            i = i + 1
        LOOP UNTIL i + gap > UBOUND(Array)
    LOOP UNTIL swapped = 0 AND gap = 1
END SUB
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = limit + 1
 
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (max + min) \ 2
        compare = _STRCMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap
        ELSEIF compare < 0 THEN
            max = gap
        ELSE
            FindIndex = gap
            found = -1
            EXIT FUNCTION
        END IF
        IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
        'PRINT min, max, search$, WordList(gap), compare
        'SLEEP
    LOOP
END FUNCTION
 

One word of caution though: This won't run in QB64x32...

You wanted a hash table size larger than the data set ("Also your wordlist of 400k+ entries exceeded the value of the hash table size of 300007. You wired it for collisions in other words."), and simply reserving space for it in memory takes up about 2.6GB of RAM, as you can see just from the snippet below:

Code: QB64: [Select]

DEFLNG A-Z
DIM SHARED HashTableSize AS LONG
HashTableSize = 10000019 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{"
RB = "}"
 
PRINT "Reserving memory for Hashtable..."
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.

Once the data is all filled and in memory, it requires 3,418.6MB to run without crashing.

(And, for what it's worth, the binary search still beats the hash table for speed, on my PC, with one taking 6 seconds, the other taking 4.5.)

Title: Re: Binary Search Method
Post by: _vince on January 25, 2019, 11:56:03 am

Quote from: Pete on January 24, 2019, 09:29:23 pm

Quote from: STxAxTIC on January 24, 2019, 09:14:15 pm
Steve, the big-O notation is designed to specifically not compare apples to oranges. Dicscrete calculus will be a lesson for another time... I just cant right now, my facepalm hurts.

Dicscrete calculus, Bill? I think you'd better chose a search method and get that spell checker up and working for craps sake.

If you mean discrete calculus, is that one of those math courses offered at night for active singles? If so, what is the probably of completing the course without contracting a nasty case of standard deviations?

Pete :D

Either way, the dic will secrete

Title: Re: Binary Search Method
Post by: Pete on January 25, 2019, 12:01:38 pm

They say comedy is contagious, but I'm not sure anyone wants to catch it that way.

Pete :D

Title: Re: Binary Search Method
Post by: Ed Davis on January 25, 2019, 12:04:24 pm

Well, I thought I would give this a try.

I took the original code, and tried to make it "apples to apples" as much as possible.

Note that for the hash function proper, no need to do a mod on each iteration. One at the end is enough. Also, I redid the hash lookup, as it was doing lots of extra work.

On my machine, hashing wins: .109 vs .219. About twice as fast.

Code: QB64: [Select]

DIM SHARED HashTableSize AS LONG
HashTableSize = 500009' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{"
RB = "}"
 
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
 
DIM SHARED WordList(466545) AS STRING
PRINT "Loading library"
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
 
print "Sorting wordlist"
Sort WordList()
 
print "Creating Hash Table"
i = 0
FOR i = 1 TO ubound(wordlist)
    b$ = WordList(i) 'the word to store
    d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
    if d < 0 then
        print "hash out of range"
        end
    end if
    EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB
NEXT
CLOSE #1
 
' Done developing fast lookup tool. Now time for an application.
 
const ITER = 75000
 
PRINT "Creating "; ITER; " random word lookup table"
DIM RandomWords(ITER) AS STRING
FOR i = 1 TO ITER
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
 
Print "Search via Hashing...       ";
t# = TIMER
FOR i = 1 TO ITER
    l$ = Lookup$(RandomWords(i))
    if l$ <> RandomWords(i) then
        print "HashTable error - searched: ", RandomWords(i), " found: ", l$
        end
    end if
NEXT
PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
 
Print "search via Binary Search... ";
t# = TIMER
FOR i = 1 TO ITER
    index = FindIndex(RandomWords(i))
    if WordList(index) <> RandomWords(i) then
        print "BinarySearch error - searched: ", RandomWords(i), " found: ", WordList(index)
    end if
NEXT
PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
 
PRINT
PRINT "And just to compare results -- the first 10 words:"
FOR i = 1 TO 10
    'PRINT "Searching for: "; RandomWords(i),
    l$ = Lookup$(RandomWords(i))
    IF l$ <> "" THEN
        l = INSTR(l$, " ")
        word$ = LEFT$(l$, l - 1)
        index = VAL(MID$(l$, l + 1))
        PRINT WordList(index), 'to show that we got the index back successfully
    ELSE
        PRINT "NOT FOUND!",
    END IF
    index = FindIndex(RandomWords(i))
    'PRINT "Searching for: "; RandomWords(i),
    IF index THEN 'to show that we got the index back successfully
        PRINT WordList(index)
    ELSE
        PRINT "NOT FOUND!"
    END IF
NEXT
 
FUNCTION Lookup$ (s AS STRING)
    haystack$ = EnglishDictionary(HashFunc(s))
    needle$ = LB + s + RB
    p = instr(haystack$, needle$)
    if p = 0 then
        Lookup$ = ""
    else
        Lookup$ = mid$(haystack$, p + 1, LEN(needle$) - 2)
    end if
END FUNCTION
 
FUNCTION HashFunc (s AS STRING) ' input string
    DIM hash~&, i
 
    hash~& = 5381
 
    FOR i = 1 TO LEN(s$)
        hash~& = ((hash~& * 33) XOR ASC(s$, i))
    NEXT i
 
    HashFunc = hash~& mod HashTableSize
END FUNCTION
 
SUB Sort (Array() AS STRING)
    'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
    'It's more than fast enough for our needs here I think.  ;)
    gap = UBOUND(array)
    DO
        gap = 10 * gap \ 13
        IF gap < 1 THEN gap = 1
        i = 0
        swapped = 0
        DO
            IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
                SWAP Array(i), Array(i + gap)
                swapped = -1
            END IF
            i = i + 1
        LOOP UNTIL i + gap > UBOUND(Array)
    LOOP UNTIL swapped = 0 AND gap = 1
END SUB
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = lbound(wordlist)
    max = ubound(wordlist)
 
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (max + min) \ 2
        compare = _STRCMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap
        ELSEIF compare < 0 THEN
            max = gap
        ELSE
            FindIndex = gap
            found = -1
            EXIT FUNCTION
        END IF
        IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
        ' PRINT min, max, search$, WordList(gap), compare
        ' SLEEP
    LOOP
END FUNCTION
 
 

Many years ago (1990 to be precise), when I was needing something like this for a project I was doing in C, I ran extensive tests on different search functions - avltree, binarytree, binary search, hashing (I tested 135 different hash functions), custom lists tuned to the data, and a few more.

I reran some of those tests this morning.

In C at least, with my code (e.g., a custom binary search, instead of the built-in bsearch), hashing is about 2 times faster than binary searching. So pretty much what I found for the code above.

Per the literature, it is even faster to use a power of 2 for the table size, rather than a prime for the table size, and then and off the appropriate bits instead of doing the mod at the end of the hash function. In my tests with C, I did find this to be true, although it was just slightly faster, not strikingly so. But of course, when you have potentially many millions of lookups, then every bit of speed counts. Also note that the power of 2 table size assumes a high quality hash function. Otherwise, it is best to stick with a prime number as the table size.

Anyway, cool topic! Thanks to everyone who shared something on this!

Title: Re: Binary Search Method
Post by: SMcNeill on January 25, 2019, 12:23:35 pm

I’m getting perfectly equal time tests on my PC, with both taking 0.164 seconds to run. Either way, both methods are generally fast enough to hold up and work in most situations.

The biggest advantage to the hash list is that you dont need sorted data, whereas a binary search has lower memory requirements. My advice is to choose whichever you prefer; either will usually work in most cases.

Title: Re: Binary Search Method
Post by: SMcNeill on January 25, 2019, 01:32:58 pm

Taking Ed's improvements, I played around and tweaked the hash routine even further, making it a much faster program:

Code: QB64: [Select]

DIM SHARED HashTableSize AS LONG
HashTableSize = 500009 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
 
DIM SHARED WordList(466545) AS STRING
PRINT "Loading library"
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
 
PRINT "Creating Hash Table"
i = 0
FOR i = 1 TO UBOUND(wordlist)
    b$ = WordList(i) 'the word to store
    d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
    IF d < 0 THEN
        PRINT "hash out of range"
        END
    END IF
    EnglishDictionary(d) = EnglishDictionary(d) + CHR$(LEN(b$)) + b$
NEXT
CLOSE #1
 
' Done developing fast lookup tool. Now time for an application.
 
 
 
CONST ITER = 75000
 
PRINT "Creating "; ITER; " random word lookup table"
DIM RandomWords(ITER) AS STRING
FOR i = 1 TO ITER
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
 
PRINT "Search via Hashing...       ";
t# = TIMER
FOR i = 1 TO ITER
    l$ = Lookup$(RandomWords(i))
    IF l$ <> RandomWords(i) THEN
        PRINT "HashTable error - searched: ", RandomWords(i), " found: ", l$
        END
    END IF
NEXT
PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
 
FUNCTION Lookup$ (s AS STRING)
    haystack$ = EnglishDictionary(HashFunc(s))
    size = 1: l = 1
    DO UNTIL needle$ = s OR size = 0
        size = ASC(haystack$, l)
        needle$ = MID$(haystack$, l + 1, size)
        l = l + size + 1
    LOOP
    IF size > 0 THEN Lookup$ = needle$
END FUNCTION
 
FUNCTION HashFunc (s AS STRING) ' input string
    DIM hash~&, i
    hash~& = 5381
    FOR i = 1 TO LEN(s$)
        hash~& = ((hash~& * 33) XOR ASC(s$, i))
    NEXT i
    HashFunc = hash~& MOD HashTableSize
END FUNCTION

What's the main difference here?

EnglishDictionary(d) = EnglishDictionary(d) + CHR$(LEN(b$)) + b$

We're not storing our data separated by delimiters. Instead, we're manually recording the size of each entry and using it to directly retrieve our information. (Side note: This is a simple implementation of those dreaded "linked lists" which STx is so obsessive over. It's hard to believe he didn't implement one from the start, since he's the resident guru on the subject for us!)

By going this route, we don't need to rely on INSTR to search for the various entries for us, giving us two distinct advantages:

1) It's much faster to "jump" from word to word, than it is to search byte by byte for a match.

2) It's less likely to generate any false positives, as we're retrieving the exact string from memory and comparing. _INSTR("dogfood", "dog") is a match -- there's a "dog" in "dogfood", but "dog" <> "dogfood" if we compare directly.

Like most code, when it comes to speed, the devil's in the details.

Title: Re: Binary Search Method
Post by: Pete on January 25, 2019, 01:48:36 pm

I've already pointed out INSTR requires use of a delimiter like a comma.

dog - dogfood is easily solved that way...

wordlist$ = ",bill,cat,dog,dogfood,pete,steve,"
a = INSTR(wordlist$,",dog,")

dogfood would not be returned, just dog. Besides... dog food is two words. :P

Hey, did you try my INSTR search and frequency sample? It's dog gone fast!
https://www.qb64.org/forum/index.php?topic=1001.msg102017#msg102017

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 25, 2019, 02:04:03 pm

Quote from: Pete on January 25, 2019, 01:48:36 pm

I've already pointed out INSTR requires use of a delimiter like a comma.

dog - dogfood is easily solved that way...

wordlist$ = ",bill,cat,dog,dogfood,pete,steve,"
a = INSTR(wordlist$,",dog,")

dogfood would not be returned, just dog. Besides... dog food is two words. :P

Hey, did you try my INSTR search and frequency sample? It's dog gone fast!
https://www.qb64.org/forum/index.php?topic=1001.msg102017#msg102017

Pete

That's the point I was making with the previous post -- instead of using INSTR at all, which requires you to SEARCH the string for your words, store the length in front of each entry instead.

wordlist$ = "4bill3cat3dog7dogfood4pete5steve"

here you read 4, then get 4 characters for a string -- "bill"
then read 3, get 3 characters for "cat"...
then read 3, get 4 characters for "dog" -- MATCH!

Your method works like this:

wordlist$ = ",bill,cat,dog,dogfood,pete,steve,"
a = INSTR(wordlist$,",dog,")

byte 1 -- ",bill" <> ",dog"
byte 2 -- "bill," <> ",dog,"
byte 3 -- "ill,c" <> ",dog,"
byte 4 -- "ll,ca" <> ",dog,"
... and on byte by byte until ",dog," = ",dog,"

That's a 33 character string, and ",dog," is 5 characters, so there's up to 28 checks that the computer is doing behind the scene to look for the first occurrence of your string...

compared to a max of 6 comparisons if we simple read each word all at once.

When dealing with a process which may be repeated in multiple loops, over and over, it ends up making a difference in performance. ;)

Title: Re: Binary Search Method
Post by: Pete on January 25, 2019, 02:12:07 pm

But that's what seed in INSTR() is for, so you skip from delimiter to delimiter.

See the same post and the second code box, which uses seed% to deal with lookup frequency within a random word list.

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 25, 2019, 07:26:23 pm

Quote from: Pete on January 25, 2019, 02:12:07 pm

But that's what seed in INSTR() is for, so you skip from delimiter to delimiter.

See the same post and the second code box, which uses seed% to deal with lookup frequency within a random word list.

Pete

Seed gives you a starting point, but you still check byte by byte for a match...

a$ = “,1,2,3,4,5,3,7,8,”

seed = INSTR(seed + 1, a$, “,3,”)

Run first, it does a compare at byte 1, then byte 2, then 3... at byte 5 it finds “,3,” — 5 comparisons.

Then you have:

seed = seed + LEN(“,3,”) ‘5 + 3
seed = INSTR(seed, a$, “,3,”)

Now you start at byte 8 and search byte by byte until you get to byte 11... 5 more comparisons.

INSTR by its nature is a byte by byte compare routine (it has to be, or else how would it know where the search string appears at? A random guess?)

wordlist$ = "4bill3cat3dog7dogfood4pete5steve0”

Placing the size first lets you advance by words, not by bytes.

See the difference, and why one is faster than the other?

Title: Re: Binary Search Method
Post by: STxAxTIC on January 26, 2019, 06:00:08 pm

For some fun trivia, the post (I think three) above this one reminds me of the heartbleed bug:

https://xkcd.com/1354/ (https://xkcd.com/1354/)

Title: Re: Binary Search Method
Post by: Pete on January 27, 2019, 12:51:11 pm

Quote from: SMcNeill on January 25, 2019, 07:26:23 pm

Quote from: Pete on January 25, 2019, 02:12:07 pm
But that's what seed in INSTR() is for, so you skip from delimiter to delimiter.

See the same post and the second code box, which uses seed% to deal with lookup frequency within a random word list.

Pete

Seed gives you a starting point, but you still check byte by byte for a match...

a$ = “,1,2,3,4,5,3,7,8,”

seed = INSTR(seed + 1, a$, “,3,”)

Run first, it does a compare at byte 1, then byte 2, then 3... at byte 5 it finds “,3,” — 5 comparisons.

Then you have:

seed = seed + LEN(“,3,”) ‘5 + 3
seed = INSTR(seed, a$, “,3,”)

Now you start at byte 8 and search byte by byte until you get to byte 11... 5 more comparisons.

INSTR by its nature is a byte by byte compare routine (it has to be, or else how would it know where the search string appears at? A random guess?)

wordlist$ = "4bill3cat3dog7dogfood4pete5steve0”

Placing the size first lets you advance by words, not by bytes.

See the difference, and why one is faster than the other?

So you are proposing a frequency search be done something like this?

Code: QB64: [Select]

s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
 
' Find the whole word dog and frequency.
FOR i = 1 TO LEN(s$)
    index = VAL(MID$(s$, i, 1)) ' Good for words 9 letters or less.
    x$ = MID$(s$, i + 1, index)
    i = i + index
    IF x$ = "dog" THEN PRINT x$
NEXT

Title: Re: Binary Search Method
Post by: SMcNeill on January 27, 2019, 01:12:26 pm

Quote from: Pete on January 27, 2019, 12:51:11 pm

So you are proposing a frequency search be done something like this?

Code: QB64: [Select]
s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"

' Find the whole word dog and frequency.
FOR i = 1 TO LEN(s$)
index = VAL(MID$(s$, i, 1)) ' Good for words 9 letters or less.
x$ = MID$(s$, i + 1, index)
i = i + index
IF x$ = "dog" THEN PRINT x$
NEXT

Aye. No need to use INSTR for your search (it's inherently slower by design); just read each word and compare directly.

The only real difference I'd use would be ASCII characters instead of numeric ones, as we can deal and process those much faster.

s$ = CHR$(4) + "bill" + CHR$(3) + "cat" + CHR$(3) + "dog" + .... (so on)

Then just use index = ASC(s$, i).

Gives return values up to 255 characters in length, and processes faster than your VAL(MID$())

Title: Re: Binary Search Method
Post by: Pete on January 27, 2019, 01:31:03 pm

Well I use ASCII characters in files as delimiters all the time, but in this case there are a few considerations.

1) ASCII characters can't be letters so there is no way to separate them from the letters in the word list. Of course if the word list is limited in size, at least the extended ASCII character set could be used for words up to 128 characters. Ah, the good ol' days when supercalifragilisticexpialidocious was the longest word in the English language. Now it's: pneumonoultramicroscopicsilicovolcanoconiosis for all you sandblasting fans. So it looks like going with the extended character set would be fine.

2) Characters like CHR$(7) do not get printed in a string, so a workaround is needed in these cases. Again, I'd go with the extended set and just subtract 127.

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 27, 2019, 01:48:46 pm

I guess that all depends on what you need to do with your list...

If you need to print it, then use _CONTROLCHR OFF:

_CONTROLCHR OFF
FOR I = 0 TO 255: PRINT CHR$(I): NEXT

Normally, though, the size would be something you hide from your user, so they never see it at all.

To build s$, you’d often do something like:

DATA bill, cat, dog

DO
READ temp$
s$ = CHR$(LEN(temp$)) + temp$
LOOP

s$ = s$ + CHR$(0) ‘to designate end of list

And to print it, you’d print it much like you did the 3 “dog” above, though you might want to add a nice comma between words just to make your list look pretty. ;)

Title: Re: Binary Search Method
Post by: Pete on January 27, 2019, 02:06:53 pm

So your method would be...

Code: QB64: [Select]

 ' Convert data format first...
_CONTROLCHR OFF
s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
'convert format...
FOR i = 1 TO LEN(s$)
    index = VAL(MID$(s$, i, 1))
    x$ = MID$(s$, i + 1, index)
    i = i + index
    news$ = news$ + CHR$(index) + x$
NEXT
s$ = news$ + CHR$(0)
 
' Now your part..
FOR i = 1 TO LEN(s$)
    index = ASC(s$, i) ' Good for words 9 letters or less.
    x$ = MID$(s$, i + 1, index)
    i = i + index
    IF x$ = "dog" THEN PRINT CHR$(index), x$
NEXT
 

Look right?

Note: The board wouldn't allow pasted ascii characters for the index in the code. It dropped them out of the s$ variable so I re-posted with a conversion method.

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 27, 2019, 02:27:37 pm

Looks good. As you can see in the post under Ed’s, it’s almost exactly what I did for the hash table — speeding it up 300% + on my PC. https://www.qb64.org/forum/index.php?topic=1003.msg102044#msg102044

The longer the average words in your list, the greater the difference you’ll see in speed vs using INSTR as you were before.

Title: Re: Binary Search Method
Post by: Pete on January 27, 2019, 04:25:10 pm

OK, I like that method because it's clever, and I learned something new about a QB64 underscore keyword I haven't used before but as for speed, we need to work on that, because I get the INSTR() method as faster when a test is set up as follows...

Code: QB64: [Select]

WIDTH 120, 25
_SCREENMOVE 0, 0
_CONTROLCHR OFF
s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
'convert format...
FOR i = 1 TO LEN(s$)
    index = VAL(MID$(s$, i, 1))
    x$ = MID$(s$, i + 1, index)
    i = i + index
    news$ = news$ + CHR$(index) + x$
NEXT
 
FOR i = 1 TO 15 ' Concatenate to a string size of 3178497
    news$ = news$ + news$
NEXT
s$ = news$: news$ = ""
 
PRINT " Start timer...": PRINT
t# = TIMER
FOR i = 1 TO LEN(s$)
    index = ASC(s$, i) ' Good for words 9 letters or less.
    x$ = MID$(s$, i + 1, index)
    i = i + index
    IF x$ = "dog" THEN k = k + 1
NEXT
PRINT TIMER - t#;
PRINT "seconds to find frequency of dog in search INDEX method. Frequency ="; k; "Length of string ="; LEN(s$)
 
k = 0: _DELAY 1: PRINT
 
news$ = ",bill,cat,dog,dogfood,pete,steve,bill,cat,dog,dogfood,pete,steve,bill,cat,dog,dogfood,pete,steve,"
FOR i = 1 TO 15
    news$ = news$ + news$
NEXT
s$ = news$: news$ = ""
 
PRINT " Start timer...": PRINT
t# = TIMER
DO
    seed& = INSTR(seed&, s$, ",dog,")
    IF seed& = 0 THEN EXIT DO
    x$ = MID$(s$, seed& + 1, LEN("dog"))
    k = k + 1
    seed& = seed& + LEN(",dog,")
LOOP
 
PRINT TIMER - t#;
PRINT "seconds to find frequency of dog in search INSTR method. Frequency ="; k; "Length of string ="; LEN(s$)
 

My belief is that seed advances INSTR past the byte by byte evaluation, much like your indexing method, but faster. I know we need to make sure the comparison is fair, so if you see anything that is skewing the results, let's work it out.

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 28, 2019, 03:13:08 am

I had to do a little pondering to sort out what the heck is going on to make INSTR faster than jumping with the index and size, and what I've came up with is the same old conclusion I've gathered in the past: There's a lot of overhead in QB64 functions.

x$ = MID$(s$, i + 1, index) <--MID$ is slower than one would like.

index = ASC(s$, i) ' Good for words 9 letters or less. <--ASC is faster than MID$, but still nothing to write home about.

IF x$ = "dog" THEN k = k + 1 <--And then you do a direct string compare....

VS:

seed& = INSTR(seed&, s$, ",dog,") <--Check the result
IF seed& = 0 THEN EXIT DO
x$ = MID$(s$, seed& + 1, LEN("dog")) <--WTH is this line in here for? It actually does NOTHING for the check...
k = k + 1
seed& = seed& + LEN(",dog,")

So, for the comparison to be as fair as possible, we need to remove as much overhead from the program as we can:

Code: QB64: [Select]

 SCREEN _NEWIMAGE(800, 600, 32)
_SCREENMOVE 0, 0
_CONTROLCHR OFF
 
DEFLNG A-Z
 
CONST Limit = 100 'number of words in our list
CONST Repitition = 10000 'number of times to search the lists
 
DIM word(Limit) AS STRING, index AS _UNSIGNED _BYTE
 
 
'make an array of the proper sizes becauseQB64 dosesn't
'understand (AS STRING * variable) for a data type
'so we're setting an array for strings of size 1  to 20, just as a quick placeholder for use with mem.
DIM Strings(1 TO 20) AS STRING
FOR i = 1 TO 20: Strings(i) = SPACE$(i): NEXT
 
DIM m AS _MEM
 
 
FOR SearchSize = 1 TO 20 STEP 5 'run the list multiple times for various size search strings
 
    'Generate suitable search string
    search1$ = STRING$(SearchSize, "A")
    search2$ = "," + search1$ + "," 'comma before and after
 
    FOR i = 1 TO Limit: word(i) = "": NEXT 'reset old list
 
 
    FOR WordNumber = 1 TO Limit 'the number of words
        IF WordNumber MOD 10 = 1 THEN 'every 10th word, no matter what, is one we want to look for
            word(WordNumber) = search1$
        ELSE 'otherwise, make the word junk
            FOR i = 1 TO INT(RND * 15) + 1 'up to 15 characters of junk in the spam "words"
                word(WordNumber) = word(WordNumber) + CHR$(INT(RND * 26) + 97)
            NEXT
        END IF
    NEXT
 
    'Words are now generated.  Now let's form our two similar lists for searching.
    list1$ = "": list2$ = ","
 
    'Size/Word list
    FOR i = 1 TO Limit: list1$ = list1$ + CHR$(LEN(word(i))) + word(i): NEXT
 
    'Comma Delimited list
    FOR i = 1 TO Limit: list2$ = list2$ + word(i) + ",": NEXT
 
    'Wordlists are now built.
 
    IF _MEMEXISTS(m) THEN _MEMFREE m 'free the old mem block if we need to, from the previous run
    m = _MEMNEW(LEN(list1$)): _MEMPUT m, m.OFFSET, list1$
 
    template$ = "##.### seconds to find frequency of " + search1$ + "."
    template$ = template$ + "  Frequency = ###"
 
 
    _DELAY 1 'a delay so we can watch the tests
    PRINT
    PRINT "Running Speed Tests on "; search1$
    PRINT "SIZE/WORD: ";
 
    t# = TIMER(0.0001)
    FOR z = 1 TO Repitition
        k = 0: l = LEN(search1$): i = 1
        DO UNTIL i > LEN(list1$)
            $CHECKING:OFF
            index = _MEMGET(m, m.OFFSET + i - 1, _UNSIGNED _BYTE) 'Get the index directly, skip ASC function call
            IF index = l THEN 'if lengths don't even match, we don't need to compare words
                'just jump to the next one
                _MEMGET m, m.OFFSET + i, Strings(index) 'get the string direct from memory (like MID$ in Pete's demo)
                IF Strings(index) = search1$ THEN k = k + 1
            END IF
            $CHECKING:ON
            i = i + index + 1
        LOOP
    NEXT
    t1# = TIMER(0.0001) - t#
    PRINT USING template$; t1#, k
 
    PRINT "INSTR:";
    t# = TIMER(0.0001)
    FOR z = 1 TO Repitition
        k = 0
        DO
            seed& = INSTR(seed&, list2$, search2$)
            IF seed& = 0 THEN EXIT DO
            k = k + 1
            seed& = seed& + LEN(search2$)
        LOOP
    NEXT
    t1# = TIMER(0.0001) - t#
    PRINT USING template$; t1#, k
NEXT

Things are running so quickly here, we're having to run a search on a list of 100 words, 10000 times, to generate any significant times for comparison.

CONST Limit = 100 'number of words in our list
CONST Repitition = 10000 'number of times to search the lists

I was curious if the length of the search$ made any real difference, and it doesn't really seem to affect much from my testing.

Notable changes in this and your routine:

index = _MEMGET(m, m.OFFSET + i - 1, _UNSIGNED _BYTE) 'Get the index directly, skip ASC function call
IF index = l THEN 'if lengths don't even match, we don't need to compare words
'just jump to the next one
_MEMGET m, m.OFFSET + i, Strings(index) 'get the string direct from memory (like MID$ in Pete's demo)
IF Strings(index) = search1$ THEN k = k + 1
END IF

We strip out the use of ASC, MID$, and don't even bother to get the word if the two lengths don't match...

And, to keep it fair:

DO
seed& = INSTR(seed&, list2$, search2$)
IF seed& = 0 THEN EXIT DO
k = k + 1
seed& = seed& + LEN(search2$)
LOOP

I basically stripped out that extra line where you were calculating x$ for some odd reason with your search...

Times on my PC are basically 0.015 seconds for SIZE/STRING storage and lookup, verses 0.025 seconds for INSTR/DELIMITED storage and lookup.

Logic says jumping and skipping searches would be faster than searching byte by byte, but once we start tacking on the overhead associated with our function calls, the gap closes quickly. Without using _MEM to replace ASC and MID$, I couldn't seem to top the speed of INSTR.. the overhead was just too great.

Which now leaves me pondering -- why do the changes I made to the hash table in the post after Ed's make it so much quicker than the previous versions on my machine? I guess that'll be a mystery to sit and study on tomorrow. For now, the bed is calling my name...

Title: Re: Binary Search Method
Post by: Pete on January 28, 2019, 04:10:22 am

No strange reason for including x$ = MID$(s$, seed& + 1, LEN("dog")) in the INSTR version. I put it there because it existed in your version. I didn't want it to run faster just because it was left out, even though in the INSTR version it doesn't need to be there to count frequency.

What I don't get is Marks test. INSTR shouldn't ever run as slow as he reported.

Pete

Title: Re: Binary Search Method
Post by: STxAxTIC on January 28, 2019, 06:49:26 am

Alright, been a few days since I actually ready at any of these threads, and at a quick glance... sigh...

Steve, did you really insinuate that I'm somehow "obsessive over" ... what was it... linked lists? That kind of jab is SO uninformed and misled, just like much of... well... everything you say. Let me remind everyone else that you had no idea what you were talking about with respect to linked lists until I took it upon myself to school you. Do you remember that from like, just days ago? If yer gonna name-drop, open a new thread and we can hash it out there. No pun intended. Actually pun intended. Something tells me you had no clue about hashing til my demo, either.

Alright, don't mind my injection. Get back to you were discussing here - talking about searching databases using INSTR, rediscovering obvious properties of QB64, etc etc... bravo guys. Keep up this awesomeness.

Title: Re: Binary Search Method
Post by: SMcNeill on January 28, 2019, 08:38:24 am

Quote from: STxAxTIC on January 28, 2019, 06:49:26 am

Steve, did you really insinuate that I'm somehow "obsessive over" ... what was it... linked lists? That kind of jab is SO uninformed and misled, just like much of... well... everything you say. Let me remind everyone else that you had no idea what you were talking about with respect to linked lists until I took it upon myself to school you.

Sure, I remember. You didn’t like the NAME of a library which links lists together, so you decided to school me over my naming sense. Programmers name their code and libraries in a way that they can easily relate to and remember what they are, which is why we don’t see Library Foo1254378 floating around very damn often. For a set of code that links the behavioral properties of two lists, I still insist “Linked List Library” is fine.

I wonder if you’re the type to also cry like hell when someone talks about computing on an Apple, claiming, “But that ain’t no red, juicy fruit!”

Quote

Do you remember that from like, just days ago? If yer gonna name-drop, open a new thread and we can hash it out there. No pun intended. Actually pun intended. Something tells me you had no clue about hashing til my demo, either.

Nope. Never heard of it. I didn’t know we had hash tables inside QB64.bas, or that idet$ is basically a Linked List. I didn’t realize that binary search works with odd distribution lists like: “Apple, Bear, Sandcastle, Sasquatch, Sediment, Segment, Sip, Siren, Soap, Soda, Suds, Supper, Zebra, Zoo, Zoology”. I certainly didn’t know that larger tables resulted in more collisions in a hash table, rendering them less efficient as they grow, or else I never would’ve pushed for bigger datasets to test with — like, “I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand)”, or “adjust it for a dataset size typical of a respectable database. A few billion if you can.”

Why, if it wasn’t for you “schooling” me so well, I never could’ve worked for years in the database field and written programs to decode various formats to make them available in QB64; never could’ve written my own database library; and never could’ve gotten paid for hashing out real world solutions to billion record issues...

Quote

Alright, don't mind my injection. Get back to you were discussing here - talking about searching databases using INSTR, rediscovering obvious properties of QB64, etc etc... bravo guys. Keep up this awesomeness.

Thanks for your permission. You don’t know how close I was to giving it all up and learning physics, before you gave me permission to carry on. I truly appreciate it.

Title: Re: Binary Search Method
Post by: Pete on January 28, 2019, 10:58:30 am

Bill, I was going to expand my obvious discussions of QB64 INSTR() usage into theoretical ways it can be used to formulate words. Is it OK with you if I call that STRING$ Theory?

Pete :D

Title: Re: Binary Search Method
Post by: Dimster on January 28, 2019, 11:42:03 am

I have number of data bases which are just numbers. At present when searching I convert them to strings. I gather neither Hashing or Binary offer a way to search without the conversion step?

Title: Re: Binary Search Method
Post by: bplus on January 28, 2019, 11:47:15 am

Quote from: Pete on January 28, 2019, 04:10:22 am

No strange reason for including x$ = MID$(s$, seed& + 1, LEN("dog")) in the INSTR version. I put it there because it existed in your version. I didn't want it to run faster just because it was left out, even though in the INSTR version it doesn't need to be there to count frequency.

What I don't get is Marks test. INSTR shouldn't ever run as slow as he reported.

Pete

Well it did. Though both methods were flawed in my test, I think what the results point to is still valid ie INSTR is great for short strings but when the strings get to a certain length the Binary Search method will come into it's own as the hash method would with truly large amounts of data.

Pete, I suspect you haven't run INSTR search on nearly half a million word string, with 50,000 lookups in a row.

PS the time was not really that bad except when comparing to Binary Search Method.

Title: Re: Binary Search Method
Post by: Pete on January 28, 2019, 12:55:27 pm

Well you need to run my long string INSTR() test then, because it is fast. That's why seeding was made part of the keyword. Without seeding, I agree, INSTR is too slow, as in your example. With seeding, I can't see any reason to jump to a more complicated way of constructing data to parse it. Too many collisions with hash tables and more and more difficulty trying to figure out ways around that. Words over 9-characters and Steve's indexing method has to be adjusted to read and interpret more than one number representing ascii character. That's not hard, but takes a bit more processing time to achieve.

Anyway, let's just look at that simple INSTR() method, without seeding, for a one time place lookup with 50 loops and that long-ascii string I made up for the seeding example.

Code: QB64: [Select]

CLS
WIDTH 80, 55
_SCREENMOVE 0, 0
 
_CONTROLCHR OFF
s1$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
'convert format...
FOR i = 1 TO LEN(s1$)
    index = VAL(MID$(s1$, i, 1))
    x$ = MID$(s1$, i + 1, index)
    i = i + index
    news$ = news$ + CHR$(index) + x$
NEXT
 
FOR i = 1 TO 15 ' Concatenate to a string size of 3178497
    news$ = news$ + news$
NEXT
 
s1$ = news$ + CHR$(1) + "*": news$ = ""
 
news$ = ",bill,cat,dog,dogfood,pete,steve,bill,cat,dog,dogfood,pete,steve,bill,cat,dog,dogfood,pete,steve,"
FOR i = 1 TO 15
    news$ = news$ + news$
NEXT
s2$ = news$ + "*,": news$ = ""
 
word$ = "*": x = 1
LOCATE , x: PRINT "INDEX Search"
t# = TIMER
FOR j = 1 TO 50
    FOR i = 1 TO LEN(s1$)
        index = ASC(s1$, i) ' Good for words 9 letters or less.
        x$ = MID$(s1$, i + 1, index)
        IF x$ = word$ THEN EXIT FOR
        i = i + index
    NEXT
    LOCATE , x: PRINT j, i
NEXT
PRINT USING "###.### seconds lookup"; TIMER - t#
 
word$ = ",*,": x = 40
LOCATE 1, x: PRINT "INSTR() Search"
t# = TIMER
FOR i = 1 TO 50
    place = INSTR(s2$, word$)
    LOCATE , x: PRINT i, place
NEXT
LOCATE , x: PRINT USING "###.### seconds lookup"; TIMER - t#
 
END
 

INSTR() won that footrace, which is why I'm finding it hard to understand how it failed so miserably in your use of it. I mean it appears here that INSTR(), however it reads the string, reads it faster than skipping by an index and looking for a word match. Now if Steve's function was applied to it, maybe it would be different, but this is why I was asking Steve if the FOR/NEXT index example I came up with would be a fair comparison.

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 28, 2019, 01:12:24 pm

Quote from: bplus on January 28, 2019, 11:47:15 am

Well it did. Though both methods were flawed in my test, I think what the results point to is still valid ie INSTR is great for short strings but when the strings get to a certain length the Binary Search method will come into it's own as the hash method would with truly large amounts of data.

Pete, I suspect you haven't run INSTR search on nearly half a million word string, with 50,000 lookups in a row.

PS the time was not really that bad except when comparing to Binary Search Method.

All 3 methods are valid tools for a programmer to use.

INSTR is the quickest and simplest to implement, and like a Bubblesort, is more than sufficient for many applications. Its ease of implementation makes it a desirable work horse.

Binary Search works great for organized lists of data, or indexed data. In many cases, it’s easier to use than a Hash Table, and fast enough to run in almost all applications. The memory requirements are lower and it works perfectly fine with external files — all things to make it a desirable process to learn to implement.

To see the usefulness of indexes with Binary Search, think of of database with 3 fields — Name, Age, Phone. Fill it with 10,000,000 records... Now, who’s the youngest person on the list? Oldest? Who is listed 1000th alphabetically?

If you maintain indexes of the data, that information is instantly available, even if the data itself is unsorted.

The DataRecord:
Joe, 86, 555-1234
Pete, 62, 555-2345
Mark, 16, 555-9786

Name Index: 1, 3, 2
Age Index: 3, 2, 1
Phone Index: 1, 2, 3

So youngest is DataRecord(Age(1)).
Oldest is DataRecord(Ubound(Age))
DataRecord(Name(1000)) tells us who is listed 1000th alphabetically.

And for a look up, we just binary search the proper index... Instead of IF search$ = DataRecord(RecordToCheck), we’d make that condition IF search$ = DataRecord(IndexDesired(RecordToCheck)), in our binary search routine.

Indexing and Binary Search make for a powerful organizational toolset.

Hash Tables are, of course, the fastest method for looking things up (provided there’s not a ton of collisions to process.). Pete’s old spell check routine is actually a form of a hash table; he just didn’t know it. He was hashing by the first and then second letters of his word, similar to:

Hash = 100* (ASC(word,1) - 64) + _CEIL((ASC(word,2) -64) / 13)

A word starting with AA would hash to 100 * 1 + _CEIL(1 / 13) = 101
A word starting with AM would hash to 100 * 1 + _CEIL(13 / 13) = 101
A word starting with AN would hash to 100.* 1 + _CEIL(14 / 13) = 102

Letters from Aa-Am in one list, An to Az in the next...

Pete was using a form of a hash table; he just wasn’t using a very efficient one. (Which is why I said a simple binary search would have probably been better in his case.)

A good hash table has almost instant look ups, but it also requires more memory. It can require finding the best formula for data distribution and lookup, but it’s still an invaluable tool for programming.

All have their merits and drawbacks. You just need to choose which suits your current needs the most.

Title: Re: Binary Search Method
Post by: Pete on January 28, 2019, 03:22:05 pm

So we are finally in agreement that my way is the best way! :D :D :D

Actually I did realize my old form of alphabetizing with folders was a type of hash process, but I always considered it a bastardized one. What I wanted was a hash method without collisions, but after a brief look at permutations, I punted on that idea. Numbers way too big for words with over 20 characters.

QBasic ran interpreted, which slows things down. QuickBASIC was faster and you could use integers and include/exclude libraries to speed things up more but I don't think I ever got a WP in QB that could outdo Windows Word for speed of spell checking. I assumed those developers came up with a hash table loaded into memory, but I could be wrong. Whatever they used, it was damn fast.

Pete

Title: Re: Binary Search Method
Post by: SMcNeill on January 30, 2019, 03:01:29 pm

Quote from: Dimster on January 28, 2019, 11:42:03 am

I have number of data bases which are just numbers. At present when searching I convert them to strings. I gather neither Hashing or Binary offer a way to search without the conversion step?

Sorry Dimster; in all the brouhaha, I somehow missed your post earlier. There’s no reason at all to convert them. Just search them as they are. (I’ll post a demo soon for you.)

Title: Re: Binary Search Method
Post by: SMcNeill on January 30, 2019, 03:38:51 pm

Using the SET Library I just posted here ( https://www.qb64.org/forum/index.php?topic=1015.0 ), we see a quick demo of using a binary search method on numbers, without needing to convert them to strings:

Code: QB64: [Select]

'$INCLUDE:'SET.BI'
 
DIM NumberArray(0 TO 100) AS LONG
DIM m AS _MEM: m = _MEM(NumberArray())
 
RANDOMIZE TIMER
 
FOR i = 0 TO 100
    NumberArray(i) = i 'toss a set of sorted numbers into the array
    'For ease of reference, the numbers are going to start out being the exact same as the array index.
NEXT
PRINT "Values = Index"
PRINT "The number 12 is in the "; BinaryMemSearch(12, m); " spot."
PRINT "The number 37 is in the "; BinaryMemSearch(37, m); " spot."
 
'And now to multiply those values by 100 so we can see we're not just parroting the index:
 
FOR i = 0 TO 100
    NumberArray(i) = i * 100 'toss a set of sorted numbers into the array
    'For ease of reference, the numbers are going to start out being the exact same as the array index.
NEXT
PRINT
PRINT "Values = Index * 100"
PRINT "The number 1200 is in the "; BinaryMemSearch(1200, m); " spot."
PRINT "The number 3700 is in the "; BinaryMemSearch(3700, m); " spot."
PRINT
PRINT "PRESS <ANY KEY> TO CONTINUE"
 
 
DO
    SLEEP
    CLS
    'And now to randomize some numbers
    DO
        WeHave50Somewhere = 0
        FOR i = 1 TO 20
            r = INT(RND * 100)
            NumberArray(i) = r
            IF r = 50 THEN WeHave50Somewhere = -1
        NEXT
    LOOP UNTIL WeHave50Somewhere
 
    MemSortSome m, 1, 20 'sort those 20 numbers so we can search them
    PRINT "Index", "Number"
    FOR i = 1 TO 20
        PRINT "      "; i, NumberArray(i)
    NEXT
    index = BinaryMemSearchSome(50, m, 1, 20) 'find 50 amongst the records from 1 to 20
    LOCATE index + 1, 1: PRINT "HERE=>"
    LOCATE 23, 1
    PRINT "PRESS <ANY KEY> TO SEARCH A NEW DATASET, OR <ESC> TO QUIT>"
LOOP UNTIL _KEYHIT = 27
 
'$INCLUDE:'SET.BM'

Now this makes use of a _MEM routine which I wrote to do the binary searching for us, but that's just so I can have one routine work properly with all data types, and utilize the speed of the _MEM commands.

If you look into the SET.BM file, you can see the actual subs there, and in this case what we're doing is basically:

Code: QB64: [Select]

        CASE 4 'LONG
            DO
                gap = (max + min) \ 2: IF oldgap = gap THEN gap = gap + 1
                IF search > _MEMGET(m, m.OFFSET + gap * m.ELEMENTSIZE, LONG) THEN
                    min = gap
                ELSEIF search < _MEMGET(m, m.OFFSET + gap * m.ELEMENTSIZE, LONG) THEN
                    max = gap
                ELSE
                    BinaryMemSearchSome = gap
                    EXIT FUNCTION
                END IF
                oldgap = gap: IF max - min <= 1 THEN BinaryMemSearchSome = -gap: found = -1 'it's not in the list
            LOOP UNTIL found

It's just a simple binary compare of the array, comparing our search value vs the actual value. There's no actual reason to need to convert it to a string, at all. (In fact, we're not actually converting it here. We're just referencing it via the _MEM offsets.)

Title: Re: Binary Search Method
Post by: RonB on January 30, 2019, 07:48:16 pm

The absolute fastest TLU is a "direct" lookup - one where the argument IS the entry number (or index) of the entry, itself. For example, if you needed to validate the 5-digit U.S. zip code for 360 million address records, you could first construct a table (ZipCodeTable) in which there is an array entry for every possible zip code, whether valid or not. That's 100000 entries (numbered from zero to 99999). The entry for each VALID zip code will contain the number '1' (for valid). The entry for each INVALID zip code will contain the number '0' (for invalid). This TLU table requires 100000 entries (which, I believe, means that QB64 will generate code to maintain an array index table containing pointers for each of those 100000 entries). Processing is simple (assuming that you have already verified that the RecordZipCode was 5 numeric digits):
zipcode& = VAL(RecordZipCode)
IF ZipCodeTable(zipcode&) = '1' THEN process as valid ELSE process as invalid END IF.

Alternatively, you could create a single string of 100000 bytes, with each corresponding byte set to either 1 or 0. This TLU "table" (really a byte-string) requires 100000 bytes (no QB64 generated array index table). Processing is simple:
zipcode& = VAL(RecordZipCode)
IF MID$(ZipCodeTable,zipcode&,1) = '1' THEN process as valid ELSE process as invalid END IF.

I'm not sure whether QB64 would support a bitstring of 100000 bits; and while the "table" would take only 12500 bytes, it may take a lot longer to obtain the result.

Always trade offs. Bottom line: pick the TLU method that works best based on THE ACTUAL (or expected) DATA, depending on your requirements - array size, speed, simplicity, reliability, data placement (file, memory, web), etc. One size does NOT fit all.

Title: Re: Binary Search Method
Post by: codeguy on May 20, 2020, 05:40:09 pm

Another super useful data structure is a binary tree, in this example, implemented as an array. It includes searching, traversal and even presents output in sorted order. Two caveats: for sorted input or monotonically repeating datasets, the performance is terrible for reasons I haven't patience to discuss.

Code: [Select]

'treesortI64.bas
OPTION _EXPLICIT
TYPE minmaxrec
    min AS _INTEGER64
    max AS _INTEGER64
END TYPE
'**************************************
'* anyone claiming you need c/c++ to implement trees is telling you CRAP
'* This is a bit more complex than the standard non-copying version, but it is still
'* respectably fast. General complexity for TreeSort() is O(NLogN), EXCEPT when
'* presented with elements already sorted. One way to avoid this is to KnuthShuffle
'* the input first. Skipped in this implementation, but there is no reason you
'* can't do it prior to TreeSort(). Code modified/added from my repository. This
'* version allows multiple same-value nodes
'* Modified/added 26 March 2018.
'* Further modifed for _INTEGER64 sized indexes 20 May 2020
'**************************************
SUB TreeSortUsingBST (CGSortLibArr() AS DOUBLE, start AS _INTEGER64, finish AS _INTEGER64, order AS INTEGER)
    DIM TSAmmrec AS minmaxrec
    'GetMinMaxArray CGSortLibArr(), start, finish, TSAmmrec
    'if  CGSortLibArr(TSAmmrec.max) = CGSortLibArr(TSAmmrec.min) then exit sub
    'IF delta# = 0 THEN 'already sorted because they're all equal
    '    EXIT SUB
    'END IF
    DIM NilValue: NilValue = LBOUND(CGSortLibArr) - 1
    DIM x AS _INTEGER64
    DIM pointer AS _INTEGER64
    DIM free: free = 2
    TYPE TreeNode
        value AS DOUBLE
        left AS _INTEGER64
        right AS _INTEGER64
    END TYPE
    DIM tree(start + 1 TO finish + 1) AS TreeNode
    FOR x = start + 1 TO finish + 1
        tree(x).value = 0
        tree(x).left = NilValue
        tree(x).right = NilValue
    NEXT
    tree(1).value = CGSortLibArr(1 - 1)
    IF order = 1 THEN
        FOR x = 2 TO finish
            pointer = 1
            DO
                IF CGSortLibArr(x - 1) < tree(pointer).value THEN
                    IF tree(pointer).left = NilValue THEN
                        tree(pointer).left = free
                        tree(free).value = CGSortLibArr(x - 1)
                        free = free + 1
                        EXIT DO
                    ELSE
                        pointer = tree(pointer).left
                    END IF
                ELSE
                    IF tree(pointer).right = NilValue THEN
                        tree(pointer).right = free
                        tree(free).value = CGSortLibArr(x - 1)
                        free = free + 1
                        EXIT DO
                    ELSE
                        pointer = tree(pointer).right
                    END IF
                END IF
            LOOP
        NEXT x
    ELSE
        FOR x = 2 TO finish
            pointer = 1
            DO
                IF CGSortLibArr(x - 1) > tree(pointer).value THEN
                    IF tree(pointer).left = NilValue THEN
                        tree(pointer).left = free
                        tree(free).value = CGSortLibArr(x - 1)
                        free = free + 1
                        EXIT DO
                    ELSE
                        pointer = tree(pointer).left
                    END IF
                ELSE
                    IF tree(pointer).right = NilValue THEN
                        tree(pointer).right = free
                        tree(free).value = CGSortLibArr(x - 1)
                        free = free + 1
                        EXIT DO
                    ELSE
                        pointer = tree(pointer).right
                    END IF
                END IF
            LOOP
        NEXT x
    END IF
    DIM depth AS _INTEGER64
    depth = start + 1
    Traverse_tree CGSortLibArr(), start + 1, depth, tree(), NilValue
    ERASE tree
END SUB

SUB Traverse_tree (CGSortLibArr() AS DOUBLE, NextPtr&, depth&, tree() AS TreeNode, NilValue&)
    IF tree(NextPtr&).left <> NilValue& THEN
        Traverse_tree CGSortLibArr(), tree(NextPtr&).left, depth&, tree(), NilValue&
    END IF
    CGSortLibArr(depth& - 1) = tree(NextPtr&).value
    depth& = depth& + 1
    IF tree(NextPtr&).right <> NilValue& THEN Traverse_tree CGSortLibArr(), tree(NextPtr&).right, depth&, tree(), NilValue&
END SUB

Title: Re: Binary Search Method
Post by: codeguy on June 17, 2020, 01:59:03 am

One caveat: BINARY search as it's coded fails on descending lists, SO a minor change is necessary. Presented is code that works successfully for ascending and descending order.

Code: [Select]

'BinarySearchTestI64
OPTION _EXPLICIT
CONST OrderDescending = -1
CONST OrderAscending = 1
CONST OrderMonotonic = 0

DIM LowerBoundTestArry AS _UNSIGNED _INTEGER64
DIM UpperBoundTestArry AS _UNSIGNED _INTEGER64: UpperBoundTestArry = 0
DIM MainI AS _UNSIGNED _INTEGER64: MainI = LowerBoundTestArry
DIM MainWhere AS _UNSIGNED _INTEGER64: MainWhere = LowerBoundTestArry
DIM MaiTestStart AS SINGLE
DIM MainTestEnd AS SINGLE
DIM mainOrderchoice AS INTEGER: mainOrderchoice = OrderDescending
DIM MainRepPercentage AS DOUBLE: MainRepPercentage = 0
DO
    UpperBoundTestArry = UpperBoundTestArry + 1
    REDIM CgSortLbAr(LowerBoundTestArry TO UpperBoundTestArry) AS DOUBLE
    FillArray CgSortLbAr(), LowerBoundTestArry, UpperBoundTestArry, mainOrderchoice, MainRepPercentage
    MaiTestStart = TIMER(.001)
    FOR MainI = LowerBoundTestArry TO UpperBoundTestArry
        cgBinarysearch CgSortLbAr(), LowerBoundTestArry, UpperBoundTestArry, CgSortLbAr(MainI), MainWhere
        IF CgSortLbAr(MainI) = CgSortLbAr(MainWhere) THEN
        ELSE
            STOP
        END IF
    NEXT
    MainTestEnd = TIMER(.001)
    PRINT UpperBoundTestArry - LowerBoundTestArry; MainTestEnd - MaiTestStart
    UpperBoundTestArry = UpperBoundTestArry * 2
LOOP UNTIL UpperBoundTestArry > 67108863
PRINT UpperBoundTestArry

SUB FillArray (CgSortLibArr() AS DOUBLE, start AS _UNSIGNED _INTEGER64, finish AS _UNSIGNED _INTEGER64, OrderChoice AS INTEGER, Repetitions AS DOUBLE)
    DIM FillArray_RepeatStart AS _UNSIGNED _INTEGER64
    IF OrderChoice = OrderMonotonic THEN
        FOR FillArray_RepeatStart = start TO finish
            CgSortLibArr(FillArray_RepeatStart) = 0
        NEXT
    ELSE
        IF Repetitions > 0 THEN
            IF Repetitions < 100 THEN
                FillArray_RepeatStart = finish - INT((Repetitions / 100) * (finish - start))
            ELSE
                FillArray_RepeatStart = finish
            END IF

            DIM FillArray_u AS _UNSIGNED _INTEGER64: FillArray_u = start
            DO
                IF FillArray_u > finish - 1 THEN
                    EXIT DO
                ELSE
                    IF FillArray_u > FillArray_RepeatStart THEN
                        CgSortLibArr(FillArray_u) = CgSortLibArr(FillArray_RepeatStart)
                    END IF
                END IF
                FillArray_u = FillArray_u + 1
            LOOP
        ELSE
            SELECT CASE OrderChoice
                CASE OrderDescending
                    FOR FillArray_RepeatStart = start TO finish
                        CgSortLibArr(FillArray_RepeatStart) = (finish - FillArray_RepeatStart) / (finish - start + 1)
                    NEXT
                CASE OrderAscending
                    FOR FillArray_RepeatStart = start TO finish
                        CgSortLibArr(FillArray_RepeatStart) = (FillArray_RepeatStart) / (finish - start + 1)
                    NEXT

                CASE ELSE
                    FOR FillArray_RepeatStart = start TO finish
                        CgSortLibArr(FillArray_RepeatStart) = 0
                    NEXT
            END SELECT
        END IF
    END IF
END SUB

SUB cgBinarysearch (CgSortLibArr() AS DOUBLE, start AS _UNSIGNED _INTEGER64, finish AS _UNSIGNED _INTEGER64, what AS DOUBLE, where AS _UNSIGNED _INTEGER64)
    DIM BinarySearch_low AS _INTEGER64
    DIM BinarySearch_high AS _INTEGER64
    SELECT CASE CgSortLibArr(start)
        CASE IS <= CgSortLibArr(finish)
            BinarySearch_low = start
            BinarySearch_high = finish
            DO
                where = BinarySearch_low + (BinarySearch_high - BinarySearch_low) \ 2
                IF CgSortLibArr(where) < what THEN
                    BinarySearch_low = where + 1
                ELSEIF CgSortLibArr(where) > what THEN
                    BinarySearch_high = where - 1
                ELSE
                    EXIT SUB
                END IF
            LOOP WHILE BinarySearch_low < BinarySearch_high
            where = BinarySearch_low
        CASE ELSE
            BinarySearch_low = finish
            BinarySearch_high = start
            DO
                where = BinarySearch_high + (BinarySearch_low - BinarySearch_high) \ 2
                IF CgSortLibArr(where) > what THEN
                    BinarySearch_high = where + 1
                ELSEIF CgSortLibArr(where) < what THEN
                    BinarySearch_low = where - 1
                ELSE
                    EXIT SUB
                END IF
            LOOP WHILE BinarySearch_low > BinarySearch_high
            where = BinarySearch_high
    END SELECT
END SUB

Title: Re: Binary Search Method
Post by: Dav on June 19, 2020, 06:47:15 pm

As always that some nice coding, @codeguy.

By the way....I just remembered something. Yes, this code it's off topic, but so what. Have a good one!

- Dav

Code: QB64: [Select]

DIM SHARED HBC&, HBCM$
 
:::::::::::::::::
:::: HBCDATA::::: 'Make HBC variable data
:::::::::::::::::
 
SCREEN _NEWIMAGE(600, 600, 32): CLS , _RGB(255, 255, 255)
 
FOR y = _HEIGHT TO 0 STEP -1
    _PUTIMAGE (0, y)-(_WIDTH, y + _HEIGHT), HBC&: _DISPLAY
    IF y = _HEIGHT THEN PLAY HBCM$
    _LIMIT 60
NEXT: _DELAY 3
FOR c = 0 TO -300 STEP -1
    _PUTIMAGE (x - c, y - c)-(_WIDTH + c, _HEIGHT + c), HBC&: _DISPLAY
    _LIMIT 120
NEXT
 
:::::::::::::::::
:::: SYSTEM::::::
:::::::::::::::::
 
 
SUB HBCDATA
    v& = _NEWIMAGE(246, 264, 32)
    DIM m AS _MEM: m = _MEMIMAGE(v&)
    A$ = A$ + "haIkM^6cfUMFGoU_#81nVVa3QH`h7j78#DX29D]MB`VXQ?IRT0Y=6>h<Ej87"
    A$ = A$ + "JSU4JRUQDBKZdF5]cdQ8PU48E>Zb1[1JXSYUFjdQJ[Ha#DB78N8`7D3D>HWf"
    A$ = A$ + "ifWOkgWO?_oOG_[mN_fggkk3?gg[[T[LOL_g[mJOmojdjJ]FKfdY>eY>eY>e"
    A$ = A$ + "Y>MGjWhWHcV7nQg\iaOlSM;YCWJdWhCL\K1CC_[geV=OOOOKflkmkM\K9kGj"
    A$ = A$ + "iN^Um\PokM^cmhWnYgO]]>EWhI6mkb]B8;S<mcl<kWfeSmHgEVh[i[Y]o;o_"
    A$ = A$ + "39n2<]b]]d6_]Bo2o2gkn\EK_87eae7NRWEO3O3Kfl#?dmgo3cgm4?ad7?hW"
    A$ = A$ + "lH1O_fDfVVR#FKYbM[1mjNMgkjbk?GY\o5oZKQ:aeMJo#HCN:\L=N:\MYOFL"
    A$ = A$ + "^GC:?oOaOaCoOiggWjG6S#6g[;f^?GYln\EL=?7lHnRnRf_]_;E2<mDHIjgi"
    A$ = A$ + "i5LnoVkiQhYi`MK3UjkWc7gSUoNYlnYNN0f5:_?a^#;dfHS_C;ShIA9Fn[h["
    A$ = A$ + "h^fH;S7U?gZoCLlk[L6]4O[cfkQ<njd?QcI:eegJL>Y<gBR=2mSmHaKSBm]l"
    A$ = A$ + "<JZlLaHCTHoS5Q=jF\1Wi^j#S_f5kAXoQk;lUjDWf6oQBI^iRASn2`nYnSFm"
    A$ = A$ + ";H:bl1_[man\jDCOCZkLZlkTHjS]_U43`HHEjW1?WoWmYoNVgOdeTS7kf83R"
    A$ = A$ + ">EdGV]kCI\MINDJeOXF_ohlUnmT<bSKKK^\?NK>GeNFAKM=l94M4kjiI>ORh"
    A$ = A$ + "loene^>mYeXUN?#oY]S?o?okZ_Wf=g7S93\_`e8SD;g4I^`Ke?cBmV[]\2dJ"
    A$ = A$ + "8_DNnFJn_07gR?DUHJ`#ia>ecC>FdcR<ANo>FO<L=oD]=`chd_R<>`BNZWg>"
    A$ = A$ + "N[CY;ckiMM\ACOR\=am3bVlKUfmVc_9>7LL]T3LegEVK=\oU?C=_MIKM>Vn5"
    A$ = A$ + "bIoLnR`cCJc;MLf;adLn]_2m=;agdDNcaA<UMVkLL8UGB:ScI>mVjgK63eD7"
    A$ = A$ + "CJ;4k:#InMV2oTiKga09U5d?ZD6[VOEe^o:]aacllIeHjgJ9W#I]0\[fQZQ["
    A$ = A$ + ";m_8ITkhoEC_O[jRQZU3^YlkZ5Vk5N>ZlMJ?Y\_T[efjcKihc[o;ic5iiRG9"
    A$ = A$ + "moPcYoodVoDTeWcQBN9YDFUfNZo]V=QJjN[A9N:mG9_?6c^AYnFJSiaAKbnh"
    A$ = A$ + "ijGbW1PkCM3L=ba_<aAceV6c6BNmfeaT=kW;S=L:^VoQ9>J^ag:o_fW^D<=m"
    A$ = A$ + "edg>Vm\lJTb1SioL[ScL=K9UG_F`Td>BIo<nVF6W`BK772iUTBK_K3?FLOic"
    A$ = A$ + "VD_Hf_GkIN]W[l<e_[DV=aeYmJd_GZc?SGUcCY=0O>FnlZd?QYbgL:;\[iEY"
    A$ = A$ + "VLMFCnNnieblBFN?\TaO:o_8gUfMhl>V<JfNWkN?m7CiPba[OYSCCjc<?7Bi"
    A$ = A$ + "TBoEV2GG:3Vf^BKhSIO]4COXTGBZFLAe\K3VPnFh\]D[oN<IX\?Y6^VW;edE"
    A$ = A$ + "keSO[lN?egai_F^gU=f:n]BkIIKC<D]lc=U]SlL\[elAYLH;a2FV3dal5I>o"
    A$ = A$ + "?CmO9nKZl4U]GjOWSBI4?WY?18;Ta3<W?ZUiMYdG`bnVa`eLLUjFbkiDNX\="
    A$ = A$ + "E;nQ3UlB9UGoF7OTJamUDR3cnTBOibS=]9Rn^JhbbfGZOWncba#Q_[FL4e^?"
    A$ = A$ + "cl4ai=U];O>?6Unn\[eWHYOS]Z;O\W=im#[W3>6kGZY;L\fK;a9F6SJjC#jc"
    A$ = A$ + "K;bmiaFcnHVKPYZ9UJiT=k3[5S^4?OlLO<TG:_OWCoE=:kK:K?S=7B\_?aIl"
    A$ = A$ + "NT7Z5CdDm_BiaUiiTc[][BoAZEkdYOCelM?ON?Fn4:mmJG7cRd>iBNfTm_f7"
    A$ = A$ + "\4kN#:GVbW]T[m\?K>bW;jGOicobfB]n07cTBIUbh4B<MZoaWASMNBMIY=TB"
    A$ = A$ + "oEAV:kcBKD7:iUTbkaFlMB:K?Un7VjQe6CR;`FiHadW?Cc__DGM]a]]4?;Uh"
    A$ = A$ + "JeSKmAPlgHi5To[b796[Do2g?UnQ\6SGAih`e2EF3G:[dZ]D^G:eeWmKCAic"
    A$ = A$ + "dFR5>aen<Xfe9OnUhe<W;C9[Tb9UgOd>iIKNNZECeYmN^?[iC_H8n]SQlBBK"
    A$ = A$ + "C=NGnL_4GGVCi4C[oEeb?BImDTG3]E`cbJj4ZQW[eNhlDjcF9ccbBONcn9O6"
    A$ = A$ + "Dj[Nb[Q?EUgWcAUmYLO:=7^>eCFjWPhQiXU>7Ylo>E>?P:7oaBo_hhZEo>YO"
    A$ = A$ + "a`YOJSUKOj3:7o]BOh]=FcgOo]SQlBBieIKWOVUhjDN<ko:m_bLhbmDiccDO"
    A$ = A$ + "XnADkIaDhIYbl=>V>nYV74eR[Z\MTn__6nDE:o>5EJ3X<oFUiP;o]<W4R]:k"
    A$ = A$ + "KJQVBVXTZiC`DSiiHmk^N<D63#INC;Um:m[9WgjieSnUdojBmKLNhJEnm73i"
    A$ = A$ + "UTBInFaeUm9U?3[M?]dJK\VmUbhbKE7G9f`h0:[CRiXJnm3SNkLLAF[lJVjL"
    A$ = A$ + "]>IbLDhH3CL?Unn<EN1Wb^W7Gnm]>VZ;1GGJKH>olAn9em#OCjCN=O<bW?e8"
    A$ = A$ + "[m^JnWEJ3IKZGd3]lBBie[eJU]<^_J?3Y?7\SaR\DZfLhM:kkCA]Jc[5bh0#"
    A$ = A$ + "N0lTb3Y^^eZf2lI_aXD[^::ia`dii[faImm:UoF[jmbcmLhjBK3kjlDY\N9D"
    A$ = A$ + "gef^>?Y?0kjJggQDN9YD7N[?ofWSiVdJ]nhDj3gnRfVnaUBeTMCl<b=cY_Sf"
    A$ = A$ + "VeF^b\c=VO]BV7UFNN]dLWg2YNO`9Wj[E2kKi5e;]DmJFWhk;J]`e;M<Eg6:"
    A$ = A$ + "SMMOYWC2iE`19^L]G7SP#^HOR=X?3kY[ilQjf2M8UG>eXiZkUFY<^[m5^nH]"
    A$ = A$ + "fkeYKWdU\lbJQ[?4[ECInl>emo[C7OjBFNIKbWG=JK67_U#Uca\>eYYX;MiU"
    A$ = A$ + "U<V9CA[UMoaXFGWRjDWP^dUGLLJgeeO_LnQ]?FWXcJ`iBcWZ>]LZ;_\Ndn<g"
    A$ = A$ + "^Kkh9dY;CZ;_dY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>"
    A$ = A$ + "eY>eYePHNif[]`>eYc7:W[mkSjH_CMZC7NjBN^f_=dnNmLYCMJ=X4G_?F7k>"
    A$ = A$ + "WXLocJ\eJd>eYC1:WWhkce=^c1:G3PGkm9S>eYeTBLmUfjhgBYeJ]dYCMJOC"
    A$ = A$ + "7Re2ic5J]F;MjDWFKRLnhj6mBg[;UL_3lBJNjG^O4eY>U4h1`#;Mo522l4if"
    A$ = A$ + "2ldBGCg;g7]JM?h]T:gkeQGbmRg3^>T?eNCF;Tk=Bef3RGC:G?h_dFk]kdh4"
    A$ = A$ + "b^Uk_e;aFGini6L[[;oLMZ]OMTkKg]4_hHk3M]^nFBkXfN3l^\_KD^7Z^[4j"
    A$ = A$ + "M[=^0ih6L9^^oeYjD9VN9hQaf7e0[>Wm^L_aR[GJS>aecEC:iNi3_26=UeWS"
    A$ = A$ + "b[[km?C53PkgkC=NB;9>2_nCmobkWYf[Fg7kc9Mjf7Ub;h3gBFK8cmjH>Flg"
    A$ = A$ + "\V]GgOehJ1ckCo^iHdgM[KdlgW;?#imPj04K>WnYL][^fNWI]H0be5gaJKiN"
    A$ = A$ + "8gDf#KM_BKZe=oef_P>MkVbl]hNlL^o1>5^6<L=L9DiN?oLLY?oIKXUah:]i"
    A$ = A$ + "3Uf_Wb7nbm\_lnMZH0bSI\fFj3dH]Qb[oHnlW?WZYWZEo2jdU1UfMBOMA>L<"
    A$ = A$ + "O<K9Fl<?>V;=c55LNle^>UkcbCAYmFmkgSOZL816>]1Vif:]QFR9[]G5?EKJ"
    A$ = A$ + ":KjUgWSU_aLoD[V]oN\eMBZdFM;DRIVbFD:kG:CG6;LYm^Ja[?5UnPCLZ9VN"
    A$ = A$ + "<KYUkieUkc2Ik_ln;aHeRk_fnfK=LMimiDke3c]gE=WO1MjbQBicFbiKYm=m"
    A$ = A$ + "9\6f9?gUbQC=G>:7K[Fl3?S]DldB`delg8_fU]a<WJUgKYN_<_OUhjJj^6;g"
    A$ = A$ + "PIObHnOXo?milFWBIgiT76K\XZIO0IKaoeR5dO[dFM6_^k3kniabaK]ld1VI"
    A$ = A$ + "<ZfhKGREbL1FkNLZLDUhelnAoS=?jUjQ6c7llLHL6MZCCA]>W::cGN6KL]S;"
    A$ = A$ + "omBm5IL[I\P9N#LiDaabi]<N0aBe8`CUi^_\M0V_fhme2^6<IN>]FHBL;7Gn"
    A$ = A$ + "id>LY>2>OeJ;;]NOjdU7UbWe\7R=jDfBlfHj3:U][5_HJ3VSU[AN<UfJConi"
    A$ = A$ + "oUGKo^dnG9nWfDI=c0f?K7L=bOWgWWcY6[HRO9cmHim#ZC<?ofOGZCR[EiiK"
    A$ = A$ + ">O`AG9m;mH[k4D:K[\6bE8ODJ;<]3VbMlohocgUHSaR_]VLnDfJ;c[LfNE>^"
    A$ = A$ + "lo0>5ilbL1ThYaJ7LO#kN:OIbn]B^F=[R_5X;XFMY>eh1JMm<VO3#UiXKOG_"
    A$ = A$ + "ZMjdWZUgfJf8:c=4f0Wjh#FM\LFEkHa^nDSC>fd5G`[PCJ97je`[im^Sm6WC"
    A$ = A$ + "`?Y?^YN]JDY?bcM?<6EKl5bfKZ3]fi?SEI\oCW^lXJihF\iDiB[lh\NbJA^B"
    A$ = A$ + "OSQGCi`Bo<hmdVfVHBCKVSAdf1G]6c;LlS`L5V]gbj]YTD_#joBWjTTbXmi]"
    A$ = A$ + "km?^n7cjgZej#]CMjBWdokY\aVhYSY=`UDWnMZCGbTaSR_[SiON6kla2?]=e"
    A$ = A$ + "2HWjdU<I<h>O<C\KF78]>?Eg7DV_nMInOgY>MY#SU3jBnH^6:UiT\?OZkDWJ"
    A$ = A$ + "S:73YF67j35UeTGg7l>eYUC>VK>oAgfa5K]KCmego>eYc?Z_G3fY>eY>eY>e"
    A$ = A$ + "Y>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY;1jWk"
    A$ = A$ + "gioiNiLnDoZOd=_egc_lVOfO]oSKn0oAo9EoOoWNQ7TGS[iCldO`mbmEWjdY"
    A$ = A$ + ">QL?HPGo?noV1l`JBLNoFnnoG\i[k=nF6H>oO\?cWm^lEHg7Pmg2N9g7Ln1;"
    A$ = A$ + "k[L?][hJ>nmAOCWjD[TbP_jWl7L#>\5IJnON<NL[IK1\=_Wc=HLNOE<mE<fc"
    A$ = A$ + "ijcoUSM<kkeXgaCo\373hI>>eG1_]]o_gO^OU1mAo=Oe_^]nldY>][4bNR9D"
    A$ = A$ + "6O:BIO]aaaPob[19^F\=HJaZ?Pmj[n\oo47]4LMJGUS36L9OgBYDGW]k>eYS"
    A$ = A$ + "11ND\<bh:W?5>EoEcooLf[GROklo4Wb[3fZ[h3NJKUf3h9S7_eH]cf_nT_=n"
    A$ = A$ + "?kaUjjjfZk4bD8CXoWK3QlD[nA;9FSS1l`?`kogLBiJaBROdGii\akeYelBa"
    A$ = A$ + "o3Lm=aFG9VJmmfcW_GKk;Y?#oEffno<^LeE<GOBW>o9aeKS\UfI<O^;ihe_A"
    A$ = A$ + "RcdH=iea>NKWZ70l=nL>U<\n4X_^l:GSJWKaHHWn[n[oENS=Knn4OWaD[^\]"
    A$ = A$ + "9nJ_^j;ofVW0kgEGG7Gg98a;R?GB>I1gX\]HSFU?iH=N#HR]L:BM7N]4;=EK"
    A$ = A$ + "=m;eS_fa`e6Ll;o[me>PYUo;lGiU?lmI^c0oX>9ea#kW?g:^F<Xj7hJ\]i:`"
    A$ = A$ + "k9miZ5OHjdi?9N<aV]:GXn0ckSHWUNm=OEH;5<6nTKkA<XHdUV;LUe=OBL^;"
    A$ = A$ + "9nN`FPUO9o=nj^Qo[mboK=lMPgBL=ggHSefRhbF`eL^lJUjBhm`Z_H9mON?2"
    A$ = A$ + "gS]^C#ILW;9FKT?mh`?Eae;aN_fMeWG]kYo_e8aD]:oJ^]WB70oNJW>]Gc[d"
    A$ = A$ + ">o_lI^KmWTnRKkFMK=eNN7o#3WG>^GmCmOH#?fKjWoI7dg`m>ofSm?n]=iiX"
    A$ = A$ + "L<ihmL?dYcG:mG]eoOVCFhil7<cQUHjL\XKiJWiNFo`e>]nVGkl#kSO[ELM6"
    A$ = A$ + "ojH];J3RQUaFm3mWiQ6No_a_ec]iiNn=3HKagL^6`VGkoM[i;3<;G;`dLo2V"
    A$ = A$ + "FL=]En=`ncM?Qo4YMNJ;Mo_?OY<o_]P_BlL[eaH>fAJOGK]]4S87_H9>>U[Q"
    A$ = A$ + "aGL[i[\]TnBX=mY^N_kgg7m6oLCOL;9`\PQo2OAO93hTGa[h^<_W_kANTGgV"
    A$ = A$ + "OW?i_m0^FFmH>Nc;a_iGm[ikJhlnHoSng=P]UY?Pk9n]oYoho[WmL#k3M4`R"
    A$ = A$ + "[mn\S_?=Y<^bUBVOD]KG;o_e^N8;\4iQL<GW;^^oh:SY;bQP[eNEjCZbX9V5"
    A$ = A$ + "IMTUAV6\6o^ao2f\T`V7fdNd7iOiVGc[iIflBOYo37n_PgCRSmUl`?lV_R_h"
    A$ = A$ + "oRKnNonn2_QOTGi;I`N=WW?eOhO`Vo_Oj=3l_oWjNS56]Oc]F;n?Pn1JkL<o"
    A$ = A$ + "3NCo`?0^VfkGnGf;Kccn\S7c0m9LOoml]n:flOmGj9gl9Ooo[6hOnOagk06W"
    A$ = A$ + "kUilS_C7NZ\^QGjaVS[3fo0[dJmIVi5O9f^clHEN<PY_7oI^1SWnS:^>]KQ<"
    A$ = A$ + "NZW1L=O7_nclCmC_i7k9oA6\[bg1fdc;G3`QOg?nS?PAnQO\ok3<hKn_8oW]"
    A$ = A$ + "]?n7kS_iUmGo;MP;aeoUocnWOhooWLU_gbL^EWTn2X]N`]CAlkfVC<]jdhL2"
    A$ = A$ + "V]VN:YghgcS?X3jWm]o]?PW1K3CkTkI`ecI__C7NBLM>?OFB\bnocHNWcGa<"
    A$ = A$ + "g;YoLFO6CML83InMAVTS?9adOf[2olColOj1\1OVSCoA5G[]K\U96F2L<WO\"
    A$ = A$ + "ObkmcPC0_`iV[1<H_gfKh=_i;m?gOk=O;OcOX1Lmgf;nKM0WSlOZWimo1n03"
    A$ = A$ + "hTGi;m_jmQ[1WkiVf<fYiEn<HW1MBG[?cHcW2?2a_KlkUjbXOh?eOjGdTaFc"
    A$ = A$ + "aCKRfjkiMomO3^FKflKdW`mDW>]8mQ^\N5WS<N?[AQ\VS7;7D#V7Z4;[\gPM"
    A$ = A$ + "U:aT:aaQL4n6Sl8_Z=akIoLcVGhJLmWmi_V_jmlO0GS=_BoA1G3FAL=H:`Y_"
    A$ = A$ + "hoBOUKNk_UWJ`N=_2>5kGPSOQ[_>`lInM`eJ_6o`iho#_o?l0n4m0loXM0^^"
    A$ = A$ + "dN=O6KfXoP[777ffimd^M\]X]VaQ?5aaS>1^WCL=nEQo4LNBo9:9^fdF;m17"
    A$ = A$ + "6L>]iibiFW>lD>6C>N:]F7dV?e\VV5G?E^BiH:7O7`Z6k8HE9m55B\3bYJkP"
    A$ = A$ + "?S=7>6Tae>=d0fnJL=H:\YPL<h]BKGUhJS;5L9hK^>l:GK`MlkH;3L3aV2fP"
    A$ = A$ + "OWc=f[O[_nOZ1<^h?lG5KKXGPHX_cM^cV?W[h?_?g_PkSigTiclO^ceo?>G>"
    A$ = A$ + ">gLN0?aeVk;`Od6hmdVX]AKDMId6nUN_o^3?KlNVf:W7noSAd^CO`C6\>hjY"
    A$ = A$ + "R=_C7N:7CbQa=Y0G?gLd97khL>7iL9N<kePg5;W<O7bI6_7Hd<>C`VJoh]mZ"
    A$ = A$ + "N]gHcP?[];`d_`ef]e7IH>Nc3=f^n9O_og6``bR[ilP>2`PR_0G0nVf6j#\="
    A$ = A$ + "b[`lKlmb?d3mE=lJRGmk5GBKhKj[oeOce36m2Z31V_So3\7GN^]MUG;N?O7j"
    A$ = A$ + "OX=b[L?`aR^6^?1Kcm<W;mk_6agciH0knSnMn0fZYOWSO^H1jdQU<oCUnA;^"
    A$ = A$ + "^<o`UTnCWe2M]a;Zfeead1i>]IS?X8[X_QJkE6l:f6dOEU_8f_KT?OQ?m0ka"
    A$ = A$ + "PodHW7I?>WgLM_B>7LLR7eN9_go3h>`EP3CF\IR;1c8C>jlmRC1GEbh;<\hG"
    A$ = A$ + "In>o??eolO]=oTo0oF3_jaHKH8^U[^fN=`n:<];m?Pk5H_?imN_Co1m;Hc^4"
    A$ = A$ + "OKl2dGGj3>fZ97OL=kdY7UcOP4Gg:n<[gJcoUeM<o6f5#Vaa\1VaS5IA`#JW"
    A$ = A$ + "6I7H`\>F?R==oDPGAf3L<o?Te4GcaJ^\07[_a63>G7a^l:be8cSN5aXRa0k0"
    A$ = A$ + ">:I`D9^B\NF3XjO\aCPm#h4CEJ7GO\Co_e7LIaU96eoShd1Kc5n1X>7e]`[f"
    A$ = A$ + "f;]mc[dGP^0bO0ncSn1^>oX_Y_^k3CCo>miPiki;kdT\6\be#SL=iY5:WGcV"
    A$ = A$ + "7<Uicl<MglBoI6`fPi#VY<gA`jOWhdK`eK^[]G\E`o1<=<HJ^>h3[HHUo5Ca"
    A$ = A$ + "_?T_<bgVf0_bGl4c2WbnN?06]llUfnd_LI[;[TcL4Jo[ea:mMI<1bFo7eZgd"
    A$ = A$ + "\^#\nC]MV]_dn<];^_Zic0aJCn0BMAZSQWO9^VH]aF=OOW>=9W[#Yl6b9l:b"
    A$ = A$ + "E:[FFK5e8m_6I86S7TI#FRhm`nYa93^6KhP=;[=3c;<O?HO`aX;`h[1[Sl4b"
    A$ = A$ + "KPW5S[=W1K`Q_YH_6fhEa6FJW]<FHa1gP5^>>1>Fj?ll1^cikI[<joFja\]\"
    A$ = A$ + "i7alWkmLN_2>6l\HmdW4n>cWPmeX35KelnNLeWnTc]HU3\^7\^VdGI`DP97S"
    A$ = A$ + "\^?#6baj1i3T=070fTLCkHbbi5mnUhJSc6\Xa6CnbdF<7CVoHd3TfQBK]jGL"
    A$ = A$ + "Y?`Uaf?DcW4WL\69=5feh>bicm]16MBlL2mK9>gljYO=RciIPe]>mk6c3??j"
    A$ = A$ + "d];2ILNnR]#lkBkS6cKj?=<H?`08KPnOadbHG0<3o?aeFOFLN[Q[ikY=PM1]"
    A$ = A$ + "IX?_jWKnZfFDnDoB=VCWSS>g<g5kPL=e^>mE[5V;WC:7B\>fciJCoTi58cO_"
    A$ = A$ + "i0Tkk<goXKX7Om]32oUeFVSA\eO0j]1?J<_FSVlNlIVWe4[F=L=W_QaQji_g"
    A$ = A$ + "HFah0S=0`aUnPWn5KNYe6Si?:c1VhG77Wm9>XLMGI]H^?hNbL?N8afdWIoF="
    A$ = A$ + "L>?kikAo]aGYO?MKgWf4nk2fYF^BLL=6d?OmH9I^W=^GdUWeKU<aJ3^VHZ5G"
    A$ = A$ + "ch?aiRSY<G`ZG9abJ3AL\fPAFk#:oWlJ7_lfj?Qh?jWX?YeSSj8]UoWeV^mm"
    A$ = A$ + "6c3nPI^jVZVejdaU\V3=GciHFnd_[_SKZK9`bP[cJS46OV73GklI`JY6MlI]"
    A$ = A$ + "OPm07kf<?dILa6_O[bS]RU^]5SL9kL8M9n_ClB`;Y_TnNea2kHZiL#IZjF[C"
    A$ = A$ + "7GRlT0>[\NPim>7T6NnMM]OhLI26oZ[i;>O6OWig1KK<c>VYXgGK>I<blMYl"
    A$ = A$ + "dJR5d6Ti7l#Q]MNK\6W;ce]SEfQDGRiH`LOPNgNMROjBh;^imL\J<2K]>VbJ"
    A$ = A$ + "cfH\;m5WJU6k2jOM>NDF?DUa9O8T;=oGLmfflDEj7nLnUkJC2lJdn=6hSAlh"
    A$ = A$ + "9nVW]D7khcEW>M9`KPO=_II=1Km8<TKkX6?]F`6R?o:K]hk=\jd1oJlbI^[e"
    A$ = A$ + "7^MAfBiKdOLXTWM]FcJ8I^a5CfaAKK^VHKfi7c=eMfei\^UhZM<<]F5F[hIh"
    A$ = A$ + "iWi#K^iGFW>N4fS1ojhLUnRK]NRMM\ACncQI>5b[hC]ib6Kg4oUe]\jf=NJn"
    A$ = A$ + "_[QLUaR_VSidL<G[bjIANX>FFAojM>YDVgZ\FaM\mV;?R6oaBcafLm5d7Tc3"
    A$ = A$ + "_>MjAijdSnRk[j;>a:ClDilJbiPXe[\cinb_O]aOHgIKUAe?fFoo^V>06Sh="
    A$ = A$ + "biXQeSYeQ9SkkHW7lmge:\Yd1THJJ[D[8okO7o?^Q]neM\WPiH<fKieDOFY>"
    A$ = A$ + "Daemh[?m9dm2n]LNgJ]1klN2<;fW#nFL\eX\HI[3ce4;?6nFi>`FYOielmeH"
    A$ = A$ + "D;aElmRi07?\n8NmHjX=J`dPUB<6c_IaJdOHNS5_U[U`il6;c_\SA\[]Jj30"
    A$ = A$ + "j<hlVG_a`fI^7UKaFnB`fUhJ^OjSSmY?Q^G`_>GN`N=HJW_EH?7o[M^;WbP>"
    A$ = A$ + "77g5iVDg#;iIbea8TQ7Z3V:nV?gh3kj>HYoeikE?F9V^6[mk1oE:WkDG_n6V"
    A$ = A$ + "eS]ck2c]VmWlKcM]B\=GCeAXojlM[9f>IWk[ma`ndW#g;H:b;^[=MV[6bm5H"
    A$ = A$ + "0aBnmFWe[U\bBbWVno:FAl2OOFSJUcYSQa3ojjUF\Ta_jlfDo\oWm6oe7eVI"
    A$ = A$ + ";h>>nL>EGnOhgCOiiEWcXY?1?c?`WhVk?R7bO_L>UVLRggEKeUOWS]A?WIWn"
    A$ = A$ + "4?SL<XP1SiLRbib\iaaWePeGKK1ehLM=#Fkd^64Tf_d>Xa2khUG>_7d6jfRM"
    A$ = A$ + "G:[OkCIoE<I^f:`m1iT<WGFiL6_6^^D?gfJg^Dg]nFem1odW0CSMPBkcI\Ke"
    A$ = A$ + "lcFoaFbhcZLAJ7]VMJWWail;fH7mmR9;I\32Wh9HamiJY#=<V7Obk:VFKcSM"
    A$ = A$ + "^<FI_7=7M>6Q>N2`>O;mH73G3ce<Si6mK;Ml^bW?n<Y_6WMJCF;Y^VHQ=JNf"
    A$ = A$ + "WSQC]WcI]NfZ\QbF8_R\U[Unil2FKaILcY_fYm\BKebCIC\lc:kW[Y8VOK7O"
    A$ = A$ + "8SoeeHUBOZ[I76OWikClZigScAILiN_JnbEo7hjbl\K<4Io#=>e;AO]GWblk"
    A$ = A$ + "Uc?milKgeMVnLandUL=T7i7RYfeEG`[CU7<[CRFS]N8oY>70_6?3C]Yik;Sc"
    A$ = A$ + "T_CIbYTM[Q[5oZ=jTCkf83[o0>NL^F:H>ZYmkJIRn_DR;EGCYoc^^nJn[Lo`"
    A$ = A$ + ";?NJ?ZK[\>Gbf#bNNXmd:^>agFCPiJVBiW[Q_EGCOl]>]9NnPMJSSN]Z7\BF"
    A$ = A$ + "kE^OgPLEb96gG[9CF3S?6^>c9\n3H=ZUiNVkeiZK<Ze<HnOL\L>ljhhKE6W#"
    A$ = A$ + "6oRS[6_BoKI<m^?QDKL][E[ZXSY<WQ]R[YFO]oEMZIo=_WkYJem3OGOMCidV"
    A$ = A$ + "Lm;aeZ[i\A_;\Sk3fZ;ad9f^LNQUL;hJn<G3a^jGI^GQ`gYMiUDOWP_H]QSS"
    A$ = A$ + "GL=j>dfFiJiX[]D9fdj5OHM?>JgP?Mo1\UfRko7h3#jO#ILeVc<NU?COC9NN"
    A$ = A$ + "<>?Ng[bL]<]CW^4aEK=nJMN^?`d`J;;mo>acY=Sa`c]P[efYkmc[im#=keP["
    A$ = A$ + "M?W=][?eiaHB0CbjX8g363oBdchlQ[L]:cH2bJO9[9EmI:k[Coc;mQ2OMBOK"
    A$ = A$ + "jhjCKRjkfmG2\W>WLTe=eB`26?Fj3NY?gig;kl2;m9O^LnJ\i^Oc]?d?=4gb"
    A$ = A$ + "eSOVi`NYenQf\M\W4g]=S3E^nVhHOkhRagQ^=laPe5U\>N<FjT;eaV\a_H\g"
    A$ = A$ + "h;NO^GOJBj7^[3?R3[9oHNbBKjImKV[MfiWci]#:W8N>c=;b>9VfOSGaoaBk"
    A$ = A$ + "ARUeFTc[QmilGLH=#oZfTeNVfJGJ]ihNiiJ=O<\MIlfjGO=M5>?#:Wo:>^Ji"
    A$ = A$ + "cWJHMWKm]\GQeYS3iJZPSKE[fS5;Ua]UeZUf0DfXdWKUkljiHPinBYhjLmj`"
    A$ = A$ + "JTdaS=[]T3aJS4nPSl\nPkhTebH3]_I[UMc7WS]fBdG<d?G\o4DjWUl`H2n`"
    A$ = A$ + "?lP?OMjdSbm]AdofbcOe1X_K]bYLQfccc[n7R?gXW`HTMn=Ue#M>W9hHI?UK"
    A$ = A$ + "W`5G?NaC]N5`eZL\WW3G_[[E:]jkQeZReJcfP[QceHJm7ScKVo37k>gg3kdY"
    A$ = A$ + "7ac5gkkbW]>V^injDbJKH9HJm]3\K]jGcH5cHUeoF`DJk>ac[9kNT`Ho^[O4"
    A$ = A$ + "iloPg?ehh_;[ghj_L;HCjC4GG6Om^dWH=2Uc;EG[IkaGOJCT;8eobcb<NGcG"
    A$ = A$ + "RSEB^WF=fHFEj7^\edARG;idN\e`BioYEKcTk\biGh^Rk]6_XFB<fjQn^mdJ"
    A$ = A$ + "UF[f[5oiLnd?_[[?:>_c]6cDGO?WhWn4[m_>??\nAMM[L:K_6W\[O7V3[bjb"
    A$ = A$ + "8iUPaFSe\We4GkjSFFKY8S_fSWff`6[RnGE>Gdgen#GcKH]ciSn;ld7KAf>e"
    A$ = A$ + "8YlPb4>g0V2GOcjCH<oQM^:WkiL[9fjDPMmGH^eiGd3<\n<\3[_I;QM=SP[i"
    A$ = A$ + "J6W2??IMdQe=7G;ZaF#g?lCObeekbLL3feji^4OG63l]IFO\GceiCmQMODG^"
    A$ = A$ + "7Bfe7;`e?kCm=O3kNK:[17oOn=kc][CARH\cLL3Ffa^bh\cJkfaJjHRYcj5M"
    A$ = A$ + "H?dkZhYgfjAI\e7YMUGZ>2[1eCUm]0m1o=mZOig7VN<nomoiW^Sa?A8ZQO\Y"
    A$ = A$ + "D>Nc>6cihHVoFihDeiMWM?;n#_?l9G^6:`HNQ]JFK\J1G;g]QOl97kc<NJ]5"
    A$ = A$ + "J\fV7V<OIMLmi5G;?8X_WHdF2V^S_?n4inPLMUe?8\HIO>K>Peglif3;Fcem"
    A$ = A$ + "\MQM_h2fiKUilBON7SY?`eL]ab5hm]VkOMndLMA#>W;gilYn#^Wgg2GFoo^^"
    A$ = A$ + "XD6GM7OOjCD7PTGT\N`\^AbLmhhjTS3mDhF[IYmYLXSSMRE]NYL>Zi[K;OCm"
    A$ = A$ + "D^3WSL?3Efem?GWVE_1n>ijjP_VL;k_8[9>gfN[aGgBlgMj`#8;I\eg<WX^2"
    A$ = A$ + "[GnL>GWa>dig5kNFcIJ;ea[5LgM^cM6HaPlm4G8h7^G43A=Sam;[=<^n`DRM"
    A$ = A$ + "PI]N[TYVK[mNnoNnTijThFJ3`^G:`J#HZcaiEgMVPck]L<5;mWH=gK04KcJD"
    A$ = A$ + "kofo9oMG5\=SCFgfmnWLO]?]?=6^m#__iTeoARQE6G<;h8`CR99GPg6Hd?<\"
    A$ = A$ + "_aOEk6M<\?Xam1jNL_=Cm3b3kMPaJ0m3<G=:l?XLmUHY?3\NDF;L]L7KOH8T"
    A$ = A$ + "c`_L]a];haUH??a_h;Phfdn4]7>W8c3FhHS713:VSfVfDa71\53SMJJ_7bf5"
    A$ = A$ + "jdL_AKX]5[Sjin:N^W7<5N?ae[9V>]MgY33iNEmfPYKIm<DLO^7i8WkcGHkJ"
    A$ = A$ + "]\mZOcK[>1>>G7e1k:>ff:g?Sio\kC:7KMAL?THNJkY]NSk?g3VheQeMjm4^"
    A$ = A$ + "^kGnQShi^c9k]a_]Y\9kndC^?NYocK[\ZagZmCTG=VFSWe_3idUP]iLVfSC6"
    A$ = A$ + "KbL]I>#`nW;WkYR?Vf1oWS=nN\n#_GegN[E>`lkCL=<e]F?WJkGbmcWU>?=Z"
    A$ = A$ + "QUMO2AOZg6\E:oMSoUAlUYOjlKLna^d^h_>GWY`XJ7FLmknmmA7hF>>HJWW:"
    A$ = A$ + "aD<Fo\ScU?k8gI;]nDji;odQ<?J`UhdF`ig\n8M]?N^oQdZ?e6kJY\FJcm]m"
    A$ = A$ + "6Nck5<Q[Oj]P=e?L`fPYMM=T_[eS?ai7KllDm9^g;bnb8aQPmkm56_S]gOTn"
    A$ = A$ + "RWeUbDeQBY]iURUQca?2KQ7jL>AkM9HA^OAVG<]f\Y_H9W7dSL8bQ_c]7S1b"
    A$ = A$ + "eX\UXSTLN`i1\=W7NUafN]ahmlY]O8l]LZak:i4?c[UfJ[aPIih#VZlk>THI"
    A$ = A$ + "TCGRmiJHKZO4GOQ1GSLO[7oQ>NKcMVi0<73MmmXec5WSBlMg^mY=AmUJL[CQ"
    A$ = A$ + "W=GZP=Vbn:fS#>PL=Ym2]gM8U[CiKJ;96cm?Db_gB`fX7cmiWi`dV_I^6Lo_"
    A$ = A$ + "]iI`LLI>1iLjhMcc1dG16]Uc?o7noVc2jGhn8ei>5ceaL[_fhk>fNM9NNYnl"
    A$ = A$ + "a\?KMQ<6N6\<bI>nY8_\Donf7lHa#3^lIOV=3<fM=?k<n^K[=MRM1mIF3XNn"
    A$ = A$ + "QM\UWZ_`JJ[\][meKf3C:bSXifPG5kgBL=I^9]VmJ=fOeMca\FhKbCNO<`FG"
    A$ = A$ + "RWA8KhHIVi7M:kcXkGmg>OP>56oFia`XUh^JlmUCn[aUX_B\Seee#]MoNnUW"
    A$ = A$ + "lLiNWG=OaAoGjS\eC^hF>o4[4W7OegCKTO?7;Je5`WGR>E^GiH5OgZOj<GAh"
    A$ = A$ + "H8OKk:n^S]GGbakba0>c7F=mlJKfS1iYSA=M:fSf3jHb_O<<=O?fXW3G_f<h"
    A$ = A$ + "j<W2dOUfSM<8bJ8QG76?_Wl7ln>G4O?_jn9RhOnN`kVg\]A7;nHiNYlBRORo"
    A$ = A$ + ";i#OGWCHmJGIm8O^Te<N9>U?[ML]=_]eHb^bJGBfh4BI\YRAn#SYeGLjg\=B"
    A$ = A$ + "OWfV;o_h?Q[ciRi7kljJZ6e;3WCadL]f5OW#gXi3_EocY]RMkM]nbki8Og9F"
    A$ = A$ + "GaAFGO^;U5dnLj[mJJKfak2k]IMTUi:biRUejX]3W;5Y_5cE;H<gW>6hJKkY"
    A$ = A$ + "^#n>^?1GFcNL[<m1iL]Tn\ejITkC<]T?OlM2Kf[aHQalh^CK7UHJT_:S7caG"
    A$ = A$ + "N<KhK;ci1IP\F6eGQBL8cYaL^=knl8eD3W;S?efFF?Yel1WkaS1^6oT97TUn"
    A$ = A$ + "hhLfZUHn;Iao3kAfGkcbn8o6]Fon[9^6^C;WHmSE<K=Ma3k_K[PoLUhITS=n"
    A$ = A$ + "Ml=5I1`Y`jW?_J]MULY<]fXX]2>GKRdNIOY56kc7KKe9^^FOY]K[AFnLRcMM"
    A$ = A$ + "<^LMOajXEMK[ELAYnM^6]D?`lkh3>n]_VcOk>e>aLddhY[5k4h?[CYe#FAOi"
    A$ = A$ + "M>4BNT;]?QL?hlF\K1NeH3dn=jY\nCijI]ShnH=g;7;<=<hbYlWUnLS1A<>g"
    A$ = A$ + "SnIcSVc?D\;VcaeJnK]DM^n\bi^Vc]=7gbi`dkSiiIOL]KS`gKc]bH?[`6Jj"
    A$ = A$ + "Kk^8[hHfhndeHhF`PPGW3OHnV\Ng0?[_RVONn?R[e>iadF=<igZeJdCoX0kb"
    A$ = A$ + "L`h]mZN]3mU\oKlok3oA6HN?h9n=nO;Y^cclIXN3gOWhga7PiZG5[a5`cI]T"
    A$ = A$ + "_f[93`mlS??1>`l`>fc<NN][hjLLLZIONKI\?klVTJUCfi5]kagJ[NH]Sh:m"
    A$ = A$ + "9kjeemRk]mhaGJcggDmKLO16i_oGeOWQn#aaSa<V_P]JilF>G\]F01OgZOIP"
    A$ = A$ + "eilXo2ZkI]abMoaG6Q?[8g<E<cjoUhJW30lm]j[Wnb3_FHJJk3i#n:LZi>3V"
    A$ = A$ + "gceAL=fk`>][Y3[ahEknGi^P[WCOZmeXgRniWkmMGkb`ij_6HI]McW1GGjO_"
    A$ = A$ + "e7>G?[YgL<=hj<ecbLL:Ll^03kiQg_?W[GMKf;SLO`^eLQihHjhKUhkLm6[F"
    A$ = A$ + "^deOO]=HL^^F8ImAR<I9fKH<L_jk]6[=6cL<f7b]mcNgi2MI^dmJ>TC]E#WB"
    A$ = A$ + "=L=]AgK;K5GGK\4Cfjn2o[ClTedEnIdOik;?WIlj>Ge7[>bMl9Ln^kLl8SY7"
    A$ = A$ + "6kbPUYfCf7nKg]I_m4?oKaO`B\]_gea0T5bLlDcWmdVQidBM1`^OLkholL?o"
    A$ = A$ + "VKVoS>nGJSciR]W?[5;[_Zbad2IBloO=`d`^^OFnm^_Eg2^fJl6kWSEK^>>4"
    A$ = A$ + "JC4<TaF3KmSBOQeemHj9hoALG`T[DH>NaafoG^N8Z>4`akSiVif`meIUa9di"
    A$ = A$ + "mcnfof7diZNndV[[CAJ?N9nM_=\[;?V3?mU06iFl3]<NBm=Uh?YmK>n:adHS"
    A$ = A$ + "O=[GD\E?6^NZj1[T9ONV_ZJm7>>5P[#FNZH[Y?0\gHm]foZ_dVcjD_Qb1RQ?"
    A$ = A$ + "fhgi`fMjn9ZOWWnMmM\iYonOE3n#ac#cUIZ[fWcn\>GgKAn0>GWLc_ce;QB>"
    A$ = A$ + "S=4l:fP\V3EFCK?R_[QWii:nWBkbleXo7RWQiLWcARiH\[gR_d\^[Gj;^kG]"
    A$ = A$ + ";bWn[`JXgY<GE>o#L>V`mfDhJWc5fgVihfi`ARSClkYRMhUbmh\_7AOacnSn"
    A$ = A$ + "M^i3lD?fVOagagh`[`Pcan=?OiETClI>fd5OK<M9GZg_d6#nnd^P<hI^6L]D"
    A$ = A$ + "n3M7POYm2;N6IF\=iiTS5l]j<hk5GSllBcC7H9fW4J96keHlbbldSOgijhVc"
    A$ = A$ + "k;`VL?WS]E9kHLEnl8kW>Tifj#`mj;mN4nO36n3n>ok]iGmYn6_Qico:?aSM"
    A$ = A$ + "O<hMd10ca<Po_f6_<b9RoG2caaaS^5h1o7^Ra62bPV3HadHO6IKUM=gMY\]h"
    A$ = A$ + "j]5C[mjF7;<gW=f5L=fUcJ=9e=bnRXkIaTC^Y6[;n]dg:jSAgli4>^Kc^>A^"
    A$ = A$ + "64O:^nINb7Mccl4?j3PYCVo;l7nYNdQnB>^?bCm83<OWonH<o7hlHhL`[XWP"
    A$ = A$ + "fSa3P<YnXkHhH^P7\=Omg?V=;]e_]aCS_e\O4fZ]f]=OLDK:^F[`m\H#Soem"
    A$ = A$ + "`HgORAG3OoHf\A?XiXnHSe>TL?oIgeF]HJaelJR[1[i[gP7ObGe0N\\?=]gg"
    A$ = A$ + "2;^fS?an8CKM:I^`<G#Ho2<kLeU1Hn]e>mfb4CmfHcfllS^:RkahP4C[]jL_"
    A$ = A$ + "7gJh\6ffn\mMMQLZaGjSiEV[\4GGj;NJO]4G_c<WW:<?O`N=bT`ho=HDS37i"
    A$ = A$ + "mi`dj;:HR3EMRR?1KKmcah:1V6mIl<1VgkLPAlLR[1[SN[a\Ikh2L^kkMgVm"
    A$ = A$ + "M9R19mo>]EG3O_:HhFaeGKo6iJcKVSm6HlF`cbH;SaoNYSgdfbXoHKm37L]H"
    A$ = A$ + "#N^`[PcMO9>gGb;afSi?^i4oB3GOYVo<]EWa:?6Vn#R[CO2Df6\\Sa6f\K5C"
    A$ = A$ + "W[oh3c_[7j[JWZecFHGCcfVS5m>R1]>2`FLjkMb^?UC\oL_E;^T;Ao`Q_4m5"
    A$ = A$ + "7KeTW9`dPUNVgiS>9Vn#S[YMXO4Inc#V7k?]R[A>Wa3RjXJY[a8;en\k5gkb"
    A$ = A$ + "LHAke>VRlN>Oe`dUhKbW6aJGVOl;alV9OYUGLm1_6^N_QOGPOh9^Ffak2IDl"
    A$ = A$ + "eX5L=fZZ]nP]f<hh]5;CmSjakJSRHJRYN<KeUhJnOh_>gcFg>>>GkjJ5h]En"
    A$ = A$ + "B;6Kc1^h86G;]CN`aeS`X[aanRgCk4<=HkC=LmBe1Tka6>_c1Oc[gDk8GiG="
    A$ = A$ + "nG?7V^<FK>>Rk`H[9nlS=njHaGBnRGJ[FL]HKOOb7M<nCN?L=fOV[^XcL6M\"
    A$ = A$ + "ae<Vd>_cL=oemjNWk:^^6I=UiJBNihJe2KMVR>1cgha6OM<i;5Oa7bGfeemA"
    A$ = A$ + "RY_13?bh>MDafGg686KbMF[Sa5aG36J9e2n^J;Va^fm90]733>Vj;Qf2[c3^"
    A$ = A$ + "^<1>Do`Mo_ciVUcGeUP]=7JL=?4kW]WjlU2^6KMDSVSJKn46GKMZdBmXTi=b"
    A$ = A$ + "ed<RgL=`_>?XBk`ROM]>emeJ`_\fBmIo\KflYOn?m`[lIWWI^NmSMMW;7R[i"
    A$ = A$ + "jP=lY6W[a6g;J;FG^7K\eaT_46_;m17kegDK95f\?UaeHWfLW]TajRSaiAL["
    A$ = A$ + "e3:<fFCl[cSZ43kLo6lhP]hOmoG3hfohGhc\i?ic\I073o2_`mN?o_Qieae["
    A$ = A$ + "UPD;;NnciZ6HK6G?\oJmSf2K]SOZ?_:khjeQ\F`]na^1Go>W1GOTS_F\]SY;"
    A$ = A$ + "i2ZeLWE6_]S5D=OXa7HWS7iNGW[mW^VAPOdPMQoDoQo1KnmoDgeF<_2F6Vg3"
    A$ = A$ + "66B<=o7>6``XG0KePYilR=iaZ7DSeN9i?K`Wnmdj4h]9nB8gIL?>FMQF>?<>"
    A$ = A$ + "ih[`eXCbaa^eLWU<7Skm[9V>WoD6S;OWa23f=]m>7ko5l<<4_biamO4`fPYM"
    A$ = A$ + "MAgmPSZi9H;\IcoM=f?\>7hcIBKeFW9RQ_eP[_R]NYGJmYDRA0KG^^ViJFP["
    A$ = A$ + "]KRU5?R]f1ljc?3oI_O<]<O7[eJI^b0GS>4J?CE7M^f]^DKf3[KaGhi37nL>"
    A$ = A$ + "gIH[VlU9^VkgK[hJW;8L?];HK`DD;J9nfmW0\Yc_9VD<nL\DRWQAgPeZ^nOk"
    A$ = A$ + "n38i#KZe71HccmBb=>gSGRe#J7GOk2GS_hL?\dakJZa0[fhKK^ZMml5<YhCX"
    A$ = A$ + "4o>7QMNHGKfdoKgo20[gBkUeRUUj;>aUkl3kHS]jhjm3ajK2H2_G[QY?eaed"
    A$ = A$ + "^Y=B^P\V;fVlWEjOK=obaF]k;F>>IVW<ag]#^6RjJJ\[aiHgfm99cmmD]EV_"
    A$ = A$ + ":]Vo<[[DG3c^dc;ni:^fe_<`e9>if8^6Vk3WgV;I\\[a^>W3f6LCinY9fE]f"
    A$ = A$ + "`LMF2\=LYOgI\eHSVa^fc=HIn\k[\^FQK\ecQ[aN=Hjal5gifU[AB3aILV]f"
    A$ = A$ + "U]]liJMRW[O23eJm]LL][U:4kh#MZ\Pi^iDf^]>aH>JMO[IB?kM7;:7o;n?^"
    A$ = A$ + "7BRMMbI^[IXlnL]CWGM_o4okM_ePf^[=KC]n3biRH[ca=O<OaikY]iNfea6?"
    A$ = A$ + "MZ`W[SQ]hjL^ND>GW_=Q[9o0XSbjDaacIGaf`POMMhUHO9_eD?HD_9`>_=\6"
    A$ = A$ + "?5[1KMSZ]Mg?N1CWegY[3Q_i_b_Y1lJ]iGYfX1ccaP]J`]eRafib5]k>Nnb1"
    A$ = A$ + "Gkj=j]ML]<e?S[mJJgF\m^4_mmUG][gohL=8JHOlkJ\m7lInSfl;mNoM_1cn"
    A$ = A$ + "FO?ob3hM\cc[?aCo17d50Wn?c[FSWdNObGnSNOf\C\]ie3M0l:HGe=TnRgmi"
    A$ = A$ + "n27GOUm]L>LM[5GK=b5[[RiNDR[MY[1f6faVbiC9nBC>\P1GW<HLgW>MmF#L"
    A$ = A$ + "XHj4G3fEl[f\M]#gl2bmSjF4WkN4>f]MnJO\a>Wbli:^Vj\jL3GW<gGTG#G?"
    A$ = A$ + "2dngJP[a69W?`?I<^PEM>A:cW;m=GOX[Q[aN=hJ]ICk?mm0IB7C?g;SL_7cm"
    A$ = A$ + "WS>Vnb4GWceSBLM=\mA7g^0Oa[ECkL?H]V^6aMkNB9H#HN_hjBlZHFablO6F"
    A$ = A$ + "Ko[H_6LYWC?_lO1GGFCgiNZS[A[mJo^S[MOiYfljhf:^N0KclEiJnnF;C_bf"
    A$ = A$ + "M6gmfjK>ikFOT;aee`dgPI_67_4L=G3e3E3_jgiNZba6[LKR?Wae3[aAElQm"
    A$ = A$ + "f<^>ili8OFMIY]kUR_Mm:?aej;nDhjBkeY_kVCld7P<oIMoZkhjFXdNmU2^>"
    A$ = A$ + "IdWQm=G3UF2nfm]>no96L:o`M_0cj9SSgm`bbmUdJhj^oeMLM;dU2^>_7Bob"
    A$ = A$ + "im^gPjn1HVK=\9GKOg5F\FRodGlBLM^f3ijZ6\eBFJkgiSAiiem8_nhE]^li"
    A$ = A$ + "HmVUhjL]DhB1GkmX[YK^?PjniG]a>fjlbHN;aOj;>_Z_g3e3jEfgM=<J#WaE"
    A$ = A$ + "OVmOXdf?OOiiCffcUNmL_fliHmQWaGO^Ug\4?Gnm^>5K^d4[G^gNkh8EF?Z>"
    A$ = A$ + "G>cH[CL]SeTnEknX3mgPUa70[=DaiedA<V_h7K\`i4gaegn`e]bUf`cmbkdg"
    A$ = A$ + "LafOK_h_e9aQ>Vb^gKka2V6NHOmmJkgj3N=O_[I_^fHMei]W?7Y>^NLLMnNR"
    A$ = A$ + "9>m9Wj<Y5\HRcioZ=JaeHg>?gcMNjaH_^lijjKFR[?WS_NYhjB<>nTcJ#Pa9"
    A$ = A$ + "KMRDc?lY\eVc=;\KckmgVkLXOhh7#?6k>^N::aee63k>^nN[QaFSiDSI>O;J"
    A$ = A$ + "1CG3KSMjL?k^UcUhjaZk\>_LnL<Gh#CQ[_EFOhk2?2VVnUL]J`lQ]DLMR]Mm"
    A$ = A$ + "B8W_F;1GO9__J]VliH>cPL]BYfHHOYQ[cj>eeG2g;Aim67lHe>B[Lj?mBRA_"
    A$ = A$ + "lhjeNf^cW[4[9GP[5K?5VnL7G?6kJiXSi5HjQmYf[l7OK`eRAUGbaUSQF_fc"
    A$ = A$ + "f=Viahi<I]HdaeeIU387dJ_DKVK;^NKIaeZKYS[gNnL=fJ9g_[OPlQoDgbFW"
    A$ = A$ + "QgcHKae3c1[]8_I[1S?kR]kc_jMS?GS]FRjN_JmQgae3\nQWf[g5O`g5Ogcl"
    A$ = A$ + "fakkSSmfaW[S_EBgP[3O>EV^S[_KOBIle;9OG[5chV[mJ[o]N=U^MlihLhZT"
    A$ = A$ + "L_ij6<m5Jn`W2G3WhjaVcEKS=hJOGb^_4il<i>gi>g`>?#jaI_<nLg7L87[["
    A$ = A$ + "KfoNjhjZhJgCodNmf4OM9Nei>BF7i9^]EVcEgWlfh;1O`QjhjUS[7K^L=6>>"
    A$ = A$ + "]g^]HgihnN]EKlUP?hBVCl>^NNLM>?]ZQ]7cWi31gg_]VW_4l1GbH\ciTL7G"
    A$ = A$ + "O?L]i3?m3?S_>mWn#SUkf]KW_T\E3ij=\i5_S[7gN]fY?5``SamJA[>O9I[F"
    A$ = A$ + "jnl5_S[O0\=hJ`;7K<K;LOmBj1i;UlUEBj;N?njjl]9LMgOl7T_4]E3INakh"
    A$ = A$ + "j7TiNgm1XS=NMYnSO\aCW2lUZ]J8`eLokJ0D7GO?<=aWLKbF]L_6b_\]E;Y="
    A$ = A$ + "knlijN<[1aP=>fHdM1KO9k?ninlfZEbiRM7GO?L=SYea6O^;lU\ohGjfZUbi"
    A$ = A$ + "]I7Go8g<nF7K\i^JcnBdOl;]a[N::[A5Ggc^Tae<NA7KLiJ`\7PM\aIM<maU"
    A$ = A$ + "`VmU<^6i2_o_]k3^lUdlenB>ogCAFoI\>LVHkS=Nk#QYY^A9^JWKV7K<iJaG"
    A$ = A$ + ":aIgcEfh4jl\fB;]GO>6g=g7>WfhEGKQ?fHaefVmiNLfM<mdTaIWhJjgdodc"
    A$ = A$ + "EL]HKZ_jc5O`_D\IgSY^=Rn9`fPSe>fU0^>GgS>fh`^=kfHFgPkSYEkD>O#:"
    A$ = A$ + "cSfi>^nHS1gG<nQL\aQM<maU`OLWoVl:eIjiX=KeII\eWSnPWf\?WV_GM<mf"
    A$ = A$ + "AJc6<]LI][L\aU[9fVH[gWeRR[g:<W[8NGHdSL8[oUcQh\a>M??IK?IMSO9P"
    A$ = A$ + "[iNKOj3^[aP\7MB^;Bn#V[^Kke=NggkeQLM6oL6GclHSa]NO=nFLN1?;knMX"
    A$ = A$ + "<HK`J72\m]mlWe]C_N4HKU7MM#k\R_jNJOHcdeC#RKGfaC3>gWENVWhA?HiR"
    A$ = A$ + "ofX_hMkd[?E>7]ceh`S>VL5HdGinecJRO0W>hgmEn2H=l1^FM9UlQbOlKKi5"
    A$ = A$ + "_SYgOD6[][o2CEWYicUS=^M>N][MDme5<=mGY]j1LM5<m#\0Gi?^kgfkCL=i"
    A$ = A$ + "]kHSEKUaNB7C_O9afF?J`eZSlJ?O>fHgi\G_VaFci2L<j;biA3lHfZQYMP^P"
    A$ = A$ + "mMna_]Tk\NMTMh8bK1n59f^4G?fcXS=fM:6\dJio[H6kK\Oinbch8hJno[IK"
    A$ = A$ + "I>o9>fhfah^Og77bJ87KB:on<_c7MbWE7K\kDlJ>^e6CMZ_k1^o[PYYoSo?m"
    A$ = A$ + "[7RiMbY8^^?^d7ObJ8gl1=gc\S=fM:N]FKD8^E7gPL>ReaeccmhX?M8]KKLT"
    A$ = A$ + "?0F>UMgShBc75]1`7SiWkn4GK=JVcG6?oMLmdfXO^gg7iH;:gY22kfV;]dGl"
    A$ = A$ + "FaDk:[=AaecMMmoW_O=FgA9VHlW_LOAjD5GCkeJI5mA7SlXemindV:WcfE]I"
    A$ = A$ + "^W`dbe6o8o=a<b4k0H#heJol`m6;]G_DL]i=k#Q[7Sa_5[OMjKFkmk;lg6iU"
    A$ = A$ + "^?gg>8NFiJE\eMjQ6K?V=#WC78WR]9\EQ\[kWN[1N8[[\beJVUP[9nl31^Vm"
    A$ = A$ + "#`Ga[hEN3GfWhn2<O7gKk:nFlLW^mAX36obca1;[SUE7GGP;hJI]c0n0o[Mn"
    A$ = A$ + "BiLYRgoBNh7NHMj4ii7iANM3b^k26QSGKek2^VoonHL^nbobNIKnZojOUgLo"
    A$ = A$ + "2oZO=OGKNSO??nmaOg?nS?lm`lO:al;a6>b08;d];On#R]ci2FJ3M]`e^GkX"
    A$ = A$ + "n3`4JCFL\HIT91WS\[<bb_]g`KN#>6iFiURK#?a#MXDP[GZOhL<kJ>ln2OAO"
    A$ = A$ + "93]7`SLOimYhfU`9VWn>j3Yo1LNilhVWgHC^SU?_Yd_lB\m3hGn2mG7M3`RY"
    A$ = A$ + "MNBQLWfVAn3IKT?1k:GJSZDVggSmlFaBlOcal[dgTF`eL<>_^FJ\1hC=]1\a"
    A$ = A$ + ">7>MKaenIjCY_5odYofJ8igiCnKO\5ijd1Tn=n]N^=o<oDodeafcHoZ<FI]A"
    A$ = A$ + "Ocgm4?j=hITcPaF5fY43G:OFCFM:6iH\1>5^RoC^^[DjCj3<_F79fJd;djL0"
    A$ = A$ + "?]7_4LjB]I?6FW[_mgM<mU9acMlA;]Og2>^T5G?T_i[SWfmaFTadn\nGGJ7N"
    A$ = A$ + "<kcCa6c9f4W2SU[YYK3^VOK^eQ9d_\=him5S?2ONO^O13__CG^TcgCagSRYO"
    A$ = A$ + "bkRS5;8^f??\N:Nei`HAQAN7LHV?Xil[\5LMR]afdHhJ`[SQ][]^?F3GCnX6"
    A$ = A$ + ";nNeIMZ`fg#O2mooloR_gS]hEW>bThK[E=Ui7Zcl[l]FLM]abJH<N_jHB<=f"
    A$ = A$ + "\[5SlDhj]Qe6E=<WeVCj;NVO_T]Vj;GgDZUc<\A[^ZS=F^Fo9adS_Bd6kDWP"
    A$ = A$ + "8gJJofl>E^72Un_:V0iO7[E`dUnL?W=VM5KGjC>O?jZbk7Im]8gOa;acbUaF"
    A$ = A$ + "G;G1WB\]9dmPn^NLfMZTZ=fg?`j0fEfY=?ITcH\C>FN^K1K_]bcUaKciZfmS"
    A$ = A$ + "HmBVkcJLR[1W\=]eF^gG:fM\hBXo7M?6g?i8]CMJ<BkN^NMVHJ778mmcLTE3"
    A$ = A$ + "OWbVC=6GK3SoWR1G;O#IL3d?labAfBm]H^O_5\\OORSYoWLVmiLEKUG[YK[C"
    A$ = A$ + "MJ:BKLH;7lYSOJ93VPJ[2I?[Y<H[eZ]QiHiLhiTcOihN]VaFnno0OPKZ1>a>"
    A$ = A$ + "N_Ue8gHhmUhGReFFbIMVI>a\MHK1Vn1n?_mgbC]iIOfnhGgYMWnhO\?n#nGa"
    A$ = A$ + "nf?fCnS<8KUn6:N6OUCIbJLZG8e9`k;UW]F^Bl5GgmA^PiNTk?^ghj`e]Fk1"
    A$ = A$ + "VkCJ_bN?hmQ<o?I`[Nl9GJ7fHWY_UnJJCXkYkcMW>64f?ZaPGn#_o?l`[8K2"
    A$ = A$ + "GnknKlilg?f4]5dQlFO7o#3HMeM=66ELJ9k_EjOSeO:jBhjhmM73gY>MhYDO"
    A$ = A$ + "5fB;InNHlG6VokYR^Z>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY>eY"
    A$ = A$ + ">eY>eY>eY>=?moga%%L2": btemp$ = ""
    HBCM$ = "MBO2L4G8G8AG>C<B2G8G8AG>DC2<G8G8>GEC<BA2>P8F8F8ECDC2"
    FOR i& = 1 TO LEN(A$) STEP 4: B$ = MID$(A$, i&, 4)
        IF INSTR(1, B$, "%") THEN
            FOR C% = 1 TO LEN(B$): F$ = MID$(B$, C%, 1)
                IF F$ <> "%" THEN C$ = C$ + F$
            NEXT: B$ = C$: END IF: FOR j = 1 TO LEN(B$)
            IF MID$(B$, j, 1) = "#" THEN
        MID$(B$, j) = "@": END IF: NEXT
        FOR t% = LEN(B$) TO 1 STEP -1
            B& = B& * 64 + ASC(MID$(B$, t%)) - 48
            NEXT: X$ = "": FOR t% = 1 TO LEN(B$) - 1
            X$ = X$ + CHR$(B& AND 255): B& = B& \ 256
    NEXT: btemp$ = btemp$ + X$: NEXT
    btemp$ = _INFLATE$(btemp$)
    _MEMPUT m, m.OFFSET, btemp$: _MEMFREE m
    HBC& = _COPYIMAGE(v&): _FREEIMAGE v&
END SUB
 
 

Title: Re: Binary Search Method
Post by: bplus on June 19, 2020, 11:47:22 pm

Wow Dav, you can remember that? :)

Title: Re: Binary Search Method
Post by: Sanmayce on December 11, 2021, 03:02:34 am

Interesting topic, it's nice seeing fellow members sharing their views.

A decade ago, this very problem (finding a word among MANY words) caught my attention, as a result, I wrote Leprechaun - the x-gram ripper. It is simply the fastest (on Internet, at least) tool for ripping x-gram lists (wordlists in particular).

According to my experience, Binary-Search is not the answer, nor lookup table hashing (per se).

Indeed "a billion words" as @STxAxTIC said, makes the game brutal, in such scenarios the really good solutions scream, so far, my biggest rip was up to 600+ million distinct words when ripping the ten years span corpus of reddit JSON comments being 300+ billion words strong:

The answer is simply using the fastest 32bit (the hash output, not the instruction set) hasher combined with B-trees as a collision handler for each slot.
By the way, hate to see B-trees misnamed, the real name is Bayer-trees, not that crappy lore stating Broad/Boeing and so on.

When collisions grow then searching/dealing with them FAST, can be done nicely (but hard to code, it took me several months) with B-trees, MOSTLY because when you tailor your physical RAM memory footprint to be small, that is, the in-memory hash table(s) to be, say, with only 1<<20 slots then Dr. Bayer comes solving the DEGENERATION of Binary-Trees, or the nasty Linked Lists used in many Hash schemes.

My point, with Hash+Bayertrees duo, there is no limit of number of keys, nor DEGRADATION, on top of that, the duo can be tweaked/tailored according to your memory needs.

Oh, as for speed, on my laptop with i5-7200U, ripping rate (i.e. how many words are recognized/found per second) was around 12,000,000 lookups-per-second till the end of the file - STEADILY, when ripping the English Wikipedia XML dump ~70GB in size.

Title: Re: Binary Search Method
Post by: Sanmayce on December 12, 2021, 12:30:40 am

Quote from: Pete on January 28, 2019, 03:22:05 pm

What I wanted was a hash method without collisions, but after a brief look at permutations, I punted on that idea. Numbers way too big for words with over 20 characters.

"PUNTED", hah-hah, didn't know that word, don't leave your donkey behind in the mud Pete, as the Bulgarian proverb says. FAST Hashing without collisions is possible.
My Gumbotron (source code in my Masakari thread) is a 128bit hash (using either 128bit or 256bit instructions) suitable for such tasks, it has only one weakness - the range 1..15 of key lengths, therefore it has to be prefixed with some bytes, up to 4, anyway 20bytes or 160bit is kinda the standard. For keys with lengths 16 or bigger, for what it has been designed, it screams.

Didn't have enough time to make the complete benchmark, yet some results, all the 466544 lines have been hashed (with Gumbotron_YMM) in 8 miliseconds:

Code: QB64: [Select]

C:\qb64>dir
 
12/11/2021  08:42         5,329,510 466544 Word List.txt
12/12/2021  03:26        18,661,760 Gumbotron_466544 Word List.txt
 
C:\qb64>"QuickSortExternal_4+GB.distinct.txt"
 
C:\qb64>type "Gumbotron_466544 Word List.txt"|more
0x01-66D12A342B6CCE8B-76DD8BBCDC717EA2
0x04-4DEB4F7BD2C7B41A-EFA92545B3FDB9AF
0x02-156418E71C1387D0-C56894090FE9BCBB
0x08-47210BB6197940FF-77FDA06EA6D5CE3E
0x04-59B0EE4F929901C0-8166DE482670F770
0x08-EB52ABD995A0EE6E-495DC42CCA812D52
0x08-D1ED8F252634F052-A7564B142920ADEB
0x08-590505A51FC4A007-1742348B51C5C667
0x08-23748F08CBDDA10A-B90F700EDB31A73B
...
 
C:\qb64>"Sandokan_QuickSortExternal_Deduplicated_4+GB_64bit_Intel.exe" "Gumbotron_466544 Word List.txt" /fast /ascend 300
Sandokan_QuickSortExternal_4+GB r.3+, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 18,661,760
Counting lines ...
Lines encountered: 466,544
Longest line (including CR if present): 39
Allocated memory for pointers-to-lines in MB: 3
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 17 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 466,544 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
/ RightEnd: 000,000,353,900; NumberOfSplittings: 0,000,053,177; Done: 100% ...
NumberOfComparisons: 10,162,539
The time to sort 466,544 items via Quicksort+Insertionsort was 938 clocks.
Performance: 10,822,725 Comparisons_128B_long-Per-Second i.e 21,645,450 RandomReads_128B_long-Per-Second.
Dumping the sorted data (Regime=2)...
| Done 100% ...
Dumped 466,544 lines.
OK! Incoming and resultant file's sizes match.
Dumping the sorted data [deduplicated] ...
Dumped 466,544 distinct lines.
Dump time: 181 clocks.
Total time: 1,227 clocks.
Performance: 15,184 bytes/clock.
Done successfully.
 
C:\qb64>dir
 
12/12/2021  03:27                 0 QuickSortExternal_4+GB.distinct.txt
12/12/2021  03:27        18,661,760 QuickSortExternal_4+GB.txt
 
C:\qb64>
 

My old console tool Sandokan (something like 'sort') sorts the hashed lines of '466544 Word List.txt' and reports/dumps the duplicates in another file, just for visual purpose.
In fact, the hash is 16+1 bytes long, so whether the 466544 hashes are in external RAM or internal RAM changes nothing, Binary Search applied on fixed length keys (prefixed by the length of the key/hash) will be fast enough, considering the avoidance of using B-trees for maximum speed.
Those 400,000+ keys are not a serious benchmark, though, since binary logarithm of 466544 is 19 - it cannot stress properly the search, moreover the first run is not indicative since the keys are not cached yet, also 17*466544 < 8MB will be cached well or not depending on L3 cache of the CPU, more older CPUs are up to 4MB LLC, whereas latest Intel and AMD have 30MB, so it has to be considered if such a benchmark is to be ... modern enough.

Title: Re: Binary Search Method
Post by: Sanmayce on December 14, 2021, 07:43:25 pm

Having an English wordlist containing 15+ million distinct words, taken from 40 million titles, is worth benchmarking.

"As of October 2019, Google celebrated 15 years of Google Books and provided the number of scanned books as more than 40 million titles."
Source: https://en.wikipedia.org/wiki/Google_Books

The attached (ugh, the archive exceeded 20MB) wordlist is ripped from "Google_Books_Ngram_Viewer_Exports_2020-02-17_eng-1-ngrams" corpus.
It is a must-have PLAYGROUND when comes to Binary-Search etudes.

How to make it, yourself:

Step #1: Download from http://storage.googleapis.com/books/ngrams/books/20200217/eng/eng-1-ngrams_exports.html 1-gram corpora;
Step #2: Decompress *.gz and "copy *24 Google_Books_Ngram_Viewer_Exports_2020-02-17_eng-1-ngrams.txt /b" them into one file;
Step #3: Trim all the "years" statistics i.e. crop only the first field (up to the first TAB character);
Step #4: Rip (make a wordlist) the trimmed file;
Step #5: Sort (for better compression) the ripped file;
Step #6: The resultant "Google_Books_Ngram_Viewer_Exports_2020-02-17_eng-1-ngrams.wordlist.txt" houses 15,074,900 distinct words with length 1..31 characters.

The whole package (including "Google_Books_Ngram_Viewer_Exports_2020-02-17_eng-1-ngrams.wordlist.txt"):
www.sanmayce.com/Google_Books_Ngram_Viewer_datasets_v3_WORDLIST.7z (32.0 MB)

Maybe, a somewhat decent Binary-Search benchmark would be to "spell-check" the KJV Bible (~4MB text) using Google's wordlist, it amounts to Binary-Searching 793,877 words into 15,074,900 wordlist ...

Title: Re: Binary Search Method
Post by: Sanmayce on December 18, 2021, 02:47:28 am

Finally, we have a decent benchmark utilizing the Binary-Search in a speedy manner, it is a good start for all kind of experiments since it is creating as a result the .HTM counterpart (with all unfamiliar words in BOLD, all OED distinct words are used as a spell-check wordlist) of our .TXT file.

It throws all the words of KJV Bible at OED 2nd edition, which is 700,000+ words against the 1,000,000+ OED words:

The 64bit binary and the QB64 source code are attached, also the OED wordlist.

Spell-checking KJV Bible under 2 seconds, can we do better?

Text Only | Text with Attachments

QB64.org Forum

Active Forums => Programs => Topic started by: SMcNeill on January 24, 2019, 08:09:35 pm