Author Topic: Binary Search Method (Read 33480 times)

STxAxTIC · « **Reply #15 on:** January 25, 2019, 09:34:54 am »

As predicted, it's not apples-to-apples.

Gotta replace my hash function which uses a cosine, and then a completely separate recursive string function just to remove the decimal point. The ReplaceSubString function should be gone from the hash function - use Lukes suggestion instead.

Also your wordlist of 400k+ entries exceeded the value of the hash table size of 300007. You wired it for collisions in other words. Not sure if you noticed or if that's an accidental straw man.

This isn't worth beans until you get an honest comparison.

EDIT

LOL and I just noticed you're using the hash table wrong. Why on earth are you storing the index (as a collision) next to the word? This is a speed test - trim that fat!

... And once we see where this test finally goes, let's push the record limit to the maximum and see how each one does with like, a billion words.

Pete · « **Reply #16 on:** January 25, 2019, 10:01:20 am »

Bill, you made my point about the problems with hashing. With my methods and Steve's method, you don't need a complicated pre-plan, which may need to be overhauled or adjusted someday. Just alphabetize the list, which a one-time never change it algorithm.

I do love the hashing method and it is a damn clever fast and efficient way to set up a list to be searched and indexed. I suppose the merits in any of these methods depends on the use. For instance, indexing doesn't mean squat in a spell check routine, so I need that feature like a fish needs a bicycle. That's one reason why even my quick and dirty QB64 instr() method is appealing to me. So while you keep searching for the Holly Grail of hash, Steve and I just grab a couple of Dixie cups and drink up.

In any event, good luck with your hash refinements and thank you for starting this awesome topic for discussion, originally in the, "Dictionizer" thread, and to Steve for continuing it here. Guys, we need more critical thinking threads like these around here.

Pete

SMcNeill · « **Reply #17 on:** January 25, 2019, 10:02:36 am »

Quote from: STxAxTIC on January 25, 2019, 09:34:54 am

As predicted, it's not apples-to-apples.

Gotta replace my hash function which uses a cosine, and then a completely separate recursive string function just to remove the decimal point. The ReplaceSubString function should be gone from the hash function - use Lukes suggestion instead.

Also your wordlist of 400k+ entries exceeded the value of the hash table size of 300007. You wired it for collisions in other words. Not sure if you noticed or if that's an accidental straw man.

This isn't worth beans until you get an honest comparison.

I told you you were comparing apples to peanut butter...

The Binary Search is a "search and done" method for data retrieval. The Hash Table is an "instant jump to a spot in memory and then parse a much smaller data set" method for data retrieval. As long as you're going to have collisions and need to search for those, you have to take that search time into consideration.

Using your suggestions of "Luke's method" + "Table Larger Than List", we end up with the following code:

Code: QB64: [Select]

DIM SHARED HashTableSize AS LONG
HashTableSize = 611953 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{"
RB = "}"
 
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
 
DIM SHARED WordList(466545) AS STRING
PRINT "Loading library"
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
 
Sort WordList()
 
i = 0
FOR i = 1 TO 466545
    b$ = WordList(i) 'the word to store
    c$ = LTRIM$(RTRIM$(STR$(i))) 'to store the index
    d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
    EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB + LB + c$ + RB
NEXT
CLOSE #1
PRINT "Done creating Hash Table."
 
' Done developing fast lookup tool. Now time for an application.
 
PRINT "Looking up"
DIM RandomWords(50000) AS STRING
FOR i = 1 TO 50000
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
 
t# = TIMER
FOR i = 1 TO 50000
    'PRINT "Searching for: "; RandomWords(i),
    l$ = Lookup$(RandomWords(i))
    'IF l$ <> "" THEN
    'l = INSTR(l$, " ")
    'word$ = LEFT$(l$, l - 1)
    'index = VAL(MID$(l$, l + 1))
    'PRINT WordList(index) 'to show that we got the index back successfully
    'ELSE
    'PRINT "NOT FOUND!"
    'END IF
NEXT
PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
 
t# = TIMER
FOR i = 1 TO 50000
    index = FindIndex(RandomWords(i))
    'PRINT "Searching for: "; RandomWords(i),
    'IF index THEN 'to show that we got the index back successfully
    '    PRINT WordList(index)
    'ELSE
    '    PRINT "NOT FOUND!"
    'END IF
NEXT
PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
 
PRINT
PRINT "And just to compare results -- the first 15 words:"
FOR i = 1 TO 10
    'PRINT "Searching for: "; RandomWords(i),
    l$ = Lookup$(RandomWords(i))
    IF l$ <> "" THEN
        l = INSTR(l$, " ")
        word$ = LEFT$(l$, l - 1)
        index = VAL(MID$(l$, l + 1))
        PRINT WordList(index), 'to show that we got the index back successfully
    ELSE
        PRINT "NOT FOUND!",
    END IF
    index = FindIndex(RandomWords(i))
    'PRINT "Searching for: "; RandomWords(i),
    IF index THEN 'to show that we got the index back successfully
        PRINT WordList(index)
    ELSE
        PRINT "NOT FOUND!"
    END IF
NEXT
 
 
 
 
FUNCTION Lookup$ (a AS STRING)
    r$ = ""
    b$ = EnglishDictionary(HashFunc(a))
    c$ = ""
    d$ = ""
    IF b$ <> "" THEN
        DO WHILE c$ <> a
            c$ = ReturnBetween(b$, LB, RB)
            IF c$ = "" THEN EXIT DO
            b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
            d$ = ReturnBetween(b$, LB, RB)
        LOOP
    END IF
    r$ = a + "  " + d$
    Lookup$ = r$
END FUNCTION
 
FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
    i = INSTR(a, b)
    j = INSTR(a, c)
    f = LEN(c)
    ReturnBetween$ = MID$(a, i + f, j - (i + f))
END FUNCTION
 
FUNCTION HashFunc (s$)
    DIM hash~&, i
    hash~& = 5381
    FOR i = 1 TO LEN(s$)
        hash~& = ((hash~& * 33) XOR ASC(s$, i)) MOD 611953
    NEXT i
    HashFunc = hash~&
END FUNCTION
 
 
FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
    j = INSTR(a, b)
    IF j > 0 THEN
        r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
    ELSE
        r$ = a
    END IF
    ReplaceSubString$ = r$
END FUNCTION
 
 
SUB Sort (Array() AS STRING)
    'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
    'It's more than fast enough for our needs here I th ink.  ;)
    gap = UBOUND(array)
    DO
        gap = 10 * gap \ 13
        IF gap < 1 THEN gap = 1
        i = 0
        swapped = 0
        DO
            IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
                SWAP Array(i), Array(i + gap)
                swapped = -1
            END IF
            i = i + 1
        LOOP UNTIL i + gap > UBOUND(Array)
    LOOP UNTIL swapped = 0 AND gap = 1
END SUB
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = 466544 'ubound(wordlist)
 
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (max + min) \ 2
        compare = _STRCMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap
        ELSEIF compare < 0 THEN
            max = gap
        ELSE
            FindIndex = gap
            found = -1
            EXIT FUNCTION
        END IF
        IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
        ' PRINT min, max, search$, WordList(gap), compare
        ' SLEEP
    LOOP
END FUNCTION
 
 

Faster than previously by about half, but still about twice as slow as the binary search method...

Regardless; I think the results speak to show that when your data is already sorted, doing a binary search of it is rather quite efficient.

Anyway, I think I've did as you asked:

Quote

I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand).

If it's *wrong* at this point, I think the burden's on you to show us where and how.

STxAxTIC · « **Reply #18 on:** January 25, 2019, 10:11:06 am »

Nah, I don't need to prove shit to you guys again on this. Fun talk though. Someone who takes the time to sift through this convo, and the way Steve split it among two threads, will get a real chuckle.

If you really want a *win* on this Steve, adjust it for a dataset size typical of a respectable database. A few billion if you can. There's the bait. You might even be right. Try a billion or so.

EDIT:

shit car's been running. lemme fix this up later

Pete · « **Reply #19 on:** January 25, 2019, 10:29:42 am »

Check your refrigerator. If it's running too, the chances are it's having an affair with your car. You'll probably be back soon, because history proves frigid relationships don't last too long.

Seriously Bill, billions of data entries and a hash algorithm to accommodate that project? What, thousands of collisions that need to be sorted? I would think at that level, you would at least need to combine our methods to get adequate results.

Pete

SMcNeill · « **Reply #20 on:** January 25, 2019, 11:15:27 am »

Quote from: STxAxTIC on January 25, 2019, 10:11:06 am

...adjust it for a dataset size typical of a respectable database. A few billion if you can. There's the bait. You might even be right. Try a billion or so.

Here's you a list with 10 million "words" in it, but it'll easily scale to whatever size you want (as long as you have the memory limits to run it):

Code: QB64: [Select]

DEFLNG A-Z
DIM SHARED HashTableSize AS LONG
HashTableSize = 10000019 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{"
RB = "}"
 
PRINT "Reserving memory for Hashtable..."
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
 
CONST limit = 10000000
DIM SHARED WordList(limit) AS STRING
 
PRINT "Generating List..."
FOR i = 1 TO limit
    WordList(i) = STR$(i) 'generate a list of "words"
NEXT
 
PRINT "Sorting list to alphabeticlize it..."
Sort WordList() 'to put it in alphabetic order and not numeric
 
PRINT "Creating Hash Table..."
FOR i = 1 TO limit
    b$ = WordList(i) 'the word to store
    d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
    EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB
NEXT
PRINT "Done creating Hash Table."
 
' Done developing fast lookup tool. Now time for an application.
 
PRINT "Generating search list..."
DIM RandomWords(limit) AS STRING
RandomWords() = WordList() 'lets just look up all the damn words
 
 
PRINT "Searching using Hash Table"
t# = TIMER
FOR i = 1 TO limit
    l$ = Lookup$(RandomWords(i))
NEXT
PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
 
PRINT "Searching using Binary Search"
t# = TIMER
FOR i = 1 TO limit
    index = FindIndex(RandomWords(i))
    'PRINT i, index, "?"; RandomWords(index)
    'SLEEP
NEXT
PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
 
 
FUNCTION Lookup$ (a AS STRING)
    r$ = ""
    b$ = EnglishDictionary(HashFunc(a))
    c$ = ""
    d$ = ""
    IF b$ <> "" THEN
        DO WHILE c$ <> a
            c$ = ReturnBetween(b$, LB, RB)
            IF c$ = "" THEN EXIT DO
            b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
            'd$ = ReturnBetween(b$, LB, RB)
        LOOP
    END IF
    r$ = a '+ "  " + d$
    Lookup$ = r$
END FUNCTION
 
FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
    i = INSTR(a, b)
    j = INSTR(a, c)
    f = LEN(c)
    ReturnBetween$ = MID$(a, i + f, j - (i + f))
END FUNCTION
 
FUNCTION HashFunc (s$)
    DIM hash~&, i
    hash~& = 5381
    FOR i = 1 TO LEN(s$)
        hash~& = ((hash~& * 33) XOR ASC(s$, i)) MOD HashTableSize
    NEXT i
    HashFunc = hash~&
END FUNCTION
 
 
FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
    j = INSTR(a, b)
    IF j > 0 THEN
        r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
    ELSE
        r$ = a
    END IF
    ReplaceSubString$ = r$
END FUNCTION
 
 
SUB Sort (Array() AS STRING)
    'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
    'It's more than fast enough for our needs here I th ink.  ;)
    gap = UBOUND(array)
    DO
        gap = 10 * gap \ 13
        IF gap < 1 THEN gap = 1
        i = 0
        swapped = 0
        DO
            IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
                SWAP Array(i), Array(i + gap)
                swapped = -1
            END IF
            i = i + 1
        LOOP UNTIL i + gap > UBOUND(Array)
    LOOP UNTIL swapped = 0 AND gap = 1
END SUB
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = limit + 1
 
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (max + min) \ 2
        compare = _STRCMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap
        ELSEIF compare < 0 THEN
            max = gap
        ELSE
            FindIndex = gap
            found = -1
            EXIT FUNCTION
        END IF
        IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
        'PRINT min, max, search$, WordList(gap), compare
        'SLEEP
    LOOP
END FUNCTION
 

One word of caution though: This won't run in QB64x32...

You wanted a hash table size larger than the data set ("Also your wordlist of 400k+ entries exceeded the value of the hash table size of 300007. You wired it for collisions in other words."), and simply reserving space for it in memory takes up about 2.6GB of RAM, as you can see just from the snippet below:

Code: QB64: [Select]

DEFLNG A-Z
DIM SHARED HashTableSize AS LONG
HashTableSize = 10000019 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{"
RB = "}"
 
PRINT "Reserving memory for Hashtable..."
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.

Once the data is all filled and in memory, it requires 3,418.6MB to run without crashing.

(And, for what it's worth, the binary search still beats the hash table for speed, on my PC, with one taking 6 seconds, the other taking 4.5.)

_vince · « **Reply #21 on:** January 25, 2019, 11:56:03 am »

Quote from: Pete on January 24, 2019, 09:29:23 pm

Quote from: STxAxTIC on January 24, 2019, 09:14:15 pm
Steve, the big-O notation is designed to specifically not compare apples to oranges. Dicscrete calculus will be a lesson for another time... I just cant right now, my facepalm hurts.

Dicscrete calculus, Bill? I think you'd better chose a search method and get that spell checker up and working for craps sake.

If you mean discrete calculus, is that one of those math courses offered at night for active singles? If so, what is the probably of completing the course without contracting a nasty case of standard deviations?

Pete :D

Either way, the dic will secrete

Pete · « **Reply #22 on:** January 25, 2019, 12:01:38 pm »

They say comedy is contagious, but I'm not sure anyone wants to catch it that way.

Pete :D

Ed Davis · « **Reply #23 on:** January 25, 2019, 12:04:24 pm »

Well, I thought I would give this a try.

I took the original code, and tried to make it "apples to apples" as much as possible.

Note that for the hash function proper, no need to do a mod on each iteration. One at the end is enough. Also, I redid the hash lookup, as it was doing lots of extra work.

On my machine, hashing wins: .109 vs .219. About twice as fast.

Code: QB64: [Select]

DIM SHARED HashTableSize AS LONG
HashTableSize = 500009' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{"
RB = "}"
 
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
 
DIM SHARED WordList(466545) AS STRING
PRINT "Loading library"
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
 
print "Sorting wordlist"
Sort WordList()
 
print "Creating Hash Table"
i = 0
FOR i = 1 TO ubound(wordlist)
    b$ = WordList(i) 'the word to store
    d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
    if d < 0 then
        print "hash out of range"
        end
    end if
    EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB
NEXT
CLOSE #1
 
' Done developing fast lookup tool. Now time for an application.
 
const ITER = 75000
 
PRINT "Creating "; ITER; " random word lookup table"
DIM RandomWords(ITER) AS STRING
FOR i = 1 TO ITER
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
 
Print "Search via Hashing...       ";
t# = TIMER
FOR i = 1 TO ITER
    l$ = Lookup$(RandomWords(i))
    if l$ <> RandomWords(i) then
        print "HashTable error - searched: ", RandomWords(i), " found: ", l$
        end
    end if
NEXT
PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
 
Print "search via Binary Search... ";
t# = TIMER
FOR i = 1 TO ITER
    index = FindIndex(RandomWords(i))
    if WordList(index) <> RandomWords(i) then
        print "BinarySearch error - searched: ", RandomWords(i), " found: ", WordList(index)
    end if
NEXT
PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
 
PRINT
PRINT "And just to compare results -- the first 10 words:"
FOR i = 1 TO 10
    'PRINT "Searching for: "; RandomWords(i),
    l$ = Lookup$(RandomWords(i))
    IF l$ <> "" THEN
        l = INSTR(l$, " ")
        word$ = LEFT$(l$, l - 1)
        index = VAL(MID$(l$, l + 1))
        PRINT WordList(index), 'to show that we got the index back successfully
    ELSE
        PRINT "NOT FOUND!",
    END IF
    index = FindIndex(RandomWords(i))
    'PRINT "Searching for: "; RandomWords(i),
    IF index THEN 'to show that we got the index back successfully
        PRINT WordList(index)
    ELSE
        PRINT "NOT FOUND!"
    END IF
NEXT
 
FUNCTION Lookup$ (s AS STRING)
    haystack$ = EnglishDictionary(HashFunc(s))
    needle$ = LB + s + RB
    p = instr(haystack$, needle$)
    if p = 0 then
        Lookup$ = ""
    else
        Lookup$ = mid$(haystack$, p + 1, LEN(needle$) - 2)
    end if
END FUNCTION
 
FUNCTION HashFunc (s AS STRING) ' input string
    DIM hash~&, i
 
    hash~& = 5381
 
    FOR i = 1 TO LEN(s$)
        hash~& = ((hash~& * 33) XOR ASC(s$, i))
    NEXT i
 
    HashFunc = hash~& mod HashTableSize
END FUNCTION
 
SUB Sort (Array() AS STRING)
    'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
    'It's more than fast enough for our needs here I think.  ;)
    gap = UBOUND(array)
    DO
        gap = 10 * gap \ 13
        IF gap < 1 THEN gap = 1
        i = 0
        swapped = 0
        DO
            IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
                SWAP Array(i), Array(i + gap)
                swapped = -1
            END IF
            i = i + 1
        LOOP UNTIL i + gap > UBOUND(Array)
    LOOP UNTIL swapped = 0 AND gap = 1
END SUB
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = lbound(wordlist)
    max = ubound(wordlist)
 
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (max + min) \ 2
        compare = _STRCMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap
        ELSEIF compare < 0 THEN
            max = gap
        ELSE
            FindIndex = gap
            found = -1
            EXIT FUNCTION
        END IF
        IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
        ' PRINT min, max, search$, WordList(gap), compare
        ' SLEEP
    LOOP
END FUNCTION
 
 

Many years ago (1990 to be precise), when I was needing something like this for a project I was doing in C, I ran extensive tests on different search functions - avltree, binarytree, binary search, hashing (I tested 135 different hash functions), custom lists tuned to the data, and a few more.

I reran some of those tests this morning.

In C at least, with my code (e.g., a custom binary search, instead of the built-in bsearch), hashing is about 2 times faster than binary searching. So pretty much what I found for the code above.

Per the literature, it is even faster to use a power of 2 for the table size, rather than a prime for the table size, and then and off the appropriate bits instead of doing the mod at the end of the hash function. In my tests with C, I did find this to be true, although it was just slightly faster, not strikingly so. But of course, when you have potentially many millions of lookups, then every bit of speed counts. Also note that the power of 2 table size assumes a high quality hash function. Otherwise, it is best to stick with a prime number as the table size.

Anyway, cool topic! Thanks to everyone who shared something on this!

SMcNeill · « **Reply #24 on:** January 25, 2019, 12:23:35 pm »

I’m getting perfectly equal time tests on my PC, with both taking 0.164 seconds to run. Either way, both methods are generally fast enough to hold up and work in most situations.

The biggest advantage to the hash list is that you dont need sorted data, whereas a binary search has lower memory requirements. My advice is to choose whichever you prefer; either will usually work in most cases.

SMcNeill · « **Reply #25 on:** January 25, 2019, 01:32:58 pm »

Taking Ed's improvements, I played around and tweaked the hash routine even further, making it a much faster program:

Code: QB64: [Select]

DIM SHARED HashTableSize AS LONG
HashTableSize = 500009 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
 
DIM SHARED WordList(466545) AS STRING
PRINT "Loading library"
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
 
PRINT "Creating Hash Table"
i = 0
FOR i = 1 TO UBOUND(wordlist)
    b$ = WordList(i) 'the word to store
    d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
    IF d < 0 THEN
        PRINT "hash out of range"
        END
    END IF
    EnglishDictionary(d) = EnglishDictionary(d) + CHR$(LEN(b$)) + b$
NEXT
CLOSE #1
 
' Done developing fast lookup tool. Now time for an application.
 
 
 
CONST ITER = 75000
 
PRINT "Creating "; ITER; " random word lookup table"
DIM RandomWords(ITER) AS STRING
FOR i = 1 TO ITER
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
 
PRINT "Search via Hashing...       ";
t# = TIMER
FOR i = 1 TO ITER
    l$ = Lookup$(RandomWords(i))
    IF l$ <> RandomWords(i) THEN
        PRINT "HashTable error - searched: ", RandomWords(i), " found: ", l$
        END
    END IF
NEXT
PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
 
FUNCTION Lookup$ (s AS STRING)
    haystack$ = EnglishDictionary(HashFunc(s))
    size = 1: l = 1
    DO UNTIL needle$ = s OR size = 0
        size = ASC(haystack$, l)
        needle$ = MID$(haystack$, l + 1, size)
        l = l + size + 1
    LOOP
    IF size > 0 THEN Lookup$ = needle$
END FUNCTION
 
FUNCTION HashFunc (s AS STRING) ' input string
    DIM hash~&, i
    hash~& = 5381
    FOR i = 1 TO LEN(s$)
        hash~& = ((hash~& * 33) XOR ASC(s$, i))
    NEXT i
    HashFunc = hash~& MOD HashTableSize
END FUNCTION

What's the main difference here?

EnglishDictionary(d) = EnglishDictionary(d) + CHR$(LEN(b$)) + b$

We're not storing our data separated by delimiters. Instead, we're manually recording the size of each entry and using it to directly retrieve our information. (Side note: This is a simple implementation of those dreaded "linked lists" which STx is so obsessive over. It's hard to believe he didn't implement one from the start, since he's the resident guru on the subject for us!)

By going this route, we don't need to rely on INSTR to search for the various entries for us, giving us two distinct advantages:

1) It's much faster to "jump" from word to word, than it is to search byte by byte for a match.

2) It's less likely to generate any false positives, as we're retrieving the exact string from memory and comparing. _INSTR("dogfood", "dog") is a match -- there's a "dog" in "dogfood", but "dog" <> "dogfood" if we compare directly.

Like most code, when it comes to speed, the devil's in the details.

Pete · « **Reply #26 on:** January 25, 2019, 01:48:36 pm »

I've already pointed out INSTR requires use of a delimiter like a comma.

dog - dogfood is easily solved that way...

wordlist$ = ",bill,cat,dog,dogfood,pete,steve,"
a = INSTR(wordlist$,",dog,")

dogfood would not be returned, just dog. Besides... dog food is two words. :P

Hey, did you try my INSTR search and frequency sample? It's dog gone fast!
https://www.qb64.org/forum/index.php?topic=1001.msg102017#msg102017

Pete

SMcNeill · « **Reply #27 on:** January 25, 2019, 02:04:03 pm »

Quote from: Pete on January 25, 2019, 01:48:36 pm

I've already pointed out INSTR requires use of a delimiter like a comma.

dog - dogfood is easily solved that way...

wordlist$ = ",bill,cat,dog,dogfood,pete,steve,"
a = INSTR(wordlist$,",dog,")

dogfood would not be returned, just dog. Besides... dog food is two words. :P

Hey, did you try my INSTR search and frequency sample? It's dog gone fast!
https://www.qb64.org/forum/index.php?topic=1001.msg102017#msg102017

Pete

That's the point I was making with the previous post -- instead of using INSTR at all, which requires you to SEARCH the string for your words, store the length in front of each entry instead.

wordlist$ = "4bill3cat3dog7dogfood4pete5steve"

here you read 4, then get 4 characters for a string -- "bill"
then read 3, get 3 characters for "cat"...
then read 3, get 4 characters for "dog" -- MATCH!

Your method works like this:

wordlist$ = ",bill,cat,dog,dogfood,pete,steve,"
a = INSTR(wordlist$,",dog,")

byte 1 -- ",bill" <> ",dog"
byte 2 -- "bill," <> ",dog,"
byte 3 -- "ill,c" <> ",dog,"
byte 4 -- "ll,ca" <> ",dog,"
... and on byte by byte until ",dog," = ",dog,"

That's a 33 character string, and ",dog," is 5 characters, so there's up to 28 checks that the computer is doing behind the scene to look for the first occurrence of your string...

compared to a max of 6 comparisons if we simple read each word all at once.

When dealing with a process which may be repeated in multiple loops, over and over, it ends up making a difference in performance. ;)

Pete · « **Reply #28 on:** January 25, 2019, 02:12:07 pm »

But that's what seed in INSTR() is for, so you skip from delimiter to delimiter.

See the same post and the second code box, which uses seed% to deal with lookup frequency within a random word list.

Pete

SMcNeill · « **Reply #29 on:** January 25, 2019, 07:26:23 pm »

Quote from: Pete on January 25, 2019, 02:12:07 pm

But that's what seed in INSTR() is for, so you skip from delimiter to delimiter.

See the same post and the second code box, which uses seed% to deal with lookup frequency within a random word list.

Pete

Seed gives you a starting point, but you still check byte by byte for a match...

a$ = “,1,2,3,4,5,3,7,8,”

seed = INSTR(seed + 1, a$, “,3,”)

Run first, it does a compare at byte 1, then byte 2, then 3... at byte 5 it finds “,3,” — 5 comparisons.

Then you have:

seed = seed + LEN(“,3,”) ‘5 + 3
seed = INSTR(seed, a$, “,3,”)

Now you start at byte 8 and search byte by byte until you get to byte 11... 5 more comparisons.

INSTR by its nature is a byte by byte compare routine (it has to be, or else how would it know where the search string appears at? A random guess?)

wordlist$ = "4bill3cat3dog7dogfood4pete5steve0”

Placing the size first lets you advance by words, not by bytes.

See the difference, and why one is faster than the other?

News:

Author Topic: Binary Search Method (Read 33480 times)

STxAxTIC

Re: Binary Search Method

Pete

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

STxAxTIC

Re: Binary Search Method

Pete

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

_vince

Re: Binary Search Method

Pete

Re: Binary Search Method

Ed Davis

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

Pete

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

Pete

Re: Binary Search Method

SMcNeill

Re: Binary Search Method