Author Topic: Binary Search Method  (Read 15512 times)

0 Members and 1 Guest are viewing this topic.

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Binary Search Method
« Reply #15 on: January 25, 2019, 09:34:54 am »
As predicted, it's not apples-to-apples.

Gotta replace my hash function which uses a cosine, and then a completely separate recursive string function just to remove the decimal point. The ReplaceSubString function should be gone from the hash function - use Lukes suggestion instead.

Also your wordlist of 400k+ entries exceeded the value of the hash table size of 300007. You wired it for collisions in other words. Not sure if you noticed or if that's an accidental straw man.

This isn't worth beans until you get an honest comparison.

EDIT

LOL and I just noticed you're using the hash table wrong. Why on earth are you storing the index (as a collision) next to the word? This is a speed test - trim that fat!

... And once we see where this test finally goes, let's push the record limit to the maximum and see how each one does with like, a billion words.
« Last Edit: January 25, 2019, 09:53:26 am by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #16 on: January 25, 2019, 10:01:20 am »
Bill, you made my point about the problems with hashing. With my methods and Steve's method, you don't need a complicated pre-plan, which may need to be overhauled or adjusted someday. Just alphabetize the list, which a one-time never change it algorithm.

I do love the hashing method and it is a damn clever fast and efficient way to set up a list to be searched and indexed. I suppose the merits in any of these methods depends on the use. For instance, indexing doesn't mean squat in a spell check routine, so I need that feature like a fish needs a bicycle. That's one reason why even my quick and dirty QB64 instr() method is appealing to me. So while you keep searching for the Holly Grail of hash, Steve and I just grab a couple of Dixie cups and drink up.

In any event, good luck with your hash refinements and thank you for starting this awesome topic for discussion, originally in the, "Dictionizer" thread, and to Steve for continuing it here. Guys, we need more critical thinking threads like these around here.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #17 on: January 25, 2019, 10:02:36 am »
As predicted, it's not apples-to-apples.

Gotta replace my hash function which uses a cosine, and then a completely separate recursive string function just to remove the decimal point. The ReplaceSubString function should be gone from the hash function - use Lukes suggestion instead.

Also your wordlist of 400k+ entries exceeded the value of the hash table size of 300007. You wired it for collisions in other words. Not sure if you noticed or if that's an accidental straw man.

This isn't worth beans until you get an honest comparison.

I told you you were comparing apples to peanut butter...

The Binary Search is a "search and done" method for data retrieval.  The Hash Table is an "instant jump to a spot in memory and then parse a much smaller data set" method for data retrieval.  As long as you're going to have collisions and need to search for those, you have to take that search time into consideration.

Using your suggestions of "Luke's method" + "Table Larger Than List", we end up with the following code:

Code: QB64: [Select]
  1. DIM SHARED HashTableSize AS LONG
  2. HashTableSize = 611953 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
  3.  
  4. DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
  5. LB = "{"
  6. RB = "}"
  7.  
  8. DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
  9.  
  10. OPEN "466544 Word List.txt" FOR BINARY AS #1
  11.  
  12. DIM SHARED WordList(466545) AS STRING
  13. PRINT "Loading library"
  14.     count = count + 1
  15.     LINE INPUT #1, WordList(count)
  16.  
  17. Sort WordList()
  18.  
  19. i = 0
  20. FOR i = 1 TO 466545
  21.     b$ = WordList(i) 'the word to store
  22.     c$ = LTRIM$(RTRIM$(STR$(i))) 'to store the index
  23.     d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
  24.     EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB + LB + c$ + RB
  25. PRINT "Done creating Hash Table."
  26.  
  27. ' Done developing fast lookup tool. Now time for an application.
  28.  
  29. PRINT "Looking up"
  30. DIM RandomWords(50000) AS STRING
  31. FOR i = 1 TO 50000
  32.     c = INT(RND * 466544) + 1
  33.     RandomWords(i) = WordList(c)
  34.  
  35. t# = TIMER
  36. FOR i = 1 TO 50000
  37.     'PRINT "Searching for: "; RandomWords(i),
  38.     l$ = Lookup$(RandomWords(i))
  39.     'IF l$ <> "" THEN
  40.     'l = INSTR(l$, " ")
  41.     'word$ = LEFT$(l$, l - 1)
  42.     'index = VAL(MID$(l$, l + 1))
  43.     'PRINT WordList(index) 'to show that we got the index back successfully
  44.     'ELSE
  45.     'PRINT "NOT FOUND!"
  46.     'END IF
  47. PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
  48.  
  49. t# = TIMER
  50. FOR i = 1 TO 50000
  51.     index = FindIndex(RandomWords(i))
  52.     'PRINT "Searching for: "; RandomWords(i),
  53.     'IF index THEN 'to show that we got the index back successfully
  54.     '    PRINT WordList(index)
  55.     'ELSE
  56.     '    PRINT "NOT FOUND!"
  57.     'END IF
  58. PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
  59.  
  60. PRINT "And just to compare results -- the first 15 words:"
  61. FOR i = 1 TO 10
  62.     'PRINT "Searching for: "; RandomWords(i),
  63.     l$ = Lookup$(RandomWords(i))
  64.     IF l$ <> "" THEN
  65.         l = INSTR(l$, " ")
  66.         word$ = LEFT$(l$, l - 1)
  67.         index = VAL(MID$(l$, l + 1))
  68.         PRINT WordList(index), 'to show that we got the index back successfully
  69.     ELSE
  70.         PRINT "NOT FOUND!",
  71.     END IF
  72.     index = FindIndex(RandomWords(i))
  73.     'PRINT "Searching for: "; RandomWords(i),
  74.     IF index THEN 'to show that we got the index back successfully
  75.         PRINT WordList(index)
  76.     ELSE
  77.         PRINT "NOT FOUND!"
  78.     END IF
  79.  
  80.  
  81.  
  82.  
  83. FUNCTION Lookup$ (a AS STRING)
  84.     r$ = ""
  85.     b$ = EnglishDictionary(HashFunc(a))
  86.     c$ = ""
  87.     d$ = ""
  88.     IF b$ <> "" THEN
  89.         DO WHILE c$ <> a
  90.             c$ = ReturnBetween(b$, LB, RB)
  91.             IF c$ = "" THEN EXIT DO
  92.             b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
  93.             d$ = ReturnBetween(b$, LB, RB)
  94.         LOOP
  95.     END IF
  96.     r$ = a + "  " + d$
  97.     Lookup$ = r$
  98.  
  99. FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
  100.     i = INSTR(a, b)
  101.     j = INSTR(a, c)
  102.     f = LEN(c)
  103.     ReturnBetween$ = MID$(a, i + f, j - (i + f))
  104.  
  105. FUNCTION HashFunc (s$)
  106.     DIM hash~&, i
  107.     hash~& = 5381
  108.     FOR i = 1 TO LEN(s$)
  109.         hash~& = ((hash~& * 33) XOR ASC(s$, i)) MOD 611953
  110.     NEXT i
  111.     HashFunc = hash~&
  112.  
  113.  
  114. FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
  115.     j = INSTR(a, b)
  116.     IF j > 0 THEN
  117.         r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
  118.     ELSE
  119.         r$ = a
  120.     END IF
  121.     ReplaceSubString$ = r$
  122.  
  123.  
  124. SUB Sort (Array() AS STRING)
  125.     'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
  126.     'It's more than fast enough for our needs here I th ink.  ;)
  127.     gap = UBOUND(array)
  128.     DO
  129.         gap = 10 * gap \ 13
  130.         IF gap < 1 THEN gap = 1
  131.         i = 0
  132.         swapped = 0
  133.         DO
  134.             IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
  135.                 SWAP Array(i), Array(i + gap)
  136.                 swapped = -1
  137.             END IF
  138.             i = i + 1
  139.         LOOP UNTIL i + gap > UBOUND(Array)
  140.     LOOP UNTIL swapped = 0 AND gap = 1
  141.  
  142. FUNCTION FindIndex (search$)
  143.     SHARED SearchTimes, LastIndex
  144.     SearchTimes = 0
  145.     min = 1 'lbound(wordlist)
  146.     max = 466544 'ubound(wordlist)
  147.  
  148.     DO UNTIL found
  149.         SearchTimes = SearchTimes + 1
  150.         gap = (max + min) \ 2
  151.         compare = _STRCMP(search$, WordList(gap))
  152.         IF compare > 0 THEN
  153.             min = gap
  154.         ELSEIF compare < 0 THEN
  155.             max = gap
  156.         ELSE
  157.             FindIndex = gap
  158.             found = -1
  159.             EXIT FUNCTION
  160.         END IF
  161.         IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
  162.         ' PRINT min, max, search$, WordList(gap), compare
  163.         ' SLEEP
  164.     LOOP
  165.  
  166.  

Faster than previously by about half, but still about twice as slow as the  binary search method...

Regardless; I think the results speak to show that when your data is already sorted, doing a binary search of it is rather quite efficient. 

Anyway, I think I've did as you asked:

Quote
I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand).

If it's *wrong* at this point, I think the burden's on you to show us where and how.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Binary Search Method
« Reply #18 on: January 25, 2019, 10:11:06 am »
Nah, I don't need to prove shit to you guys again on this. Fun talk though. Someone who takes the time to sift through this convo, and the way Steve split it among two threads, will get a real chuckle.

If you really want a *win* on this Steve, adjust it for a dataset size typical of a respectable database. A few billion if you can. There's the bait. You might even be right. Try a billion or so.

EDIT:

shit car's been running. lemme fix this up later
« Last Edit: January 25, 2019, 10:17:01 am by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #19 on: January 25, 2019, 10:29:42 am »
Check your refrigerator. If it's running too, the chances are it's having an affair with your car. You'll probably be back soon, because history proves frigid relationships don't last too long.

Seriously Bill, billions of data entries and a hash algorithm to accommodate that project?  What, thousands of collisions that need to be sorted? I would think at that level, you would at least need to combine our methods to get adequate results.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #20 on: January 25, 2019, 11:15:27 am »
...adjust it for a dataset size typical of a respectable database. A few billion if you can. There's the bait. You might even be right. Try a billion or so.

Here's you a list with 10 million "words" in it, but it'll easily scale to whatever size you want (as long as you have the memory limits to run it):

Code: QB64: [Select]
  1. DEFLNG A-Z
  2. DIM SHARED HashTableSize AS LONG
  3. HashTableSize = 10000019 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
  4.  
  5. DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
  6. LB = "{"
  7. RB = "}"
  8.  
  9. PRINT "Reserving memory for Hashtable..."
  10. DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
  11.  
  12.  
  13. CONST limit = 10000000
  14. DIM SHARED WordList(limit) AS STRING
  15.  
  16. PRINT "Generating List..."
  17. FOR i = 1 TO limit
  18.     WordList(i) = STR$(i) 'generate a list of "words"
  19.  
  20. PRINT "Sorting list to alphabeticlize it..."
  21. Sort WordList() 'to put it in alphabetic order and not numeric
  22.  
  23. PRINT "Creating Hash Table..."
  24. FOR i = 1 TO limit
  25.     b$ = WordList(i) 'the word to store
  26.     d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
  27.     EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB
  28. PRINT "Done creating Hash Table."
  29.  
  30. ' Done developing fast lookup tool. Now time for an application.
  31.  
  32. PRINT "Generating search list..."
  33. DIM RandomWords(limit) AS STRING
  34. RandomWords() = WordList() 'lets just look up all the damn words
  35.  
  36.  
  37. PRINT "Searching using Hash Table"
  38. t# = TIMER
  39. FOR i = 1 TO limit
  40.     l$ = Lookup$(RandomWords(i))
  41. PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
  42.  
  43. PRINT "Searching using Binary Search"
  44. t# = TIMER
  45. FOR i = 1 TO limit
  46.     index = FindIndex(RandomWords(i))
  47.     'PRINT i, index, "?"; RandomWords(index)
  48.     'SLEEP
  49. PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
  50.  
  51.  
  52. FUNCTION Lookup$ (a AS STRING)
  53.     r$ = ""
  54.     b$ = EnglishDictionary(HashFunc(a))
  55.     c$ = ""
  56.     d$ = ""
  57.     IF b$ <> "" THEN
  58.         DO WHILE c$ <> a
  59.             c$ = ReturnBetween(b$, LB, RB)
  60.             IF c$ = "" THEN EXIT DO
  61.             b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
  62.             'd$ = ReturnBetween(b$, LB, RB)
  63.         LOOP
  64.     END IF
  65.     r$ = a '+ "  " + d$
  66.     Lookup$ = r$
  67.  
  68. FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
  69.     i = INSTR(a, b)
  70.     j = INSTR(a, c)
  71.     f = LEN(c)
  72.     ReturnBetween$ = MID$(a, i + f, j - (i + f))
  73.  
  74. FUNCTION HashFunc (s$)
  75.     DIM hash~&, i
  76.     hash~& = 5381
  77.     FOR i = 1 TO LEN(s$)
  78.         hash~& = ((hash~& * 33) XOR ASC(s$, i)) MOD HashTableSize
  79.     NEXT i
  80.     HashFunc = hash~&
  81.  
  82.  
  83. FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
  84.     j = INSTR(a, b)
  85.     IF j > 0 THEN
  86.         r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
  87.     ELSE
  88.         r$ = a
  89.     END IF
  90.     ReplaceSubString$ = r$
  91.  
  92.  
  93. SUB Sort (Array() AS STRING)
  94.     'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
  95.     'It's more than fast enough for our needs here I th ink.  ;)
  96.     gap = UBOUND(array)
  97.     DO
  98.         gap = 10 * gap \ 13
  99.         IF gap < 1 THEN gap = 1
  100.         i = 0
  101.         swapped = 0
  102.         DO
  103.             IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
  104.                 SWAP Array(i), Array(i + gap)
  105.                 swapped = -1
  106.             END IF
  107.             i = i + 1
  108.         LOOP UNTIL i + gap > UBOUND(Array)
  109.     LOOP UNTIL swapped = 0 AND gap = 1
  110.  
  111. FUNCTION FindIndex (search$)
  112.     SHARED SearchTimes, LastIndex
  113.     SearchTimes = 0
  114.     min = 1 'lbound(wordlist)
  115.     max = limit + 1
  116.  
  117.     DO UNTIL found
  118.         SearchTimes = SearchTimes + 1
  119.         gap = (max + min) \ 2
  120.         compare = _STRCMP(search$, WordList(gap))
  121.         IF compare > 0 THEN
  122.             min = gap
  123.         ELSEIF compare < 0 THEN
  124.             max = gap
  125.         ELSE
  126.             FindIndex = gap
  127.             found = -1
  128.             EXIT FUNCTION
  129.         END IF
  130.         IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
  131.         'PRINT min, max, search$, WordList(gap), compare
  132.         'SLEEP
  133.     LOOP
  134.  

One word of caution though:  This won't run in QB64x32...

You wanted a hash table size larger than the data set ("Also your wordlist of 400k+ entries exceeded the value of the hash table size of 300007. You wired it for collisions in other words."), and simply reserving space for it in memory takes up about 2.6GB of RAM, as you can see just from the snippet below:

Code: QB64: [Select]
  1. DEFLNG A-Z
  2. DIM SHARED HashTableSize AS LONG
  3. HashTableSize = 10000019 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
  4.  
  5. DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
  6. LB = "{"
  7. RB = "}"
  8.  
  9. PRINT "Reserving memory for Hashtable..."
  10. DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.

Once the data is all filled and in memory, it requires 3,418.6MB to run without  crashing.

(And, for what it's worth, the binary search still beats the hash table for speed, on my PC, with one taking 6 seconds, the other taking 4.5.)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline _vince

  • Seasoned Forum Regular
  • Posts: 422
    • View Profile
Re: Binary Search Method
« Reply #21 on: January 25, 2019, 11:56:03 am »
Steve, the big-O notation is designed to specifically not compare apples to oranges. Dicscrete calculus will be a lesson for another time... I just cant right now, my facepalm hurts.

Dicscrete calculus, Bill? I think you'd better chose a search method and get that spell checker up and working for craps sake.

If you mean discrete calculus, is that one of those math courses offered at night for active singles? If so, what is the probably of completing the course without contracting a nasty case of standard deviations?

Pete :D

Either way, the dic will secrete

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #22 on: January 25, 2019, 12:01:38 pm »
They say comedy is contagious, but I'm not sure anyone wants to catch it that way.

Pete :D
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline Ed Davis

  • Newbie
  • Posts: 40
    • View Profile
Re: Binary Search Method
« Reply #23 on: January 25, 2019, 12:04:24 pm »
Well, I thought I would give this a try.

I took the original code, and tried to make it "apples to apples" as much as possible.

Note that for the hash function proper, no need to do a mod on each iteration.  One at the end is enough.  Also, I redid the hash lookup, as it was doing lots of extra work.

On my machine, hashing wins: .109 vs .219.  About twice as fast.

Code: QB64: [Select]
  1. DIM SHARED HashTableSize AS LONG
  2. HashTableSize = 500009' Best to use a big prime number. Bigger examples are 611953 and 1014729.
  3.  
  4. DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
  5. LB = "{"
  6. RB = "}"
  7.  
  8. DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
  9.  
  10. OPEN "466544 Word List.txt" FOR BINARY AS #1
  11.  
  12. DIM SHARED WordList(466545) AS STRING
  13. PRINT "Loading library"
  14.     count = count + 1
  15.     LINE INPUT #1, WordList(count)
  16.  
  17. print "Sorting wordlist"
  18. Sort WordList()
  19.  
  20. print "Creating Hash Table"
  21. i = 0
  22. FOR i = 1 TO ubound(wordlist)
  23.     b$ = WordList(i) 'the word to store
  24.     d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
  25.     if d < 0 then
  26.         print "hash out of range"
  27.         end
  28.     end if
  29.     EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB
  30.  
  31. ' Done developing fast lookup tool. Now time for an application.
  32.  
  33. const ITER = 75000
  34.  
  35. PRINT "Creating "; ITER; " random word lookup table"
  36. DIM RandomWords(ITER) AS STRING
  37. FOR i = 1 TO ITER
  38.     c = INT(RND * 466544) + 1
  39.     RandomWords(i) = WordList(c)
  40.  
  41. Print "Search via Hashing...       ";
  42. t# = TIMER
  43. FOR i = 1 TO ITER
  44.     l$ = Lookup$(RandomWords(i))
  45.     if l$ <> RandomWords(i) then
  46.         print "HashTable error - searched: ", RandomWords(i), " found: ", l$
  47.         end
  48.     end if
  49. PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
  50.  
  51. Print "search via Binary Search... ";
  52. t# = TIMER
  53. FOR i = 1 TO ITER
  54.     index = FindIndex(RandomWords(i))
  55.     if WordList(index) <> RandomWords(i) then
  56.         print "BinarySearch error - searched: ", RandomWords(i), " found: ", WordList(index)
  57.     end if
  58. PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
  59.  
  60. PRINT "And just to compare results -- the first 10 words:"
  61. FOR i = 1 TO 10
  62.     'PRINT "Searching for: "; RandomWords(i),
  63.     l$ = Lookup$(RandomWords(i))
  64.     IF l$ <> "" THEN
  65.         l = INSTR(l$, " ")
  66.         word$ = LEFT$(l$, l - 1)
  67.         index = VAL(MID$(l$, l + 1))
  68.         PRINT WordList(index), 'to show that we got the index back successfully
  69.     ELSE
  70.         PRINT "NOT FOUND!",
  71.     END IF
  72.     index = FindIndex(RandomWords(i))
  73.     'PRINT "Searching for: "; RandomWords(i),
  74.     IF index THEN 'to show that we got the index back successfully
  75.         PRINT WordList(index)
  76.     ELSE
  77.         PRINT "NOT FOUND!"
  78.     END IF
  79.  
  80. FUNCTION Lookup$ (s AS STRING)
  81.     haystack$ = EnglishDictionary(HashFunc(s))
  82.     needle$ = LB + s + RB
  83.     p = instr(haystack$, needle$)
  84.     if p = 0 then
  85.         Lookup$ = ""
  86.     else
  87.         Lookup$ = mid$(haystack$, p + 1, LEN(needle$) - 2)
  88.     end if
  89.  
  90. FUNCTION HashFunc (s AS STRING) ' input string
  91.     DIM hash~&, i
  92.  
  93.     hash~& = 5381
  94.  
  95.     FOR i = 1 TO LEN(s$)
  96.         hash~& = ((hash~& * 33) XOR ASC(s$, i))
  97.     NEXT i
  98.  
  99.     HashFunc = hash~& mod HashTableSize
  100.  
  101. SUB Sort (Array() AS STRING)
  102.     'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
  103.     'It's more than fast enough for our needs here I think.  ;)
  104.     gap = UBOUND(array)
  105.     DO
  106.         gap = 10 * gap \ 13
  107.         IF gap < 1 THEN gap = 1
  108.         i = 0
  109.         swapped = 0
  110.         DO
  111.             IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
  112.                 SWAP Array(i), Array(i + gap)
  113.                 swapped = -1
  114.             END IF
  115.             i = i + 1
  116.         LOOP UNTIL i + gap > UBOUND(Array)
  117.     LOOP UNTIL swapped = 0 AND gap = 1
  118.  
  119. FUNCTION FindIndex (search$)
  120.     SHARED SearchTimes, LastIndex
  121.     SearchTimes = 0
  122.     min = lbound(wordlist)
  123.     max = ubound(wordlist)
  124.  
  125.     DO UNTIL found
  126.         SearchTimes = SearchTimes + 1
  127.         gap = (max + min) \ 2
  128.         compare = _STRCMP(search$, WordList(gap))
  129.         IF compare > 0 THEN
  130.             min = gap
  131.         ELSEIF compare < 0 THEN
  132.             max = gap
  133.         ELSE
  134.             FindIndex = gap
  135.             found = -1
  136.             EXIT FUNCTION
  137.         END IF
  138.         IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
  139.         ' PRINT min, max, search$, WordList(gap), compare
  140.         ' SLEEP
  141.     LOOP
  142.  
  143.  

Many years ago (1990 to be precise), when I was needing something like this for a project I was doing in C, I ran extensive tests on different search functions - avltree, binarytree, binary search, hashing (I tested 135 different hash functions), custom lists tuned to the data, and a few more.

I reran some of those tests this morning.

In C at least, with my code (e.g., a custom binary search, instead of the built-in bsearch), hashing is about 2 times faster than binary searching.  So pretty much what I found for the code above.

Per the literature, it is even faster to use a power of 2 for the table size, rather than a prime for the table size, and then and off the appropriate bits instead of doing the mod at the end of the hash function.  In my tests with C, I did find this to be true, although it was just slightly faster, not strikingly so.  But of course, when you have potentially many millions of lookups, then every bit of speed counts.  Also note that the power of 2 table size assumes a high quality hash function.  Otherwise, it is best to stick with a prime number as the table size.

Anyway, cool topic!  Thanks to everyone who shared something on this!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #24 on: January 25, 2019, 12:23:35 pm »
I’m getting perfectly equal time tests on my PC, with both taking 0.164 seconds to run.  Either way, both methods are generally fast enough to hold up and work in most situations.

The biggest advantage to the hash list is that you dont need sorted data, whereas a binary search has lower memory requirements.  My advice is to choose whichever you prefer; either will usually work in most cases.
Speed Test.jpg
* Speed Test.jpg (Filesize: 37.81 KB, Dimensions: 646x429, Views: 326)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #25 on: January 25, 2019, 01:32:58 pm »
Taking Ed's improvements, I played around and tweaked the hash routine even further, making it a  much faster program:

Code: QB64: [Select]
  1. DIM SHARED HashTableSize AS LONG
  2. HashTableSize = 500009 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
  3.  
  4. DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
  5.  
  6. OPEN "466544 Word List.txt" FOR BINARY AS #1
  7.  
  8. DIM SHARED WordList(466545) AS STRING
  9. PRINT "Loading library"
  10.     count = count + 1
  11.     LINE INPUT #1, WordList(count)
  12.  
  13. PRINT "Creating Hash Table"
  14. i = 0
  15. FOR i = 1 TO UBOUND(wordlist)
  16.     b$ = WordList(i) 'the word to store
  17.     d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
  18.     IF d < 0 THEN
  19.         PRINT "hash out of range"
  20.         END
  21.     END IF
  22.     EnglishDictionary(d) = EnglishDictionary(d) + CHR$(LEN(b$)) + b$
  23.  
  24. ' Done developing fast lookup tool. Now time for an application.
  25.  
  26.  
  27.  
  28. CONST ITER = 75000
  29.  
  30. PRINT "Creating "; ITER; " random word lookup table"
  31. DIM RandomWords(ITER) AS STRING
  32. FOR i = 1 TO ITER
  33.     c = INT(RND * 466544) + 1
  34.     RandomWords(i) = WordList(c)
  35.  
  36. PRINT "Search via Hashing...       ";
  37. t# = TIMER
  38. FOR i = 1 TO ITER
  39.     l$ = Lookup$(RandomWords(i))
  40.     IF l$ <> RandomWords(i) THEN
  41.         PRINT "HashTable error - searched: ", RandomWords(i), " found: ", l$
  42.         END
  43.     END IF
  44. PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
  45.  
  46. FUNCTION Lookup$ (s AS STRING)
  47.     haystack$ = EnglishDictionary(HashFunc(s))
  48.     size = 1: l = 1
  49.     DO UNTIL needle$ = s OR size = 0
  50.         size = ASC(haystack$, l)
  51.         needle$ = MID$(haystack$, l + 1, size)
  52.         l = l + size + 1
  53.     LOOP
  54.     IF size > 0 THEN Lookup$ = needle$
  55.  
  56. FUNCTION HashFunc (s AS STRING) ' input string
  57.     DIM hash~&, i
  58.     hash~& = 5381
  59.     FOR i = 1 TO LEN(s$)
  60.         hash~& = ((hash~& * 33) XOR ASC(s$, i))
  61.     NEXT i
  62.     HashFunc = hash~& MOD HashTableSize

What's the main difference here?

EnglishDictionary(d) = EnglishDictionary(d) + CHR$(LEN(b$)) + b$

We're not storing our data separated by delimiters.  Instead, we're manually recording the size of each entry and using it to directly retrieve our information.  (Side note: This is a simple implementation of those dreaded "linked lists" which STx is so obsessive over.  It's hard to believe he didn't implement one from the start, since he's the resident guru on the subject for us!)

By going this route, we don't need to rely on INSTR to search for the various entries for us, giving us two distinct advantages:

1) It's much faster to "jump" from word to word, than it is to search byte by byte for a match.

2) It's less likely to generate any false positives, as we're retrieving the exact string from memory and comparing.  _INSTR("dogfood", "dog") is a match -- there's a "dog" in "dogfood", but "dog" <> "dogfood" if we compare directly.

Like most code, when it comes to speed, the devil's in the details.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #26 on: January 25, 2019, 01:48:36 pm »
I've already pointed out INSTR requires use of a delimiter like a comma.

dog - dogfood is easily solved that way...

wordlist$ = ",bill,cat,dog,dogfood,pete,steve,"
a = INSTR(wordlist$,",dog,")

dogfood would not be returned, just dog. Besides... dog food is two words. :P

Hey, did you try my INSTR search and frequency sample? It's dog gone fast!
https://www.qb64.org/forum/index.php?topic=1001.msg102017#msg102017

Pete
« Last Edit: January 25, 2019, 01:50:39 pm by Pete »
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #27 on: January 25, 2019, 02:04:03 pm »
I've already pointed out INSTR requires use of a delimiter like a comma.

dog - dogfood is easily solved that way...

wordlist$ = ",bill,cat,dog,dogfood,pete,steve,"
a = INSTR(wordlist$,",dog,")

dogfood would not be returned, just dog. Besides... dog food is two words. :P

Hey, did you try my INSTR search and frequency sample? It's dog gone fast!
https://www.qb64.org/forum/index.php?topic=1001.msg102017#msg102017

Pete

That's the point I was making with the previous post -- instead of using INSTR at all, which requires you to SEARCH the string for your words, store the length in front of each entry instead.

wordlist$ = "4bill3cat3dog7dogfood4pete5steve"

here you read 4, then get 4 characters for a string -- "bill"
then read 3, get 3 characters for "cat"...
then read 3, get 4 characters for "dog" -- MATCH!

Your method works like this:

wordlist$ = ",bill,cat,dog,dogfood,pete,steve,"
a = INSTR(wordlist$,",dog,")

byte 1 -- ",bill" <> ",dog"
byte 2 -- "bill," <> ",dog,"
byte 3 -- "ill,c" <> ",dog,"
byte 4 -- "ll,ca" <> ",dog,"
... and on byte by byte until ",dog," = ",dog,"

That's a 33 character string, and ",dog," is 5 characters, so there's up to 28 checks that the computer is doing behind the scene to look for the first occurrence of your string...

compared to a max of 6 comparisons if we simple read each word all at once. 

When dealing with a process which may be repeated in multiple loops, over and over, it ends up making a difference in performance.  ;)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #28 on: January 25, 2019, 02:12:07 pm »
But that's what seed in INSTR() is for, so you skip from delimiter to delimiter.

See the same post and the second code box, which uses seed% to deal with lookup frequency within a random word list.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #29 on: January 25, 2019, 07:26:23 pm »
But that's what seed in INSTR() is for, so you skip from delimiter to delimiter.

See the same post and the second code box, which uses seed% to deal with lookup frequency within a random word list.

Pete

Seed gives you a starting point, but you still check byte by byte for a match...

a$ = “,1,2,3,4,5,3,7,8,”

seed = INSTR(seed + 1, a$, “,3,”)

Run first, it does a compare at byte 1, then byte 2, then 3... at byte 5 it finds “,3,” — 5 comparisons.

Then you have:

seed = seed + LEN(“,3,”)  ‘5 + 3
seed = INSTR(seed, a$, “,3,”)

Now you start at byte 8 and search byte by byte until you get to byte 11... 5 more comparisons.

INSTR by its nature is a byte by byte compare routine (it has to be, or else how would it know where the search string appears at?  A random guess?)



wordlist$ = "4bill3cat3dog7dogfood4pete5steve0”

Placing the size first lets you advance by words, not by bytes

See the difference, and why one is faster than the other?
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!