Author Topic: Binary Search Method  (Read 18403 times)

0 Members and 1 Guest are viewing this topic.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Binary Search Method
« on: January 24, 2019, 08:09:35 pm »
A simple demostration of how to implement and use a binary search method:

Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(800, 600, 32)
  2.  
  3. OPEN "466544 Word List.txt" FOR BINARY AS #1
  4. DIM SHARED WordList(1 TO 466544) AS STRING
  5.     count = count + 1
  6.     LINE INPUT #1, WordList(count)
  7.  
  8.  
  9.  
  10.     _KEYCLEAR
  11.     INPUT "Give me any word"; search$
  12.     search$ = LTRIM$(RTRIM$(search$))
  13.     PRINT "Searching for "; search$
  14.     IF search$ = "" THEN END
  15.     index = FindIndex(search$)
  16.     IF index THEN
  17.         PRINT "Word was found at position "; index; " in "; SearchTimes; "passes."
  18.     ELSE
  19.         PRINT "Word was not in list."
  20.         PRINT "Previous word was ==> "; WordList(LastIndex - 1)
  21.         PRINT "Next word was ======> "; WordList(LastIndex + 1)
  22.         PRINT "Search took"; SearchTimes; "passes."
  23.     END IF
  24.     PRINT
  25.  
  26.  
  27.  
  28. FUNCTION FindIndex (search$)
  29.     SHARED SearchTimes, LastIndex
  30.     SearchTimes = 0
  31.     min = 1 'lbound(wordlist)
  32.     max = 370099 'ubound(wordlist)
  33.     DO UNTIL found
  34.         SearchTimes = SearchTimes + 1
  35.         gap = (min + max) \ 2
  36.         compare = _STRICMP(search$, WordList(gap))
  37.         IF compare > 0 THEN
  38.             min = gap + 1
  39.         ELSEIF compare < 0 THEN
  40.             max = gap - 1
  41.         ELSE
  42.             FindIndex = gap
  43.             found = -1
  44.         END IF
  45.         IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list
  46.         PRINT min, max, search$, WordList(gap), compare
  47.         SLEEP
  48.     LOOP
  49.  

The word list has 466544 words in it, so it takes a few seconds to load into memory.  Searching for a word in the list however, and finding its index, is almost instantaneous.  At *MOST*, this will make 19 passes to either find, or eliminate a word from the list.

Now, let's apply the same list to STx's hash tables as written here: https://www.qb64.org/forum/index.php?topic=1001.msg101972#msg101972

Code: QB64: [Select]
  1. DIM SHARED HashTableSize AS LONG
  2. HashTableSize = 300007 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
  3.  
  4.  
  5. DIM Counter(300007) 'So we can count how deep some of the tables go
  6.  
  7. PRINT "Loading dictionary..."
  8.  
  9. OPEN "466544 Word List.txt" FOR BINARY AS #1
  10.     LINE INPUT #1, a$
  11.     d = HashFunc(a$) ' Calculate the hash value (array address) of the word on hand.
  12.     Counter(d) = Counter(d) + 1
  13.  
  14. FOR i = 1 TO 300007
  15.     IF Counter(i) > max THEN max = Counter(i)
  16.  
  17. PRINT "Lists go up to "; max; " levels deep."
  18.  
  19.  
  20. PRINT "Done."
  21.  
  22.  
  23.  
  24.  
  25. FUNCTION HashFunc (a AS STRING) ' input string
  26.     DIM sum AS DOUBLE
  27.     sum = HashTableSize
  28.     FOR k = 1 TO LEN(a)
  29.         sum = sum + k * COS(ASC(MID$(a, k, 1))) ' Without the linear factor of k, permutations have same hash values.
  30.     NEXT
  31.     sum = ABS(VAL(ReplaceSubString$(STR$(sum), ".", "")))
  32.     sum = sum MOD HashTableSize
  33.     HashFunc = sum
  34.  
  35. FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
  36.     j = INSTR(a, b)
  37.     IF j > 0 THEN
  38.         r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
  39.     ELSE
  40.         r$ = a
  41.     END IF
  42.     ReplaceSubString$ = r$

Notice that the hash table has entries which go up to 10 levels deep in it, so we first need to hash to the proper address and then check the list for up to 10 different entries before finding the right one.

19 loops MAX in this case for the Binary Search Method, verses 11 (hash + 10 levels) MAX for the Hash Table...

... so it seems as if the Hash Table wins in this case. (Also depending on how efficient the method is for looking up the 10 levels deep info is; some commands have more overhead than others.)

BUT...

Hash Tables require your data in memory PLUS the table itself being in memory. 

All the Binary Search Method requires is for the data to be memory (and sorted).  (And if memory constraints are a true issue, it works quite well directly with Random Access Files, as you can simply use the index number to reference them with GET #1, index, word$.)



Either way, it seems to be a nice improvement over Pete's old method which he talks about here: https://www.qb64.org/forum/index.php?topic=1001.msg101983#msg101983

And, as for Stx's question:
Quote
17 steps? 18 steps? Where did these numbers come from? How do I interpret them in O(n) notation?

That was covered here: https://www.qb64.org/forum/index.php?topic=1001.msg101981#msg101981

Quote
A max of N searches, where 2^N > Index.  ;)

2^18 = 262,144
INDEX = 466544 words
2^19 = 524,288

So our max number of searches is 2^N > Index... Or 19 in this case. 

Quote
I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand).

I think the above offers a fairly decent comparison of the two methods.  Feel free to run them and actually time them in various performance tests if you want. 

One large advantage to the binary search is lesser memory usage and simplicity to implement.

A large advantage to hash tables is the fact that your data doesn't need to be sorted to search properly.

Both seem equally useful to me, and neither should be completely dismissed out-of-hand.  ;)

NOTE: Feel free to remove (or comment out) the following statements, if you want, in the code above.  It's mainly there just to help highlight the search process.

Code: QB64: [Select]
  1.         PRINT min, max, search$, WordList(gap), compare
  2.         SLEEP

* 466544 Word List.txt (Filesize: 5.08 MB, Downloads: 13141)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Binary Search Method
« Reply #1 on: January 24, 2019, 08:23:04 pm »
Nice and honest writeup.

An actual speed test will be to rattle off say 50,000 consecutive lookups. To say it "feels instaneous" for one word at a time is a bit pedestrian. Let's talk Pete into making the test.

EDIT: Just looked up the average conversion time for binary sort is O(log n). That solves the speed question. On binary sort's best day you have O(n), which is incidentally what you get on hash's worst day.

You're right about the memory argument. Good thing it's no longer 1973 and we have been cramming gigabytes into keychains since 2002. Hashing is safe, memory-wise.  (Speed is still coveted though.)
« Last Edit: January 24, 2019, 08:31:34 pm by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #2 on: January 24, 2019, 08:42:05 pm »
Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(800, 600, 32)
  2.  
  3. OPEN "466544 Word List.txt" FOR BINARY AS #1
  4. DIM SHARED WordList(1 TO 466544) AS STRING
  5.     count = count + 1
  6.     LINE INPUT #1, WordList(count)
  7.  
  8. DIM RandomWords(50000) AS STRING
  9. FOR i = 1 TO 50000
  10.     c = INT(RND * 466544) + 1
  11.     RandomWords(i) = WordList(c)
  12.  
  13. t# = TIMER
  14. FOR i = 1 TO 50000
  15.     index = FindIndex(RandomWords(i))
  16. PRINT USING "###.### seconds lookup"; TIMER - t#
  17.  
  18.  
  19.  
  20.  
  21.  
  22.     _KEYCLEAR
  23.     INPUT "Give me any word"; search$
  24.     search$ = LTRIM$(RTRIM$(search$))
  25.     PRINT "Searching for "; search$
  26.     IF search$ = "" THEN END
  27.     index = FindIndex(search$)
  28.     IF index THEN
  29.         PRINT "Word was found at position "; index; " in "; SearchTimes; "passes."
  30.     ELSE
  31.         PRINT "Word was not in list."
  32.         PRINT "Previous word was ==> "; WordList(LastIndex - 1)
  33.         PRINT "Next word was ======> "; WordList(LastIndex + 1)
  34.         PRINT "Search took"; SearchTimes; "passes."
  35.     END IF
  36.     PRINT
  37.  
  38.  
  39.  
  40. FUNCTION FindIndex (search$)
  41.     SHARED SearchTimes, LastIndex
  42.     SearchTimes = 0
  43.     min = 1 'lbound(wordlist)
  44.     max = 370099 'ubound(wordlist)
  45.     DO UNTIL found
  46.         SearchTimes = SearchTimes + 1
  47.         gap = (min + max) \ 2
  48.         compare = _STRICMP(search$, WordList(gap))
  49.         IF compare > 0 THEN
  50.             min = gap + 1
  51.         ELSEIF compare < 0 THEN
  52.             max = gap - 1
  53.         ELSE
  54.             FindIndex = gap
  55.             found = -1
  56.         END IF
  57.         IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list
  58.         ' PRINT min, max, search$, WordList(gap), compare
  59.         ' SLEEP
  60.     LOOP

0.055 seconds on my PC to find 50,000 random words.  Personally, I’d call that “almost instantaneous”, but what do I know?  My naming sense sucks.  LOL

Now you just need to speed test your method to lookup the same 50,000 words.  (Which is why theres no RANDOMIZER TIMER in there.)
« Last Edit: January 24, 2019, 08:44:04 pm by SMcNeill »
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #3 on: January 24, 2019, 08:50:53 pm »
Hey I'm still over in the other thread with my new method that makes you guys look like losers! Sorry, I gave brain cells today at the Head Cross clinic and for a few hours after that procedure, I find myself channeling Ted.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #4 on: January 24, 2019, 08:53:52 pm »
Nevermind.  It’s not worth the effort.  Use it or not; there it is.
« Last Edit: January 24, 2019, 09:14:47 pm by SMcNeill »
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #5 on: January 24, 2019, 09:12:57 pm »
Not alphabetizing a file for a binary search would be as stupid as not using the Dewey Decimal system for stacking books. The books in here somewhere, just start doing a shelf by shelf search. So I don't think we should be considering that as part of this discussion. This is about an organized word list vs a hash table, which alphabetizing in the hash method is of absolutely no consequence.

In my 25 year old example, the speed was adequate because I organized the data and sub-divided the list into different alphabetized folders. If I had to loop through 50,000 random words in a single list with the slower INPUT AS method, I'd still be waiting to see if I spelled dumb-ass correctly.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Binary Search Method
« Reply #6 on: January 24, 2019, 09:14:15 pm »
Steve, the big-O notation is designed to specifically not compare apples to oranges. Dicscrete calculus will be a lesson for another time... I just cant right now, my facepalm hurts.
« Last Edit: January 24, 2019, 09:16:05 pm by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #7 on: January 24, 2019, 09:17:35 pm »
Use it or not; there it is.

I’d still love to see a similar speed test for your method.  I’m curious how big a difference it could be.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Binary Search Method
« Reply #8 on: January 24, 2019, 09:24:47 pm »
It's open source. Have a ball. (Might wanna optimize it first.)
You're not done when it works, you're done when it's right.

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #9 on: January 24, 2019, 09:29:23 pm »
Steve, the big-O notation is designed to specifically not compare apples to oranges. Dicscrete calculus will be a lesson for another time... I just cant right now, my facepalm hurts.

Dicscrete calculus, Bill? I think you'd better chose a search method and get that spell checker up and working for craps sake.

If you mean discrete calculus, is that one of those math courses offered at night for active singles? If so, what is the probably of completing the course without contracting a nasty case of standard deviations?

Pete :D
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Binary Search Method
« Reply #10 on: January 24, 2019, 10:21:24 pm »
Haha, not sure who promised I was making a spell checker. The dictionizer thing I posted is still a working product... or... prototype. I consider the whole case closed pretty much.
You're not done when it works, you're done when it's right.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Binary Search Method
« Reply #11 on: January 25, 2019, 12:16:04 am »
Steve there is a bug in your FindIndex function:
Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(800, 600, 32)
  2.  
  3. OPEN "466544 Word List.txt" FOR BINARY AS #1
  4. DIM SHARED WordList(1 TO 466544) AS STRING
  5.     count = count + 1
  6.     LINE INPUT #1, WordList(count)
  7. DIM RandomWords(50000) AS STRING
  8. FOR i = 1 TO 50000
  9.     c = INT(RND * 466544) + 1
  10.     RandomWords(i) = WordList(c)
  11. PRINT "Steve's 50000 word lookup with binary search"
  12. t# = TIMER
  13. FOR i = 1 TO 10
  14.     index = FindIndex(RandomWords(i))
  15.     IF index THEN PRINT WordList(index) ELSE PRINT RandomWords(i) + " not found!"
  16. PRINT USING "###.### seconds lookup"; TIMER - t#
  17.  
  18.  
  19. 'pete's method
  20. OPEN "466544 Word List.txt" FOR BINARY AS #1
  21. word$ = SPACE$(LOF(1))
  22. GET #1, , word$
  23. PRINT "Pete's 50000 word lookup using INSTR to get position in string of word."
  24. t# = TIMER
  25. FOR i = 1 TO 10
  26.     place = INSTR(word$, RandomWords(i) + CHR$(13) + CHR$(10))
  27.     PRINT MID$(word$, place, LEN(RandomWords(i)))
  28. PRINT USING "####.### seconds lookup"; TIMER - t#
  29.  
  30.  
  31.  
  32.  
  33.  
  34.     _KEYCLEAR
  35.     INPUT "Give me any word"; search$
  36.     search$ = LTRIM$(RTRIM$(search$))
  37.     PRINT "Searching for "; search$
  38.     IF search$ = "" THEN END
  39.     index = FindIndex(search$)
  40.     IF index THEN
  41.         PRINT "Word was found at position "; index; " in "; SearchTimes; "passes."
  42.     ELSE
  43.         PRINT "Word was not in list."
  44.         PRINT "Previous word was ==> "; WordList(LastIndex - 1)
  45.         PRINT "Next word was ======> "; WordList(LastIndex + 1)
  46.         PRINT "Search took"; SearchTimes; "passes."
  47.     END IF
  48.     PRINT
  49.  
  50.  
  51.  
  52. FUNCTION FindIndex (search$)
  53.     SHARED SearchTimes, LastIndex
  54.     SearchTimes = 0
  55.     min = 1 'lbound(wordlist)
  56.     max = 370099 'ubound(wordlist)
  57.     DO UNTIL found
  58.         SearchTimes = SearchTimes + 1
  59.         gap = (min + max) \ 2
  60.         compare = _STRICMP(search$, WordList(gap))
  61.         IF compare > 0 THEN
  62.             min = gap + 1
  63.         ELSEIF compare < 0 THEN
  64.             max = gap - 1
  65.         ELSE
  66.             FindIndex = gap
  67.             found = -1
  68.         END IF
  69.         IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list
  70.         ' PRINT min, max, search$, WordList(gap), compare
  71.         ' SLEEP
  72.     LOOP
  73.  
  74.  
Binary lookup bug.PNG

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #12 on: January 25, 2019, 02:08:45 am »
A couple of little things was working against the original running as intended:

Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(800, 600, 32)
  2.  
  3. OPEN "466544 Word List.txt" FOR BINARY AS #1
  4. DIM SHARED WordList(466545) AS STRING
  5. PRINT "Loading library"
  6.     count = count + 1
  7.     LINE INPUT #1, WordList(count)
  8.  
  9. PRINT "Sorting"
  10. Sort WordList()
  11.  
  12. PRINT "Looking up"
  13. DIM RandomWords(50000) AS STRING
  14. FOR i = 1 TO 50000
  15.     c = INT(RND * 466544) + 1
  16.     RandomWords(i) = WordList(c)
  17. PRINT "Steve's 50000 word lookup with binary search"
  18. t# = TIMER
  19. FOR i = 1 TO 10
  20.     index = FindIndex(RandomWords(i))
  21.     PRINT "Searching for: "; RandomWords(i),
  22.     IF index THEN PRINT WordList(index) ELSE PRINT "NOT FOUND!"
  23. PRINT USING "###.### seconds lookup"; TIMER - t#
  24.  
  25.  
  26. 'pete's method
  27.  
  28. 'FOR i = 1 TO 466545
  29. '    adding brackets to words to stop a false match, such as "dog" being found in "dogfood"
  30. '    word$ = word$ + "{" + WordList(i) + "}" 'build up the search list
  31. '    IF i < 50001 THEN RandomWords(i) = "{" + RandomWords(i) + "}"
  32. 'NEXT
  33.  
  34. OPEN "466544 Word List.txt" FOR BINARY AS #1
  35. word$ = SPACE$(LOF(1))
  36. GET #1, , word$
  37.  
  38.  
  39.  
  40. PRINT "Pete's 50000 word lookup using INSTR to get position in string of word."
  41. t# = TIMER
  42. FOR i = 1 TO 10
  43.     place = INSTR(word$, RandomWords(i) + CHR$(13) + CHR$(10))
  44.     PRINT "Searching for: "; RandomWords(i),
  45.     IF place THEN PRINT MID$(word$, place, LEN(RandomWords(i))) ELSE PRINT "NOT FOUND!"
  46. PRINT USING "####.### seconds lookup"; TIMER - t#
  47.  
  48.  
  49.  
  50.  
  51.  
  52.     _KEYCLEAR
  53.     INPUT "Give me any word"; search$
  54.     search$ = LTRIM$(RTRIM$(search$))
  55.     PRINT "Searching for "; search$
  56.     IF search$ = "" THEN END
  57.     index = FindIndex(search$)
  58.     IF index THEN
  59.         PRINT WordList(index); " was found at position "; index; " in "; SearchTimes; "passes."
  60.     ELSE
  61.         PRINT "Word was not in list."
  62.         PRINT "Previous word was ==> "; WordList(LastIndex - 1)
  63.         PRINT "Next word was ======> "; WordList(LastIndex + 1)
  64.         PRINT "Search took"; SearchTimes; "passes."
  65.     END IF
  66.     PRINT
  67.  
  68.  
  69.  
  70. FUNCTION FindIndex (search$)
  71.     SHARED SearchTimes, LastIndex
  72.     SearchTimes = 0
  73.     min = 1 'lbound(wordlist)
  74.     max = 466544 'ubound(wordlist)
  75.  
  76.     DO UNTIL found
  77.         SearchTimes = SearchTimes + 1
  78.         gap = (max + min) \ 2
  79.         'IF gap = oldgap THEN gap = gap + 1
  80.         compare = _STRCMP(search$, WordList(gap))
  81.         IF compare > 0 THEN
  82.             oldmin = min
  83.             min = gap
  84.         ELSEIF compare < 0 THEN
  85.             oldmax = max
  86.             max = gap
  87.         ELSE
  88.             FindIndex = gap
  89.             found = -1
  90.             EXIT FUNCTION
  91.         END IF
  92.         oldgap = gap
  93.         IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
  94.         ' PRINT min, max, search$, WordList(gap), compare
  95.         ' SLEEP
  96.     LOOP
  97.  
  98. SUB Sort (Array() AS STRING)
  99.     'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
  100.     'It's more than fast enough for our needs here I th ink.  ;)
  101.     gap = UBOUND(array)
  102.     DO
  103.         gap = 10 * gap \ 13
  104.         IF gap < 1 THEN gap = 1
  105.         i = 0
  106.         swapped = 0
  107.         DO
  108.             IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
  109.                 SWAP Array(i), Array(i + gap)
  110.                 swapped = -1
  111.             END IF
  112.             i = i + 1
  113.         LOOP UNTIL i + gap > UBOUND(Array)
  114.     LOOP UNTIL swapped = 0 AND gap = 1
  115.  

First issue:  The data in the wordlist isn't fully alphabetical...  O_o!!   It needed a quick sort so we can search it properly.  Just look at the first few lines; it should've been obvious, but I overlooked it:

Quote
2
1080
&c
10-point
10th
11-point
12-point
16-point
18-point
1st

Since when does "2" come before "1"??

Sorting routine added, and bug was squashed.



Glitch 2:

OPEN "466544 Word List.txt" FOR BINARY AS #1
max = 370099 'ubound(wordlist)

There's a small difference in those two numbers...  (I'd originally was using a smaller file, before I found this one on the web and grabbed it for the larger data set.)

There's no way we could find a ton of words in the list, while only checking about 3/4 of them...

Glitch fixed.



Glitch #3:

        ELSE
            FindIndex = gap
            found = -1
        END IF
        IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list

We're missing one important piece of program flow here: 

        ELSE
            FindIndex = gap
            found = -1
            EXIT FUNCTION
        END IF
        IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list

We'd find the proper result, and then if the gap was small enough between the last numbers, we'd lie and claim it wasn't actually in the list...

Glitch fixed.



All should be working as advertised now, in the code above.  ;)

Sidenote:  Your implementation of Pete's method is glitched.  "dog" would be found if "dogfood" was included in the list, while "dog" actually wasn't.

I was going to do a fix for the issue, but you'd need to take a nap while it was running:

Code: QB64: [Select]
  1. FOR i = 1 TO 466545 ' adding brackets to words to stop a false match, such as "dog" being found in "dogfood"
  2.     word$ = word$ + "{" + WordList(i) + "}" 'build up the search list
  3.     IF i < 50001 THEN RandomWords(i) = "{" + RandomWords(i) + "}"

I tried to rebuild the already loaded and sorted list with brackets delimiting the words, but I didn't have the patience to sit and listen to my PC try to sound like an airplane with all the fans going full throttle for who knows how long....
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline luke

  • Administrator
  • Seasoned Forum Regular
  • Posts: 324
    • View Profile
Re: Binary Search Method
« Reply #13 on: January 25, 2019, 05:42:29 am »
And just because we can, here's a recursive binary sort:
Code: QB64: [Select]
  1. DEFLNG A-Z
  2.  
  3. OPEN "words" FOR INPUT AS #1
  4. DIM SHARED WordList(99170) AS STRING
  5.     LINE INPUT #1, WordList(count)
  6.     count = count + 1
  7.  
  8.     INPUT "> ", search$
  9.     IF search$ = "" THEN END
  10.     index = find(search$, 0, UBOUND(wordlist))
  11.     IF index >= 0 THEN
  12.         PRINT "Found "; WordList(index); " at"; index
  13.     ELSE
  14.         PRINT "Not found"
  15.     END IF
  16.  
  17. FUNCTION find (word$, start, finish)
  18.     size = finish - start + 1
  19.     mid = size \ 2
  20.     cmp = _STRCMP(WordList(start + mid), word$)
  21.     SELECT CASE cmp
  22.         CASE 0
  23.             find = start + mid
  24.         CASE IS > 0
  25.             IF size = 1 THEN find = -1 ELSE find = find(word$, start, start + mid - 1)
  26.         CASE IS < 0
  27.             IF size = 1 THEN find = -1 ELSE find = find(word$, start + mid + 1, finish)
  28.     END SELECT

(My word list had all the words beginning with a capital letter listed first, so I'm using _STRCMP)

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #14 on: January 25, 2019, 09:18:49 am »
Use it or not; there it is.

I’d still love to see a similar speed test for your method.  I’m curious how big a difference it could be.

It's open source. Have a ball. (Might wanna optimize it first.)

So, since Stx wasn't willing to do a speed comparison, I did:

Code: QB64: [Select]
  1. DIM SHARED HashTableSize AS LONG
  2. HashTableSize = 300007 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
  3.  
  4. DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
  5. LB = "{"
  6. RB = "}"
  7.  
  8. DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
  9.  
  10. OPEN "466544 Word List.txt" FOR BINARY AS #1
  11.  
  12. DIM SHARED WordList(466545) AS STRING
  13. PRINT "Loading library"
  14.     count = count + 1
  15.     LINE INPUT #1, WordList(count)
  16.  
  17. Sort WordList()
  18.  
  19. i = 0
  20. FOR i = 1 TO 466545
  21.     b$ = WordList(i) 'the word to store
  22.     c$ = LTRIM$(RTRIM$(STR$(i))) 'to store the index
  23.     d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
  24.     EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB + LB + c$ + RB
  25. PRINT "Done creating Hash Table."
  26.  
  27. ' Done developing fast lookup tool. Now time for an application.
  28.  
  29. PRINT "Looking up"
  30. DIM RandomWords(50000) AS STRING
  31. FOR i = 1 TO 50000
  32.     c = INT(RND * 466544) + 1
  33.     RandomWords(i) = WordList(c)
  34.  
  35. t# = TIMER
  36. FOR i = 1 TO 50000
  37.     'PRINT "Searching for: "; RandomWords(i),
  38.     l$ = Lookup$(RandomWords(i))
  39.     IF l$ <> "" THEN
  40.         l = INSTR(l$, " ")
  41.         word$ = LEFT$(l$, l - 1)
  42.         index = VAL(MID$(l$, l + 1))
  43.         'PRINT WordList(index) 'to show that we got the index back successfully
  44.         'ELSE
  45.         'PRINT "NOT FOUND!"
  46.     END IF
  47. PRINT USING "###.### seconds lookup using Hash Table"; TIMER - t#
  48.  
  49. t# = TIMER
  50. FOR i = 1 TO 50000
  51.     index = FindIndex(RandomWords(i))
  52.     'PRINT "Searching for: "; RandomWords(i),
  53.     'IF index THEN 'to show that we got the index back successfully
  54.     '    PRINT WordList(index)
  55.     'ELSE
  56.     '    PRINT "NOT FOUND!"
  57.     'END IF
  58. PRINT USING "###.### seconds lookup using Binary Search"; TIMER - t#
  59.  
  60. PRINT "And just to compare results -- the first 15 words:"
  61. FOR i = 1 TO 10
  62.     'PRINT "Searching for: "; RandomWords(i),
  63.     l$ = Lookup$(RandomWords(i))
  64.     IF l$ <> "" THEN
  65.         l = INSTR(l$, " ")
  66.         word$ = LEFT$(l$, l - 1)
  67.         index = VAL(MID$(l$, l + 1))
  68.         PRINT WordList(index), 'to show that we got the index back successfully
  69.     ELSE
  70.         PRINT "NOT FOUND!",
  71.     END IF
  72.     index = FindIndex(RandomWords(i))
  73.     'PRINT "Searching for: "; RandomWords(i),
  74.     IF index THEN 'to show that we got the index back successfully
  75.         PRINT WordList(index)
  76.     ELSE
  77.         PRINT "NOT FOUND!"
  78.     END IF
  79.  
  80.  
  81.  
  82.  
  83. FUNCTION Lookup$ (a AS STRING)
  84.     r$ = ""
  85.     b$ = EnglishDictionary(HashFunc(a))
  86.     c$ = ""
  87.     d$ = ""
  88.     IF b$ <> "" THEN
  89.         DO WHILE c$ <> a
  90.             c$ = ReturnBetween(b$, LB, RB)
  91.             IF c$ = "" THEN EXIT DO
  92.             b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
  93.             d$ = ReturnBetween(b$, LB, RB)
  94.         LOOP
  95.     END IF
  96.     r$ = a + "  " + d$
  97.     Lookup$ = r$
  98.  
  99. FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
  100.     i = INSTR(a, b)
  101.     j = INSTR(a, c)
  102.     f = LEN(c)
  103.     ReturnBetween$ = MID$(a, i + f, j - (i + f))
  104.  
  105. FUNCTION HashFunc (a AS STRING) ' input string
  106.     DIM sum AS DOUBLE
  107.     sum = HashTableSize
  108.     FOR k = 1 TO LEN(a)
  109.         sum = sum + k * COS(ASC(MID$(a, k, 1))) ' Without the linear factor of k, permutations have same hash values.
  110.     NEXT
  111.     sum = ABS(VAL(ReplaceSubString$(STR$(sum), ".", "")))
  112.     sum = sum MOD HashTableSize
  113.     HashFunc = sum
  114.  
  115. FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
  116.     j = INSTR(a, b)
  117.     IF j > 0 THEN
  118.         r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
  119.     ELSE
  120.         r$ = a
  121.     END IF
  122.     ReplaceSubString$ = r$
  123.  
  124.  
  125. SUB Sort (Array() AS STRING)
  126.     'The dice sorting routine, optimized to use _MEM and a comb sort algorithm.
  127.     'It's more than fast enough for our needs here I th ink.  ;)
  128.     gap = UBOUND(array)
  129.     DO
  130.         gap = 10 * gap \ 13
  131.         IF gap < 1 THEN gap = 1
  132.         i = 0
  133.         swapped = 0
  134.         DO
  135.             IF _STRCMP(Array(i), Array(i + gap)) > 0 THEN
  136.                 SWAP Array(i), Array(i + gap)
  137.                 swapped = -1
  138.             END IF
  139.             i = i + 1
  140.         LOOP UNTIL i + gap > UBOUND(Array)
  141.     LOOP UNTIL swapped = 0 AND gap = 1
  142.  
  143. FUNCTION FindIndex (search$)
  144.     SHARED SearchTimes, LastIndex
  145.     SearchTimes = 0
  146.     min = 1 'lbound(wordlist)
  147.     max = 466544 'ubound(wordlist)
  148.  
  149.     DO UNTIL found
  150.         SearchTimes = SearchTimes + 1
  151.         gap = (max + min) \ 2
  152.         compare = _STRCMP(search$, WordList(gap))
  153.         IF compare > 0 THEN
  154.             min = gap
  155.         ELSEIF compare < 0 THEN
  156.             max = gap
  157.         ELSE
  158.             FindIndex = gap
  159.             found = -1
  160.             EXIT FUNCTION
  161.         END IF
  162.         IF max - min < 1 THEN LastIndex = gap: found = -1 'it's not in the list
  163.         ' PRINT min, max, search$, WordList(gap), compare
  164.         ' SLEEP
  165.     LOOP
  166.  
  167.  

In both search instances, we use the same word list, the same 50,000 word search list, and we return both the word we're searching for and its index position in the list.

Now, I imagine that Stx will come along and tell me I'm doing something wrong with the routine he provided, but the screenshot below is the results I get.  If there's something wrong with things, I'm more than happy to learn from whatever the issue is:
Speed Test.jpg
* Speed Test.jpg (Filesize: 39.99 KB, Dimensions: 646x429, Views: 342)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!