Author Topic: Dictionizer: Useful utility and study of Hashing  (Read 9599 times)

0 Members and 1 Guest are viewing this topic.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Dictionizer: Useful utility and study of Hashing
« Reply #15 on: January 24, 2019, 11:45:32 pm »
The question there, Pete, is what spot does the word contain in your list?  If just knowing it’s there is good enough, that might work, but if you need the word index, it’s insufficient.  ;)

Steve you had nothing to worry about.
Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(800, 600, 32)
  2.  
  3. OPEN "466544 Word List.txt" FOR BINARY AS #1
  4. DIM SHARED WordList(1 TO 466544) AS STRING
  5.     count = count + 1
  6.     LINE INPUT #1, WordList(count)
  7. DIM RandomWords(50000) AS STRING
  8. FOR i = 1 TO 50000
  9.     c = INT(RND * 466544) + 1
  10.     RandomWords(i) = WordList(c)
  11. PRINT "Steve's 50000 word lookup with binary search"
  12. t# = TIMER
  13. FOR i = 1 TO 50000
  14.     index = FindIndex(RandomWords(i))
  15. PRINT USING "###.### seconds lookup"; TIMER - t#
  16.  
  17.  
  18. 'pete's method
  19. OPEN "466544 Word List.txt" FOR BINARY AS #1
  20. word$ = SPACE$(LOF(1))
  21. GET #1, , word$
  22. PRINT "Pete's 50000 word lookup using INSTR to get position in string of word."
  23. t# = TIMER
  24. FOR i = 1 TO 50000
  25.     place = INSTR(word$, RandomWords(i) + CHR$(13) + CHR$(10))
  26. PRINT USING "###.### seconds lookup"; TIMER - t#
  27.  
  28.  
  29.  
  30.  
  31.  
  32.     _KEYCLEAR
  33.     INPUT "Give me any word"; search$
  34.     search$ = LTRIM$(RTRIM$(search$))
  35.     PRINT "Searching for "; search$
  36.     IF search$ = "" THEN END
  37.     index = FindIndex(search$)
  38.     IF index THEN
  39.         PRINT "Word was found at position "; index; " in "; SearchTimes; "passes."
  40.     ELSE
  41.         PRINT "Word was not in list."
  42.         PRINT "Previous word was ==> "; WordList(LastIndex - 1)
  43.         PRINT "Next word was ======> "; WordList(LastIndex + 1)
  44.         PRINT "Search took"; SearchTimes; "passes."
  45.     END IF
  46.     PRINT
  47.  
  48.  
  49.  
  50. FUNCTION FindIndex (search$)
  51.     SHARED SearchTimes, LastIndex
  52.     SearchTimes = 0
  53.     min = 1 'lbound(wordlist)
  54.     max = 370099 'ubound(wordlist)
  55.     DO UNTIL found
  56.         SearchTimes = SearchTimes + 1
  57.         gap = (min + max) \ 2
  58.         compare = _STRICMP(search$, WordList(gap))
  59.         IF compare > 0 THEN
  60.             min = gap + 1
  61.         ELSEIF compare < 0 THEN
  62.             max = gap - 1
  63.         ELSE
  64.             FindIndex = gap
  65.             found = -1
  66.         END IF
  67.         IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list
  68.         ' PRINT min, max, search$, WordList(gap), compare
  69.         ' SLEEP
  70.     LOOP
  71.  
  72.  
Binary vrs INSTR.PNG

Ha, ha, ha, had me fooled too! ;-))

Though the INSTR method would find a word in collisions fast enough, I imagine.
« Last Edit: January 24, 2019, 11:55:42 pm by bplus »

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Dictionizer: Useful utility and study of Hashing
« Reply #16 on: January 25, 2019, 12:21:52 am »
Makes sense when you think about what INSTR does:  compares character by character until it finds a match....

PRINT INSTR (“1234567890”, “0”)  <—- this would need to start at position 1 and compare each byte in the string to where “0” appears first — if it appears at all.

So take 466544 words of average length 5, and you have a string with 2 million bytes in it...  That’s up to 2 million comparisons to find a match...

To be honest, I’m surprised all it took was 2.5 minutes or so to finish. 
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Dictionizer: Useful utility and study of Hashing
« Reply #17 on: January 25, 2019, 02:30:43 am »
For my needs, INSTR() works great. Actually, I have used my method to pare words out of very large html page. I just don't look up words 50,000 times! So to find just one word in a 50,000 word list with an average of 6 letters per word would be nearly instantaneous...

Code: QB64: [Select]
  1. x$ = ","
  2.     x$ = x$ + CHR$(RND * 26 + 65)
  3.     IF INT(RND * 8) = 5 THEN x$ = x$ + ","
  4.     IF LEN(x$) > 150000 AND flag = 0 THEN flag = 1: PRINT "Half finished...": x$ = x$ + ",ATE,"
  5. LOOP UNTIL LEN(x$) > 300000
  6.  
  7. PRINT " Press any key to run speed test..."
  8.  
  9.  
  10. t# = TIMER
  11. PRINT "ATE is at position:"; INSTR(x$, ",ATE,") + 1
  12. PRINT USING "###.### seconds lookup"; TIMER - t#

What does take a minute is putting the 300000+ character string together.

Parsing out frequency would require seeding the instr() function but with seeding, it is also very fast.

It's a quick and dirty low tech solution if you don't mind clogging up a lot of memory to use it.

Edit: Wow, frequency tests are lightening fast, too.

Code: QB64: [Select]
  1. x$ = ","
  2.     x$ = x$ + CHR$(RND * 26 + 65)
  3.     IF INT(RND * 500) = 25 THEN x$ = x$ + ",ATE,"
  4. LOOP UNTIL LEN(x$) > 300000
  5.  
  6. PRINT " Press any key to run speed test..."
  7.  
  8.  
  9. t# = TIMER
  10.     ii = ii + 1
  11.     seed& = INSTR(seed&, x$, ",ATE,")
  12.     IF seed& = 0 THEN EXIT DO
  13.     PRINT ii; seed&; MID$(x$, seed& + 1, LEN("ATE"))
  14.     seed& = seed& + LEN(",ATE,")
  15. PRINT USING "###.### seconds lookup"; TIMER - t#

Pete
« Last Edit: January 25, 2019, 03:02:44 am by Pete »
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline _vince

  • Seasoned Forum Regular
  • Posts: 422
    • View Profile
Re: Dictionizer: Useful utility and study of Hashing
« Reply #18 on: January 25, 2019, 11:50:16 am »
Pete the Builder

I believe QB64 actually uses hash tables.  I remember when that was a giant milestone for galleon in improving compilation time - hashing for variable (sub, function, array, etc) names though I am not familiar with exactly how it was done. Here's a reasonable list of applications that even mentions programming language interpreters/compilers: https://en.wikipedia.org/wiki/Hash_table#Uses. Maybe Steve can further increase compilation time by replacing it with his binary search?

@STx the dictionary example may be a poor example of practical hash table use unless Steve is just totally messing with you with this one-upping you with the binary search, who knows. The best and simplest example would be "associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like Perl, Ruby, Python, and PHP" (from the wiki).  Here's an example:

there's an array called "cat":

cat$(0) = "Benedict"
cat$(1) = "maine coon"
cat$(2) = "rats"
cat$(3) = "35 kg"

It's a little confusing to remember that I'd have to call cat$(1) to get it's breed type, a better representation is:

cat_hash$("name") = "Benedict"
cat_hash$("breed") = "maine coon"
cat_hash$("food") = "rats"
cat_hash$("mass") = "35 kg"

now all you need is a hashing function to resolve the following:

"name" -> 0
"breed" -> 1
"food" -> 2
"mass" -> 3

as you can imagine, this is extremely useful in tables and databases.  Shame on Steve for spamming this otherwise informative thread with unrelated and misleading nonsense.
bob.jpeg
* bob.jpeg (Filesize: 22.39 KB, Dimensions: 300x300, Views: 318)
« Last Edit: January 25, 2019, 02:08:49 pm by _vince »

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Dictionizer: Useful utility and study of Hashing
« Reply #19 on: January 25, 2019, 12:05:18 pm »
"Yes he can!"

Just don't get on his bad side... Angry Pete

https://thumbs.gfycat.com/ReasonableEmbellishedHalcyon-size_restricted.gif
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Dictionizer: Useful utility and study of Hashing
« Reply #20 on: January 25, 2019, 12:31:42 pm »
@STx the dictionary example may be a poor example of practical hash table use unless Steve is just totally messing with you...

Now we all know I wouldn’t do anything like that.  :P
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline _vince

  • Seasoned Forum Regular
  • Posts: 422
    • View Profile
Re: Dictionizer: Useful utility and study of Hashing
« Reply #21 on: January 25, 2019, 02:10:59 pm »
Edited above

Offline _vince

  • Seasoned Forum Regular
  • Posts: 422
    • View Profile
Re: Dictionizer: Useful utility and study of Hashing
« Reply #22 on: January 25, 2019, 02:13:49 pm »
Perhaps a dictionary implementation is as follows:

hashing function:

"apple" -> 3
"africa" -> 1
"xylophone" -> 2
"zebra" -> 0

then you simply store your dictionary as follows

dict$(0) = "horse-like animal"
dict$(1) = "continent"
dict$(2) = "musical instrument"
dict$(3) = "type of fruit"

obviously it does not have to be in order at all, and words can resolve to arbitrary numbers which are just more conveniences of hash tables
« Last Edit: January 25, 2019, 02:17:32 pm by _vince »