QB64.org Forum

Active Forums => Programs => Topic started by: STxAxTIC on January 23, 2019, 08:21:21 pm

Title: Dictionizer: Useful utility and study of Hashing
Post by: STxAxTIC on January 23, 2019, 08:21:21 pm
Hello all,

This post is about my latest mechanization - the Dictionizer. Since the topic of alternative data structures comes up once in a while, I figure I would showcase the magic of "hashing", and how this idea can (and should) replace traditional arrays for many problems.

Beginning Remarks
This discussion often mingles with, but is completely separate from, the idea of linked lists that we beat to death on an adjacent thread. What I explain below doesn't reference that stuff at all, but the starting place is the same - traditional arrays are very misused, and with a little cleverness we may cut lookup and looping times to a minimum.

Intro to Hashing
Much like an ordinary array, a "hash table" is a rigid structure for holding data, where each data element has an address. Unlike an array however, an element can't just have *any* address, but it needs a very specific address. Where does the address come from? A data element's address is calculated using the data itself, not manually assigned. The calculation is performed by a hash function.

So if you want to store a list of names Alice, Bob, Candice, Daniel, Ezequiel, and so on, you must come up with a number for each name. For absolute simplicity, let our very simple hash function create a number from each data element using the ASCII value of the LAST letter of the name.

Code: QB64: [Select]
  1. FUNCTION ExampleHashFunction(a AS STRING)
  2.     ExampleHashFunction = ASC(RIGHT$(a,1))

For this situation, we have ASC("e")= 101, ASC("b")= 98, ASC("l")= 108. This generates the hash table:

Alice -- 101
Bob -- 98
Candice -- 101
Daniel -- 108
Ezequiel -- 108

Keep in mind that a traditional array would probably use addresses 1,2,3,4,5. Using our hash function though, we cram all five names into the addresses 98, 101, and 108. Setting aside the fact that certain elements have the same address (called a "collision"), observe what just happened: Looking up elements in a hash table does not involve looping or otherwise scanning through the elements. That is, you simply supply the name of the element, for now on called a key, and the hash function calculates the address from the key.

Handling Collisions
Hash table construction must account for collisions, which occur when two keys generate the same hash value. This is handled by simply concatenating the data at that address, with appropriate measures taken to later parse the combined elements. For our example, the finished hash table looks like:

98 -- {Bob}
101 -- {Alice}{Candice}
108 -- {Daniel}{Ezequiel}

...

Application
Okay, I've realized this forum space isn't the proper venue to spill out three more pages on hashing theory. I'll cut right to the chase. What is the Dictionizer? This program first loads the entire Oxford English dictionary (and its quirks) into a big hash table in memory, allowing extremely fast lookup of words. (Ask for a word, and zoom right to the address without stepping through the whole list.) Next, the program scans through a user-specified text file to construct a list of every unique word. Finally, every one of those words is looked up from the hash table, and the output is a beefy CSV that contains the words and their definitions. Crude at this point, but can be polished of course.

Instructions
Download and compile dictionizer.bas to an exe (code box below).
Make sure dic.txt is in the same directory as the exe.
Drag+drop a text file (there are two supplied) to generate the output.

You'll immediately see where the dictionary falls short, but it's fine for a prototype.
As a bonus, attached is a histogram of the hash function output across the whole data space. The idea is the plot should show no trends and not be too tall in any place.

Code: QB64: [Select]
  1. ' Deal with file names.
  2. IF c$ = "" THEN
  3.     c$ = "doc.txt"
  4.     PRINT "No input file specified. Assuming " + c$ + "."
  5. file$ = c$
  6. j = INSTR(c$, ".")
  7. IF j > 0 THEN
  8.     fileout$ = LEFT$(file$, j - 1) + "-out.csv"
  9.     fileout$ = file$ + "-out.csv"
  10.  
  11. DIM SHARED HashTableSize AS LONG
  12. HashTableSize = 300007 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
  13.  
  14. DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
  15. LB = "{/"
  16. RB = "/}"
  17.  
  18. DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
  19.  
  20. ' Analysis/debuggig tool:
  21. ' Prepare a plot of hash values versus frequency of occurence (histogram).
  22. OPEN "histo.txt" FOR OUTPUT AS #2
  23.  
  24. ' Open Oxford English Dictionary and load all information into a hash table (the array EnglishDictionary).
  25. PRINT "Loading dictionary..."
  26. OPEN "dic.txt" FOR BINARY AS #1
  27. i = 0
  28.     LINE INPUT #1, a$
  29.     j = INSTR(a$, "  ")
  30.     k = INSTR(a$, " , ")
  31.     IF k > 0 AND j = 0 THEN j = k ' Handles (the) word(s like) "To , "
  32.     IF j > 0 THEN
  33.         i = i + 1
  34.         b$ = LEFT$(a$, j - 1) ' Extract the base word.
  35.         IF RIGHT$(b$, 1) = "1" THEN b$ = LEFT$(b$, LEN(b$) - 1) ' Remove trailing `1' from words with multiple definitions.
  36.         b$ = LCASE$(b$) ' Convert to lowercase.
  37.         c$ = RIGHT$(a$, LEN(a$) - j - 1)
  38.         IF b$ = "usage" THEN
  39.             ' Append previous entry with "Usage" information from source.
  40.             ' The source was originally formatted such that "Usage" parses exactly as a dictionary word.
  41.             EnglishDictionary(d) = LEFT$(EnglishDictionary(d), LEN(EnglishDictionary(d)) - LEN(RB)) + "... " + b$ + ": " + c$ + RB
  42.         ELSE
  43.             d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
  44.             ' Store the word and definition in the followig format: {/word/}{/definition/}
  45.             ' Collisions are appended to the right and parsed later during lookup: {/word1/}{/definition1/}{/word2/}{/definition2/}
  46.             EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB + LB + c$ + RB
  47.             PRINT #2, d ' Record the hash value (analysis/debugging).
  48.         END IF
  49.     END IF
  50. CLOSE #2 ' Close histogram data file (analysis/debugging).
  51. PRINT "Done."
  52.  
  53. ' Done developing fast lookup tool. Now time for an application.
  54.  
  55. ' Open user-specified text document and make a list of unique words (without counting).
  56. WordList$ = ""
  57. PRINT "Loading user document..."
  58. OPEN file$ FOR BINARY AS #1
  59. OPEN fileout$ FOR OUTPUT AS #2
  60.     LINE INPUT #1, a$
  61.     c$ = a$
  62.     DO WHILE LEN(c$) > 0
  63.         j = INSTR(c$, " ")
  64.         IF j > 0 THEN
  65.             b$ = LEFT$(c$, j - 1) ' Extract the base word.
  66.             b$ = LTRIM$(RTRIM$(b$))
  67.             b$ = LCASE$(b$) ' Convert to lowercase.
  68.             c$ = RIGHT$(c$, LEN(c$) - j)
  69.             ' Remove punctuation and stray marks.
  70.             TheNaughtyList$ = "`1234567890=~!@#$%^&*()_+[]\{}|;:,.<>/? " + CHR$(34) + CHR$(10) + CHR$(13) ' ignores hyphen and single quote as in: all-that's
  71.             FOR k = 1 TO LEN(TheNaughtyList$)
  72.                 b$ = ReplaceSubString$(b$, MID$(TheNaughtyList$, k, 1), "")
  73.             NEXT
  74.             ' Add to word list in format: {/word1/}{/word2/}{/word3/}...
  75.             IF INSTR(WordList$, LB + b$ + RB) = 0 THEN
  76.                 WordList$ = WordList$ + LB + b$ + RB
  77.                 PRINT b$ + CHR$(9) + Lookup$(b$)
  78.                 PRINT #2, b$ + CHR$(9) + Lookup$(b$)
  79.             END IF
  80.         ELSE
  81.             b$ = c$
  82.             c$ = ""
  83.         END IF
  84.     LOOP
  85. PRINT "Done."
  86.  
  87.  
  88. FUNCTION Lookup$ (a AS STRING)
  89.     r$ = ""
  90.     b$ = EnglishDictionary(HashFunc(a))
  91.     c$ = ""
  92.     d$ = ""
  93.     IF b$ <> "" THEN
  94.         DO WHILE c$ <> a
  95.             c$ = ReturnBetween(b$, LB, RB)
  96.             IF c$ = "" THEN EXIT DO
  97.             b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
  98.             d$ = ReturnBetween(b$, LB, RB)
  99.         LOOP
  100.     END IF
  101.     r$ = a + "  " + d$
  102.     Lookup$ = r$
  103.  
  104. FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
  105.     i = INSTR(a, b)
  106.     j = INSTR(a, c)
  107.     f = LEN(c)
  108.     ReturnBetween$ = MID$(a, i + f, j - (i + f))
  109.  
  110. FUNCTION HashFunc (a AS STRING) ' input string
  111.     DIM sum AS DOUBLE
  112.     sum = HashTableSize
  113.     FOR k = 1 TO LEN(a)
  114.         sum = sum + k * COS(ASC(MID$(a, k, 1))) ' Without the linear factor of k, permutations have same hash values.
  115.     NEXT
  116.     sum = ABS(VAL(ReplaceSubString$(STR$(sum), ".", "")))
  117.     sum = sum MOD HashTableSize
  118.     HashFunc = sum
  119.  
  120. FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
  121.     j = INSTR(a, b)
  122.     IF j > 0 THEN
  123.         r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
  124.     ELSE
  125.         r$ = a
  126.     END IF
  127.     ReplaceSubString$ = r$
  128.  
  129.  
  130.  
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: luke on January 24, 2019, 07:24:26 am
If you'd like something that doesn't require transcendental functions, here's my go to hashing function. Stolen shamelessly from http://www.cse.yorku.ca/~oz/hash.html
Code: QB64: [Select]
  1. FUNCTION htable_hash~& (s$, mod_size)
  2.     DIM hash~&, i
  3.     hash~& = 5381
  4.     FOR i = 1 TO LEN(s$)
  5.         hash~& = ((hash~& * 33) XOR ASC(s$, i)) MOD mod_size
  6.     NEXT i
  7.     htable_hash~& = hash~&
  8.  
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: _vince on January 24, 2019, 09:01:02 am
Nice post, stx.  Hey luke did you know there are new bitshifting functions/statements in QB64?
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: Pete on January 24, 2019, 10:10:00 am
I feel like just saw the trailer to a good show, and then the company forgot to release the movie! Loved the introduction the hashing... but then it ended. Ahhhhhhh! I'd have been happy to read the following three pages if you wrote/included them.

When I made my word processor with spell check, I found the fasted way I knew of at the time was to divide the dictionary in a-z with one further division by the second letter of the word, a-l and m-z. This cut down of the search a lot, and even back in the 1990s systems were fast enough that a two page report could be completely spell checked in around 5-seconds. 

Pete
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: SMcNeill on January 24, 2019, 10:25:04 am
When I made my word processor with spell check, I found the fasted way I knew of at the time was to divide the dictionary in a-z with one further division by the second letter of the word, a-l and m-z. This cut down of the search a lot, and even back in the 1990s systems were fast enough that a two page report could be completely spell checked in around 5-seconds. 

Pete

Wouldn’t a simple binary search have been faster?  Since a dictionary is stored alphabetically, you just need to search in halves...

A
B
C
D
E
F
G
H
I
J

Where is C in the list?

Start in the middle at E.  E > C so search before it.
B is in the middle of A to D.  B < C so search after it.
C is the middle of C to D.  C = C.  Found it!

A max of N searches, where 2^N > Index.  ;)
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: Pete on January 24, 2019, 11:30:30 am
Well let's see.

What I came up then was something like this...

Code: QB64: [Select]
  1. LINE INPUT "word: "; w$
  2.  
  3. a1$ = "spellck\" + MID$(w$, 1, 1)
  4. a2$ = LCASE$(MID$(w$, 2, 1))
  5. IF a2$ > "l" THEN a2$ = a1$ + "mz\" ELSE a2$ = a1$ + "al\"
  6.  
  7. PRINT "Folder: " + a2$ + " Search for: " + w$
  8. END ' Stop demo here because folders and word lists not included.
  9.  
  10. ' The spell check would go something like this. No caps in this example and other considerations like hyphens not included here.
  11. OPEN a2$ FOR INPUT AS #1
  12. l$ = LCASE$(MID$(w$, 1, 2))
  13.     LINE INPUT #1, x$
  14.     x1$ = MID$(LCASE$(x$), 1, 2)
  15.     IF x$ = w$ THEN EXIT DO
  16.     IF x1$ > l$ THEN flag% = -1: EXIT DO ' Search word is larger than current record entry.
  17.  
  18. IF flag THEN
  19.     flag = 0
  20.     PRINT "Word not in database."
  21.  

So my spellcheck folder contained 52 sub-folders. If memory serves, it was a 50,000 word spell check, so a sub-folder would be approximately 1,000 records. It worked fast enough so I never put together a way to index the entries past that.

Pete
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: STxAxTIC on January 24, 2019, 03:19:46 pm
Hi again all and thanks for the replies. When checking this thread last night I saw that more people looked at this histogram than ran the code, which made me wonder... anyway...

@luke
Thanks for the suggestion! I've been denying myself the pleasure(?) of deriving or pasting a better hash function into the code, but for some reason the scaled cosine (for lack of a better description) worked. I think I'll be stealing yours though in the future. (Btw, http://www.cse.yorku.ca/~oz/hash.html (http://www.cse.yorku.ca/~oz/hash.html) is an awesome quick read!)

@vince
Honored to receive your positive feedback!

@pete
I'm taking the preview-movie comparison as a compliment (thanks)! Unfortunately, the fabled "three pages on hashing theory" are only floating around in my mind and on the Internet, but not in a form that I've typed up yet. I've been considering converting some of my juicier posts into articles though. (Case in point, when I first posted qXed the community focus was 100% on cursor bugs and 0% on the mechanism.) If released as a PDF or HTML instead of a forum post, I think plenty of works would self-represent a bit better.

Rambling on that chord for a second, I think time has shown that QB64 and its users don't encourage projects with thick stacks of libraries and $include files by different authors. Our works don't really propagate by means of being $included into other projects, but we all fantasize that it does. The absence of evidence for this is staggering. I think the thickest program I ever wrote was an InForm implementation of Sprezzo, and the number of authors swelled to a whopping 3 (includes Luke via falcon.h). I dare say we'll never have projects out there with 5 or 6 largely-contributing authors. Thinking about how to properly "write up" a post reminds me that we propagate our ideas by teaching them to the coder next to us rather than saying "include my junk in your shitstack". I think if any one post has a big point to make, a forum probably doesn't have the right tooling.

@Steve and Pete (again)
Funny, hashing could have solved this for you back in the 90's. There are no exotic QB64 keywords here - nearly all of my code (and perhaps all of this code) is QB45-ready, plus or minus memory limits. I like Steve's suggestion if we agree to only use uniformly-filled dictionaries. You can see why a weird list like

Apple
Bear
Sandcastle
Sasquatch
Sediment
Segment
Sip
Siren
Soap
Soda
Suds
Supper
Zebra
Zoo
Zoology

... might confound the method.


Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: Pete on January 24, 2019, 03:43:08 pm
I've always considered QB forums just fun places to talk about stuff, like talking to yourself in some respect. I think the majority of QB coders are more the reinvent the wheel types, because that's just more fun although less productive. Oh course one benefit from doing all the code your own way is that if you ever want to customize it, you can, because you completely understand it.

I recall when I first looked into C in 1995, because I figured it was going to be a language that would survive, I ran into a brick wall with this use other people's stuff approach. I needed a custom made key input routine for word processing, but, when I asked for help, all I got was use a library! So in some ways these gurus were ahead of me, because they new the libraries, but I was ahead of them in that they couldn't program one of those libraries to save their collective souls. I managed to make my own custom key input routine without their forum help but after that project I just concluded C wasn't for me.

I never got into the rules and regs of other languages much. No real education in making hash tables, linked lists, file indexing, etc. I've always been one who can think up a way to get one what I want done. Find the angle, so to speak. When it comes to coding, I consider myself more of a builder than a programmer. Well, now I wonder if I explained my coding style or my avatar selection by these comments. Oh well.

... and yes Bill, that was meant as a compliment... but you weren't supposed to take it. So give it back!:D

A raven walks into a crowbar... Oh well, a story for anther time.

Pete
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: SMcNeill on January 24, 2019, 04:06:24 pm
Quote
I like Steve's suggestion if we agree to only use uniformly-filled dictionaries. You can see why a weird list like

Apple
Bear
Sandcastle
Sasquatch
Sediment
Segment
Sip
Siren
Soap
Soda
Suds
Supper
Zebra
Zoo
Zoology

... might confound the method.

If it data is sorted, the strangeness of its contents don’t matter.  There’s 16 items in that list, you can find any of them in less than 5 comparisons.  (And add another 15 items to the list and still find any of them in less than 5 comparisons.)

Binary Searching can be even more effective that hash tables, if you have a poor hash algorithm and lots of collisions in the tables — like if you were to use your “ASCII value of the LAST letter” method on Pete’s 50,000 word dictionary.  The data for “101” (E) might have 10,000 entries to it, to look through after the hash is done...

A binary search, however, would *always* get the result in less than 17 tries.  (18 maybe; I was trying to do the math in my head and just woke up...)

Contents of the data doesn’t matter, as it’s the index which you compare against and move up/down in increments of half. 

16 items..  guess...
Wrong?  You just eliminated 8...  guess...
Wrong?  You just eliminated 4... 
2...
1...  RIGHT!
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: STxAxTIC on January 24, 2019, 05:36:12 pm
(I think you noticed the data I listed was already alphabetical but in case anyone missed it - it was.)

You baited me on purpose, right? Because duuuuuuude, of course a shitty hash implementation will perform below a well-made alternative method. It's like you're saying a well-designed sledge hammer rolls down the road more elegantly than cars, for sufficiently badly designed cars. Yes, yes we know.

Beyond straw man stuff, please don't tell me you're defending *any* kind of iterative search against a hash method. 17 steps? 18 steps? Where did these numbers come from? How do I interpret them in O(n) notation? Both numbers are more than dozen too large anyway. The method I demonstrate above gets to the address in one step, and in extremely rare cases, might have to deal with a collision more than 6 layers deep. (Actual example, not theory.)

I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand). Feel free to use the method above. Use Luke's hash function if its faster.

EDIT:

To clarify - how does comparing only indices actually sort or find any real data?

Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: Pete on January 24, 2019, 08:09:28 pm
So let's say I put all 50 states in an RA word file, in alphabetical order. With Steve's concept, I get the LOF() of that file, divide it in half, and start a search (Get) at that record number. If the word at that address matches, I'm done. If it is higher or lower, I travel 1/2 in that direction, say record 13 or 38, and try again. If that does not match, continue x amount of times and if that does not find it, I go record by record for a short distance to match the word. Now with Bills idea, the word list can be in any order, and I just put together a hash algorithm and put the words into a table. Some records are single words. Other words have "collision" so I can separate them by a comma. now I pop them in my RA word base by the hash number, the lenght of the RA file to be determined by the largest record entry. I now find the word by applying the hash algorithm to it, get that record number, and if comma(s) are present, use an INSTR() function to see if the word I'm checking is in that record. Sound correct guys? If so, with Steve's method, the program uses more GET and compare loops. With Bill's method, the word itself has to be looped through to come up with the hash index, but then t only takes one GET to find record and one INSTR() to find the word within that record. I will say to alphabetize a large word list and use Steve's method is easier to implement. To make a good hash algorithm, to produce a good distribution of words in records, would be more of a challenge.

Pete

So Steve's idea...

Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: SMcNeill on January 24, 2019, 08:11:24 pm
So Steve's idea...

Just shared here: https://www.qb64.org/forum/index.php?topic=1003.0
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: Pete on January 24, 2019, 08:45:56 pm
LOL - I am psychic. I had a hunch you were about to post something as I put my reply, above, together.

You know you mentioned in the other thread about memory, which now isn't the problem it was in QB circa 1990. So I was thinking The hell with all techno crap, just use that memory!

First, separate your word list with commas and put the whole glob into a binary file as one giant string. Now...

OPEN wordlist.txt" FOR BINARY AS #1
word$ = SPACE$(LOF(1))
GET #1, , word$
CLOSE #1

if instr(word$, mytypedword$+CHR$(34)) then print "yep, found it." else print "word not found."

I mean what's faster than one instr() lookup and the list is already and quickly loaded into memory at the start of the program.

Pete

Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: SMcNeill on January 24, 2019, 08:58:24 pm
The question there, Pete, is what spot does the word contain in your list?  If just knowing it’s there is good enough, that might work, but if you need the word index, it’s insufficient.  ;)
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: Pete on January 24, 2019, 09:21:09 pm
True on the index issue, although I never needed that. I built some kind of replacement words algorithm from stuff I studied online and figured out for myself, so that was better than just getting the surrounding words from the point just before and just after where it fell out of search for a word not found.

Pete

When it come to coding, I can get away with murder if I don't end up killing myself in the process.
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: bplus on January 24, 2019, 11:45:32 pm
The question there, Pete, is what spot does the word contain in your list?  If just knowing it’s there is good enough, that might work, but if you need the word index, it’s insufficient.  ;)

Steve you had nothing to worry about.
Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(800, 600, 32)
  2.  
  3. OPEN "466544 Word List.txt" FOR BINARY AS #1
  4. DIM SHARED WordList(1 TO 466544) AS STRING
  5.     count = count + 1
  6.     LINE INPUT #1, WordList(count)
  7. DIM RandomWords(50000) AS STRING
  8. FOR i = 1 TO 50000
  9.     c = INT(RND * 466544) + 1
  10.     RandomWords(i) = WordList(c)
  11. PRINT "Steve's 50000 word lookup with binary search"
  12. t# = TIMER
  13. FOR i = 1 TO 50000
  14.     index = FindIndex(RandomWords(i))
  15. PRINT USING "###.### seconds lookup"; TIMER - t#
  16.  
  17.  
  18. 'pete's method
  19. OPEN "466544 Word List.txt" FOR BINARY AS #1
  20. word$ = SPACE$(LOF(1))
  21. GET #1, , word$
  22. PRINT "Pete's 50000 word lookup using INSTR to get position in string of word."
  23. t# = TIMER
  24. FOR i = 1 TO 50000
  25.     place = INSTR(word$, RandomWords(i) + CHR$(13) + CHR$(10))
  26. PRINT USING "###.### seconds lookup"; TIMER - t#
  27.  
  28.  
  29.  
  30.  
  31.  
  32.     _KEYCLEAR
  33.     INPUT "Give me any word"; search$
  34.     search$ = LTRIM$(RTRIM$(search$))
  35.     PRINT "Searching for "; search$
  36.     IF search$ = "" THEN END
  37.     index = FindIndex(search$)
  38.     IF index THEN
  39.         PRINT "Word was found at position "; index; " in "; SearchTimes; "passes."
  40.     ELSE
  41.         PRINT "Word was not in list."
  42.         PRINT "Previous word was ==> "; WordList(LastIndex - 1)
  43.         PRINT "Next word was ======> "; WordList(LastIndex + 1)
  44.         PRINT "Search took"; SearchTimes; "passes."
  45.     END IF
  46.     PRINT
  47.  
  48.  
  49.  
  50. FUNCTION FindIndex (search$)
  51.     SHARED SearchTimes, LastIndex
  52.     SearchTimes = 0
  53.     min = 1 'lbound(wordlist)
  54.     max = 370099 'ubound(wordlist)
  55.     DO UNTIL found
  56.         SearchTimes = SearchTimes + 1
  57.         gap = (min + max) \ 2
  58.         compare = _STRICMP(search$, WordList(gap))
  59.         IF compare > 0 THEN
  60.             min = gap + 1
  61.         ELSEIF compare < 0 THEN
  62.             max = gap - 1
  63.         ELSE
  64.             FindIndex = gap
  65.             found = -1
  66.         END IF
  67.         IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list
  68.         ' PRINT min, max, search$, WordList(gap), compare
  69.         ' SLEEP
  70.     LOOP
  71.  
  72.  

Ha, ha, ha, had me fooled too! ;-))

Though the INSTR method would find a word in collisions fast enough, I imagine.
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: SMcNeill on January 25, 2019, 12:21:52 am
Makes sense when you think about what INSTR does:  compares character by character until it finds a match....

PRINT INSTR (“1234567890”, “0”)  <—- this would need to start at position 1 and compare each byte in the string to where “0” appears first — if it appears at all.

So take 466544 words of average length 5, and you have a string with 2 million bytes in it...  That’s up to 2 million comparisons to find a match...

To be honest, I’m surprised all it took was 2.5 minutes or so to finish. 
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: Pete on January 25, 2019, 02:30:43 am
For my needs, INSTR() works great. Actually, I have used my method to pare words out of very large html page. I just don't look up words 50,000 times! So to find just one word in a 50,000 word list with an average of 6 letters per word would be nearly instantaneous...

Code: QB64: [Select]
  1. x$ = ","
  2.     x$ = x$ + CHR$(RND * 26 + 65)
  3.     IF INT(RND * 8) = 5 THEN x$ = x$ + ","
  4.     IF LEN(x$) > 150000 AND flag = 0 THEN flag = 1: PRINT "Half finished...": x$ = x$ + ",ATE,"
  5. LOOP UNTIL LEN(x$) > 300000
  6.  
  7. PRINT " Press any key to run speed test..."
  8.  
  9.  
  10. t# = TIMER
  11. PRINT "ATE is at position:"; INSTR(x$, ",ATE,") + 1
  12. PRINT USING "###.### seconds lookup"; TIMER - t#

What does take a minute is putting the 300000+ character string together.

Parsing out frequency would require seeding the instr() function but with seeding, it is also very fast.

It's a quick and dirty low tech solution if you don't mind clogging up a lot of memory to use it.

Edit: Wow, frequency tests are lightening fast, too.

Code: QB64: [Select]
  1. x$ = ","
  2.     x$ = x$ + CHR$(RND * 26 + 65)
  3.     IF INT(RND * 500) = 25 THEN x$ = x$ + ",ATE,"
  4. LOOP UNTIL LEN(x$) > 300000
  5.  
  6. PRINT " Press any key to run speed test..."
  7.  
  8.  
  9. t# = TIMER
  10.     ii = ii + 1
  11.     seed& = INSTR(seed&, x$, ",ATE,")
  12.     IF seed& = 0 THEN EXIT DO
  13.     PRINT ii; seed&; MID$(x$, seed& + 1, LEN("ATE"))
  14.     seed& = seed& + LEN(",ATE,")
  15. PRINT USING "###.### seconds lookup"; TIMER - t#

Pete
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: _vince on January 25, 2019, 11:50:16 am
Pete the Builder

I believe QB64 actually uses hash tables.  I remember when that was a giant milestone for galleon in improving compilation time - hashing for variable (sub, function, array, etc) names though I am not familiar with exactly how it was done. Here's a reasonable list of applications that even mentions programming language interpreters/compilers: https://en.wikipedia.org/wiki/Hash_table#Uses (https://en.wikipedia.org/wiki/Hash_table#Uses). Maybe Steve can further increase compilation time by replacing it with his binary search?

@STx the dictionary example may be a poor example of practical hash table use unless Steve is just totally messing with you with this one-upping you with the binary search, who knows. The best and simplest example would be "associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like Perl, Ruby, Python, and PHP" (from the wiki).  Here's an example:

there's an array called "cat":

cat$(0) = "Benedict"
cat$(1) = "maine coon"
cat$(2) = "rats"
cat$(3) = "35 kg"

It's a little confusing to remember that I'd have to call cat$(1) to get it's breed type, a better representation is:

cat_hash$("name") = "Benedict"
cat_hash$("breed") = "maine coon"
cat_hash$("food") = "rats"
cat_hash$("mass") = "35 kg"

now all you need is a hashing function to resolve the following:

"name" -> 0
"breed" -> 1
"food" -> 2
"mass" -> 3

as you can imagine, this is extremely useful in tables and databases.  Shame on Steve for spamming this otherwise informative thread with unrelated and misleading nonsense.
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: Pete on January 25, 2019, 12:05:18 pm
"Yes he can!"

Just don't get on his bad side... Angry Pete

https://thumbs.gfycat.com/ReasonableEmbellishedHalcyon-size_restricted.gif
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: SMcNeill on January 25, 2019, 12:31:42 pm
@STx the dictionary example may be a poor example of practical hash table use unless Steve is just totally messing with you...

Now we all know I wouldn’t do anything like that.  :P
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: _vince on January 25, 2019, 02:10:59 pm
Edited above
Title: Re: Dictionizer: Useful utility and study of Hashing
Post by: _vince on January 25, 2019, 02:13:49 pm
Perhaps a dictionary implementation is as follows:

hashing function:

"apple" -> 3
"africa" -> 1
"xylophone" -> 2
"zebra" -> 0

then you simply store your dictionary as follows

dict$(0) = "horse-like animal"
dict$(1) = "continent"
dict$(2) = "musical instrument"
dict$(3) = "type of fruit"

obviously it does not have to be in order at all, and words can resolve to arbitrary numbers which are just more conveniences of hash tables