Author Topic: Dictionizer: Useful utility and study of Hashing (Read 16046 times)

bplus · « **Reply #15 on:** January 24, 2019, 11:45:32 pm »

Quote from: SMcNeill on January 24, 2019, 08:58:24 pm

The question there, Pete, is what spot does the word contain in your list? If just knowing it’s there is good enough, that might work, but if you need the word index, it’s insufficient. ;)

Steve you had nothing to worry about.

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
 
OPEN "466544 Word List.txt" FOR BINARY AS #1
DIM SHARED WordList(1 TO 466544) AS STRING
DO UNTIL EOF(1)
    count = count + 1
    LINE INPUT #1, WordList(count)
LOOP
CLOSE #1
DIM RandomWords(50000) AS STRING
FOR i = 1 TO 50000
    c = INT(RND * 466544) + 1
    RandomWords(i) = WordList(c)
NEXT
PRINT "Steve's 50000 word lookup with binary search"
t# = TIMER
FOR i = 1 TO 50000
    index = FindIndex(RandomWords(i))
NEXT
PRINT USING "###.### seconds lookup"; TIMER - t#
 
 
'pete's method
OPEN "466544 Word List.txt" FOR BINARY AS #1
word$ = SPACE$(LOF(1))
GET #1, , word$
CLOSE #1
PRINT "Pete's 50000 word lookup using INSTR to get position in string of word."
t# = TIMER
FOR i = 1 TO 50000
    place = INSTR(word$, RandomWords(i) + CHR$(13) + CHR$(10))
NEXT
PRINT USING "###.### seconds lookup"; TIMER - t#
 
END
 
 
 
 
DO
    _KEYCLEAR
    INPUT "Give me any word"; search$
    search$ = LTRIM$(RTRIM$(search$))
    PRINT "Searching for "; search$
    IF search$ = "" THEN END
    index = FindIndex(search$)
    IF index THEN
        PRINT "Word was found at position "; index; " in "; SearchTimes; "passes."
    ELSE
        PRINT "Word was not in list."
        PRINT "Previous word was ==> "; WordList(LastIndex - 1)
        PRINT "Next word was ======> "; WordList(LastIndex + 1)
        PRINT "Search took"; SearchTimes; "passes."
    END IF
    PRINT
LOOP
 
 
 
FUNCTION FindIndex (search$)
    SHARED SearchTimes, LastIndex
    SearchTimes = 0
    min = 1 'lbound(wordlist)
    max = 370099 'ubound(wordlist)
    DO UNTIL found
        SearchTimes = SearchTimes + 1
        gap = (min + max) \ 2
        compare = _STRICMP(search$, WordList(gap))
        IF compare > 0 THEN
            min = gap + 1
        ELSEIF compare < 0 THEN
            max = gap - 1
        ELSE
            FindIndex = gap
            found = -1
        END IF
        IF max - min <= 1 THEN LastIndex = gap: found = -1 'it's not in the list
        ' PRINT min, max, search$, WordList(gap), compare
        ' SLEEP
    LOOP
END FUNCTION
 
 

Ha, ha, ha, had me fooled too! ;-))

Though the INSTR method would find a word in collisions fast enough, I imagine.

SMcNeill · « **Reply #16 on:** January 25, 2019, 12:21:52 am »

Makes sense when you think about what INSTR does: compares character by character until it finds a match....

PRINT INSTR (“1234567890”, “0”) <—- this would need to start at position 1 and compare each byte in the string to where “0” appears first — if it appears at all.

So take 466544 words of average length 5, and you have a string with 2 million bytes in it... That’s up to 2 million comparisons to find a match...

To be honest, I’m surprised all it took was 2.5 minutes or so to finish.

Pete · « **Reply #17 on:** January 25, 2019, 02:30:43 am »

For my needs, INSTR() works great. Actually, I have used my method to pare words out of very large html page. I just don't look up words 50,000 times! So to find just one word in a 50,000 word list with an average of 6 letters per word would be nearly instantaneous...

Code: QB64: [Select]

x$ = ","
DO
    x$ = x$ + CHR$(RND * 26 + 65)
    IF INT(RND * 8) = 5 THEN x$ = x$ + ","
    IF LEN(x$) > 150000 AND flag = 0 THEN flag = 1: PRINT "Half finished...": x$ = x$ + ",ATE,"
LOOP UNTIL LEN(x$) > 300000
 
PRINT x$
PRINT
PRINT " Press any key to run speed test..."
 
SLEEP: PRINT
 
t# = TIMER
PRINT "ATE is at position:"; INSTR(x$, ",ATE,") + 1
PRINT USING "###.### seconds lookup"; TIMER - t#

What does take a minute is putting the 300000+ character string together.

Parsing out frequency would require seeding the instr() function but with seeding, it is also very fast.

It's a quick and dirty low tech solution if you don't mind clogging up a lot of memory to use it.

Edit: Wow, frequency tests are lightening fast, too.

Code: QB64: [Select]

x$ = ","
DO
    x$ = x$ + CHR$(RND * 26 + 65)
    IF INT(RND * 500) = 25 THEN x$ = x$ + ",ATE,"
LOOP UNTIL LEN(x$) > 300000
 
PRINT x$
PRINT
PRINT " Press any key to run speed test..."
 
SLEEP: PRINT
 
t# = TIMER
DO
    ii = ii + 1
    seed& = INSTR(seed&, x$, ",ATE,")
    IF seed& = 0 THEN EXIT DO
    PRINT ii; seed&; MID$(x$, seed& + 1, LEN("ATE"))
    seed& = seed& + LEN(",ATE,")
LOOP
PRINT USING "###.### seconds lookup"; TIMER - t#

Pete

_vince · « **Reply #18 on:** January 25, 2019, 11:50:16 am »

Pete the Builder

I believe QB64 actually uses hash tables. I remember when that was a giant milestone for galleon in improving compilation time - hashing for variable (sub, function, array, etc) names though I am not familiar with exactly how it was done. Here's a reasonable list of applications that even mentions programming language interpreters/compilers: https://en.wikipedia.org/wiki/Hash_table#Uses. Maybe Steve can further increase compilation time by replacing it with his binary search?

@STx the dictionary example may be a poor example of practical hash table use unless Steve is just totally messing with you with this one-upping you with the binary search, who knows. The best and simplest example would be "associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like Perl, Ruby, Python, and PHP" (from the wiki). Here's an example:

there's an array called "cat":

cat$(0) = "Benedict"
cat$(1) = "maine coon"
cat$(2) = "rats"
cat$(3) = "35 kg"

It's a little confusing to remember that I'd have to call cat$(1) to get it's breed type, a better representation is:

cat_hash$("name") = "Benedict"
cat_hash$("breed") = "maine coon"
cat_hash$("food") = "rats"
cat_hash$("mass") = "35 kg"

now all you need is a hashing function to resolve the following:

"name" -> 0
"breed" -> 1
"food" -> 2
"mass" -> 3

as you can imagine, this is extremely useful in tables and databases. Shame on Steve for spamming this otherwise informative thread with unrelated and misleading nonsense.

Pete · « **Reply #19 on:** January 25, 2019, 12:05:18 pm »

"Yes he can!"

Just don't get on his bad side... Angry Pete

https://thumbs.gfycat.com/ReasonableEmbellishedHalcyon-size_restricted.gif

SMcNeill · « **Reply #20 on:** January 25, 2019, 12:31:42 pm »

Quote from: _vince on January 25, 2019, 11:50:16 am

@STx the dictionary example may be a poor example of practical hash table use unless Steve is just totally messing with you...

Now we all know I wouldn’t do anything like that. :P

_vince · « **Reply #21 on:** January 25, 2019, 02:10:59 pm »

Edited above

_vince · « **Reply #22 on:** January 25, 2019, 02:13:49 pm »

Perhaps a dictionary implementation is as follows:

hashing function:

"apple" -> 3
"africa" -> 1
"xylophone" -> 2
"zebra" -> 0

then you simply store your dictionary as follows

dict$(0) = "horse-like animal"
dict$(1) = "continent"
dict$(2) = "musical instrument"
dict$(3) = "type of fruit"

obviously it does not have to be in order at all, and words can resolve to arbitrary numbers which are just more conveniences of hash tables

News: