Author Topic: Dictionizer: Useful utility and study of Hashing (Read 16680 times)

STxAxTIC · « **on:** January 23, 2019, 08:21:21 pm »

Hello all,

This post is about my latest mechanization - the Dictionizer. Since the topic of alternative data structures comes up once in a while, I figure I would showcase the magic of "hashing", and how this idea can (and should) replace traditional arrays for many problems.

Beginning Remarks
This discussion often mingles with, but is completely separate from, the idea of linked lists that we beat to death on an adjacent thread. What I explain below doesn't reference that stuff at all, but the starting place is the same - traditional arrays are very misused, and with a little cleverness we may cut lookup and looping times to a minimum.

Intro to Hashing
Much like an ordinary array, a "hash table" is a rigid structure for holding data, where each data element has an address. Unlike an array however, an element can't just have *any* address, but it needs a very specific address. Where does the address come from? A data element's address is calculated using the data itself, not manually assigned. The calculation is performed by a hash function.

So if you want to store a list of names Alice, Bob, Candice, Daniel, Ezequiel, and so on, you must come up with a number for each name. For absolute simplicity, let our very simple hash function create a number from each data element using the ASCII value of the LAST letter of the name.

Code: QB64: [Select]

FUNCTION ExampleHashFunction(a AS STRING)
    ExampleHashFunction = ASC(RIGHT$(a,1))
END FUNCTION

For this situation, we have ASC("e")= 101, ASC("b")= 98, ASC("l")= 108. This generates the hash table:

Alice -- 101
Bob -- 98
Candice -- 101
Daniel -- 108
Ezequiel -- 108

Keep in mind that a traditional array would probably use addresses 1,2,3,4,5. Using our hash function though, we cram all five names into the addresses 98, 101, and 108. Setting aside the fact that certain elements have the same address (called a "collision"), observe what just happened: Looking up elements in a hash table does not involve looping or otherwise scanning through the elements. That is, you simply supply the name of the element, for now on called a key, and the hash function calculates the address from the key.

Handling Collisions
Hash table construction must account for collisions, which occur when two keys generate the same hash value. This is handled by simply concatenating the data at that address, with appropriate measures taken to later parse the combined elements. For our example, the finished hash table looks like:

98 -- {Bob}
101 -- {Alice}{Candice}
108 -- {Daniel}{Ezequiel}

...

Application
Okay, I've realized this forum space isn't the proper venue to spill out three more pages on hashing theory. I'll cut right to the chase. What is the Dictionizer? This program first loads the entire Oxford English dictionary (and its quirks) into a big hash table in memory, allowing extremely fast lookup of words. (Ask for a word, and zoom right to the address without stepping through the whole list.) Next, the program scans through a user-specified text file to construct a list of every unique word. Finally, every one of those words is looked up from the hash table, and the output is a beefy CSV that contains the words and their definitions. Crude at this point, but can be polished of course.

Instructions
Download and compile dictionizer.bas to an exe (code box below).
Make sure dic.txt is in the same directory as the exe.
Drag+drop a text file (there are two supplied) to generate the output.

You'll immediately see where the dictionary falls short, but it's fine for a prototype.
As a bonus, attached is a histogram of the hash function output across the whole data space. The idea is the plot should show no trends and not be too tall in any place.

Code: QB64: [Select]

' Deal with file names.
c$ = COMMAND$
IF c$ = "" THEN
    c$ = "doc.txt"
    PRINT "No input file specified. Assuming " + c$ + "."
END IF
file$ = c$
j = INSTR(c$, ".")
IF j > 0 THEN
    fileout$ = LEFT$(file$, j - 1) + "-out.csv"
ELSE
    fileout$ = file$ + "-out.csv"
END IF
 
DIM SHARED HashTableSize AS LONG
HashTableSize = 300007 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
 
DIM SHARED LB AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s).
DIM SHARED RB AS STRING
LB = "{/"
RB = "/}"
 
DIM SHARED EnglishDictionary(HashTableSize) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
 
' Analysis/debuggig tool:
' Prepare a plot of hash values versus frequency of occurence (histogram).
OPEN "histo.txt" FOR OUTPUT AS #2
 
' Open Oxford English Dictionary and load all information into a hash table (the array EnglishDictionary).
PRINT "Loading dictionary..."
OPEN "dic.txt" FOR BINARY AS #1
i = 0
DO WHILE NOT EOF(1)
    LINE INPUT #1, a$
    j = INSTR(a$, "  ")
    k = INSTR(a$, " , ")
    IF k > 0 AND j = 0 THEN j = k ' Handles (the) word(s like) "To , "
    IF j > 0 THEN
        i = i + 1
        b$ = LEFT$(a$, j - 1) ' Extract the base word.
        IF RIGHT$(b$, 1) = "1" THEN b$ = LEFT$(b$, LEN(b$) - 1) ' Remove trailing `1' from words with multiple definitions.
        b$ = LCASE$(b$) ' Convert to lowercase.
        c$ = RIGHT$(a$, LEN(a$) - j - 1)
        IF b$ = "usage" THEN
            ' Append previous entry with "Usage" information from source.
            ' The source was originally formatted such that "Usage" parses exactly as a dictionary word.
            EnglishDictionary(d) = LEFT$(EnglishDictionary(d), LEN(EnglishDictionary(d)) - LEN(RB)) + "... " + b$ + ": " + c$ + RB
        ELSE
            d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
            ' Store the word and definition in the followig format: {/word/}{/definition/}
            ' Collisions are appended to the right and parsed later during lookup: {/word1/}{/definition1/}{/word2/}{/definition2/}
            EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB + LB + c$ + RB
            PRINT #2, d ' Record the hash value (analysis/debugging).
        END IF
    END IF
LOOP
CLOSE #2 ' Close histogram data file (analysis/debugging).
CLOSE #1
PRINT "Done."
 
' Done developing fast lookup tool. Now time for an application.
 
' Open user-specified text document and make a list of unique words (without counting).
WordList$ = ""
PRINT "Loading user document..."
OPEN file$ FOR BINARY AS #1
OPEN fileout$ FOR OUTPUT AS #2
DO WHILE NOT EOF(1)
    LINE INPUT #1, a$
    c$ = a$
    DO WHILE LEN(c$) > 0
        j = INSTR(c$, " ")
        IF j > 0 THEN
            b$ = LEFT$(c$, j - 1) ' Extract the base word.
            b$ = LTRIM$(RTRIM$(b$))
            b$ = LCASE$(b$) ' Convert to lowercase.
            c$ = RIGHT$(c$, LEN(c$) - j)
            ' Remove punctuation and stray marks.
            TheNaughtyList$ = "`1234567890=~!@#$%^&*()_+[]\{}|;:,.<>/? " + CHR$(34) + CHR$(10) + CHR$(13) ' ignores hyphen and single quote as in: all-that's
            FOR k = 1 TO LEN(TheNaughtyList$)
                b$ = ReplaceSubString$(b$, MID$(TheNaughtyList$, k, 1), "")
            NEXT
            ' Add to word list in format: {/word1/}{/word2/}{/word3/}...
            IF INSTR(WordList$, LB + b$ + RB) = 0 THEN
                WordList$ = WordList$ + LB + b$ + RB
                PRINT b$ + CHR$(9) + Lookup$(b$)
                PRINT #2, b$ + CHR$(9) + Lookup$(b$)
            END IF
        ELSE
            b$ = c$
            c$ = ""
        END IF
    LOOP
LOOP
CLOSE #1
CLOSE #2
PRINT "Done."
 
END
 
FUNCTION Lookup$ (a AS STRING)
    r$ = ""
    b$ = EnglishDictionary(HashFunc(a))
    c$ = ""
    d$ = ""
    IF b$ <> "" THEN
        DO WHILE c$ <> a
            c$ = ReturnBetween(b$, LB, RB)
            IF c$ = "" THEN EXIT DO
            b$ = RIGHT$(b$, LEN(b$) - LEN(LB + c$ + RB))
            d$ = ReturnBetween(b$, LB, RB)
        LOOP
    END IF
    r$ = a + "  " + d$
    Lookup$ = r$
END FUNCTION
 
FUNCTION ReturnBetween$ (a AS STRING, b AS STRING, c AS STRING) ' input string, left bracket, right bracket
    i = INSTR(a, b)
    j = INSTR(a, c)
    f = LEN(c)
    ReturnBetween$ = MID$(a, i + f, j - (i + f))
END FUNCTION
 
FUNCTION HashFunc (a AS STRING) ' input string
    DIM sum AS DOUBLE
    sum = HashTableSize
    FOR k = 1 TO LEN(a)
        sum = sum + k * COS(ASC(MID$(a, k, 1))) ' Without the linear factor of k, permutations have same hash values.
    NEXT
    sum = ABS(VAL(ReplaceSubString$(STR$(sum), ".", "")))
    sum = sum MOD HashTableSize
    HashFunc = sum
END FUNCTION
 
FUNCTION ReplaceSubString$ (a AS STRING, b AS STRING, c AS STRING)
    j = INSTR(a, b)
    IF j > 0 THEN
        r$ = LEFT$(a, j - 1) + c + ReplaceSubString$(RIGHT$(a, LEN(a) - j + 1 - LEN(b)), b, c)
    ELSE
        r$ = a
    END IF
    ReplaceSubString$ = r$
END FUNCTION
 
 
 

luke · « **Reply #1 on:** January 24, 2019, 07:24:26 am »

If you'd like something that doesn't require transcendental functions, here's my go to hashing function. Stolen shamelessly from http://www.cse.yorku.ca/~oz/hash.html

Code: QB64: [Select]

FUNCTION htable_hash~& (s$, mod_size)
    DIM hash~&, i
    hash~& = 5381
    FOR i = 1 TO LEN(s$)
        hash~& = ((hash~& * 33) XOR ASC(s$, i)) MOD mod_size
    NEXT i
    htable_hash~& = hash~&
END FUNCTION
 

_vince · « **Reply #2 on:** January 24, 2019, 09:01:02 am »

Nice post, stx. Hey luke did you know there are new bitshifting functions/statements in QB64?

Pete · « **Reply #3 on:** January 24, 2019, 10:10:00 am »

I feel like just saw the trailer to a good show, and then the company forgot to release the movie! Loved the introduction the hashing... but then it ended. Ahhhhhhh! I'd have been happy to read the following three pages if you wrote/included them.

When I made my word processor with spell check, I found the fasted way I knew of at the time was to divide the dictionary in a-z with one further division by the second letter of the word, a-l and m-z. This cut down of the search a lot, and even back in the 1990s systems were fast enough that a two page report could be completely spell checked in around 5-seconds.

Pete

SMcNeill · « **Reply #4 on:** January 24, 2019, 10:25:04 am »

Quote from: Pete on January 24, 2019, 10:10:00 am

When I made my word processor with spell check, I found the fasted way I knew of at the time was to divide the dictionary in a-z with one further division by the second letter of the word, a-l and m-z. This cut down of the search a lot, and even back in the 1990s systems were fast enough that a two page report could be completely spell checked in around 5-seconds.

Pete

Wouldn’t a simple binary search have been faster? Since a dictionary is stored alphabetically, you just need to search in halves...

A
B
C
D
E
F
G
H
I
J

Where is C in the list?

Start in the middle at E. E > C so search before it.
B is in the middle of A to D. B < C so search after it.
C is the middle of C to D. C = C. Found it!

A max of N searches, where 2^N > Index. ;)

Pete · « **Reply #5 on:** January 24, 2019, 11:30:30 am »

Well let's see.

What I came up then was something like this...

Code: QB64: [Select]

LINE INPUT "word: "; w$
 
a1$ = "spellck\" + MID$(w$, 1, 1)
a2$ = LCASE$(MID$(w$, 2, 1))
IF a2$ > "l" THEN a2$ = a1$ + "mz\" ELSE a2$ = a1$ + "al\"
 
PRINT "Folder: " + a2$ + " Search for: " + w$
END ' Stop demo here because folders and word lists not included.
 
' The spell check would go something like this. No caps in this example and other considerations like hyphens not included here.
OPEN a2$ FOR INPUT AS #1
l$ = LCASE$(MID$(w$, 1, 2))
DO UNTIL EOF(1)
    LINE INPUT #1, x$
    x1$ = MID$(LCASE$(x$), 1, 2)
    IF x$ = w$ THEN EXIT DO
    IF x1$ > l$ THEN flag% = -1: EXIT DO ' Search word is larger than current record entry.
LOOP
 
IF flag THEN
    flag = 0
    PRINT "Word not in database."
END IF
 

So my spellcheck folder contained 52 sub-folders. If memory serves, it was a 50,000 word spell check, so a sub-folder would be approximately 1,000 records. It worked fast enough so I never put together a way to index the entries past that.

Pete

STxAxTIC · « **Reply #6 on:** January 24, 2019, 03:19:46 pm »

Hi again all and thanks for the replies. When checking this thread last night I saw that more people looked at this histogram than ran the code, which made me wonder... anyway...

@luke
Thanks for the suggestion! I've been denying myself the pleasure(?) of deriving or pasting a better hash function into the code, but for some reason the scaled cosine (for lack of a better description) worked. I think I'll be stealing yours though in the future. (Btw, http://www.cse.yorku.ca/~oz/hash.html is an awesome quick read!)

@vince
Honored to receive your positive feedback!

@pete
I'm taking the preview-movie comparison as a compliment (thanks)! Unfortunately, the fabled "three pages on hashing theory" are only floating around in my mind and on the Internet, but not in a form that I've typed up yet. I've been considering converting some of my juicier posts into articles though. (Case in point, when I first posted qXed the community focus was 100% on cursor bugs and 0% on the mechanism.) If released as a PDF or HTML instead of a forum post, I think plenty of works would self-represent a bit better.

Rambling on that chord for a second, I think time has shown that QB64 and its users don't encourage projects with thick stacks of libraries and $include files by different authors. Our works don't really propagate by means of being $included into other projects, but we all fantasize that it does. The absence of evidence for this is staggering. I think the thickest program I ever wrote was an InForm implementation of Sprezzo, and the number of authors swelled to a whopping 3 (includes Luke via falcon.h). I dare say we'll never have projects out there with 5 or 6 largely-contributing authors. Thinking about how to properly "write up" a post reminds me that we propagate our ideas by teaching them to the coder next to us rather than saying "include my junk in your shitstack". I think if any one post has a big point to make, a forum probably doesn't have the right tooling.

@Steve and Pete (again)
Funny, hashing could have solved this for you back in the 90's. There are no exotic QB64 keywords here - nearly all of my code (and perhaps all of this code) is QB45-ready, plus or minus memory limits. I like Steve's suggestion if we agree to only use uniformly-filled dictionaries. You can see why a weird list like

Apple
Bear
Sandcastle
Sasquatch
Sediment
Segment
Sip
Siren
Soap
Soda
Suds
Supper
Zebra
Zoo
Zoology

... might confound the method.

Pete · « **Reply #7 on:** January 24, 2019, 03:43:08 pm »

I've always considered QB forums just fun places to talk about stuff, like talking to yourself in some respect. I think the majority of QB coders are more the reinvent the wheel types, because that's just more fun although less productive. Oh course one benefit from doing all the code your own way is that if you ever want to customize it, you can, because you completely understand it.

I recall when I first looked into C in 1995, because I figured it was going to be a language that would survive, I ran into a brick wall with this use other people's stuff approach. I needed a custom made key input routine for word processing, but, when I asked for help, all I got was use a library! So in some ways these gurus were ahead of me, because they new the libraries, but I was ahead of them in that they couldn't program one of those libraries to save their collective souls. I managed to make my own custom key input routine without their forum help but after that project I just concluded C wasn't for me.

I never got into the rules and regs of other languages much. No real education in making hash tables, linked lists, file indexing, etc. I've always been one who can think up a way to get one what I want done. Find the angle, so to speak. When it comes to coding, I consider myself more of a builder than a programmer. Well, now I wonder if I explained my coding style or my avatar selection by these comments. Oh well.

... and yes Bill, that was meant as a compliment... but you weren't supposed to take it. So give it back!:D

A raven walks into a crowbar... Oh well, a story for anther time.

Pete

SMcNeill · « **Reply #8 on:** January 24, 2019, 04:06:24 pm »

Quote

I like Steve's suggestion if we agree to only use uniformly-filled dictionaries. You can see why a weird list like

Apple
Bear
Sandcastle
Sasquatch
Sediment
Segment
Sip
Siren
Soap
Soda
Suds
Supper
Zebra
Zoo
Zoology

... might confound the method.

If it data is sorted, the strangeness of its contents don’t matter. There’s 16 items in that list, you can find any of them in less than 5 comparisons. (And add another 15 items to the list and still find any of them in less than 5 comparisons.)

Binary Searching can be even more effective that hash tables, if you have a poor hash algorithm and lots of collisions in the tables — like if you were to use your “ASCII value of the LAST letter” method on Pete’s 50,000 word dictionary. The data for “101” (E) might have 10,000 entries to it, to look through after the hash is done...

A binary search, however, would *always* get the result in less than 17 tries. (18 maybe; I was trying to do the math in my head and just woke up...)

Contents of the data doesn’t matter, as it’s the index which you compare against and move up/down in increments of half.

16 items.. guess...
Wrong? You just eliminated 8... guess...
Wrong? You just eliminated 4...
2...
1... RIGHT!

STxAxTIC · « **Reply #9 on:** January 24, 2019, 05:36:12 pm »

(I think you noticed the data I listed was already alphabetical but in case anyone missed it - it was.)

You baited me on purpose, right? Because duuuuuuude, of course a shitty hash implementation will perform below a well-made alternative method. It's like you're saying a well-designed sledge hammer rolls down the road more elegantly than cars, for sufficiently badly designed cars. Yes, yes we know.

Beyond straw man stuff, please don't tell me you're defending *any* kind of iterative search against a hash method. 17 steps? 18 steps? Where did these numbers come from? How do I interpret them in O(n) notation? Both numbers are more than dozen too large anyway. The method I demonstrate above gets to the address in one step, and in extremely rare cases, might have to deal with a collision more than 6 layers deep. (Actual example, not theory.)

I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand). Feel free to use the method above. Use Luke's hash function if its faster.

EDIT:

To clarify - how does comparing only indices actually sort or find any real data?

Pete · « **Reply #10 on:** January 24, 2019, 08:09:28 pm »

So let's say I put all 50 states in an RA word file, in alphabetical order. With Steve's concept, I get the LOF() of that file, divide it in half, and start a search (Get) at that record number. If the word at that address matches, I'm done. If it is higher or lower, I travel 1/2 in that direction, say record 13 or 38, and try again. If that does not match, continue x amount of times and if that does not find it, I go record by record for a short distance to match the word. Now with Bills idea, the word list can be in any order, and I just put together a hash algorithm and put the words into a table. Some records are single words. Other words have "collision" so I can separate them by a comma. now I pop them in my RA word base by the hash number, the lenght of the RA file to be determined by the largest record entry. I now find the word by applying the hash algorithm to it, get that record number, and if comma(s) are present, use an INSTR() function to see if the word I'm checking is in that record. Sound correct guys? If so, with Steve's method, the program uses more GET and compare loops. With Bill's method, the word itself has to be looped through to come up with the hash index, but then t only takes one GET to find record and one INSTR() to find the word within that record. I will say to alphabetize a large word list and use Steve's method is easier to implement. To make a good hash algorithm, to produce a good distribution of words in records, would be more of a challenge.

Pete

So Steve's idea...

SMcNeill · « **Reply #11 on:** January 24, 2019, 08:11:24 pm »

Quote from: Pete on January 24, 2019, 08:09:28 pm

So Steve's idea...

Just shared here: https://www.qb64.org/forum/index.php?topic=1003.0

Pete · « **Reply #12 on:** January 24, 2019, 08:45:56 pm »

LOL - I am psychic. I had a hunch you were about to post something as I put my reply, above, together.

You know you mentioned in the other thread about memory, which now isn't the problem it was in QB circa 1990. So I was thinking The hell with all techno crap, just use that memory!

First, separate your word list with commas and put the whole glob into a binary file as one giant string. Now...

OPEN wordlist.txt" FOR BINARY AS #1
word$ = SPACE$(LOF(1))
GET #1, , word$
CLOSE #1

if instr(word$, mytypedword$+CHR$(34)) then print "yep, found it." else print "word not found."

I mean what's faster than one instr() lookup and the list is already and quickly loaded into memory at the start of the program.

Pete

SMcNeill · « **Reply #13 on:** January 24, 2019, 08:58:24 pm »

The question there, Pete, is what spot does the word contain in your list? If just knowing it’s there is good enough, that might work, but if you need the word index, it’s insufficient. ;)

Pete · « **Reply #14 on:** January 24, 2019, 09:21:09 pm »

True on the index issue, although I never needed that. I built some kind of replacement words algorithm from stuff I studied online and figured out for myself, so that was better than just getting the surrounding words from the point just before and just after where it fell out of search for a word not found.

Pete

When it come to coding, I can get away with murder if I don't end up killing myself in the process.

News:

Author Topic: Dictionizer: Useful utility and study of Hashing (Read 16680 times)

STxAxTIC

Dictionizer: Useful utility and study of Hashing

luke

Re: Dictionizer: Useful utility and study of Hashing

_vince

Re: Dictionizer: Useful utility and study of Hashing

Pete

Re: Dictionizer: Useful utility and study of Hashing

SMcNeill

Re: Dictionizer: Useful utility and study of Hashing

Pete

Re: Dictionizer: Useful utility and study of Hashing

STxAxTIC

Re: Dictionizer: Useful utility and study of Hashing

Pete

Re: Dictionizer: Useful utility and study of Hashing

SMcNeill

Re: Dictionizer: Useful utility and study of Hashing

STxAxTIC

Re: Dictionizer: Useful utility and study of Hashing

Pete

Re: Dictionizer: Useful utility and study of Hashing

SMcNeill

Re: Dictionizer: Useful utility and study of Hashing

Pete

Re: Dictionizer: Useful utility and study of Hashing

SMcNeill

Re: Dictionizer: Useful utility and study of Hashing

Pete

Re: Dictionizer: Useful utility and study of Hashing