Hello all,
This post is about my latest mechanization - the Dictionizer. Since the topic of alternative data structures comes up once in a while, I figure I would showcase the magic of "hashing", and how this idea can (and should) replace traditional arrays for many problems.
Beginning RemarksThis discussion often mingles with, but is completely separate from, the idea of linked lists that we beat to death on an adjacent thread. What I explain below doesn't reference that stuff at all, but the starting place is the same - traditional arrays are very misused, and with a little cleverness we may cut lookup and looping times to a minimum.
Intro to HashingMuch like an ordinary array, a "hash table" is a rigid structure for holding data, where each data element has an address. Unlike an array however, an element can't just have *any* address, but it needs a very specific address. Where does the address come from?
A data element's address is calculated using the data itself, not manually assigned. The calculation is performed by a hash function.So if you want to store a list of names Alice, Bob, Candice, Daniel, Ezequiel, and so on, you must come up with a number for each name. For absolute simplicity, let our very simple hash function create a number from each data element using the ASCII value of the LAST letter of the name.
For this situation, we have ASC("e")= 101, ASC("b")= 98, ASC("l")= 108. This generates the hash table:
Alice -- 101
Bob -- 98
Candice -- 101
Daniel -- 108
Ezequiel -- 108
Keep in mind that a traditional array would probably use addresses 1,2,3,4,5. Using our hash function though, we cram all five names into the addresses 98, 101, and 108. Setting aside the fact that certain elements have the same address (called a "collision"), observe what just happened:
Looking up elements in a hash table does not involve looping or otherwise scanning through the elements. That is, you simply supply the name of the element, for now on called a
key, and the hash function calculates the address from the key.
Handling CollisionsHash table construction must account for
collisions, which occur when two keys generate the same hash value. This is handled by simply concatenating the data at that address, with appropriate measures taken to later parse the combined elements. For our example, the finished hash table looks like:
98 -- {Bob}
101 -- {Alice}{Candice}
108 -- {Daniel}{Ezequiel}
...
ApplicationOkay, I've realized this forum space isn't the proper venue to spill out three more pages on hashing theory. I'll cut right to the chase. What is the Dictionizer? This program first loads the entire Oxford English dictionary (and its quirks) into a big hash table in memory, allowing extremely fast lookup of words. (Ask for a word, and zoom right to the address without stepping through the whole list.) Next, the program scans through a user-specified text file to construct a list of every unique word. Finally, every one of those words is looked up from the hash table, and the output is a beefy CSV that contains the words and their definitions. Crude at this point, but can be polished of course.
InstructionsDownload and compile dictionizer.bas to an exe (code box below).
Make sure dic.txt is in the same directory as the exe.
Drag+drop a text file (there are two supplied) to generate the output.
You'll immediately see where the dictionary falls short, but it's fine for a prototype.
As a bonus, attached is a histogram of the hash function output across the whole data space. The idea is the plot should show no trends and not be too tall in any place.
' Deal with file names.
c$ = "doc.txt"
PRINT "No input file specified. Assuming " + c$
+ "." file$ = c$
fileout$
= LEFT$(file$
, j
- 1) + "-out.csv" fileout$ = file$ + "-out.csv"
HashTableSize = 300007 ' Best to use a big prime number. Bigger examples are 611953 and 1014729.
DIM SHARED LB
AS STRING ' Make sure that bcracketing sequences do not appear in the data source, otherwise use (a) special character(s). LB = "{/"
RB = "/}"
DIM SHARED EnglishDictionary
(HashTableSize
) AS STRING ' Hash table size does not need to equal the size of the source dictionary itself.
' Analysis/debuggig tool:
' Prepare a plot of hash values versus frequency of occurence (histogram).
' Open Oxford English Dictionary and load all information into a hash table (the array EnglishDictionary).
PRINT "Loading dictionary..." i = 0
IF k
> 0 AND j
= 0 THEN j
= k
' Handles (the) word(s like) "To , " i = i + 1
b$
= LEFT$(a$
, j
- 1) ' Extract the base word. IF RIGHT$(b$
, 1) = "1" THEN b$
= LEFT$(b$
, LEN(b$
) - 1) ' Remove trailing `1' from words with multiple definitions. b$
= LCASE$(b$
) ' Convert to lowercase. ' Append previous entry with "Usage" information from source.
' The source was originally formatted such that "Usage" parses exactly as a dictionary word.
EnglishDictionary
(d
) = LEFT$(EnglishDictionary
(d
), LEN(EnglishDictionary
(d
)) - LEN(RB
)) + "... " + b$
+ ": " + c$
+ RB
d = HashFunc(b$) ' Calculate the hash value (array address) of the word on hand.
' Store the word and definition in the followig format: {/word/}{/definition/}
' Collisions are appended to the right and parsed later during lookup: {/word1/}{/definition1/}{/word2/}{/definition2/}
EnglishDictionary(d) = EnglishDictionary(d) + LB + b$ + RB + LB + c$ + RB
PRINT #2, d
' Record the hash value (analysis/debugging). CLOSE #2 ' Close histogram data file (analysis/debugging).
' Done developing fast lookup tool. Now time for an application.
' Open user-specified text document and make a list of unique words (without counting).
WordList$ = ""
PRINT "Loading user document..." c$ = a$
b$
= LEFT$(c$
, j
- 1) ' Extract the base word. b$
= LCASE$(b$
) ' Convert to lowercase. ' Remove punctuation and stray marks.
TheNaughtyList$
= "`1234567890=~!@#$%^&*()_+[]\{}|;:,.<>/? " + CHR$(34) + CHR$(10) + CHR$(13) ' ignores hyphen and single quote as in: all-that's b$
= ReplaceSubString$
(b$
, MID$(TheNaughtyList$
, k
, 1), "") ' Add to word list in format: {/word1/}{/word2/}{/word3/}...
WordList$ = WordList$ + LB + b$ + RB
b$ = c$
c$ = ""
r$ = ""
b$ = EnglishDictionary(HashFunc(a))
c$ = ""
d$ = ""
c$ = ReturnBetween(b$, LB, RB)
d$ = ReturnBetween(b$, LB, RB)
r$ = a + " " + d$
Lookup$ = r$
ReturnBetween$
= MID$(a
, i
+ f
, j
- (i
+ f
))
sum = HashTableSize
sum
= sum
+ k
* COS(ASC(MID$(a
, k
, 1))) ' Without the linear factor of k, permutations have same hash values. sum
= ABS(VAL(ReplaceSubString$
(STR$(sum
), ".", ""))) sum
= sum
MOD HashTableSize
HashFunc = sum
r$
= LEFT$(a
, j
- 1) + c
+ ReplaceSubString$
(RIGHT$(a
, LEN(a
) - j
+ 1 - LEN(b
)), b
, c
) r$ = a
ReplaceSubString$ = r$