Author Topic: WordCracker (Read 26691 times)

codeguy · « **Reply #30 on:** October 08, 2018, 05:23:50 pm »

No, mine does unconditional permutations. Which is why it may have detected 3 more words than other submissions. But it does generate matches in lexical (alphabetical) order, saving sorting. For 8 characters or less, the speed is VERY similar to Steve's algorithm, running 8+ times to completion in the 5 second limit. Mine in 8 character mode, runs about .60s or less using the complete wordlist.txt document.

SMcNeill · « **Reply #31 on:** October 08, 2018, 06:50:09 pm »

My entry into the field of long-arse string searches:

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
CONST In$ = "thequickbrownfoxjumpedoverthelazydogdreamingofcountingsheepthinkingofbrownfoxjumpingoverdogs"
 
InLen = LEN(in$)
 
 
OPEN "WordList.txt" FOR BINARY AS #1
DO UNTIL EOF(1)
    LINE INPUT #1, junk$
    WordCount = WordCount + 1
    IF LEN(junk$) > maxlength THEN maxlength = LEN(junk$)
LOOP
SEEK 1, 1
 
TYPE WordLetters
    letter AS _UNSIGNED _BYTE
    count AS _UNSIGNED _BYTE
END TYPE
DIM Words(WordCount) AS STRING, Match(WordCount) AS LONG
DIM WordLetters(WordCount, 26) AS WordLetters
DIM Letters(97 TO 122) AS _UNSIGNED _BYTE
 
FOR i = 1 TO WordCount 'Load and prepare our word list and wordsearch array
    LINE INPUT #1, Words(i)
    FOR j = 1 TO LEN(Words(i))
        a = ASC(Words(i), j)
        IF a > 96 AND a < 123 THEN
            afound = 0
            FOR k = 1 TO WordLetters(i, 0).count
                IF a = WordLetters(i, k).letter THEN afound = k: EXIT FOR
            NEXT
            IF afound THEN
                WordLetters(i, afound).count = WordLetters(i, afound).count + 1 'add to the count of an existing letter
            ELSE
                WordLetters(i, 0).count = WordLetters(i, 0).count + 1 'increase the total count of letters
                w = WordLetters(i, 0).count
                WordLetters(i, w).letter = a 'save the letter
                WordLetters(i, w).count = 1 'and count it 1 for the first time it appeared
            END IF
        ELSE
            WordLetters(i, 0).count = 0 'invalidate words with non A-Z letters
            EXIT FOR
        END IF
    NEXT
NEXT
 
FOR i = 1 TO InLen 'get the letters for the search string
    a = (ASC(in$, i) OR 32)
    Letters(a) = Letters(a) + 1
NEXT
 
 
t# = TIMER
DO UNTIL TIMER > t# + 5
    loopcount = loopcount + 1
    wordfound = 0 'reset our counter
 
    'THE MAIN SEARCH ROUTINE HERE
 
    FOR i = 1 TO WordCount 'now check every word for a match
        sl = WordLetters(i, 0).count 'search limit
        IF sl = 0 THEN GOTO invalid
        FOR j = 1 TO sl
            l = WordLetters(i, j).letter 'the letter in the word
            IF Letters(l) < WordLetters(i, j).count THEN GOTO invalid 'it's impossible
        NEXT
        wordfound = wordfound + 1
        Match(wordfound) = i
        invalid:
    NEXT
 
    'END OF THE MAIN SEARCH ROUTINE
 
LOOP
 
'Print Results
FOR i = 1 TO wordfound
    PRINT Words(Match(i));
    IF i < wordfound THEN PRINT ","; ELSE PRINT
NEXT
PRINT
PRINT "DONE, with"; wordfound; "matches, with a speed of"; loopcount; "runs in 5 seconds."
 
END
 
'And, if you want to see the secret behind this method, here's how we store and retrieve our data:
 
FOR i = 1 TO 10 'I think just showing how we track 10 words should be fine enough
    PRINT Words(i),
    FOR j = 1 TO WordLetters(i, 0).count
        PRINT CHR$(WordLetters(i, j).letter);
        PRINT WordLetters(i, j).count;
        PRINT ",";
    NEXT
    PRINT
NEXT
 
SLEEP
 
'and here's the in$ and how its letter count holds up
 
 
PRINT in$
FOR i = 97 TO 122
    PRINT Letters(i);
NEXT
PRINT
 

Instead of 32-37 runs in 5 seconds, this does about 150 on my machine.

If you're curious how it does it, remove the END statement and then let it run. It'll show you how we basically break down our words so we never have to check the letters in them any more times than is absolutely necessary to make a match. ;)

SMcNeill · « **Reply #32 on:** October 08, 2018, 07:30:11 pm »

And, a slight tweak to do some pre-sorting based on In$ size and non-A-Z characters, and to add a key$ to make certain that the words all contain a valid letter we designate, as per the original post's requirements:

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
CONST In$ = "thequickbrownfoxjumpedoverthelazydogdreamingofcountingsheepthinkingofbrownfoxjumpingoverdogs"
'CONST In$ = "abcdefghi"
'key$ = "a" 'make "" or remark this line out completely to do a search without a required key, as per the original post
InLen = LEN(In$)
 
 
OPEN "WordList.txt" FOR BINARY AS #1
DO UNTIL EOF(1)
    LINE INPUT #1, junk$
    WordCount = WordCount + 1
    IF LEN(junk$) > maxlength THEN maxlength = LEN(junk$)
LOOP
SEEK 1, 1
 
TYPE WordLetters
    letter AS _UNSIGNED _BYTE
    count AS _UNSIGNED _BYTE
END TYPE
DIM Words(WordCount) AS STRING, Match(WordCount) AS LONG
DIM WordLetters(WordCount, 17) AS WordLetters
DIM Letters(97 TO 122) AS _UNSIGNED _BYTE
 
 
FOR i = 1 TO WordCount
    wc = wc + 1
    LINE INPUT #1, Words(wc)
    IF LEN(Words(wc)) > InLen THEN 'remove words too long to fit
        wc = wc - 1
    ELSE
        FOR j = 1 TO LEN(Words(wc)) 'remove non A-Z words as we're only searching for words with those letters
            IF ASC(Words(wc), j) < 97 OR ASC(Words(wc), j) > 122 THEN wc = wc - 1: EXIT FOR
        NEXT
    END IF
NEXT
WordCount = wc
 
FOR i = 1 TO WordCount 'Load and prepare our word list and wordsearch array
    FOR j = 1 TO LEN(Words(i))
        a = ASC(Words(i), j)
        afound = 0
        FOR k = 1 TO WordLetters(i, 0).count
            IF a = WordLetters(i, k).letter THEN afound = k: EXIT FOR
        NEXT
        IF afound THEN
            WordLetters(i, afound).count = WordLetters(i, afound).count + 1 'add to the count of an existing letter
        ELSE
            WordLetters(i, 0).count = WordLetters(i, 0).count + 1 'increase the total count of letters
            w = WordLetters(i, 0).count
            WordLetters(i, w).letter = a 'save the letter
            WordLetters(i, w).count = 1 'and count it 1 for the first time it appeared
        END IF
    NEXT
NEXT
 
FOR i = 1 TO InLen 'get the letters for the search string
    a = (ASC(In$, i) OR 32)
    Letters(a) = Letters(a) + 1
NEXT
 
 
t# = TIMER
DO UNTIL TIMER > t# + 5
    loopcount = loopcount + 1
    wordfound = 0 'reset our counter
 
    'THE MAIN SEARCH ROUTINE HERE
 
    FOR i = 1 TO WordCount 'now check every word for a match
        sl = WordLetters(i, 0).count 'search limit
        FOR j = 1 TO sl
            l = WordLetters(i, j).letter 'the letter in the word
            IF Letters(l) < WordLetters(i, j).count THEN GOTO invalid 'it's impossible
        NEXT
        IF key$ = "" THEN
            wordfound = wordfound + 1
            Match(wordfound) = i
        ELSEIF INSTR(Words(i), key$) <> 0 THEN
            wordfound = wordfound + 1
            Match(wordfound) = i
        END IF
        invalid:
    NEXT
 
    'END OF THE MAIN SEARCH ROUTINE
 
LOOP
 
 
'Print Results
FOR i = 1 TO wordfound
    PRINT Words(Match(i));
    IF i < wordfound THEN PRINT ","; ELSE PRINT
NEXT
PRINT
PRINT "DONE, with"; wordfound; "matches, with a speed of"; loopcount; "runs in 5 seconds."
 
SLEEP
CLS
 
 
 
 
'And, if you want to see the secret behind this method, here's how we store and retrieve our data:
 
FOR i = 1 TO 10 'I think just showing how we track 10 words should be fine enough
    PRINT Words(i),
    FOR j = 1 TO WordLetters(i, 0).count
        PRINT CHR$(WordLetters(i, j).letter);
        PRINT WordLetters(i, j).count;
        PRINT ",",
    NEXT
    PRINT
NEXT
 
SLEEP
 
'and here's the in$ and how its letter count holds up
 
 
PRINT In$
FOR i = 97 TO 122
    PRINT CHR$(i); LTRIM$(RTRIM$(STR$(Letters(i))));
    IF i < 122 THEN PRINT ", "; ELSE PRINT
NEXT
PRINT

We now run nice and fast for both short and long strings. We can use a preset qualifier key if we want to, but we don't have to. Run times are ~900 loops for "abcdefghi" with a key of "a", and ~200 times with the long entry Bplus used in his test routine earlier.

bplus · « **Reply #33 on:** October 08, 2018, 07:34:55 pm »

Yah Steve, 140 loops on my machine fabulous. Poor Peter, you robbed him to pay Paul. ;-))
(Steve, this comment based on your previous code post, you posted another while I was getting list ready for codeguy.)

Hi code guy,

Here is my list of 392 words from stevemcneill from the given WordList.txt file posted earlier in thread.
Hope this helps you find differences in lists.

PS what is unconditional permutation?

bplus · « **Reply #34 on:** October 08, 2018, 07:59:49 pm »

Some time ago, I built a Word Search Solver and had thought about making a Word Search Puzzle Builder.

With this Word Crack code review, I am rethinking building a Word Search Puzzle Builder based on a longish quote. Hmm... a few more details to work out... but have good word list generator now and if you throw in allot of extra of the same letters, could be a quite a challenge, like getting a puzzle where all the pieces look the same.

SMcNeill · « **Reply #35 on:** October 08, 2018, 08:19:52 pm »

And for those who are having trouble understanding how my last 2 examples work, the secret is in the letter counter.

Take APPLE as an example... Instead of searching to see if the In$ has those 5 letters, we instead count the letters first, when we load the dictionary. 1 A, 2 P, 1 L, 1 E... All we have to check is a maximum of 4 times, instead of 5; we just check to make certain In$ has 2 Ps in it once, rather than checking twice...

In some cases, this reduces a ton of letter checks. MISSISSIPPI. Instead of 11 letters, we check 4... (1 M, 4 I, 4 S, 2 P).

Minimal conditional checking makes for maximum performance. ;)

bplus · « **Reply #36 on:** October 08, 2018, 08:32:47 pm »

Quote from: SMcNeill on October 08, 2018, 08:19:52 pm

And for those who are having trouble understanding how my last 2 examples work, the secret is in the letter counter.

Take APPLE as an example... Instead of searching to see if the In$ has those 5 letters, we instead count the letters first, when we load the dictionary. 1 A, 2 P, 1 L, 1 E... All we have to check is a maximum of 4 times, instead of 5; we just check to make certain In$ has 2 Ps in it once, rather than checking twice...

In some cases, this reduces a ton of letter checks. MISSISSIPPI. Instead of 11 letters, we check 4... (1 M, 4 I, 4 S, 2 P).

Minimal conditional checking makes for maximum performance. ;)

PLUS, if you want to save even more time, save the processed Word List back into a new Data File, once and forever preprocessed!

Peter smiles, he ain't feel'in so broke no more.

codeguy · « **Reply #37 on:** October 08, 2018, 08:49:01 pm »

Unconditional permutation means I perform no checking as to whether the next generated permutation actually fits the pattern of the language. In essence, it will check for permutations of words beginning with rx or some other beginning consonant pair that does not appear in standard language. This is handy for finding abbreviations like RPM or GPS or even TLDR, which are not words in the standard sense, but is included in the English language as shorthand for Revolutions Per Minute Global Positioning System and Too Late Didn't READ. Even stuff like RTFM, whose translation I will leave to the reader. While my method is slow for very large strings, it is absolutely thorough and can work for languages that are not English too. Slow? Yes, it is slow, but it will not skip any possible permutations that could lead to missing words in a word list. This is why it's considered an exhaustive algorithm. Also, my method eliminates any chance for repetitions of words found. BTW, my name has 871 exactly matching words that are in wordlist.txt, among them. Sanhedrin, Idaho and ordinals. Weird, huh? But for words of 8 characters or less, using the permutation method and my searching algorithm is competitive to Steve's work. This word list does not contain axolotl, a kind of salamander (301 words in wordlist.txt), but 32 words contained in axolotl were found. With my algorithm, if it's in there, this will find it AND give results in sorted order.

bplus · « **Reply #38 on:** October 08, 2018, 09:05:36 pm »

Thanks codeguy,

As the man said, "Minimal conditional checking makes for maximum performance. ;) "

Oh hey, I've got try this with my middle name too, 1106! including roman, sagamen.

SMcNeill · « **Reply #39 on:** October 08, 2018, 09:33:10 pm »

Quote from: bplus on October 08, 2018, 08:32:47 pm

Quote from: SMcNeill on October 08, 2018, 08:19:52 pm
And for those who are having trouble understanding how my last 2 examples work, the secret is in the letter counter.

Take APPLE as an example... Instead of searching to see if the In$ has those 5 letters, we instead count the letters first, when we load the dictionary. 1 A, 2 P, 1 L, 1 E... All we have to check is a maximum of 4 times, instead of 5; we just check to make certain In$ has 2 Ps in it once, rather than checking twice...

In some cases, this reduces a ton of letter checks. MISSISSIPPI. Instead of 11 letters, we check 4... (1 M, 4 I, 4 S, 2 P).

Minimal conditional checking makes for maximum performance. ;)

PLUS, if you want to save even more time, save the processed Word List back into a new Data File, once and forever preprocessed!

Peter smiles, he ain't feel'in so broke no more.

If I were to save it as a processed list, I'd sort it first to put maximum letters first.

Apple has 2Ps and only one of every other letter. Since double letters are rarer than single letters, if we check for the PP first, we're more likely to be able to skip the rest of the checks.

So, instead of 1A2P1L1E, I'd store it as 2P1A1E1L....

Take ELEVEN for a perfect example. It's rare to see 4 Es in a word, but not so rare to find a single L, V, or N. Check it first and chances are you can skip all the other letters completely.

***************

Edit: Scrabble letter values would actually be a good criteria for sorting order. Z, X, Q at the front of the search list, A, E, I, O, U, S, T, and such as the last things compared.

I imagine you could eek out a considerable performance boost with minimal effort, implementing such a method.

codeguy · « **Reply #40 on:** October 08, 2018, 11:10:34 pm »

Steve, I was really impressed with your speedy performance, so I took the liberty of modifying it for exact same results and significantly faster performance.

Code: QB64: [Select]

 
SCREEN _NEWIMAGE(800, 600, 32)
CONST In$ = "thequickbrownfoxjumpedoverthelazydogdreamingofcountingsheepthinkingofbrownfoxjumpingoverdogs"
DIM wordcount AS LONG: wordcount = 0
DIM maxlength AS LONG: maxlength = 0
InLen = LEN(In$)
 
 
OPEN "WordList.txt" FOR BINARY AS #1
DO UNTIL EOF(1)
    LINE INPUT #1, junk$
    wordcount = wordcount + 1
    IF LEN(junk$) > maxlength THEN maxlength = LEN(junk$)
LOOP
SEEK 1, 1
 
TYPE WordLetters
    letter AS _UNSIGNED _BYTE
    count AS _UNSIGNED _BYTE
END TYPE
DIM Words(wordcount) AS STRING, Match(wordcount) AS LONG
DIM WordLetters(wordcount, 26) AS WordLetters
DIM Letters(97 TO 122) AS _UNSIGNED _BYTE
DIM ascii AS _UNSIGNED _BYTE
DIM KIterWordLettersCount AS LONG
DIM afound AS LONG: afound = 0
DIM w AS LONG
DIM i AS LONG
DIM j AS LONG
FOR i = 1 TO wordcount 'Load and prepare our word list and wordsearch array
    LINE INPUT #1, Words(i)
    FOR j = 1 TO LEN(Words(i))
        ascii = ASC(Words(i), j)
        SELECT CASE ascii
            CASE 97 TO 122
                FOR KIterWordLettersCount = 1 TO WordLetters(i, 0).count
                    IF ascii = WordLetters(i, KIterWordLettersCount).letter THEN afound = KIterWordLettersCount: EXIT FOR
                NEXT
                IF afound THEN
                    WordLetters(i, afound).count = WordLetters(i, afound).count + 1 'add to the count of an existing letter
                    afound = 0
                ELSE
                    WordLetters(i, 0).count = WordLetters(i, 0).count + 1 'increase the total count of letters
                    w = WordLetters(i, 0).count
                    WordLetters(i, w).letter = ascii 'save the letter
                    WordLetters(i, w).count = 1 'and count it 1 for the first time it appeared
                END IF
            CASE ELSE
                WordLetters(i, 0).count = 0 'invalidate words with non A-Z letters
                EXIT FOR
        END SELECT
    NEXT
NEXT
 
FOR i = 1 TO InLen 'get the letters for the search string
    ascii = (ASC(In$, i) OR 32)
    Letters(ascii) = Letters(ascii) + 1
NEXT
 
DIM sl AS LONG
DIM wordfound AS LONG
t# = TIMER(.001)
DO UNTIL TIMER > t# + 5
    loopcount = loopcount + 1
    wordfound = 0 'reset our counter
 
    'THE MAIN SEARCH ROUTINE HERE
 
    FOR i = 1 TO wordcount 'now check every word for a match
        sl = WordLetters(i, 0).count 'search limit
        IF sl THEN
            FOR j = 1 TO sl
                l = WordLetters(i, j).letter 'the letter in the word
                IF Letters(l) < WordLetters(i, j).count THEN GOTO invalid 'it's impossible
            NEXT
            wordfound = wordfound + 1
            Match(wordfound) = i
        END IF
        invalid:
    NEXT
 
    'END OF THE MAIN SEARCH ROUTINE
 
LOOP
 
'Print Results
FOR i = 1 TO wordfound
    PRINT Words(Match(i));
    IF i < wordfound THEN PRINT ","; ELSE PRINT
NEXT
PRINT
PRINT "DONE, with"; wordfound; "matches, with a speed of"; loopcount; "runs in 5 seconds."
 
END
 
'And, if you want to see the secret behind this method, here's how we store and retrieve our data:
 
FOR i = 1 TO 10 'I think just showing how we track 10 words should be fine enough
    PRINT Words(i),
    FOR j = 1 TO WordLetters(i, 0).count
        PRINT CHR$(WordLetters(i, j).letter);
        PRINT WordLetters(i, j).count;
        PRINT ",";
    NEXT
    PRINT
NEXT
 
SLEEP
 
'and here's the in$ and how its letter count holds up
 
 
PRINT In$
FOR i = 97 TO 122
    PRINT Letters(i);
NEXT
PRINT
 
 

On my humble machine, this represents a 45 loops/5s to 80+ loops/5s. Awesome work, Steve.

SMcNeill · « **Reply #41 on:** October 09, 2018, 12:13:36 am »

Thanks for the kind words, Codeguy. ;)

I've worked with datasets like this thousands of times in the past, so I've learned a few tricks for making them run efficiently. The above was "speedy enough" for most needs, but there's methods quite a bit faster we could employ -- if we wanted to put forth the effort and alter our data somewhat.

Absolute fastest method I can imagine is by dividing our data into a tree structure....

For example, let's start with this tree:

A
AA
AAA

The first 3 entries on our list are those three. By "treeing" our data, we say, "If I don't have A in the search phrase, then I can't have anything below A"

Eliminate "A" and we eliminate EVERYTHING with an A. Our search list just dropped 50k words.

If we have A, but not AA, we've eliminated all words with AA from our search list...

It's a "cascading elimination" scheme and it's efficient, and fast, as heck!

The main issue with it is generating the lookup table to begin with... Your data would need to be stored in a similar manner to this:
A (the word), 52154 (number of words with this eliminator), 2,3,4,5,6.... (Word list)
AA (next word), 2154 (number of words with this eliminator), 3,44,67,87,... (Word list)

**********************

It would bloat our data file considerably, depending on how many "eliminators" we want to use (why use anything more than 2 digits? Longer words get more unique, the longer they become.), but it'd reduce our list of possible words to check by huge chunks at a time...

Fastest method I can think of, at the moment anyway. ;)

(And if you look at my previous code, you can see where I was already generating lists which we could use for elimination purposes for single letters back with the original code in message #18.)

bplus · « **Reply #42 on:** October 10, 2018, 01:39:23 pm »

This letter count number formula thing lends itself obviously to anagrams, so yesterday I started modifying the file with BFormulas a 26 chr$() string of counts for a, b, c... and all day long as I proceeded through with code tests, I had strongest feeling of Deja Vu that I had done this before, that we, Steve too, had worked through this before, probably at Walter's forum. I check through old code posts 3 times and do not find anything on Anagrams... so I keep going reluctantly because it becomes more and more clear we had done this before.

Finally late last night, I do find the old code posted under Rosetta Folder! Yes, that's right because it started from a challenge from Rosetta stuff we were doing. Dang! I failed to remember the biggest hint of all, to sort by the BFormula$ key. Man what a time saver it is doing it that way, like a blink of the eye!

So if you find a word in your name or you are word building from a string, every anagram of it is automatically included. So along with filing the WordList with the BFormula key at start for saving allot of time, some more time saving could be made by listing all the anagrams that come with such a BFormula! WordList was reduced by 6,500+ lines by listing anagrams. Can't wait to try timed tests for generating words lists.

News:

Author Topic: WordCracker (Read 26691 times)

codeguy

Re: WordCracker

SMcNeill

Re: WordCracker

SMcNeill

Re: WordCracker

bplus

Re: WordCracker

bplus

Re: WordCracker

SMcNeill

Re: WordCracker

bplus

Re: WordCracker

codeguy

Re: WordCracker

bplus

Re: WordCracker

SMcNeill

Re: WordCracker

codeguy

Re: WordCracker

SMcNeill

Re: WordCracker

bplus

Re: WordCracker