Author Topic: Binary Search Method (Read 34448 times)

STxAxTIC · « **Reply #30 on:** January 26, 2019, 06:00:08 pm »

For some fun trivia, the post (I think three) above this one reminds me of the heartbleed bug:

https://xkcd.com/1354/

Pete · « **Reply #31 on:** January 27, 2019, 12:51:11 pm »

Quote from: SMcNeill on January 25, 2019, 07:26:23 pm

Quote from: Pete on January 25, 2019, 02:12:07 pm
But that's what seed in INSTR() is for, so you skip from delimiter to delimiter.

See the same post and the second code box, which uses seed% to deal with lookup frequency within a random word list.

Pete

Seed gives you a starting point, but you still check byte by byte for a match...

a$ = “,1,2,3,4,5,3,7,8,”

seed = INSTR(seed + 1, a$, “,3,”)

Run first, it does a compare at byte 1, then byte 2, then 3... at byte 5 it finds “,3,” — 5 comparisons.

Then you have:

seed = seed + LEN(“,3,”) ‘5 + 3
seed = INSTR(seed, a$, “,3,”)

Now you start at byte 8 and search byte by byte until you get to byte 11... 5 more comparisons.

INSTR by its nature is a byte by byte compare routine (it has to be, or else how would it know where the search string appears at? A random guess?)

wordlist$ = "4bill3cat3dog7dogfood4pete5steve0”

Placing the size first lets you advance by words, not by bytes.

See the difference, and why one is faster than the other?

So you are proposing a frequency search be done something like this?

Code: QB64: [Select]

s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
 
' Find the whole word dog and frequency.
FOR i = 1 TO LEN(s$)
    index = VAL(MID$(s$, i, 1)) ' Good for words 9 letters or less.
    x$ = MID$(s$, i + 1, index)
    i = i + index
    IF x$ = "dog" THEN PRINT x$
NEXT

SMcNeill · « **Reply #32 on:** January 27, 2019, 01:12:26 pm »

Quote from: Pete on January 27, 2019, 12:51:11 pm

So you are proposing a frequency search be done something like this?

Code: QB64: [Select]
s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"

' Find the whole word dog and frequency.
FOR i = 1 TO LEN(s$)
index = VAL(MID$(s$, i, 1)) ' Good for words 9 letters or less.
x$ = MID$(s$, i + 1, index)
i = i + index
IF x$ = "dog" THEN PRINT x$
NEXT

Aye. No need to use INSTR for your search (it's inherently slower by design); just read each word and compare directly.

The only real difference I'd use would be ASCII characters instead of numeric ones, as we can deal and process those much faster.

s$ = CHR$(4) + "bill" + CHR$(3) + "cat" + CHR$(3) + "dog" + .... (so on)

Then just use index = ASC(s$, i).

Gives return values up to 255 characters in length, and processes faster than your VAL(MID$())

Pete · « **Reply #33 on:** January 27, 2019, 01:31:03 pm »

Well I use ASCII characters in files as delimiters all the time, but in this case there are a few considerations.

1) ASCII characters can't be letters so there is no way to separate them from the letters in the word list. Of course if the word list is limited in size, at least the extended ASCII character set could be used for words up to 128 characters. Ah, the good ol' days when supercalifragilisticexpialidocious was the longest word in the English language. Now it's: pneumonoultramicroscopicsilicovolcanoconiosis for all you sandblasting fans. So it looks like going with the extended character set would be fine.

2) Characters like CHR$(7) do not get printed in a string, so a workaround is needed in these cases. Again, I'd go with the extended set and just subtract 127.

Pete

SMcNeill · « **Reply #34 on:** January 27, 2019, 01:48:46 pm »

I guess that all depends on what you need to do with your list...

If you need to print it, then use _CONTROLCHR OFF:

_CONTROLCHR OFF
FOR I = 0 TO 255: PRINT CHR$(I): NEXT

Normally, though, the size would be something you hide from your user, so they never see it at all.

To build s$, you’d often do something like:

DATA bill, cat, dog

DO
READ temp$
s$ = CHR$(LEN(temp$)) + temp$
LOOP

s$ = s$ + CHR$(0) ‘to designate end of list

And to print it, you’d print it much like you did the 3 “dog” above, though you might want to add a nice comma between words just to make your list look pretty. ;)

Pete · « **Reply #35 on:** January 27, 2019, 02:06:53 pm »

So your method would be...

Code: QB64: [Select]

 ' Convert data format first...
_CONTROLCHR OFF
s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
'convert format...
FOR i = 1 TO LEN(s$)
    index = VAL(MID$(s$, i, 1))
    x$ = MID$(s$, i + 1, index)
    i = i + index
    news$ = news$ + CHR$(index) + x$
NEXT
s$ = news$ + CHR$(0)
 
' Now your part..
FOR i = 1 TO LEN(s$)
    index = ASC(s$, i) ' Good for words 9 letters or less.
    x$ = MID$(s$, i + 1, index)
    i = i + index
    IF x$ = "dog" THEN PRINT CHR$(index), x$
NEXT
 

Look right?

Note: The board wouldn't allow pasted ascii characters for the index in the code. It dropped them out of the s$ variable so I re-posted with a conversion method.

Pete

SMcNeill · « **Reply #36 on:** January 27, 2019, 02:27:37 pm »

Looks good. As you can see in the post under Ed’s, it’s almost exactly what I did for the hash table — speeding it up 300% + on my PC. https://www.qb64.org/forum/index.php?topic=1003.msg102044#msg102044

The longer the average words in your list, the greater the difference you’ll see in speed vs using INSTR as you were before.

Pete · « **Reply #37 on:** January 27, 2019, 04:25:10 pm »

OK, I like that method because it's clever, and I learned something new about a QB64 underscore keyword I haven't used before but as for speed, we need to work on that, because I get the INSTR() method as faster when a test is set up as follows...

Code: QB64: [Select]

WIDTH 120, 25
_SCREENMOVE 0, 0
_CONTROLCHR OFF
s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
'convert format...
FOR i = 1 TO LEN(s$)
    index = VAL(MID$(s$, i, 1))
    x$ = MID$(s$, i + 1, index)
    i = i + index
    news$ = news$ + CHR$(index) + x$
NEXT
 
FOR i = 1 TO 15 ' Concatenate to a string size of 3178497
    news$ = news$ + news$
NEXT
s$ = news$: news$ = ""
 
PRINT " Start timer...": PRINT
t# = TIMER
FOR i = 1 TO LEN(s$)
    index = ASC(s$, i) ' Good for words 9 letters or less.
    x$ = MID$(s$, i + 1, index)
    i = i + index
    IF x$ = "dog" THEN k = k + 1
NEXT
PRINT TIMER - t#;
PRINT "seconds to find frequency of dog in search INDEX method. Frequency ="; k; "Length of string ="; LEN(s$)
 
k = 0: _DELAY 1: PRINT
 
news$ = ",bill,cat,dog,dogfood,pete,steve,bill,cat,dog,dogfood,pete,steve,bill,cat,dog,dogfood,pete,steve,"
FOR i = 1 TO 15
    news$ = news$ + news$
NEXT
s$ = news$: news$ = ""
 
PRINT " Start timer...": PRINT
t# = TIMER
DO
    seed& = INSTR(seed&, s$, ",dog,")
    IF seed& = 0 THEN EXIT DO
    x$ = MID$(s$, seed& + 1, LEN("dog"))
    k = k + 1
    seed& = seed& + LEN(",dog,")
LOOP
 
PRINT TIMER - t#;
PRINT "seconds to find frequency of dog in search INSTR method. Frequency ="; k; "Length of string ="; LEN(s$)
 

My belief is that seed advances INSTR past the byte by byte evaluation, much like your indexing method, but faster. I know we need to make sure the comparison is fair, so if you see anything that is skewing the results, let's work it out.

Pete

SMcNeill · « **Reply #38 on:** January 28, 2019, 03:13:08 am »

I had to do a little pondering to sort out what the heck is going on to make INSTR faster than jumping with the index and size, and what I've came up with is the same old conclusion I've gathered in the past: There's a lot of overhead in QB64 functions.

x$ = MID$(s$, i + 1, index) <--MID$ is slower than one would like.

index = ASC(s$, i) ' Good for words 9 letters or less. <--ASC is faster than MID$, but still nothing to write home about.

IF x$ = "dog" THEN k = k + 1 <--And then you do a direct string compare....

VS:

seed& = INSTR(seed&, s$, ",dog,") <--Check the result
IF seed& = 0 THEN EXIT DO
x$ = MID$(s$, seed& + 1, LEN("dog")) <--WTH is this line in here for? It actually does NOTHING for the check...
k = k + 1
seed& = seed& + LEN(",dog,")

So, for the comparison to be as fair as possible, we need to remove as much overhead from the program as we can:

Code: QB64: [Select]

 SCREEN _NEWIMAGE(800, 600, 32)
_SCREENMOVE 0, 0
_CONTROLCHR OFF
 
DEFLNG A-Z
 
CONST Limit = 100 'number of words in our list
CONST Repitition = 10000 'number of times to search the lists
 
DIM word(Limit) AS STRING, index AS _UNSIGNED _BYTE
 
 
'make an array of the proper sizes becauseQB64 dosesn't
'understand (AS STRING * variable) for a data type
'so we're setting an array for strings of size 1  to 20, just as a quick placeholder for use with mem.
DIM Strings(1 TO 20) AS STRING
FOR i = 1 TO 20: Strings(i) = SPACE$(i): NEXT
 
DIM m AS _MEM
 
 
FOR SearchSize = 1 TO 20 STEP 5 'run the list multiple times for various size search strings
 
    'Generate suitable search string
    search1$ = STRING$(SearchSize, "A")
    search2$ = "," + search1$ + "," 'comma before and after
 
    FOR i = 1 TO Limit: word(i) = "": NEXT 'reset old list
 
 
    FOR WordNumber = 1 TO Limit 'the number of words
        IF WordNumber MOD 10 = 1 THEN 'every 10th word, no matter what, is one we want to look for
            word(WordNumber) = search1$
        ELSE 'otherwise, make the word junk
            FOR i = 1 TO INT(RND * 15) + 1 'up to 15 characters of junk in the spam "words"
                word(WordNumber) = word(WordNumber) + CHR$(INT(RND * 26) + 97)
            NEXT
        END IF
    NEXT
 
    'Words are now generated.  Now let's form our two similar lists for searching.
    list1$ = "": list2$ = ","
 
    'Size/Word list
    FOR i = 1 TO Limit: list1$ = list1$ + CHR$(LEN(word(i))) + word(i): NEXT
 
    'Comma Delimited list
    FOR i = 1 TO Limit: list2$ = list2$ + word(i) + ",": NEXT
 
    'Wordlists are now built.
 
    IF _MEMEXISTS(m) THEN _MEMFREE m 'free the old mem block if we need to, from the previous run
    m = _MEMNEW(LEN(list1$)): _MEMPUT m, m.OFFSET, list1$
 
    template$ = "##.### seconds to find frequency of " + search1$ + "."
    template$ = template$ + "  Frequency = ###"
 
 
    _DELAY 1 'a delay so we can watch the tests
    PRINT
    PRINT "Running Speed Tests on "; search1$
    PRINT "SIZE/WORD: ";
 
    t# = TIMER(0.0001)
    FOR z = 1 TO Repitition
        k = 0: l = LEN(search1$): i = 1
        DO UNTIL i > LEN(list1$)
            $CHECKING:OFF
            index = _MEMGET(m, m.OFFSET + i - 1, _UNSIGNED _BYTE) 'Get the index directly, skip ASC function call
            IF index = l THEN 'if lengths don't even match, we don't need to compare words
                'just jump to the next one
                _MEMGET m, m.OFFSET + i, Strings(index) 'get the string direct from memory (like MID$ in Pete's demo)
                IF Strings(index) = search1$ THEN k = k + 1
            END IF
            $CHECKING:ON
            i = i + index + 1
        LOOP
    NEXT
    t1# = TIMER(0.0001) - t#
    PRINT USING template$; t1#, k
 
    PRINT "INSTR:";
    t# = TIMER(0.0001)
    FOR z = 1 TO Repitition
        k = 0
        DO
            seed& = INSTR(seed&, list2$, search2$)
            IF seed& = 0 THEN EXIT DO
            k = k + 1
            seed& = seed& + LEN(search2$)
        LOOP
    NEXT
    t1# = TIMER(0.0001) - t#
    PRINT USING template$; t1#, k
NEXT

Things are running so quickly here, we're having to run a search on a list of 100 words, 10000 times, to generate any significant times for comparison.

CONST Limit = 100 'number of words in our list
CONST Repitition = 10000 'number of times to search the lists

I was curious if the length of the search$ made any real difference, and it doesn't really seem to affect much from my testing.

Notable changes in this and your routine:

index = _MEMGET(m, m.OFFSET + i - 1, _UNSIGNED _BYTE) 'Get the index directly, skip ASC function call
IF index = l THEN 'if lengths don't even match, we don't need to compare words
'just jump to the next one
_MEMGET m, m.OFFSET + i, Strings(index) 'get the string direct from memory (like MID$ in Pete's demo)
IF Strings(index) = search1$ THEN k = k + 1
END IF

We strip out the use of ASC, MID$, and don't even bother to get the word if the two lengths don't match...

And, to keep it fair:

DO
seed& = INSTR(seed&, list2$, search2$)
IF seed& = 0 THEN EXIT DO
k = k + 1
seed& = seed& + LEN(search2$)
LOOP

I basically stripped out that extra line where you were calculating x$ for some odd reason with your search...

Times on my PC are basically 0.015 seconds for SIZE/STRING storage and lookup, verses 0.025 seconds for INSTR/DELIMITED storage and lookup.

Logic says jumping and skipping searches would be faster than searching byte by byte, but once we start tacking on the overhead associated with our function calls, the gap closes quickly. Without using _MEM to replace ASC and MID$, I couldn't seem to top the speed of INSTR.. the overhead was just too great.

Which now leaves me pondering -- why do the changes I made to the hash table in the post after Ed's make it so much quicker than the previous versions on my machine? I guess that'll be a mystery to sit and study on tomorrow. For now, the bed is calling my name...

Pete · « **Reply #39 on:** January 28, 2019, 04:10:22 am »

No strange reason for including x$ = MID$(s$, seed& + 1, LEN("dog")) in the INSTR version. I put it there because it existed in your version. I didn't want it to run faster just because it was left out, even though in the INSTR version it doesn't need to be there to count frequency.

What I don't get is Marks test. INSTR shouldn't ever run as slow as he reported.

Pete

STxAxTIC · « **Reply #40 on:** January 28, 2019, 06:49:26 am »

Alright, been a few days since I actually ready at any of these threads, and at a quick glance... sigh...

Steve, did you really insinuate that I'm somehow "obsessive over" ... what was it... linked lists? That kind of jab is SO uninformed and misled, just like much of... well... everything you say. Let me remind everyone else that you had no idea what you were talking about with respect to linked lists until I took it upon myself to school you. Do you remember that from like, just days ago? If yer gonna name-drop, open a new thread and we can hash it out there. No pun intended. Actually pun intended. Something tells me you had no clue about hashing til my demo, either.

Alright, don't mind my injection. Get back to you were discussing here - talking about searching databases using INSTR, rediscovering obvious properties of QB64, etc etc... bravo guys. Keep up this awesomeness.

SMcNeill · « **Reply #41 on:** January 28, 2019, 08:38:24 am »

Quote from: STxAxTIC on January 28, 2019, 06:49:26 am

Steve, did you really insinuate that I'm somehow "obsessive over" ... what was it... linked lists? That kind of jab is SO uninformed and misled, just like much of... well... everything you say. Let me remind everyone else that you had no idea what you were talking about with respect to linked lists until I took it upon myself to school you.

Sure, I remember. You didn’t like the NAME of a library which links lists together, so you decided to school me over my naming sense. Programmers name their code and libraries in a way that they can easily relate to and remember what they are, which is why we don’t see Library Foo1254378 floating around very damn often. For a set of code that links the behavioral properties of two lists, I still insist “Linked List Library” is fine.

I wonder if you’re the type to also cry like hell when someone talks about computing on an Apple, claiming, “But that ain’t no red, juicy fruit!”

Quote

Do you remember that from like, just days ago? If yer gonna name-drop, open a new thread and we can hash it out there. No pun intended. Actually pun intended. Something tells me you had no clue about hashing til my demo, either.

Nope. Never heard of it. I didn’t know we had hash tables inside QB64.bas, or that idet$ is basically a Linked List. I didn’t realize that binary search works with odd distribution lists like: “Apple, Bear, Sandcastle, Sasquatch, Sediment, Segment, Sip, Siren, Soap, Soda, Suds, Supper, Zebra, Zoo, Zoology”. I certainly didn’t know that larger tables resulted in more collisions in a hash table, rendering them less efficient as they grow, or else I never would’ve pushed for bigger datasets to test with — like, “I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand)”, or “adjust it for a dataset size typical of a respectable database. A few billion if you can.”

Why, if it wasn’t for you “schooling” me so well, I never could’ve worked for years in the database field and written programs to decode various formats to make them available in QB64; never could’ve written my own database library; and never could’ve gotten paid for hashing out real world solutions to billion record issues...

Quote

Alright, don't mind my injection. Get back to you were discussing here - talking about searching databases using INSTR, rediscovering obvious properties of QB64, etc etc... bravo guys. Keep up this awesomeness.

Thanks for your permission. You don’t know how close I was to giving it all up and learning physics, before you gave me permission to carry on. I truly appreciate it.

Pete · « **Reply #42 on:** January 28, 2019, 10:58:30 am »

Bill, I was going to expand my obvious discussions of QB64 INSTR() usage into theoretical ways it can be used to formulate words. Is it OK with you if I call that STRING$ Theory?

Pete :D

Dimster · « **Reply #43 on:** January 28, 2019, 11:42:03 am »

I have number of data bases which are just numbers. At present when searching I convert them to strings. I gather neither Hashing or Binary offer a way to search without the conversion step?

bplus · « **Reply #44 on:** January 28, 2019, 11:47:15 am »

Quote from: Pete on January 28, 2019, 04:10:22 am

No strange reason for including x$ = MID$(s$, seed& + 1, LEN("dog")) in the INSTR version. I put it there because it existed in your version. I didn't want it to run faster just because it was left out, even though in the INSTR version it doesn't need to be there to count frequency.

What I don't get is Marks test. INSTR shouldn't ever run as slow as he reported.

Pete

Well it did. Though both methods were flawed in my test, I think what the results point to is still valid ie INSTR is great for short strings but when the strings get to a certain length the Binary Search method will come into it's own as the hash method would with truly large amounts of data.

Pete, I suspect you haven't run INSTR search on nearly half a million word string, with 50,000 lookups in a row.

PS the time was not really that bad except when comparing to Binary Search Method.

News:

Author Topic: Binary Search Method (Read 34448 times)

STxAxTIC

Re: Binary Search Method

Pete

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

Pete

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

Pete

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

Pete

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

Pete

Re: Binary Search Method

STxAxTIC

Re: Binary Search Method

SMcNeill

Re: Binary Search Method

Pete

Re: Binary Search Method

Dimster

Re: Binary Search Method

bplus

Re: Binary Search Method