Author Topic: Binary Search Method  (Read 18343 times)

0 Members and 1 Guest are viewing this topic.

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Binary Search Method
« Reply #30 on: January 26, 2019, 06:00:08 pm »
For some fun trivia, the post (I think three) above this one reminds me of the heartbleed bug:

https://xkcd.com/1354/
You're not done when it works, you're done when it's right.

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #31 on: January 27, 2019, 12:51:11 pm »
But that's what seed in INSTR() is for, so you skip from delimiter to delimiter.

See the same post and the second code box, which uses seed% to deal with lookup frequency within a random word list.

Pete

Seed gives you a starting point, but you still check byte by byte for a match...

a$ = “,1,2,3,4,5,3,7,8,”

seed = INSTR(seed + 1, a$, “,3,”)

Run first, it does a compare at byte 1, then byte 2, then 3... at byte 5 it finds “,3,” — 5 comparisons.

Then you have:

seed = seed + LEN(“,3,”)  ‘5 + 3
seed = INSTR(seed, a$, “,3,”)

Now you start at byte 8 and search byte by byte until you get to byte 11... 5 more comparisons.

INSTR by its nature is a byte by byte compare routine (it has to be, or else how would it know where the search string appears at?  A random guess?)



wordlist$ = "4bill3cat3dog7dogfood4pete5steve0”

Placing the size first lets you advance by words, not by bytes

See the difference, and why one is faster than the other?

So you are proposing a frequency search be done something like this?

Code: QB64: [Select]
  1. s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
  2.  
  3. ' Find the whole word dog and frequency.
  4. FOR i = 1 TO LEN(s$)
  5.     index = VAL(MID$(s$, i, 1)) ' Good for words 9 letters or less.
  6.     x$ = MID$(s$, i + 1, index)
  7.     i = i + index
  8.     IF x$ = "dog" THEN PRINT x$
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #32 on: January 27, 2019, 01:12:26 pm »
So you are proposing a frequency search be done something like this?

Code: QB64: [Select]
  1. s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
  2.  
  3. ' Find the whole word dog and frequency.
  4. FOR i = 1 TO LEN(s$)
  5.     index = VAL(MID$(s$, i, 1)) ' Good for words 9 letters or less.
  6.     x$ = MID$(s$, i + 1, index)
  7.     i = i + index
  8.     IF x$ = "dog" THEN PRINT x$

Aye.  No need to use INSTR for your search (it's inherently slower by design); just read each word and compare directly.

The only real difference I'd use would be ASCII characters instead of numeric  ones, as we can deal and process those much faster.

s$ = CHR$(4) + "bill" + CHR$(3) + "cat" + CHR$(3) + "dog" + ....  (so on)

Then just use index = ASC(s$, i).

Gives return values up to 255 characters in length, and processes faster than your VAL(MID$())
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #33 on: January 27, 2019, 01:31:03 pm »
Well I use ASCII characters in files as delimiters all the time, but in this case there are a few considerations.

1) ASCII characters can't be letters so there is no way to separate them from the letters in the word list. Of course if the word list is limited in size, at least the extended ASCII character set could be used for words up to 128 characters. Ah, the good ol' days when supercalifragilisticexpialidocious was the longest word in the English language. Now it's: pneumonoultramicroscopicsilicovolcanoconiosis for all you sandblasting fans. So it looks like going with the extended character set would be fine.

2) Characters like CHR$(7) do not get printed in a string, so a workaround is needed in these cases. Again, I'd go with the extended set and just subtract 127.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #34 on: January 27, 2019, 01:48:46 pm »
I guess that all depends on what you need to do with your list... 

If you need to print it, then use _CONTROLCHR OFF:

_CONTROLCHR OFF
FOR I = 0 TO 255: PRINT CHR$(I): NEXT



Normally, though, the size would be something you hide from your user, so they never see it at all.

To build s$, you’d often do something like:

DATA bill, cat, dog

DO
   READ temp$
   s$ = CHR$(LEN(temp$)) + temp$
LOOP

s$ = s$ + CHR$(0) ‘to designate end of list



And to print it, you’d print it much like you did the 3 “dog” above, though you might want to add a nice comma between words just to make your list look pretty.  ;)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #35 on: January 27, 2019, 02:06:53 pm »
So your method would be...

Code: QB64: [Select]
  1.  ' Convert data format first...
  2. s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
  3. 'convert format...
  4. FOR i = 1 TO LEN(s$)
  5.     index = VAL(MID$(s$, i, 1))
  6.     x$ = MID$(s$, i + 1, index)
  7.     i = i + index
  8.     news$ = news$ + CHR$(index) + x$
  9. s$ = news$ + CHR$(0)
  10.  
  11. ' Now your part..
  12. FOR i = 1 TO LEN(s$)
  13.     index = ASC(s$, i) ' Good for words 9 letters or less.
  14.     x$ = MID$(s$, i + 1, index)
  15.     i = i + index
  16.     IF x$ = "dog" THEN PRINT CHR$(index), x$
  17.  

Look right?

Note: The board wouldn't allow pasted ascii characters for the index in the code. It dropped them out of the s$ variable so I re-posted with a conversion method.

Pete
« Last Edit: January 27, 2019, 02:16:41 pm by Pete »
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #36 on: January 27, 2019, 02:27:37 pm »
Looks good.  As you can see in the post under Ed’s, it’s almost exactly what I did for the hash table — speeding it up 300% + on my PC.  https://www.qb64.org/forum/index.php?topic=1003.msg102044#msg102044

The longer the average words in your list, the greater the difference you’ll see in speed vs using INSTR as you were before. 
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #37 on: January 27, 2019, 04:25:10 pm »
OK, I like that method because it's clever, and I learned something new about a QB64 underscore keyword I haven't used before but as for speed, we need to work on that, because I get the INSTR() method as faster when a test is set up as follows...

Code: QB64: [Select]
  1. WIDTH 120, 25
  2. s$ = "4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve4bill3cat3dog7dogfood4pete5steve0"
  3. 'convert format...
  4. FOR i = 1 TO LEN(s$)
  5.     index = VAL(MID$(s$, i, 1))
  6.     x$ = MID$(s$, i + 1, index)
  7.     i = i + index
  8.     news$ = news$ + CHR$(index) + x$
  9.  
  10. FOR i = 1 TO 15 ' Concatenate to a string size of 3178497
  11.     news$ = news$ + news$
  12. s$ = news$: news$ = ""
  13.  
  14. PRINT " Start timer...": PRINT
  15. t# = TIMER
  16. FOR i = 1 TO LEN(s$)
  17.     index = ASC(s$, i) ' Good for words 9 letters or less.
  18.     x$ = MID$(s$, i + 1, index)
  19.     i = i + index
  20.     IF x$ = "dog" THEN k = k + 1
  21. PRINT TIMER - t#;
  22. PRINT "seconds to find frequency of dog in search INDEX method. Frequency ="; k; "Length of string ="; LEN(s$)
  23.  
  24. k = 0: _DELAY 1: PRINT
  25.  
  26. news$ = ",bill,cat,dog,dogfood,pete,steve,bill,cat,dog,dogfood,pete,steve,bill,cat,dog,dogfood,pete,steve,"
  27. FOR i = 1 TO 15
  28.     news$ = news$ + news$
  29. s$ = news$: news$ = ""
  30.  
  31. PRINT " Start timer...": PRINT
  32. t# = TIMER
  33.     seed& = INSTR(seed&, s$, ",dog,")
  34.     IF seed& = 0 THEN EXIT DO
  35.     x$ = MID$(s$, seed& + 1, LEN("dog"))
  36.     k = k + 1
  37.     seed& = seed& + LEN(",dog,")
  38.  
  39. PRINT TIMER - t#;
  40. PRINT "seconds to find frequency of dog in search INSTR method. Frequency ="; k; "Length of string ="; LEN(s$)
  41.  

My belief is that seed advances INSTR past the byte by byte evaluation, much like your indexing method, but faster. I know we need to make sure the comparison is fair, so if you see anything that is skewing the results, let's work it out.

Pete
« Last Edit: January 27, 2019, 04:46:26 pm by Pete »
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #38 on: January 28, 2019, 03:13:08 am »
I had to do a little pondering to sort out what the heck is going on to make INSTR faster than jumping with the index and size, and what I've came up with is the same old conclusion I've gathered in the past:  There's a lot of overhead in QB64 functions.

    x$ = MID$(s$, i + 1, index)    <--MID$ is slower than one would like.

    index = ASC(s$, i) ' Good for words 9 letters or less.   <--ASC is faster than MID$, but still nothing to write home about.

    IF x$ = "dog" THEN k = k + 1  <--And then you do a direct string compare....

VS:

   seed& = INSTR(seed&, s$, ",dog,")  <--Check the result
    IF seed& = 0 THEN EXIT DO
    x$ = MID$(s$, seed& + 1, LEN("dog"))  <--WTH is this line in here for?  It actually does NOTHING for the check...
    k = k + 1
    seed& = seed& + LEN(",dog,")

So, for the comparison to be as fair as possible, we need to remove as much overhead from the program as we can:

Code: QB64: [Select]
  1.  SCREEN _NEWIMAGE(800, 600, 32)
  2.  
  3. DEFLNG A-Z
  4.  
  5. CONST Limit = 100 'number of words in our list
  6. CONST Repitition = 10000 'number of times to search the lists
  7.  
  8. DIM word(Limit) AS STRING, index AS _UNSIGNED _BYTE
  9.  
  10.  
  11. 'make an array of the proper sizes becauseQB64 dosesn't
  12. 'understand (AS STRING * variable) for a data type
  13. 'so we're setting an array for strings of size 1  to 20, just as a quick placeholder for use with mem.
  14. DIM Strings(1 TO 20) AS STRING
  15. FOR i = 1 TO 20: Strings(i) = SPACE$(i): NEXT
  16.  
  17.  
  18.  
  19. FOR SearchSize = 1 TO 20 STEP 5 'run the list multiple times for various size search strings
  20.  
  21.     'Generate suitable search string
  22.     search1$ = STRING$(SearchSize, "A")
  23.     search2$ = "," + search1$ + "," 'comma before and after
  24.  
  25.     FOR i = 1 TO Limit: word(i) = "": NEXT 'reset old list
  26.  
  27.  
  28.     FOR WordNumber = 1 TO Limit 'the number of words
  29.         IF WordNumber MOD 10 = 1 THEN 'every 10th word, no matter what, is one we want to look for
  30.             word(WordNumber) = search1$
  31.         ELSE 'otherwise, make the word junk
  32.             FOR i = 1 TO INT(RND * 15) + 1 'up to 15 characters of junk in the spam "words"
  33.                 word(WordNumber) = word(WordNumber) + CHR$(INT(RND * 26) + 97)
  34.             NEXT
  35.         END IF
  36.     NEXT
  37.  
  38.     'Words are now generated.  Now let's form our two similar lists for searching.
  39.     list1$ = "": list2$ = ","
  40.  
  41.     'Size/Word list
  42.     FOR i = 1 TO Limit: list1$ = list1$ + CHR$(LEN(word(i))) + word(i): NEXT
  43.  
  44.     'Comma Delimited list
  45.     FOR i = 1 TO Limit: list2$ = list2$ + word(i) + ",": NEXT
  46.  
  47.     'Wordlists are now built.
  48.  
  49.     IF _MEMEXISTS(m) THEN _MEMFREE m 'free the old mem block if we need to, from the previous run
  50.     m = _MEMNEW(LEN(list1$)): _MEMPUT m, m.OFFSET, list1$
  51.  
  52.     template$ = "##.### seconds to find frequency of " + search1$ + "."
  53.     template$ = template$ + "  Frequency = ###"
  54.  
  55.  
  56.     _DELAY 1 'a delay so we can watch the tests
  57.     PRINT
  58.     PRINT "Running Speed Tests on "; search1$
  59.     PRINT "SIZE/WORD: ";
  60.  
  61.     t# = TIMER(0.0001)
  62.     FOR z = 1 TO Repitition
  63.         k = 0: l = LEN(search1$): i = 1
  64.         DO UNTIL i > LEN(list1$)
  65.             $CHECKING:OFF
  66.             index = _MEMGET(m, m.OFFSET + i - 1, _UNSIGNED _BYTE) 'Get the index directly, skip ASC function call
  67.             IF index = l THEN 'if lengths don't even match, we don't need to compare words
  68.                 'just jump to the next one
  69.                 _MEMGET m, m.OFFSET + i, Strings(index) 'get the string direct from memory (like MID$ in Pete's demo)
  70.                 IF Strings(index) = search1$ THEN k = k + 1
  71.             END IF
  72.             $CHECKING:ON
  73.             i = i + index + 1
  74.         LOOP
  75.     NEXT
  76.     t1# = TIMER(0.0001) - t#
  77.     PRINT USING template$; t1#, k
  78.  
  79.     PRINT "INSTR:";
  80.     t# = TIMER(0.0001)
  81.     FOR z = 1 TO Repitition
  82.         k = 0
  83.         DO
  84.             seed& = INSTR(seed&, list2$, search2$)
  85.             IF seed& = 0 THEN EXIT DO
  86.             k = k + 1
  87.             seed& = seed& + LEN(search2$)
  88.         LOOP
  89.     NEXT
  90.     t1# = TIMER(0.0001) - t#
  91.     PRINT USING template$; t1#, k

Things are running so quickly here, we're having to run a search on a list of 100 words, 10000 times, to generate any significant times for comparison.

CONST Limit = 100 'number of words in our list
CONST Repitition = 10000 'number of times to search the lists

I was curious if the length of the search$ made any real difference, and it doesn't really seem to affect much from my testing. 



Notable changes in this and your routine:

            index = _MEMGET(m, m.OFFSET + i - 1, _UNSIGNED _BYTE) 'Get the index directly, skip ASC function call
            IF index = l THEN 'if lengths don't even match, we don't need to compare words
                'just jump to the next one
                _MEMGET m, m.OFFSET + i, Strings(index) 'get the string direct from memory (like MID$ in Pete's demo)
                IF Strings(index) = search1$ THEN k = k + 1
            END IF

We strip out the use of ASC, MID$, and don't even bother to get the word if the two lengths don't match...

And, to keep it fair:

        DO
            seed& = INSTR(seed&, list2$, search2$)
            IF seed& = 0 THEN EXIT DO
            k = k + 1
            seed& = seed& + LEN(search2$)
        LOOP

I basically stripped out that extra line where you were calculating x$ for some odd reason with your search...


Times on my PC are basically 0.015 seconds for SIZE/STRING storage and lookup, verses 0.025 seconds for INSTR/DELIMITED storage and lookup.

Logic says jumping and skipping searches would be faster than searching byte by byte, but once we start tacking on the overhead associated with our function calls, the gap closes quickly.  Without using _MEM to replace ASC and MID$, I couldn't seem to top the speed of INSTR.. the overhead was just too great.

Which now leaves me pondering -- why do the changes I made to the hash table in the post after Ed's make it so much quicker than the previous versions on my machine?  I guess that'll be a mystery to sit and study on tomorrow.  For now, the bed is calling my name...
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #39 on: January 28, 2019, 04:10:22 am »
No strange reason for including x$ = MID$(s$, seed& + 1, LEN("dog")) in the INSTR version. I put it there because it existed in your version. I didn't want it to run faster just because it was left out, even though in the INSTR version it doesn't need to be there to count frequency.

What I don't get is Marks test. INSTR shouldn't ever run as slow as he reported.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Binary Search Method
« Reply #40 on: January 28, 2019, 06:49:26 am »
Alright, been a few days since I actually ready at any of these threads, and at a quick glance... sigh...

Steve, did you really insinuate that I'm somehow "obsessive over" ... what was it... linked lists? That kind of jab is SO uninformed and misled, just like much of... well... everything you say. Let me remind everyone else that you had no idea what you were talking about with respect to linked lists until I took it upon myself to school you. Do you remember that from like, just days ago? If yer gonna name-drop, open a new thread and we can hash it out there. No pun intended. Actually pun intended. Something tells me you had no clue about hashing til my demo, either.

Alright, don't mind my injection. Get back to you were discussing here - talking about searching databases using INSTR, rediscovering obvious properties of QB64, etc etc... bravo guys. Keep up this awesomeness.
« Last Edit: January 28, 2019, 06:58:07 am by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Binary Search Method
« Reply #41 on: January 28, 2019, 08:38:24 am »
Steve, did you really insinuate that I'm somehow "obsessive over" ... what was it... linked lists? That kind of jab is SO uninformed and misled, just like much of... well... everything you say. Let me remind everyone else that you had no idea what you were talking about with respect to linked lists until I took it upon myself to school you.

Sure, I remember.  You didn’t like the NAME of a library which links lists together, so you decided to school me over my naming sense.  Programmers name their code and libraries in a way that they can easily relate to and remember what they are, which is why we don’t see Library Foo1254378 floating around very damn often.  For a set of code that links the behavioral properties of two lists, I still insist “Linked List Library” is fine. 

I wonder if you’re the type to also cry like hell when someone talks about computing on an Apple, claiming, “But that ain’t no red, juicy fruit!”

Quote
Do you remember that from like, just days ago? If yer gonna name-drop, open a new thread and we can hash it out there. No pun intended. Actually pun intended. Something tells me you had no clue about hashing til my demo, either.

Nope.  Never heard of it.  I didn’t know we had hash tables inside QB64.bas, or that idet$ is basically a Linked List.  I didn’t realize that binary search works with odd distribution lists like: “Apple, Bear, Sandcastle, Sasquatch, Sediment, Segment, Sip, Siren, Soap, Soda, Suds, Supper, Zebra, Zoo, Zoology”.  I certainly didn’t know that larger tables resulted in more collisions in a hash table, rendering them less efficient as they grow, or else I never would’ve pushed for bigger datasets to test with — like, “I think the burden's on you to do the speed test m8 (for a large data set with lots of performance demand)”, or  “adjust it for a dataset size typical of a respectable database. A few billion if you can.”

Why, if it wasn’t for you “schooling” me so well, I never could’ve worked for years in the database field and written programs to decode various formats to make them available in QB64; never could’ve written my own database library; and never could’ve gotten paid for hashing out real world solutions to billion record issues... 

Quote
Alright, don't mind my injection. Get back to you were discussing here - talking about searching databases using INSTR, rediscovering obvious properties of QB64, etc etc... bravo guys. Keep up this awesomeness.

Thanks for your permission.  You don’t know how close I was to giving it all up and learning physics, before you gave me permission to carry on.  I truly appreciate it.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Binary Search Method
« Reply #42 on: January 28, 2019, 10:58:30 am »
Bill, I was going to expand my obvious discussions of QB64 INSTR() usage into theoretical ways it can be used to formulate words. Is it OK with you if I call that STRING$ Theory?

Pete :D
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline Dimster

  • Forum Resident
  • Posts: 500
    • View Profile
Re: Binary Search Method
« Reply #43 on: January 28, 2019, 11:42:03 am »
I have number of data bases which are just numbers. At present when searching I convert them to strings. I gather neither Hashing or Binary offer a way to search without the conversion step?

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Binary Search Method
« Reply #44 on: January 28, 2019, 11:47:15 am »
No strange reason for including x$ = MID$(s$, seed& + 1, LEN("dog")) in the INSTR version. I put it there because it existed in your version. I didn't want it to run faster just because it was left out, even though in the INSTR version it doesn't need to be there to count frequency.

What I don't get is Marks test. INSTR shouldn't ever run as slow as he reported.

Pete

Well it did. Though both methods were flawed in my test, I think what the results point to is still valid ie INSTR is great for short strings but when the strings get to a certain length the Binary Search method will come into it's own as the hash method would with truly large amounts of data.

Pete, I suspect you haven't run INSTR search on nearly half a million word string, with 50,000 lookups in a row.

PS the time was not really that bad except when comparing to Binary Search Method.
« Last Edit: January 28, 2019, 11:58:33 am by bplus »