Author Topic: Split and join strings  (Read 10041 times)

0 Members and 1 Guest are viewing this topic.

Offline luke

  • Administrator
  • Seasoned Forum Regular
  • Posts: 324
    • View Profile
Split and join strings
« on: February 15, 2019, 04:11:07 am »
Given a string of words separated by spaces (or any other character), splits it into an array of the words. I've no doubt many people have written a version of this over the years and no doubt there's a million ways to do it, but I thought I'd put mine here so we have at least one version. There's also a join function that does the opposite array -> single string.

Code is hopefully reasonably self explanatory with comments and a little demo. Note, this is akin to Python/JavaScript split/join, PHP explode/implode.

Code: QB64: [Select]
  1. redim words$(0)
  2.  
  3. original$ = "The rain   in Spain  "
  4. print "Original string: "; original$
  5.  
  6. split original$, " ", words$()
  7.  
  8. print "Words:"
  9. for i = lbound(words$) to ubound(words$)
  10.     print words$(i)
  11.  
  12. print "Joined with commas: ";join$(words$(), ",")
  13.  
  14. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  15. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  16. '
  17. 'delimiter$ must be one character long.
  18. 'result$() must have been REDIMmed previously.
  19. sub split(in$, delimiter$, result$())
  20.     redim result$(-1)
  21.     start = 1
  22.     do
  23.         while mid$(in$, start, 1) = delimiter$
  24.             start = start + 1
  25.             if start > len(in$) then exit sub
  26.         wend
  27.         finish = instr(start, in$, delimiter$)
  28.         if finish = 0 then finish = len(in$) + 1
  29.         redim _preserve result$(0 to ubound(result$) + 1)
  30.         result$(ubound(result$)) = mid$(in$, start, finish - start)
  31.         start = finish + 1
  32.     loop while start <= len(in$)
  33.  
  34. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  35. function join$(in$(), delimiter$)
  36.     result$ = in$(lbound(in$))
  37.     for i = lbound(in$) + 1 to ubound(in$)
  38.         result$ = result$ + delimiter$ + in$(i)
  39.     next i
  40.     join$ = result$

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
    • View Profile
Re: Split and join strings
« Reply #1 on: February 15, 2019, 04:29:43 am »
Here are my two cents,

words/components can be seperated by arbitrary number of whitespace (TAB/SPACE) and "quoted" sections are taken as one word/component. It also is satisfied with any defined dynamic array, as it REDIMs it as needed.

EDIT:
Old code removed, goto https://qb64forum.alephc.xyz/index.php?topic=4142 for the latest update.
« Last Edit: January 31, 2022, 06:12:32 pm by RhoSigma »
My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Split and join strings
« Reply #2 on: February 15, 2019, 01:38:44 pm »
And my 2 cents:
My Split sub handles more than one char delimiters, comes in handy for Splitting a file into an array of lines with the delimiter set as Chr($13)+chr$(10) for a .bas for instance. It also handles the space delimiter specially by removing all double spaces before splitting.

Oh, also, I like to use long strings for arrays. For that you need to be able to leave empty spots open for strings eg see the days of week example.

Code: QB64: [Select]
  1. 'split test.bas for qb64 bplus 2018-05-07
  2. ' I think I want to replace my inefficient Wrd function
  3.  
  4. '2018-08-25 reworked for space delimiters and more variable declares
  5. '2019-02-15 add Luke's version to compare
  6. ntests = 5
  7. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  8.  
  9. a(0) = ""
  10. d(0) = " "
  11. a(1) = " test test    test " 'good no error!
  12. d(1) = " "
  13. a(2) = " test"
  14. d(2) = " "
  15. a(3) = "3d,z6d,z1 10 #d,z5"
  16. d(3) = ",z"
  17. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  18. d(4) = ", "
  19.  
  20. FOR test = 0 TO ntests - 1
  21.     PRINT: PRINT "splitting {"; a(test); "} with delimeter {"; d(test); "}"
  22.     REDIM myarr(0) AS STRING '<<<<< REDIM forces the creation of a dynamic/resizable array
  23.     Split a(test), d(test), myarr()
  24.     amax = UBOUND(myarr)
  25.     FOR i = 0 TO amax
  26.         PRINT i; ":"; myarr(i)
  27.     NEXT i
  28.     INPUT "press enter for next test... "; wate$
  29.  
  30. ' how about a quick file reader test?
  31. PRINT: INPUT "Press enter for file test, any other + enter quits! "; wate$
  32. IF LEN(wate$) THEN END
  33.  
  34. 'other wise continue
  35. OPEN "Split test.bas" FOR BINARY AS #1 '<<< this file name!!!
  36. ftext$ = SPACE$(LOF(1))
  37. GET #1, , ftext$
  38. Split ftext$, CHR$(13) + CHR$(10), myarr()
  39. FOR i = 0 TO UBOUND(myarr)
  40.     PRINT myarr(i)
  41.     IF i MOD 20 = 19 THEN PRINT: INPUT "press enter for more "; wate$
  42. PRINT "the end"
  43. END ' end program
  44.  
  45.  
  46. 'the space delimiter is such a special case perhaps I should develope a single split for that alone?
  47.  
  48.  
  49. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  50. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  51.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  52.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  53.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  54.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  55.  
  56.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  57.  
  58.     'special case if delim is space, probably want to remove all excess space
  59.     IF delim = " " THEN
  60.         copy = RTRIM$(LTRIM$(copy))
  61.         p = INSTR(copy, "  ")
  62.         WHILE p > 0
  63.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  64.             p = INSTR(copy, "  ")
  65.         WEND
  66.     END IF
  67.     curpos = 1
  68.     arrpos = 0
  69.     dpos = INSTR(curpos, copy, delim)
  70.     DO UNTIL dpos = 0
  71.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  72.         arrpos = arrpos + 1
  73.         REDIM _PRESERVE arr(arrpos + 1) AS STRING
  74.         curpos = dpos + LEN(delim)
  75.         dpos = INSTR(curpos, copy, delim)
  76.     LOOP
  77.     arr(arrpos) = MID$(copy, curpos)
  78.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  79.  
  80.  
  81. ' Luke 2019-02-15
  82. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  83. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  84. '
  85. 'delimiter$ must be one character long.
  86. 'result$() must have been REDIMmed previously.
  87. SUB Lsplit (in$, delimiter$, result$())
  88.     REDIM result$(-1)
  89.     start = 1
  90.     DO
  91.         WHILE MID$(in$, start, 1) = delimiter$
  92.             start = start + 1
  93.             IF start > LEN(in$) THEN EXIT SUB
  94.         WEND
  95.         finish = INSTR(start, in$, delimiter$)
  96.         IF finish = 0 THEN finish = LEN(in$) + 1
  97.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  98.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  99.         start = finish + 1
  100.     LOOP WHILE start <= LEN(in$)
  101.  
  102. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  103. FUNCTION join$ (in$(), delimiter$)
  104.     result$ = in$(LBOUND(in$))
  105.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  106.         result$ = result$ + delimiter$ + in$(i)
  107.     NEXT i
  108.     join$ = result$
  109.  
  110.  

I would like to test RhoSigma's but looks like a couple of specific delimiters are used.

I think I had a reason not to redim the arr inside the Split sub, can't recall it now. It does seem more convenient to handle it inside the sub.

Append: you only have to redim once outside the Split sub before the call to it, to let QB64 know you are using that name as a dynamic array.
« Last Edit: February 15, 2019, 03:16:31 pm by bplus »

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
    • View Profile
Re: Split and join strings
« Reply #3 on: February 15, 2019, 04:03:16 pm »
I would like to test RhoSigma's but looks like a couple of specific delimiters are used.

??? - I hope you doesn't speak about the CHR$(34) in the example lines, it's a quote (").

The ParseLine&() function is a rather simple (yet incomplete) command line parsing function, whitespace separates options/words, whitespace in this case means any number of TABs and/or SPACEs, and quoted parts are taken as is (ie. incl. whitespace) as you usually would need it on a command line for filenames which contain spaces or any chars with special meanings in the command interpreter suche as %()<>& etc..
My function does otherwise not allow for any user defined delimiters.
My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #4 on: February 15, 2019, 08:09:23 pm »
Here's a split routine which I'll toss into the mess as well:

Code: QB64: [Select]
  1. CONST ntests = 5
  2. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  3. REDIM results(0) AS STRING
  4.  
  5. a(0) = ""
  6. d(0) = " "
  7. a(1) = " test test    test " 'good no error!
  8. d(1) = " "
  9. a(2) = " test"
  10. d(2) = " "
  11. a(3) = "3d,z6d,z1 10 #d,z5"
  12. d(3) = ",z"
  13. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  14. d(4) = ", "
  15.  
  16. FOR i = 0 TO ntests - 1
  17.     PRINT "Splitting: "; a(i)
  18.     SteveSplit a(i), d(i), results()
  19.     FOR j = 1 TO UBOUND(results)
  20.         PRINT j, results(j)
  21.     NEXT
  22.     SLEEP
  23.  
  24. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  25.     STATIC count AS LONG
  26.     count = count + 1
  27.     u = UBOUND(storage_array)
  28.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  29.     i = INSTR(text$, delimiter$)
  30.     IF i THEN
  31.         storage_array(count) = LEFT$(text$, i - 1)
  32.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  33.     ELSE
  34.         storage_array(count) = text$
  35.         REDIM _PRESERVE storage_array(count) AS STRING
  36.         count = 0
  37.     END IF

I was even nice and named it "SteveSplit" so folks can test for speed and compare results if they wish verses the other split routines.

One thing to note:  This *doesn't* strip off any extra leading/trailing spaces.  Why would it, if they're delimiters in our data?

Let's say we have the data of the following: 

1,2,,,5,6

When we INPUT it from a file, we get data of:
"1"
"2"
""
""
"5"
"6"

Those commas are valid delimiters of null strings.  If we're using a space as a delimiter, then shouldn't it also follow the same behavior of the comma?  If the user doesn't want to include/process null strings, then let them ignore them elsewhere in their code.  As far as our data is concerned, they're valid split points, in my opinion.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #5 on: February 16, 2019, 02:14:25 am »
And a timed comparison of the three routines:

Code: QB64: [Select]
  1. CONST ntests = 5
  2. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  3. REDIM results1(0) AS STRING
  4. REDIM results2(0) AS STRING
  5. REDIM results3(0) AS STRING
  6.  
  7. CONST Limit = 1000000
  8.  
  9. a(0) = ""
  10. d(0) = " "
  11. a(1) = " test test    test " 'good no error!
  12. d(1) = " "
  13. a(2) = " test"
  14. d(2) = " "
  15. a(3) = "3d,z6d,z1 10 #d,z5"
  16. d(3) = ",z"
  17. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  18. d(4) = ", "
  19.  
  20. FOR i = 0 TO ntests - 1
  21.     t# = TIMER
  22.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  23.         SteveSplit a(i), d(i), results1()
  24.     NEXT
  25.     t1# = TIMER
  26.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  27.         REDIM results2(0) AS STRING
  28.         Split a(i), d(i), results2()
  29.     NEXT
  30.     t2# = TIMER
  31.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  32.         Lsplit a(i), d(i), results3()
  33.     NEXT
  34.     t3# = TIMER
  35.     PRINT "TEST #"; i; " -- Splitting: "; CHR$(34); a(i); CHR$(34); " with "; CHR$(34); d(i); CHR$(34)
  36.     PRINT USING "###.####     ###.####     ###.####"; t1# - t#, t2# - t1#, t3# - t2#
  37.  
  38.     FOR j = 1 TO UBOUND(results1)
  39.         PRINT j, CHR$(34); results1(j); CHR$(34),
  40.         IF j <= UBOUND(results2) + 1 THEN PRINT CHR$(34); results2(j - 1); CHR$(34),
  41.         IF j <= UBOUND(results3) + 1 THEN PRINT CHR$(34); results3(j - 1); CHR$(34),
  42.         PRINT
  43.     NEXT
  44.     SLEEP
  45.  
  46.  
  47.  
  48.  
  49. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  50. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  51.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  52.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  53.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  54.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  55.  
  56.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  57.  
  58.     'special case if delim is space, probably want to remove all excess space
  59.     IF delim = " " THEN
  60.         copy = RTRIM$(LTRIM$(copy))
  61.         p = INSTR(copy, "  ")
  62.         WHILE p > 0
  63.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  64.             p = INSTR(copy, "  ")
  65.         WEND
  66.     END IF
  67.     curpos = 1
  68.     arrpos = 0
  69.     dpos = INSTR(curpos, copy, delim)
  70.     DO UNTIL dpos = 0
  71.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  72.         arrpos = arrpos + 1
  73.         REDIM _PRESERVE arr(arrpos + 1) AS STRING
  74.         curpos = dpos + LEN(delim)
  75.         dpos = INSTR(curpos, copy, delim)
  76.     LOOP
  77.     arr(arrpos) = MID$(copy, curpos)
  78.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  79.  
  80.  
  81. ' Luke 2019-02-15
  82. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  83. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  84. '
  85. 'delimiter$ must be one character long.
  86. 'result$() must have been REDIMmed previously.
  87. SUB Lsplit (in$, delimiter$, result$())
  88.     REDIM result$(-1)
  89.     start = 1
  90.     DO
  91.         WHILE MID$(in$, start, 1) = delimiter$
  92.             start = start + 1
  93.             IF start > LEN(in$) THEN EXIT SUB
  94.         WEND
  95.         finish = INSTR(start, in$, delimiter$)
  96.         IF finish = 0 THEN finish = LEN(in$) + 1
  97.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  98.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  99.         start = finish + 1
  100.     LOOP WHILE start <= LEN(in$)
  101.  
  102. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  103. FUNCTION join$ (in$(), delimiter$)
  104.     result$ = in$(LBOUND(in$))
  105.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  106.         result$ = result$ + delimiter$ + in$(i)
  107.     NEXT i
  108.     join$ = result$
  109.  
  110.  
  111.  
  112.  
  113.  
  114.  
  115.  
  116.  
  117.  
  118.  
  119.  
  120.  
  121.  
  122.  
  123.  
  124.  
  125.  
  126.  
  127. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  128.     STATIC count AS LONG
  129.     count = count + 1
  130.     u = UBOUND(storage_array)
  131.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  132.     i = INSTR(text$, delimiter$)
  133.     IF i THEN
  134.         storage_array(count) = LEFT$(text$, i - 1)
  135.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  136.     ELSE
  137.         storage_array(count) = text$
  138.         REDIM _PRESERVE storage_array(count) AS STRING
  139.         count = 0
  140.     END IF

On my machine, the test speeds are as follows for:

               SteveSplit , Split (bplus), LSplit (Luke)
TEST #0: 0.1650 , 0.4395, 0.3296
TEST #1: 2.3076, 2.3076, 1.5386
TEST #2: 0.4399, 0.5493, 0.5493
TEST #3: 1.0439, 1.2637, 1.5381 (False results)
TEST #4: 1.9229, 2.1431, 2.6924 (False results)

In tests 0, 2, 3, 4, SteveSplit ran fastest. 
In test 1, LSplit was the fastest routine.

The difference in the speeds in Test #1 is a general philosophy of *how* we behave with the act of splitting.  For my routine, we generate several null strings, delimited by spaces.  For everyone else, the extra spaces are simply ignored and removed.

Luke's routine is set to only use a single character as a string delimiter, so for test 3 and 4, it produces false results as we have a 2-character delimiter.

******************

As to WHY my routine tosses out those extra null strings, it's to simply uniformly answer the question of, "How would we behave if if used a FIND AND REPLACE to change all the spaces to commas?"

Instead of a(1) = " test test    test ", what results would we expect if a(1) = ",test,test,,,,test," and we used a comma as a delimiter instead of a space?  Would we not then count the characters between each comma as as null-string?

I'd think so, and if two commas side-by-side designate a null-string result between them, then I feel like two spaces side-by-side should do the same for space delimited data.  This concept also allows me to perfectly reproduce the original data when using a join function, without automatically losing those spaces which may (or may not) have been important for formatting or data purposes.

If I don't want the extra spaces, all I need to do is ignore those null-strings. 



https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Split and join strings
« Reply #6 on: February 16, 2019, 09:26:49 am »
Steve:
Quote
The difference in the speeds in Test #1 is a general philosophy of *how* we behave with the act of splitting.  For my routine, we generate several null strings, delimited by spaces.  For everyone else, the extra spaces are simply ignored and removed.

Luke's routine is set to only use a single character as a string delimiter, so for test 3 and 4, it produces false results as we have a 2-character delimiter.

******************

As to WHY my routine tosses out those extra null strings, it's to simply uniformly answer the question of, "How would we behave if if used a FIND AND REPLACE to change all the spaces to commas?"

Instead of a(1) = " test test    test ", what results would we expect if a(1) = ",test,test,,,,test," and we used a comma as a delimiter instead of a space?  Would we not then count the characters between each comma as as null-string?

I'd think so, and if two commas side-by-side designate a null-string result between them, then I feel like two spaces side-by-side should do the same for space delimited data.  This concept also allows me to perfectly reproduce the original data when using a join function, without automatically losing those spaces which may (or may not) have been important for formatting or data purposes.

If I don't want the extra spaces, all I need to do is ignore those null-strings. 

Well in defense of my routine, you have the best of both worlds:
1. If you want to ignore spaces and just use any amount of spaces to separate items then use a space delimiter,
2. otherwise use any other delimiter of any length!

There is a real need for option #1 like when use string a number of numbers together, you won't need to worry about trimming them which would otherwise load the array with unwanted null strings. Having to ignore unwanted null strings would likely pose a needless burden on an app that depends upon the order placement of the items. I have in mind my simple little interpreter that used no punctuation only spaces as a delimiter.

Once you have a good general Split routine working, it would be a snap to modify and optimize for the particular application.
Likely it would need one delimiter or 1 character only delimiters or it may go the n-spaces route. Piece of cake to drop the unneeded general options to speed the thing up for the app. For this reason, Steve your method of REDIM _PRESERVE in large batches of say 1000, instead of at every new item, that is a good idea that I will incorporate in my general Split routine.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #7 on: February 16, 2019, 10:28:33 am »
Generally, if I want to strip out extra spaces, I’ll just run the string through a find-replace routine to change “  “ to “ “ before running the split routine.  It’s a fast process and saves time overall to do the whole process at once, instead of checking/trimming every loop inside the split routine.

And if you notice, my little routine only resizes the array larger when needed, and since REDIM _PRESERVE is a slow process, it’s better to oversized it by 1000 (or more) elements and then resize down when finished, than it is to resize it to the proper size for each word.  It’s very rare that I’ll ever REDIM _PRESERVE Array(Limit + 1).  REDIM _PRESERVE Array(Limit + 1000) is generally the smallest increment I’ll use in a routine, and then I’ll resize to free unneeded memory when done.  ;)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Split and join strings
« Reply #8 on: February 16, 2019, 11:21:02 am »
Generally, if I want to strip out extra spaces, I’ll just run the string through a find-replace routine to change “  “ to “ “ before running the split routine.  It’s a fast process and saves time overall to do the whole process at once, instead of checking/trimming every loop inside the split routine.

And if you notice, my little routine only resizes the array larger when needed, and since REDIM _PRESERVE is a slow process, it’s better to oversized it by 1000 (or more) elements and then resize down when finished, than it is to resize it to the proper size for each word.  It’s very rare that I’ll ever REDIM _PRESERVE Array(Limit + 1).  REDIM _PRESERVE Array(Limit + 1000) is generally the smallest increment I’ll use in a routine, and then I’ll resize to free unneeded memory when done.  ;)

Well Steve, I did strip out all extra spaces (if more than one) before running through the Split code (only if the delimiter was a space) and I did notice increasing the array by chucks instead of by 1 and already mentioned I was going to use that time saver tip!

One more interesting point about your method is that it is recursive which is a favorite technique I like to see employed. But I have heard that recursive techniques aren't as efficient ultimately as non recursive ones ie for every recursive routine there exists a more efficient non recursive one. (The proof of that might be interesting!) If true, then there is a faster version still needing to be revealed. :)
« Last Edit: February 16, 2019, 11:24:42 am by bplus »

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Split and join strings
« Reply #9 on: February 16, 2019, 01:17:11 pm »
I have had to trim extra spaces a lot in html parsing. I made something simple years ago that I still use for that. It's so simple, I'll just code it here...

Code: QB64: [Select]
  1. WIDTH 120, 25
  2. a$ = "   This   is a test      of   eliminating     multiple         spaces    in    a text    line.   "
  3. DO UNTIL INSTR(a$, "  ") = 0
  4.     a$ = MID$(a$, 1, INSTR(a$, "  ") - 1) + MID$(a$, INSTR(a$, "  ") + 1)
  5. a$ = RTRIM$(LTRIM$(a$))
  6.  

I probably should have read previous posts before putting it up. It may have no relevance, but at least it prints text with an apolitical message!

Replacement, splitting, concatenating are all very useful. Good luck with these additions, as I assume this is more stuff for the tool box forum.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Split and join strings
« Reply #10 on: February 17, 2019, 12:44:45 pm »
Ha, ha, ha! Steve pulled a fast one! ;)

With such trivial tests his Split shines but check out a test with 10,000 items to Split and the new Split1000 sub:
Code: QB64: [Select]
  1. 'split test.bas for qb64 bplus 2018-05-07
  2. ' directly below is Steve's Timed test orig tests commented out
  3. ' 2019-02-17 modified by B+ with a new Split1000 sub and a SERIOUS String to Split!
  4.  
  5. '=================================================================== steve;s speed test
  6. CONST ntests = 6
  7. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  8. REDIM results1(0) AS STRING
  9. REDIM results2(0) AS STRING
  10. REDIM results3(0) AS STRING
  11.  
  12. CONST Limit = 100
  13.  
  14. 'trivial tests just to test accuracy of split
  15. a(0) = ""
  16. d(0) = " "
  17. a(1) = " test test    test " 'good no error!
  18. d(1) = " "
  19. a(2) = " test"
  20. d(2) = " "
  21. a(3) = "3d,z6d,z1 10 #d,z5"
  22. d(3) = ",z"
  23. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  24. d(4) = ", "
  25.  
  26. 'lets get a serious test in here!! test a 10,000 random number string
  27. FOR i = 1 TO 10000
  28.     s$ = s$ + STR$(RND)
  29. a(5) = s$
  30. d(5) = " "
  31.  
  32. FOR i = 0 TO ntests - 1
  33.     CLS
  34.     t# = TIMER
  35.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  36.         SteveSplit a(i), d(i), results1()
  37.     NEXT
  38.     t1# = TIMER
  39.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  40.         REDIM results2(0) AS STRING
  41.         Split1000 a(i), d(i), results2()
  42.     NEXT
  43.     t2# = TIMER
  44.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  45.         Lsplit a(i), d(i), results3()
  46.     NEXT
  47.     t3# = TIMER
  48.     PRINT "TEST #"; i; " -- Splitting: "; CHR$(34); MID$(a(i), 1, 80); CHR$(34); " with "; CHR$(34); d(i); CHR$(34)
  49.     PRINT: PRINT "Test names:", "SteveSplit", "Split1000", "Lsplit"
  50.     PRINT "Times:",
  51.     PRINT USING "###.####     ###.####     ###.####"; t1# - t#, t2# - t1#, t3# - t2#
  52.     PRINT: PRINT "First Items in Results arrays (up to 10):"
  53.     FOR j = 1 TO 10
  54.         p = 0
  55.         IF j <= UBOUND(results1) THEN p = 1: PRINT j, CHR$(34); MID$(results1(j), 1, 15); CHR$(34),
  56.         IF j <= UBOUND(results2) + 1 THEN p = 1: PRINT CHR$(34); results2(j - 1); CHR$(34),
  57.         IF j <= UBOUND(results3) + 1 THEN p = 1: PRINT CHR$(34); results3(j - 1); CHR$(34),
  58.         IF p THEN PRINT
  59.     NEXT
  60.     PRINT: INPUT "Press enter for next test... "; wate$
  61.  
  62.  
  63.  
  64. '' ================================= My Old Split test Code
  65. 'the space delimiter is such a special case perhaps I should develope a single split for that alone?
  66. ''2018-08-25 reworked for space delimiters and more variable declares
  67. ''2019-02-15 add Luke's version to compare
  68. 'ntests = 5
  69. 'DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  70.  
  71. 'a(0) = ""
  72. 'd(0) = " "
  73. 'a(1) = " test test    test " 'good no error!
  74. 'd(1) = " "
  75. 'a(2) = " test"
  76. 'd(2) = " "
  77. 'a(3) = "3d,z6d,z1 10 #d,z5"
  78. 'd(3) = ",z"
  79. 'a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  80. 'd(4) = ", "
  81. 'REDIM myarr(0) AS STRING '<<<<< REDIM forces the creation of a dynamic/resizable array
  82. 'FOR test = 0 TO ntests - 1
  83. '    PRINT: PRINT "splitting {"; a(test); "} with delimeter {"; d(test); "}"
  84. '    Split1000 a(test), d(test), myarr()
  85. '    amax = UBOUND(myarr)
  86. '    FOR i = 0 TO amax
  87. '        PRINT i; ":"; myarr(i)
  88. '    NEXT i
  89. '    INPUT "press enter for next test... "; wate$
  90. 'NEXT
  91.  
  92. '' how about a quick file reader test?
  93. 'PRINT: INPUT "Press enter for file test, any other + enter quits! "; wate$
  94. 'IF LEN(wate$) THEN END
  95. 'CLS
  96.  
  97. ''other wise continue
  98. 'OPEN "Split test.bas" FOR BINARY AS #1 '<<< this file name!!!
  99. 'ftext$ = SPACE$(LOF(1))
  100. 'GET #1, , ftext$
  101. 'CLOSE #1
  102. 'Split ftext$, CHR$(13) + CHR$(10), myarr()
  103. 'FOR i = 0 TO UBOUND(myarr)
  104. '    PRINT myarr(i)
  105. '    IF i MOD 20 = 19 THEN PRINT: INPUT "press enter for more "; wate$
  106. 'NEXT
  107. 'PRINT "the end"
  108. 'END ' end program
  109.  
  110.  
  111.  
  112. '
  113. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  114. SUB Split1000 (mystr AS STRING, delim AS STRING, arr() AS STRING)
  115.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  116.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  117.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  118.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  119.  
  120.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  121.  
  122.     'special case if delim is space, probably want to remove all excess space
  123.     IF delim = " " THEN
  124.         copy = RTRIM$(LTRIM$(copy))
  125.         p = INSTR(copy, "  ")
  126.         WHILE p > 0
  127.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  128.             p = INSTR(copy, "  ")
  129.         WEND
  130.     END IF
  131.     curpos = 1
  132.     arrpos = 0
  133.     dpos = INSTR(curpos, copy, delim)
  134.     DO UNTIL dpos = 0
  135.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  136.         arrpos = arrpos + 1
  137.         IF arrpos > UBOUND(arr) THEN REDIM _PRESERVE arr(UBOUND(arr) + 1000) AS STRING
  138.         curpos = dpos + LEN(delim)
  139.         dpos = INSTR(curpos, copy, delim)
  140.     LOOP
  141.     arr(arrpos) = MID$(copy, curpos)
  142.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  143.  
  144.  
  145. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  146. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  147.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  148.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  149.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  150.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  151.  
  152.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  153.  
  154.     'special case if delim is space, probably want to remove all excess space
  155.     IF delim = " " THEN
  156.         copy = RTRIM$(LTRIM$(copy))
  157.         p = INSTR(copy, "  ")
  158.         WHILE p > 0
  159.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  160.             p = INSTR(copy, "  ")
  161.         WEND
  162.     END IF
  163.     curpos = 1
  164.     arrpos = 0
  165.     dpos = INSTR(curpos, copy, delim)
  166.     DO UNTIL dpos = 0
  167.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  168.         arrpos = arrpos + 1
  169.         REDIM _PRESERVE arr(arrpos + 1000) AS STRING
  170.         curpos = dpos + LEN(delim)
  171.         dpos = INSTR(curpos, copy, delim)
  172.     LOOP
  173.     arr(arrpos) = MID$(copy, curpos)
  174.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  175.  
  176.  
  177. ' Luke 2019-02-15
  178. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  179. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  180. '
  181. 'delimiter$ must be one character long.
  182. 'result$() must have been REDIMmed previously.
  183. SUB Lsplit (in$, delimiter$, result$())
  184.     REDIM result$(-1)
  185.     start = 1
  186.     DO
  187.         WHILE MID$(in$, start, 1) = delimiter$
  188.             start = start + 1
  189.             IF start > LEN(in$) THEN EXIT SUB
  190.         WEND
  191.         finish = INSTR(start, in$, delimiter$)
  192.         IF finish = 0 THEN finish = LEN(in$) + 1
  193.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  194.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  195.         start = finish + 1
  196.     LOOP WHILE start <= LEN(in$)
  197.  
  198. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  199. FUNCTION join$ (in$(), delimiter$)
  200.     result$ = in$(LBOUND(in$))
  201.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  202.         result$ = result$ + delimiter$ + in$(i)
  203.     NEXT i
  204.     join$ = result$
  205.  
  206. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  207.     STATIC count AS LONG
  208.     count = count + 1
  209.     u = UBOUND(storage_array)
  210.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  211.     i = INSTR(text$, delimiter$)
  212.     IF i THEN
  213.         storage_array(count) = LEFT$(text$, i - 1)
  214.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  215.     ELSE
  216.         storage_array(count) = text$
  217.         REDIM _PRESERVE storage_array(count) AS STRING
  218.         count = 0
  219.     END IF
  220.  
  221.  

I knew his recursive method was curse worthy. ;-)))
Splite Timed Test #2 now with Split1000.PNG
* Splite Timed Test #2 now with Split1000.PNG (Filesize: 13.86 KB, Dimensions: 645x387, Views: 413)
« Last Edit: February 17, 2019, 12:58:02 pm by bplus »

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #11 on: February 17, 2019, 01:49:10 pm »
25 seconds doesn’t seem right at all.  I’ll do some digging to see what’s up later and report back.  ;)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #12 on: February 17, 2019, 02:46:48 pm »
Try this non-recursive version and see how it performs for you:

Code: QB64: [Select]
  1. 'split test.bas for qb64 bplus 2018-05-07
  2. ' directly below is Steve's Timed test orig tests commented out
  3. ' 2019-02-17 modified by B+ with a new Split1000 sub and a SERIOUS String to Split!
  4.  
  5. '=================================================================== steve;s speed test
  6. CONST ntests = 6
  7. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  8. REDIM results1(0) AS STRING
  9. REDIM results2(0) AS STRING
  10. REDIM results3(0) AS STRING
  11.  
  12. CONST Limit = 100
  13. CONST NoNull = -1
  14.  
  15. 'trivial tests just to test accuracy of split
  16. a(0) = ""
  17. d(0) = " "
  18. a(1) = " test test    test " 'good no error!
  19. d(1) = " "
  20. a(2) = " test"
  21. d(2) = " "
  22. a(3) = "3d,z6d,z1 10 #d,z5"
  23. d(3) = ",z"
  24. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  25. d(4) = ", "
  26.  
  27. 'lets get a serious test in here!! test a 10,000 random number string
  28. FOR i = 1 TO 10000
  29.     s$ = s$ + STR$(RND)
  30. a(5) = s$
  31. d(5) = " "
  32.  
  33. FOR i = 0 TO ntests - 1
  34.     CLS
  35.     t# = TIMER
  36.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  37.         SteveSplit2 a(i), d(i), results1(), NoNull
  38.     NEXT
  39.     t1# = TIMER
  40.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  41.         REDIM results2(0) AS STRING
  42.         Split1000 a(i), d(i), results2()
  43.     NEXT
  44.     t2# = TIMER
  45.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  46.         Lsplit a(i), d(i), results3()
  47.     NEXT
  48.     t3# = TIMER
  49.     PRINT "TEST #"; i; " -- Splitting: "; CHR$(34); MID$(a(i), 1, 80); CHR$(34); " with "; CHR$(34); d(i); CHR$(34)
  50.     PRINT: PRINT "Test names:", "SteveSplit", "Split1000", "Lsplit"
  51.     PRINT "Times:",
  52.     PRINT USING "###.####     ###.####     ###.####"; t1# - t#, t2# - t1#, t3# - t2#
  53.     PRINT: PRINT "First Items in Results arrays (up to 10):"
  54.     FOR j = 1 TO 10
  55.         p = 0
  56.         PRINT j,
  57.         IF j <= UBOUND(results1) THEN PRINT CHR$(34); MID$(results1(j), 1, 15); CHR$(34), ELSE PRINT ,
  58.         IF j <= UBOUND(results2) + 1 THEN PRINT CHR$(34); results2(j - 1); CHR$(34), ELSE PRINT ,
  59.         IF j <= UBOUND(results3) + 1 THEN PRINT CHR$(34); results3(j - 1); CHR$(34),
  60.         PRINT
  61.     NEXT
  62.     PRINT: INPUT "Press enter for next test... "; wate$
  63.  
  64.  
  65.  
  66.  
  67. '
  68. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  69. SUB Split1000 (mystr AS STRING, delim AS STRING, arr() AS STRING)
  70.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  71.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  72.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  73.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  74.  
  75.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  76.  
  77.     'special case if delim is space, probably want to remove all excess space
  78.     IF delim = " " THEN
  79.         copy = RTRIM$(LTRIM$(copy))
  80.         p = INSTR(copy, "  ")
  81.         WHILE p > 0
  82.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  83.             p = INSTR(copy, "  ")
  84.         WEND
  85.     END IF
  86.     curpos = 1
  87.     arrpos = 0
  88.     dpos = INSTR(curpos, copy, delim)
  89.     DO UNTIL dpos = 0
  90.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  91.         arrpos = arrpos + 1
  92.         IF arrpos > UBOUND(arr) THEN REDIM _PRESERVE arr(UBOUND(arr) + 1000) AS STRING
  93.         curpos = dpos + LEN(delim)
  94.         dpos = INSTR(curpos, copy, delim)
  95.     LOOP
  96.     arr(arrpos) = MID$(copy, curpos)
  97.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  98.  
  99.  
  100. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  101. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  102.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  103.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  104.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  105.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  106.  
  107.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  108.  
  109.     'special case if delim is space, probably want to remove all excess space
  110.     IF delim = " " THEN
  111.         copy = RTRIM$(LTRIM$(copy))
  112.         p = INSTR(copy, "  ")
  113.         WHILE p > 0
  114.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  115.             p = INSTR(copy, "  ")
  116.         WEND
  117.     END IF
  118.     curpos = 1
  119.     arrpos = 0
  120.     dpos = INSTR(curpos, copy, delim)
  121.     DO UNTIL dpos = 0
  122.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  123.         arrpos = arrpos + 1
  124.         REDIM _PRESERVE arr(arrpos + 1000) AS STRING
  125.         curpos = dpos + LEN(delim)
  126.         dpos = INSTR(curpos, copy, delim)
  127.     LOOP
  128.     arr(arrpos) = MID$(copy, curpos)
  129.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  130.  
  131.  
  132. ' Luke 2019-02-15
  133. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  134. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  135. '
  136. 'delimiter$ must be one character long.
  137. 'result$() must have been REDIMmed previously.
  138. SUB Lsplit (in$, delimiter$, result$())
  139.     REDIM result$(-1)
  140.     start = 1
  141.     DO
  142.         WHILE MID$(in$, start, 1) = delimiter$
  143.             start = start + 1
  144.             IF start > LEN(in$) THEN EXIT SUB
  145.         WEND
  146.         finish = INSTR(start, in$, delimiter$)
  147.         IF finish = 0 THEN finish = LEN(in$) + 1
  148.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  149.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  150.         start = finish + 1
  151.     LOOP WHILE start <= LEN(in$)
  152.  
  153. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  154. FUNCTION join$ (in$(), delimiter$)
  155.     result$ = in$(LBOUND(in$))
  156.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  157.         result$ = result$ + delimiter$ + in$(i)
  158.     NEXT i
  159.     join$ = result$
  160.  
  161. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  162.     STATIC count AS LONG
  163.     count = count + 1
  164.     u = UBOUND(storage_array)
  165.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  166.     i = INSTR(text$, delimiter$)
  167.     IF i THEN
  168.         storage_array(count) = LEFT$(text$, i - 1)
  169.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  170.     ELSE
  171.         storage_array(count) = text$
  172.         REDIM _PRESERVE storage_array(count) AS STRING
  173.         count = 0
  174.     END IF
  175.  
  176. SUB SteveSplit2 (text$, delimiter$, storage_array() AS STRING, Options AS INTEGER)
  177.     IF Options AND 1 THEN text$ = LTRIM$(text$)
  178.     IF Options AND 2 THEN text$ = RTRIM$(text$)
  179.     count = 1: oldi = 1
  180.     l = LEN(delimiter$)
  181.     u = UBOUND(storage_array)
  182.     IF u < 1 THEN REDIM _PRESERVE storage_array(1000) AS STRING
  183.     DO
  184.         i = INSTR(oldi, text$, delimiter$)
  185.         IF i THEN
  186.             length = i - oldi
  187.             u = UBOUND(storage_array)
  188.             storage_array(count) = MID$(text$, oldi, length)
  189.             IF (Options AND 4) AND (LEN(storage_array(count)) = 0) THEN
  190.                 count = count - 1 'remove null-strings.
  191.             END IF
  192.             oldi = i + l
  193.             i = oldi
  194.             count = count + 1
  195.             IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  196.         END IF
  197.     LOOP UNTIL i = 0
  198.     storage_array(count) = MID$(text$, oldi)
  199.     REDIM _PRESERVE storage_array(count) AS STRING
  200.  

Added feature:  This now has a simple flag which you can use to make null$ a non-acceptable result for your storage_array, so it'll behave exactly as the other routines.  Just change CONST NoNull = -1  to 0 and see how it toggles between the two methods.

(And both are the fastest splitters yet, so now it might not sound so bad to say, "I pulled a fast one."  ;D )

((And, if you like keeping the null-strings inside a split routine, as I generally do, you can remove those IF checks and make it even speedier. ))



https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Split and join strings
« Reply #13 on: February 17, 2019, 03:13:10 pm »
Performs great Steve!

They are now running neck and neck on the big test and I agree without the IF check for options, SteveSplit2 should be faster yet.
So that means I might have another fix to speed up Split1000 because for my code, the delimiter check for space is a sort of Option check done before running the main part of the Split code.

« Last Edit: February 17, 2019, 03:15:16 pm by bplus »

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Split and join strings
« Reply #14 on: February 17, 2019, 04:44:29 pm »
One tiny mod to Split1000 and another test that shows it out paces SteveSplit2 when delimiter isn't a space (which is what slows down Split1000 specially clear now that the limit is increased), probably because of the extra decision about Options in SteveSplit2.
Code: QB64: [Select]
  1. 'split test.bas for qb64 bplus 2018-05-07
  2. ' directly below is Steve's Timed test orig tests commented out
  3. ' 2019-02-17 modified by B+ with a new Split1000 sub and a SERIOUS String to Split!
  4.  
  5. '=================================================================== steve;s speed test
  6. CONST ntests = 7
  7. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  8. REDIM results1(0) AS STRING
  9. REDIM results2(0) AS STRING
  10. REDIM results3(0) AS STRING
  11.  
  12. CONST Limit = 1000
  13. CONST NoNull = -1
  14.  
  15. 'trivial tests just to test accuracy of split
  16. a(0) = ""
  17. d(0) = " "
  18. a(1) = " test test    test " 'good no error!
  19. d(1) = " "
  20. a(2) = " test"
  21. d(2) = " "
  22. a(3) = "3d,z6d,z1 10 #d,z5"
  23. d(3) = ",z"
  24. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  25. d(4) = ", "
  26.  
  27. 'lets get a serious test in here!! test a 10,000 random number string
  28. FOR i = 1 TO 10000
  29.     s$ = s$ + STR$(RND)
  30. a(5) = s$
  31. d(5) = " "
  32. FOR i = 1 TO 10000
  33.     IF i = 1 THEN s$ = STR$(RND * 1000 \ 1) ELSE s$ = s$ + "," + STR$(RND * 1000 \ 1)
  34. a(6) = s$
  35. d(6) = ", "
  36.  
  37.  
  38. FOR i = 0 TO ntests - 1
  39.     CLS
  40.     t# = TIMER
  41.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  42.         SteveSplit2 a(i), d(i), results1(), NoNull
  43.     NEXT
  44.     t1# = TIMER
  45.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  46.         REDIM results2(0) AS STRING
  47.         Split1000 a(i), d(i), results2()
  48.     NEXT
  49.     t2# = TIMER
  50.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  51.         Lsplit a(i), d(i), results3()
  52.     NEXT
  53.     t3# = TIMER
  54.     PRINT "TEST #"; i; " -- Splitting: "; CHR$(34); MID$(a(i), 1, 80); CHR$(34); " with "; CHR$(34); d(i); CHR$(34)
  55.     PRINT: PRINT "Test names:", "SteveSplit2", "Split1000", "Lsplit"
  56.     PRINT "Times:",
  57.     PRINT USING "###.####     ###.####     ###.####"; t1# - t#, t2# - t1#, t3# - t2#
  58.     PRINT: PRINT "First Items in Results arrays (up to 10):"
  59.     FOR j = 1 TO 10
  60.         PRINT j,
  61.         IF j <= UBOUND(results1) THEN PRINT CHR$(34); MID$(results1(j), 1, 15); CHR$(34), ELSE PRINT "      ",
  62.         IF j <= UBOUND(results2) + 1 THEN PRINT CHR$(34); results2(j - 1); CHR$(34), ELSE PRINT "      ",
  63.         IF j <= UBOUND(results3) + 1 THEN PRINT CHR$(34); results3(j - 1); CHR$(34), ELSE PRINT "      ",
  64.         PRINT
  65.     NEXT
  66.     PRINT: INPUT "Press enter for next test... "; wate$
  67.  
  68.  
  69.  
  70. '' ================================= My Old Split test Code
  71. 'the space delimiter is such a special case perhaps I should develope a single split for that alone?
  72. ''2018-08-25 reworked for space delimiters and more variable declares
  73. ''2019-02-15 add Luke's version to compare
  74. 'ntests = 5
  75. 'DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  76.  
  77. 'a(0) = ""
  78. 'd(0) = " "
  79. 'a(1) = " test test    test " 'good no error!
  80. 'd(1) = " "
  81. 'a(2) = " test"
  82. 'd(2) = " "
  83. 'a(3) = "3d,z6d,z1 10 #d,z5"
  84. 'd(3) = ",z"
  85. 'a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  86. 'd(4) = ", "
  87. 'REDIM myarr(0) AS STRING '<<<<< REDIM forces the creation of a dynamic/resizable array
  88. 'FOR test = 0 TO ntests - 1
  89. '    PRINT: PRINT "splitting {"; a(test); "} with delimeter {"; d(test); "}"
  90. '    Split1000 a(test), d(test), myarr()
  91. '    amax = UBOUND(myarr)
  92. '    FOR i = 0 TO amax
  93. '        PRINT i; ":"; myarr(i)
  94. '    NEXT i
  95. '    INPUT "press enter for next test... "; wate$
  96. 'NEXT
  97.  
  98. '' how about a quick file reader test?
  99. 'PRINT: INPUT "Press enter for file test, any other + enter quits! "; wate$
  100. 'IF LEN(wate$) THEN END
  101. 'CLS
  102.  
  103. ''other wise continue
  104. 'OPEN "Split test.bas" FOR BINARY AS #1 '<<< this file name!!!
  105. 'ftext$ = SPACE$(LOF(1))
  106. 'GET #1, , ftext$
  107. 'CLOSE #1
  108. 'Split ftext$, CHR$(13) + CHR$(10), myarr()
  109. 'FOR i = 0 TO UBOUND(myarr)
  110. '    PRINT myarr(i)
  111. '    IF i MOD 20 = 19 THEN PRINT: INPUT "press enter for more "; wate$
  112. 'NEXT
  113. 'PRINT "the end"
  114. 'END ' end program
  115.  
  116. '
  117. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  118. SUB Split1000 (mystr AS STRING, delim AS STRING, arr() AS STRING)
  119.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  120.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  121.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  122.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  123.  
  124.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  125.  
  126.     'special case if delim is space, probably want to remove all excess space
  127.     IF delim = " " THEN
  128.         copy = RTRIM$(LTRIM$(copy))
  129.         p = INSTR(copy, "  ")
  130.         WHILE p > 0
  131.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  132.             p = INSTR(copy, "  ")
  133.         WEND
  134.     END IF
  135.     curpos = 1
  136.     arrpos = 0
  137.     LD = LEN(delim) 'mod
  138.     dpos = INSTR(curpos, copy, delim)
  139.     DO UNTIL dpos = 0
  140.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  141.         arrpos = arrpos + 1
  142.         IF arrpos > UBOUND(arr) THEN REDIM _PRESERVE arr(UBOUND(arr) + 1000) AS STRING
  143.         curpos = dpos + LD
  144.         dpos = INSTR(curpos, copy, delim)
  145.     LOOP
  146.     arr(arrpos) = MID$(copy, curpos)
  147.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  148.  
  149.  
  150. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  151. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  152.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  153.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  154.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  155.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  156.  
  157.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  158.  
  159.     'special case if delim is space, probably want to remove all excess space
  160.     IF delim = " " THEN
  161.         copy = RTRIM$(LTRIM$(copy))
  162.         p = INSTR(copy, "  ")
  163.         WHILE p > 0
  164.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  165.             p = INSTR(copy, "  ")
  166.         WEND
  167.     END IF
  168.     curpos = 1
  169.     arrpos = 0
  170.     dpos = INSTR(curpos, copy, delim)
  171.     DO UNTIL dpos = 0
  172.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  173.         arrpos = arrpos + 1
  174.         REDIM _PRESERVE arr(arrpos + 1000) AS STRING
  175.         curpos = dpos + LEN(delim)
  176.         dpos = INSTR(curpos, copy, delim)
  177.     LOOP
  178.     arr(arrpos) = MID$(copy, curpos)
  179.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  180.  
  181.  
  182. ' Luke 2019-02-15
  183. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  184. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  185. '
  186. 'delimiter$ must be one character long.
  187. 'result$() must have been REDIMmed previously.
  188. SUB Lsplit (in$, delimiter$, result$())
  189.     REDIM result$(-1)
  190.     start = 1
  191.     DO
  192.         WHILE MID$(in$, start, 1) = delimiter$
  193.             start = start + 1
  194.             IF start > LEN(in$) THEN EXIT SUB
  195.         WEND
  196.         finish = INSTR(start, in$, delimiter$)
  197.         IF finish = 0 THEN finish = LEN(in$) + 1
  198.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  199.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  200.         start = finish + 1
  201.     LOOP WHILE start <= LEN(in$)
  202.  
  203. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  204. FUNCTION join$ (in$(), delimiter$)
  205.     result$ = in$(LBOUND(in$))
  206.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  207.         result$ = result$ + delimiter$ + in$(i)
  208.     NEXT i
  209.     join$ = result$
  210.  
  211. SUB SteveSplit2 (text$, delimiter$, storage_array() AS STRING, Options AS INTEGER)
  212.     IF Options AND 1 THEN text$ = LTRIM$(text$)
  213.     IF Options AND 2 THEN text$ = RTRIM$(text$)
  214.     count = 1: oldi = 1
  215.     l = LEN(delimiter$)
  216.     u = UBOUND(storage_array)
  217.     IF u < 1 THEN REDIM _PRESERVE storage_array(1000) AS STRING
  218.     DO
  219.         i = INSTR(oldi, text$, delimiter$)
  220.         IF i THEN
  221.             length = i - oldi
  222.             u = UBOUND(storage_array)
  223.             storage_array(count) = MID$(text$, oldi, length)
  224.             IF (Options AND 4) AND (LEN(storage_array(count)) = 0) THEN
  225.                 count = count - 1 'remove null-strings.
  226.             END IF
  227.             oldi = i + l
  228.             i = oldi
  229.             count = count + 1
  230.             IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  231.         END IF
  232.     LOOP UNTIL i = 0
  233.     storage_array(count) = MID$(text$, oldi)
  234.     REDIM _PRESERVE storage_array(count) AS STRING
  235.  
  236. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  237.     STATIC count AS LONG
  238.     count = count + 1
  239.     u = UBOUND(storage_array)
  240.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  241.     i = INSTR(text$, delimiter$)
  242.     IF i THEN
  243.         storage_array(count) = LEFT$(text$, i - 1)
  244.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  245.     ELSE
  246.         storage_array(count) = text$
  247.         REDIM _PRESERVE storage_array(count) AS STRING
  248.         count = 0
  249.     END IF
  250.  
  251.  
Split test another long string test.PNG
* Split test another long string test.PNG (Filesize: 13.6 KB, Dimensions: 643x436, Views: 332)
« Last Edit: February 17, 2019, 04:50:21 pm by bplus »