Author Topic: Split Versus Tokenize  (Read 1451 times)

0 Members and 1 Guest are viewing this topic.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Split Versus Tokenize
« on: February 01, 2022, 11:33:24 am »
https://qb64forum.alephc.xyz/index.php?topic=4618.msg140260#msg140260
Quote
I think my way of splitting a string into an array is slightly better:

I say Split has at least two advantages!
1. Less LOC
2. Preserves blank lines from a .bas file.

And I am pretty sure for small files there is no significant time differences, not even sure if Tokenize is faster and how big a file you need to see it. I leave that to believers of Tokenize. ;-))
Code: QB64: [Select]
  1. _Title "Split versus Tokenize" ' b+ 2022-02-01
  2.  
  3. ' Is Tokenize really worth the extra LOC compared to Split?
  4. ' No! if you want to preserve blank lines.
  5.  
  6. '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  7.  
  8. '            save this file as "Split versus Tokenize.bas"
  9.  
  10. '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  11.  
  12. Open "Split versus Tokenize.bas" For Binary As #1
  13. buf$ = Space$(LOF(1))
  14.  
  15. Get #1, , buf$
  16.  
  17.  
  18. deli$ = Chr$(13) + Chr$(10)
  19.  
  20. ReDim LoadMe$(1 To 1)
  21.  
  22. startSplit = Timer(.001)
  23. Split buf$, deli$, LoadMe$()
  24. splitTime = Timer(.001) - startSplit
  25.  
  26. For i = LBound(LoadMe$) To 10
  27.     Print i, LoadMe$(i)
  28. 'For i = UBound(LoadMe$) - 5 To UBound(LoadMe$)
  29. '    Print i, LoadMe$(i)
  30. 'Next
  31. Print "Time for Split was:"; splitTime
  32.  
  33. ' reset for tokenize
  34. ReDim LoadMe$(1 To 1)
  35. startTokenize = Timer(.001)
  36. tokenize buf$, deli$, LoadMe$()
  37. TokenizeTime = Timer(.001) - startTokenize
  38. For i = LBound(LoadMe$) To 10
  39.     Print i, LoadMe$(i)
  40. 'For i = UBound(LoadMe$) - 5 To UBound(LoadMe$)
  41. '    Print i, LoadMe$(i)
  42. 'Next
  43. Print "Time for Tokenize was:"; TokenizeTime  ' << edit: had wrong variable in here
  44.  
  45.  
  46.  
  47. ' note: I buggered this twice now, FOR base 1 array REDIM MyArray (1 to 1) AS ... the (1 to 1) is not same as (1) which was the Blunder!!!
  48. 'notes: REDIM the array(0) to be loaded before calling Split '<<<< IMPORTANT dynamic array and empty, can use any lbound though
  49. 'This SUB will take a given N delimited string, and delimiter$ and create an array of N+1 strings using the LBOUND of the given dynamic array to load.
  50. 'notes: the loadMeArray() needs to be dynamic string array and will not change the LBOUND of the array it is given.  rev 2019-08-27
  51. Sub Split (SplitMeString As String, delim As String, loadMeArray() As String)
  52.     Dim curpos As Long, arrpos As Long, LD As Long, dpos As Long 'fix use the Lbound the array already has
  53.     curpos = 1: arrpos = LBound(loadMeArray): LD = Len(delim)
  54.     dpos = InStr(curpos, SplitMeString, delim)
  55.     Do Until dpos = 0
  56.         loadMeArray(arrpos) = Mid$(SplitMeString, curpos, dpos - curpos)
  57.         arrpos = arrpos + 1
  58.         If arrpos > UBound(loadMeArray) Then ReDim _Preserve loadMeArray(LBound(loadMeArray) To UBound(loadMeArray) + 1000) As String
  59.         curpos = dpos + LD
  60.         dpos = InStr(curpos, SplitMeString, delim)
  61.     Loop
  62.     loadMeArray(arrpos) = Mid$(SplitMeString, curpos)
  63.     ReDim _Preserve loadMeArray(LBound(loadMeArray) To arrpos) As String 'get the ubound correct
  64.  
  65. Sub tokenize (toTokenize As String, delimiters As String, StorageArray() As String)
  66.         Function strtok%& (ByVal str As _Offset, delimiters As String)
  67.     End Declare
  68.     Dim As _Offset tokenized
  69.     Dim As String tokCopy: tokCopy = toTokenize + Chr$(0)
  70.     Dim As String delCopy: delCopy = delimiters + Chr$(0)
  71.     Dim As _Unsigned Long lowerbound: lowerbound = LBound(StorageArray)
  72.     Dim As _Unsigned Long i: i = lowerbound
  73.     tokenized = strtok(_Offset(tokCopy), delCopy)
  74.     While tokenized <> 0
  75.         ReDim _Preserve StorageArray(lowerbound To UBound(StorageArray) + 1)
  76.         StorageArray(i) = pointerToString(tokenized)
  77.         tokenized = strtok(0, delCopy)
  78.         i = i + 1
  79.     Wend
  80.     ReDim _Preserve StorageArray(UBound(StorageArray) - 1)
  81.  
  82. Function pointerToString$ (pointer As _Offset)
  83.         Function strlen%& (ByVal ptr As _Unsigned _Offset)
  84.     End Declare
  85.     Dim As _Offset length: length = strlen(pointer)
  86.     If length Then
  87.         Dim As _MEM pString: pString = _Mem(pointer, length)
  88.         Dim As String ret: ret = Space$(length)
  89.         _MemGet pString, pString.OFFSET, ret
  90.         _MemFree pString
  91.     End If
  92.     pointerToString = ret
  93.  
  94.  
  95.  

Edit: had wrong Time variable in showing Tokenize Time. Ha! I was wondering how they were coming exactly the same each time tested.
« Last Edit: February 01, 2022, 01:14:15 pm by bplus »

Offline _vince

  • Seasoned Forum Regular
  • Posts: 422
Re: Split Versus Tokenize
« Reply #1 on: February 01, 2022, 11:58:49 am »
nice, bplus.  Those tokenize and pointertostring functions should be banned.  Looks like a C compiler swallowed VB.NET -- not BASIC at all!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Split Versus Tokenize
« Reply #2 on: February 01, 2022, 12:02:22 pm »
Looks like a C compiler swallowed VB.NET

LOL

I will bet if we really need a speed bump for some app, we might have to resort to this stuff?

Offline Cobalt

  • QB64 Developer
  • Forum Resident
  • Posts: 878
  • At 60 I become highly radioactive!
Re: Split Versus Tokenize
« Reply #3 on: February 01, 2022, 12:28:25 pm »
LOL

I will bet if we really need a speed bump for some app, we might have to resort to this stuff?

At that point its time to go to a non-translated language. If speed is needed to that point better start learning machine code and using ASM.
Granted after becoming radioactive I only have a half-life!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Split Versus Tokenize
« Reply #4 on: February 01, 2022, 01:16:56 pm »
Also weird is how some times are negative!?!

Does that mean if we run the program allot we will go back in time?

Offline SpriggsySpriggs

  • Forum Resident
  • Posts: 1145
  • Larger than life
    • GitHub
Re: Split Versus Tokenize
« Reply #5 on: February 01, 2022, 01:37:21 pm »
The reason you weren't preserving blank lines is because you are using tokenize incorrectly. You should pass it either CHR$(13) or CHR$(10) (can't remember which). If you pass both, it'll split by both. It can take a list of delimiters.
Shuwatch!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Split Versus Tokenize
« Reply #6 on: February 01, 2022, 02:48:17 pm »
The reason you weren't preserving blank lines is because you are using tokenize incorrectly. You should pass it either CHR$(13) or CHR$(10) (can't remember which). If you pass both, it'll split by both. It can take a list of delimiters.

Both are used as one delimiter for .bas files.

So you are saying I cannot use ", " as 1 delimiter for Tokenize either, I must only use only single char delimiters?

Advantage #3 for Split = variable length delimiters. :-))

Offline SpriggsySpriggs

  • Forum Resident
  • Posts: 1145
  • Larger than life
    • GitHub
Re: Split Versus Tokenize
« Reply #7 on: February 01, 2022, 02:49:57 pm »
You're intentionally misinterpreting what I'm saying but that's ok. If you prefer to remain ignorant then by all means, please do.
Shuwatch!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Split Versus Tokenize
« Reply #8 on: February 01, 2022, 02:51:50 pm »
You're intentionally misinterpreting what I'm saying but that's ok. If you prefer to remain ignorant then by all means, please do.

What did I misinterpret, .bas files are really split by Chr$(13) + Chr$(10) you need both in one delimiter.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Split Versus Tokenize
« Reply #9 on: February 01, 2022, 02:58:20 pm »
@SpriggsySpriggs

If you want, I will say Tokenize has advantage of multiple single char delimiters, that must be how you "list" them in a single delimiter string...

Actually if that were true Tokenize should be splitting into even more blank lines?

I suspect Tokenize ignores blank lines? Maybe?

Easy enough to test....

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Split Versus Tokenize
« Reply #10 on: February 01, 2022, 03:13:40 pm »
4th advantage of Split = it preserves the Lbound of the storage array.


Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • Steve’s QB64 Archive Forum
Re: Split Versus Tokenize
« Reply #11 on: February 01, 2022, 03:35:36 pm »
What did I misinterpret, .bas files are really split by Chr$(13) + Chr$(10) you need both in one delimiter.

This isn't quite true.  :P

Files are delimited according to the CRLF line endings, which vary from OS to OS.

Old versions of Windows used CHR$(13) + CHR$(10) as the CRLF character.
Linux uses CHR$(10) for line endings.
Old Macs use CHR$(13) for line endings.
New Macs have swapped over to CHR$(10) line endings, if I remember correctly.
New versions of Windows have gotten with the program and generally default to the old CHR$(13) + CHR$(10) line endings, but they also now allow you to import and export your line endings however the heck you want!

It's your OS which determines what types of line endings your files have; not the BAS extension itself.  At the end of the day, all a BAS file is, is a TXT file with a little more descriptive extension on it.  ;)
« Last Edit: February 01, 2022, 04:13:53 pm by SMcNeill »
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Split Versus Tokenize
« Reply #12 on: February 01, 2022, 04:04:54 pm »
Screw files then here is a test that proves tokenize ignores blank lines and changes the lbound of the load array to 0:
Code: QB64: [Select]
  1. _Title "Tokenize blank line test" ' b+ 2022-02-01
  2. b$ = "Line 1"
  3. For i = 2 To 10
  4.     If i = 4 Or i = 7 Or i = 10 Then
  5.         b$ = b$ + Chr$(10) + "Line" + Str$(i)
  6.     Else
  7.         b$ = b$ + Chr$(10)
  8.     End If
  9.  
  10. deli$ = Chr$(10)
  11.  
  12. ReDim LoadMe$(1 To 1)
  13. startSplit = Timer(.001)
  14. Split b$, deli$, LoadMe$()
  15. splitTime = Timer(.001) - startSplit
  16. For i = LBound(LoadMe$) To UBound(LoadMe$)
  17.     Print i, LoadMe$(i)
  18. Print "Time for Split was:"; splitTime
  19.  
  20. ' reset for tokenize
  21. ReDim LoadMe$(1 To 1)
  22. startTokenize = Timer(.001)
  23. tokenize b$, deli$, LoadMe$()
  24. TokenizeTime = Timer(.001) - startTokenize
  25. For i = LBound(LoadMe$) To UBound(LoadMe$)
  26.     Print i, LoadMe$(i)
  27. Print "Time for Tokenize was:"; TokenizeTime
  28.  
  29.  
  30.  
  31. ' note: I buggered this twice now, FOR base 1 array REDIM MyArray (1 to 1) AS ... the (1 to 1) is not same as (1) which was the Blunder!!!
  32. 'notes: REDIM the array(0) to be loaded before calling Split '<<<< IMPORTANT dynamic array and empty, can use any lbound though
  33. 'This SUB will take a given N delimited string, and delimiter$ and create an array of N+1 strings using the LBOUND of the given dynamic array to load.
  34. 'notes: the loadMeArray() needs to be dynamic string array and will not change the LBOUND of the array it is given.  rev 2019-08-27
  35. Sub Split (SplitMeString As String, delim As String, loadMeArray() As String)
  36.     Dim curpos As Long, arrpos As Long, LD As Long, dpos As Long 'fix use the Lbound the array already has
  37.     curpos = 1: arrpos = LBound(loadMeArray): LD = Len(delim)
  38.     dpos = InStr(curpos, SplitMeString, delim)
  39.     Do Until dpos = 0
  40.         loadMeArray(arrpos) = Mid$(SplitMeString, curpos, dpos - curpos)
  41.         arrpos = arrpos + 1
  42.         If arrpos > UBound(loadMeArray) Then ReDim _Preserve loadMeArray(LBound(loadMeArray) To UBound(loadMeArray) + 1000) As String
  43.         curpos = dpos + LD
  44.         dpos = InStr(curpos, SplitMeString, delim)
  45.     Loop
  46.     loadMeArray(arrpos) = Mid$(SplitMeString, curpos)
  47.     ReDim _Preserve loadMeArray(LBound(loadMeArray) To arrpos) As String 'get the ubound correct
  48.  
  49. Sub tokenize (toTokenize As String, delimiters As String, StorageArray() As String)
  50.         Function strtok%& (ByVal str As _Offset, delimiters As String)
  51.     End Declare
  52.     Dim As _Offset tokenized
  53.     Dim As String tokCopy: tokCopy = toTokenize + Chr$(0)
  54.     Dim As String delCopy: delCopy = delimiters + Chr$(0)
  55.     Dim As _Unsigned Long lowerbound: lowerbound = LBound(StorageArray)
  56.     Dim As _Unsigned Long i: i = lowerbound
  57.     tokenized = strtok(_Offset(tokCopy), delCopy)
  58.     While tokenized <> 0
  59.         ReDim _Preserve StorageArray(lowerbound To UBound(StorageArray) + 1)
  60.         StorageArray(i) = pointerToString(tokenized)
  61.         tokenized = strtok(0, delCopy)
  62.         i = i + 1
  63.     Wend
  64.     ReDim _Preserve StorageArray(UBound(StorageArray) - 1)
  65.  
  66. Function pointerToString$ (pointer As _Offset)
  67.         Function strlen%& (ByVal ptr As _Unsigned _Offset)
  68.     End Declare
  69.     Dim As _Offset length: length = strlen(pointer)
  70.     If length Then
  71.         Dim As _MEM pString: pString = _Mem(pointer, length)
  72.         Dim As String ret: ret = Space$(length)
  73.         _MemGet pString, pString.OFFSET, ret
  74.         _MemFree pString
  75.     End If
  76.     pointerToString = ret
  77.  
  78.  

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • Steve’s QB64 Archive Forum
Re: Split Versus Tokenize
« Reply #13 on: February 01, 2022, 04:17:53 pm »
I'd say the redim issue is probably here:

        ReDim _Preserve StorageArray(lowerbound To UBound(StorageArray) + 1)
        StorageArray(i) = pointerToString(tokenized)
        tokenized = strtok(0, delCopy)
        i = i + 1
    Wend
    ReDim _Preserve StorageArray(UBound(StorageArray) - 1)


Notice the difference n the first and last lines?  The last line doesn't have the lowerbound set in the ReDim statement.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Split Versus Tokenize
« Reply #14 on: February 01, 2022, 04:30:20 pm »
Yep! Fixed with lowerbound on line 70:
Code: QB64: [Select]
  1. _Title "Tokenize blank line test" ' b+ 2022-02-01
  2. b$ = "Line 1"
  3. For i = 2 To 10
  4.     If i = 4 Or i = 7 Or i = 10 Then
  5.         b$ = b$ + Chr$(10) + "Line" + Str$(i)
  6.     Else
  7.         b$ = b$ + Chr$(10)
  8.     End If
  9.  
  10. deli$ = Chr$(10)
  11.  
  12. ReDim LoadMe$(1 To 1)
  13. startSplit = Timer(.001)
  14. Split b$, deli$, LoadMe$()
  15. splitTime = Timer(.001) - startSplit
  16. For i = LBound(LoadMe$) To UBound(LoadMe$)
  17.     Print i, LoadMe$(i)
  18. Print "Time for Split was:"; splitTime
  19.  
  20. ' reset for tokenize
  21. ReDim LoadMe$(1 To 1)
  22. startTokenize = Timer(.001)
  23. tokenize b$, deli$, LoadMe$()
  24. TokenizeTime = Timer(.001) - startTokenize
  25. For i = LBound(LoadMe$) To UBound(LoadMe$)
  26.     Print i, LoadMe$(i)
  27. Print "Time for Tokenize was:"; TokenizeTime
  28.  
  29.  
  30.  
  31. ' note: I buggered this twice now, FOR base 1 array REDIM MyArray (1 to 1) AS ... the (1 to 1) is not same as (1) which was the Blunder!!!
  32. 'notes: REDIM the array(0) to be loaded before calling Split '<<<< IMPORTANT dynamic array and empty, can use any lbound though
  33. 'This SUB will take a given N delimited string, and delimiter$ and create an array of N+1 strings using the LBOUND of the given dynamic array to load.
  34. 'notes: the loadMeArray() needs to be dynamic string array and will not change the LBOUND of the array it is given.  rev 2019-08-27
  35. Sub Split (SplitMeString As String, delim As String, loadMeArray() As String)
  36.     Dim curpos As Long, arrpos As Long, LD As Long, dpos As Long 'fix use the Lbound the array already has
  37.     curpos = 1: arrpos = LBound(loadMeArray): LD = Len(delim)
  38.     dpos = InStr(curpos, SplitMeString, delim)
  39.     Do Until dpos = 0
  40.         loadMeArray(arrpos) = Mid$(SplitMeString, curpos, dpos - curpos)
  41.         arrpos = arrpos + 1
  42.         If arrpos > UBound(loadMeArray) Then ReDim _Preserve loadMeArray(LBound(loadMeArray) To UBound(loadMeArray) + 1000) As String
  43.         curpos = dpos + LD
  44.         dpos = InStr(curpos, SplitMeString, delim)
  45.     Loop
  46.     loadMeArray(arrpos) = Mid$(SplitMeString, curpos)
  47.     ReDim _Preserve loadMeArray(LBound(loadMeArray) To arrpos) As String 'get the ubound correct
  48.  
  49. Sub tokenize (toTokenize As String, delimiters As String, StorageArray() As String)
  50.         Function strtok%& (ByVal str As _Offset, delimiters As String)
  51.     End Declare
  52.     Dim As _Offset tokenized
  53.     Dim As String tokCopy: tokCopy = toTokenize + Chr$(0)
  54.     Dim As String delCopy: delCopy = delimiters + Chr$(0)
  55.     Dim As _Unsigned Long lowerbound: lowerbound = LBound(StorageArray)
  56.     Dim As _Unsigned Long i: i = lowerbound
  57.     tokenized = strtok(_Offset(tokCopy), delCopy)
  58.     While tokenized <> 0
  59.         ReDim _Preserve StorageArray(lowerbound To UBound(StorageArray) + 1)
  60.         StorageArray(i) = pointerToString(tokenized)
  61.         tokenized = strtok(0, delCopy)
  62.         i = i + 1
  63.     Wend
  64.     ReDim _Preserve StorageArray(lowerbound To UBound(StorageArray) - 1) ' <<< fix with lowerbound
  65.  
  66. Function pointerToString$ (pointer As _Offset)
  67.         Function strlen%& (ByVal ptr As _Unsigned _Offset)
  68.     End Declare
  69.     Dim As _Offset length: length = strlen(pointer)
  70.     If length Then
  71.         Dim As _MEM pString: pString = _Mem(pointer, length)
  72.         Dim As String ret: ret = Space$(length)
  73.         _MemGet pString, pString.OFFSET, ret
  74.         _MemFree pString
  75.     End If
  76.     pointerToString = ret
  77.  
  78.  
  79.  

Score is now Split 3 and Tokenize 1  :)
« Last Edit: February 01, 2022, 04:42:09 pm by bplus »