Author Topic: Versatile String parsing function  (Read 9998 times)

0 Members and 1 Guest are viewing this topic.

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
Versatile String parsing function
« on: August 27, 2021, 11:35:04 am »
I guess every developer is sooner or later in need of such a parsing function, doesn't matter if it's to split a simple text line into its single words, quickly reading CSV data into an array, break up a path specification into the single folder names or get the individual options of a given command line or of an URL query string.

Obviously such a function must be able to recognize several separator chars and needs to be able to suppress the splitting of components in quoted sections. Special to this function is the ability to optionally use different chars for opening quotes and closing quotes, which e.g. allows to read out sections in parantesis or brackets.

The following short example program will demonstrate some of the possible uses. A detailed function description is provided in the HTML Documentation available for download below the example code block.

An example using the new ParseLine function:
Save as: ParseExample.bas (or whatever)
Code: QB64: [Select]
  1. _TITLE "ParseExample"
  2. '=== Full description for the ParseLine&() function is available
  3. '=== in the separate HTML document.
  4. '=====================================================================
  5. WIDTH 100, 32
  6. REDIM a$(3 TO 4) 'result array (at least one element)
  7.  
  8.  
  9.  
  10. '=== e$ = example description
  11. '=== s$ = used separators (max. 5 chars)
  12. '=== q$ = used quotes (max. 2 chars) (empty = regular ")
  13. '=== l$ = test line to parse
  14. '=====================================================================
  15. e$ = "empty lines or those containing defined separators only, won't give a result"
  16. s$ = " ,.": q$ = ""
  17. l$ = "      ,. , ., ., . ,.,., .,,,,,. ,., "
  18. GOSUB doFunc
  19. e$ = "a simple text line, using space, comma and period as separators and regular quoting"
  20. s$ = " ,.": q$ = ""
  21. l$ = "Hello World, just want to say,greetings " + CHR$(34) + "to all" + CHR$(34) + " from RhoSigma."
  22. GOSUB doFunc
  23. e$ = "now a complex space separated test line with regular quoting and empty quotes"
  24. s$ = " ": q$ = ""
  25. l$ = "     " + CHR$(34) + "  ABC  " + CHR$(34) + " 123 " + CHR$(34) + CHR$(34) + " " + CHR$(34) + CHR$(34) + "X Y Z" + CHR$(34) + CHR$(34)
  26. GOSUB doFunc
  27. e$ = "same space separated test line with reodered quoting and empty quotes"
  28. s$ = " ": q$ = ""
  29. l$ = "       ABC   123" + CHR$(34) + CHR$(34) + CHR$(34) + " X Y Z " + CHR$(34) + CHR$(34) + CHR$(34) + "345  "
  30. GOSUB doFunc
  31. e$ = "again the space separated test line with regular quoting and an unfinished (EOL) quote"
  32. s$ = " ": q$ = ""
  33. l$ = "       ABC   123" + CHR$(34) + " " + CHR$(34) + " X Y Z " + CHR$(34) + " " + CHR$(34) + CHR$(34) + "345  "
  34. GOSUB doFunc
  35. e$ = "an opening quote at EOL is in fact an empty quote, it adds another empty array element"
  36. s$ = " ": q$ = ""
  37. l$ = "  " + CHR$(34) + "a final open quote is empty" + CHR$(34) + "   " + CHR$(34)
  38. GOSUB doFunc
  39. '-----------------------------
  40. '-----------------------------
  41. e$ = "a SUB declaration line using paranthesis as TWO char quoting"
  42. s$ = " ": q$ = "()"
  43. l$ = "SUB RectFill (lin%, col%, hei%, wid%, fg%, bg%, ch$)"
  44. GOSUB doFunc
  45. e$ = "same SUB line with many extra paranthesis, showing that TWO char quoting avoids nesting"
  46. s$ = " ": q$ = "()"
  47. l$ = "SUB RectFill (lin%, col%, ((hei%)), wid%, (fg%), bg%, (ch$))"
  48. GOSUB doFunc
  49. '-----------------------------
  50. '-----------------------------
  51. e$ = "space separated command line with regular quoting"
  52. s$ = " ": q$ = ""
  53. l$ = "--testfile " + CHR$(34) + "C:\My Folder\My File.txt" + CHR$(34) + " --testmode --output logfile.txt"
  54. GOSUB doFunc
  55. e$ = "space and/or equal sign separated command line with regular quoting"
  56. s$ = " =": q$ = ""
  57. l$ = "--testfile=" + CHR$(34) + "C:\My Folder\My File.txt" + CHR$(34) + " --testmode --output=logfile.txt"
  58. GOSUB doFunc
  59. e$ = "space and/or equal sign separated command line with alternative ONE char quoting"
  60. s$ = " =": q$ = "|"
  61. l$ = "--testfile=|C:\My Folder\My File.txt| --testmode --output=logfile.txt"
  62. GOSUB doFunc
  63. e$ = "space and/or equal sign separated command line with alternative TWO char quoting"
  64. s$ = " =": q$ = "{}"
  65. l$ = "--testfile={C:\My Folder\My File.txt} --testmode --output=logfile.txt"
  66. GOSUB doFunc
  67. '-----------------------------
  68. '-----------------------------
  69. e$ = "parsing a filename using (back)slashes as separators but NO spaces"
  70. s$ = "\/": q$ = ""
  71. l$ = "C:\My Folder\My File.txt"
  72. GOSUB doFunc
  73. e$ = "for quoted filenames the quoting char(s) must be separators instead of quotes (see source)"
  74. 'NOTE: a char cannot be used as separator and quote at the same time
  75. s$ = "\/" + CHR$(34): q$ = "*" '* is not allowd in filenames, so it's perfect to knock out the regular quote here
  76. l$ = CHR$(34) + "C:\My Folder\My File.txt" + CHR$(34)
  77. GOSUB doFunc
  78. '=====================================================================
  79.  
  80.  
  81.  
  82. '-- This GOSUB subroutine will execute the examples from above and
  83. '-- print the given inputs and function results.
  84. doFunc:
  85. COLOR 12: PRINT "square brackets just used to better visualize the start and end of strings ..."
  86. COLOR 14: PRINT "Example: ";: COLOR 10: PRINT e$
  87. PRINT "given input to function:"
  88. PRINT "------------------------"
  89. COLOR 14: PRINT "      Line: ";: COLOR 12: PRINT "[";: COLOR 7: PRINT l$;: COLOR 12: PRINT "]"
  90. COLOR 14: PRINT "Separators: ";: COLOR 12: PRINT "[";: COLOR 7: PRINT s$;: COLOR 12: PRINT "]"
  91. COLOR 14: PRINT "    Quotes: ";: COLOR 12: PRINT "[";: COLOR 7: PRINT q$;: COLOR 12: PRINT "]";: COLOR 3: PRINT " (empty = " + CHR$(34) + ")     "
  92. COLOR 14: PRINT "     Array: ";: COLOR 7: PRINT "LBOUND ="; LBOUND(a$), "UBOUND ="; UBOUND(a$)
  93. res& = ParseLine&(l$, s$, q$, a$(), 0)
  94. PRINT "result of function call (new UBOUND or -1 for nothing to parse):"
  95. PRINT "----------------------------------------------------------------"
  96. COLOR 14: PRINT "Result: ";: COLOR 7: PRINT res&
  97. IF res& > 0 THEN
  98.     COLOR 15
  99.     PRINT "array dump:"
  100.     PRINT "-----------"
  101.     FOR x& = LBOUND(a$) TO UBOUND(a$)
  102.         COLOR 14: PRINT "Index:";: COLOR 7: PRINT x&,
  103.         COLOR 14: PRINT "Content: ";: COLOR 12: PRINT "[";: COLOR 7: PRINT a$(x&);: COLOR 12: PRINT "]"; TAB(80);
  104.         COLOR 14: PRINT "Length:";: COLOR 7: PRINT LEN(a$(x&))
  105.     NEXT x&
  106.     PRINT
  107. PRINT "press any key ...": SLEEP
  108.  
  109.  
  110.  
  111.  
  112.  
  113. '--- Full description available in separate HTML document.
  114. '---------------------------------------------------------------------
  115. FUNCTION ParseLine& (inpLine$, sepChars$, quoChars$, outArray$(), minUB&)
  116. '--- option _explicit requirements ---
  117. DIM ilen&, icnt&, slen%, s1%, s2%, s3%, s4%, s5%, q1%, q2%
  118. DIM oalb&, oaub&, ocnt&, flag%, ch%, nest%, spos&, epos&
  119. '--- so far return nothing ---
  120. ParseLine& = -1
  121. '--- init & check some runtime variables ---
  122. ilen& = LEN(inpLine$): icnt& = 1
  123. IF ilen& = 0 THEN EXIT FUNCTION
  124. slen% = LEN(sepChars$)
  125. IF slen% > 0 THEN s1% = ASC(sepChars$, 1)
  126. IF slen% > 1 THEN s2% = ASC(sepChars$, 2)
  127. IF slen% > 2 THEN s3% = ASC(sepChars$, 3)
  128. IF slen% > 3 THEN s4% = ASC(sepChars$, 4)
  129. IF slen% > 4 THEN s5% = ASC(sepChars$, 5)
  130. IF slen% > 5 THEN slen% = 5 'max. 5 chars, ignore the rest
  131. IF LEN(quoChars$) > 0 THEN q1% = ASC(quoChars$, 1): ELSE q1% = 34
  132. IF LEN(quoChars$) > 1 THEN q2% = ASC(quoChars$, 2): ELSE q2% = q1%
  133. oalb& = LBOUND(outArray$): oaub& = UBOUND(outArray$): ocnt& = oalb&
  134. '--- skip preceding separators ---
  135. plSkipSepas:
  136. flag% = 0
  137. WHILE icnt& <= ilen& AND NOT flag%
  138.     ch% = ASC(inpLine$, icnt&)
  139.     SELECT CASE slen%
  140.         CASE 0: flag% = -1
  141.         CASE 1: flag% = ch% <> s1%
  142.         CASE 2: flag% = ch% <> s1% AND ch% <> s2%
  143.         CASE 3: flag% = ch% <> s1% AND ch% <> s2% AND ch% <> s3%
  144.         CASE 4: flag% = ch% <> s1% AND ch% <> s2% AND ch% <> s3% AND ch% <> s4%
  145.         CASE 5: flag% = ch% <> s1% AND ch% <> s2% AND ch% <> s3% AND ch% <> s4% AND ch% <> s5%
  146.     END SELECT
  147.     icnt& = icnt& + 1
  148. IF NOT flag% THEN 'nothing else? - then exit
  149.     IF ocnt& > oalb& GOTO plEnd
  150. '--- redim to clear array on 1st word/component ---
  151. IF ocnt& = oalb& THEN REDIM outArray$(oalb& TO oaub&)
  152. '--- expand array, if required ---
  153. plNextWord:
  154. IF ocnt& > oaub& THEN
  155.     oaub& = oaub& + 10
  156.     REDIM _PRESERVE outArray$(oalb& TO oaub&)
  157. '--- get current word/component until next separator ---
  158. flag% = 0: nest% = 0: spos& = icnt& - 1
  159. WHILE icnt& <= ilen& AND NOT flag%
  160.     IF ch% = q1% AND nest% = 0 THEN
  161.         nest% = 1
  162.     ELSEIF ch% = q1% AND nest% > 0 THEN
  163.         nest% = nest% + 1
  164.     ELSEIF ch% = q2% AND nest% > 0 THEN
  165.         nest% = nest% - 1
  166.     END IF
  167.     ch% = ASC(inpLine$, icnt&)
  168.     SELECT CASE slen%
  169.         CASE 0: flag% = (nest% = 0 AND (ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
  170.         CASE 1: flag% = (nest% = 0 AND (ch% = s1% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
  171.         CASE 2: flag% = (nest% = 0 AND (ch% = s1% OR ch% = s2% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
  172.         CASE 3: flag% = (nest% = 0 AND (ch% = s1% OR ch% = s2% OR ch% = s3% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
  173.         CASE 4: flag% = (nest% = 0 AND (ch% = s1% OR ch% = s2% OR ch% = s3% OR ch% = s4% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
  174.         CASE 5: flag% = (nest% = 0 AND (ch% = s1% OR ch% = s2% OR ch% = s3% OR ch% = s4% OR ch% = s5% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
  175.     END SELECT
  176.     icnt& = icnt& + 1
  177. epos& = icnt& - 1
  178. IF ASC(inpLine$, spos&) = q1% THEN spos& = spos& + 1
  179. outArray$(ocnt&) = MID$(inpLine$, spos&, epos& - spos&)
  180. ocnt& = ocnt& + 1
  181. '--- more words/components following? ---
  182. IF flag% AND ch% = q1% AND nest% = 0 GOTO plNextWord
  183. IF flag% GOTO plSkipSepas
  184. IF (ch% <> q1%) AND (ch% <> q2% OR nest% = 0) THEN outArray$(ocnt& - 1) = outArray$(ocnt& - 1) + CHR$(ch%)
  185. '--- final array size adjustment, then exit ---
  186. plEnd:
  187. IF ocnt& - 1 < minUB& THEN ocnt& = minUB& + 1
  188. REDIM _PRESERVE outArray$(oalb& TO (ocnt& - 1))
  189. ParseLine& = ocnt& - 1
  190.  
  191.  

As it is required to preserve the UTF-8 encoding of the HTML Documentation, it is packed into an 7-zip archive file attached below. The archive does also contain the example from the codebox above.
* ParseLine.7z (Filesize: 5.97 KB, Downloads: 273)
My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Versatile String parsing function
« Reply #1 on: August 27, 2021, 11:50:09 am »
Nice advance from split! :)

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
Re: Versatile String parsing function
« Reply #2 on: August 27, 2021, 12:13:20 pm »
@RhoSigma

Is this the parser you are using with your GUI tools?

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
Re: Versatile String parsing function
« Reply #3 on: August 27, 2021, 12:25:01 pm »
@RhoSigma
Is this the parser you are using with your GUI tools?

In fact it is based on that parser. The features of this new version are mostly driven by my different needs I got aware of in GuiTools.

In future versions of GuiTools I'll replace the old parser with this new version.

My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
Re: Versatile String parsing function
« Reply #4 on: August 28, 2021, 09:13:24 am »
Tips & Tricks

If you need to specify control chars as separators, lets say tabulators, linefeeds, carrige returns and formfeeds, then you can of course do it the bulky way by adding CHR$(9)+CHR$(10)+CHR$(13)+CHR$(12) in place of the sepChars$ function argument, but writing it as MKL$(&H090A0D0C) instead would be much smarter in that case. As the order of the given separator chars doesn't matter, it will not hurt that MKL$ will reverse the given codes internally into the little endian order.

Accordingly MKI$ could be used for only two separator chars. However, if you need an odd number of chars, eg. 3 then you must at least go the MKI$+CHR$ way. Or you may use MKL$ nevertheless with only 3 hex bytes (this would imply the 4th hex byte is zero), and hope that your input line does not have any CHR$(0) in it.
My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack

Offline euklides

  • Forum Regular
  • Posts: 128
Re: Versatile String parsing function
« Reply #5 on: August 28, 2021, 10:44:37 am »
Complicated program for a fairly simple problem:
1) define the allowed characters,
2) put one no allowed character at the end of the string
3) read all the characters in the string, note the retained characters forming a word and, when you find a not allowed character, place the word into a incremented variable...

And your program takes ! or ? as a word  ?

For a html text, take only text within >   <


« Last Edit: August 28, 2021, 10:47:30 am by euklides »
Why not yes ?

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
Re: Versatile String parsing function
« Reply #6 on: August 28, 2021, 11:25:42 am »
And your program takes ! or ? as a word  ?

Alone from this, I'm for 99% sure you not even had a closer look to the example, not to say even running it, or take a look into the function description.

The function is exactly how I wanted it to be, it behaves exactly as I wanted it to behave.
You're welcome to use this function, or to code a better one and post it here, so we can compare.
My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack

Offline euklides

  • Forum Regular
  • Posts: 128
Re: Versatile String parsing function
« Reply #7 on: August 30, 2021, 02:46:18 am »
Ok
I work with some non commercial epubs and have my little program to extract from them the text (it's html text as you know) and the words (used for word compilation, spell check; number of different words and complexity of the text.) [ in french]
Why not yes ?

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
Re: Versatile String parsing function
« Reply #8 on: August 30, 2021, 05:33:07 am »
I see, maybe I was somewhat unprecise on my functions purpose. It's not just intended for word extraction out of a text, but also to process as many different data as possible, eg. splitting CSV data into its single values, extracting argument lists from a SUBs/FUNCs in program sources and even parsing of binary patterns would be possible etc..

I need this function in many different programs for many different purposes, so I made it to fit for all my needs, and that's why I call it "versatile".

Of course, alone for splitting text into words, a simpler implementation would do it too. But the good thing on my implementation is, that it it can do a lot of different parsing and it's still possible to use it for simple word splitting, given the right parameters.
« Last Edit: August 30, 2021, 05:35:05 am by RhoSigma »
My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • Steve’s QB64 Archive Forum
Re: Versatile String parsing function
« Reply #9 on: August 30, 2021, 11:10:52 am »
I haven’t tried this yet, Rho, but let me ask: Does it have a “cluster” feature?  (I’ll explain; I’m not really certain what to call what I’m looking for…)

For example, my data might look like:

“Smith, Joe”, “123 Frog Lane”, “New York, New York”
Or
>Smith, Joe<,ect…

Both are CSV data, but the first uses quotes to cluster data together, ignoring the commas, while the second uses >< as a delimiter.

Somewhere around here, I’ve got a function that offers such delimition ability via arrays, but I found it a little too complex to remember how it worked without going back and having to reread my documentation every time I plugged it into a program.  If yours can handle various data cases, like above, I’ll probably just use it to plug into my projects in the future, rather than having to create a simpler version of the over-engineered beast I have currently.

Ideally, it’d need/offer:

Multiple start/stop cluster symbols.  (Quotes, or parentheses, or brackets, ect.)

Non-cluster symbols.  (Say triple quotes to indicate a regular quote in the data.  “””Sexy Beast””” would actually represent “Sexy Beast” and those quotes are non-grouping data.  Or “5 >> 3” would represent “5 > 3” if >< were delimiters…)

Multiple data separators.  Comma, CHR$(13) + CHR$(10), CHR$(13), CHR$(10) might *all* indicate separation of data.

Data inclusionary/exclusionary criteria.  For example if a line starts with “DATA 123, 456, 789”, can I exclude that “DATA “ from my results?  Or a line that starts with REM or ‘ and ends with a CRLF?  Can I start extraction only on lines that begin with “DATA” and end with a CRLF, while ignoring lines that don’t?

I’ve got a function here somewhere that handles such use cases, but it’s bulky and relies on the use of multiple arrays for each parameter.  I’ve really got to tone it down to something simpler for in the future, and was curious just how robust this was, overall.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
Re: Versatile String parsing function
« Reply #10 on: August 30, 2021, 12:39:33 pm »
“Smith, Joe”, “123 Frog Lane”, “New York, New York”

>Smith, Joe<,ect…
easy:
1.) specify , as sepChars$ and empty quoChars$ (implies regular " quote)
2.) specify , as sepChars$ and >< as quoChars$
Multiple start/stop cluster symbols.  (Quotes, or parentheses, or brackets, ect.)
not sure about your "multiple", you may use different quoting start/stop chars, but not eg. () and <>  in the same call. Quoting works in "one char" like the regular " or any other char of your choice, eg. |, * etc., even control chars
quoting is also possible in "two chars" mode like your ><, or (), [], {}, /\ or whatever, but only one pair at a time
Non-cluster symbols.  (Say triple quotes to indicate a regular quote in the data.  “””Sexy Beast””” would actually represent “Sexy Beast” and those quotes are non-grouping data.
yes/no, would create 3 array enties empty/Sexy Beast/empty, if regular quoting is used (you would simply ignore the empty ones)
Or “5 >> 3” would represent “5 > 3” if >< were delimiters…)
yes, as "two chars" quoting doesn't nest, the 1st > would open the quote and the 2nd, 3rd... > would be taken as literal, also < would be literal until a matching number is reached to close the quote.
Multiple data separators.  Comma, CHR$(13) + CHR$(10), CHR$(13), CHR$(10) might *all* indicate separation of data.
yes, currently upto 5 chars (easiely expandable with your skills)
Data inclusionary/exclusionary criteria.  For example if a line starts with “DATA 123, 456, 789”, can I exclude that “DATA “ from my results?  Or a line that starts with REM or ‘ and ends with a CRLF?  Can I start extraction only on lines that begin with “DATA” and end with a CRLF, while ignoring lines that don’t?
conditional, if you know DATA is always first, then simply ignore the 1st array element returned

Also, if to many conditions come together, then it's always possible to do the parsing in 2 or more passes.
My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack