Author Topic: Versatile String parsing function (Read 18501 times)

RhoSigma · « **on:** August 27, 2021, 11:35:04 am »

I guess every developer is sooner or later in need of such a parsing function, doesn't matter if it's to split a simple text line into its single words, quickly reading CSV data into an array, break up a path specification into the single folder names or get the individual options of a given command line or of an URL query string.

Obviously such a function must be able to recognize several separator chars and needs to be able to suppress the splitting of components in quoted sections. Special to this function is the ability to optionally use different chars for opening quotes and closing quotes, which e.g. allows to read out sections in parantesis or brackets.

The following short example program will demonstrate some of the possible uses. A detailed function description is provided in the HTML Documentation available for download below the example code block.

An example using the new ParseLine function:
Save as: ParseExample.bas (or whatever)

Code: QB64: [Select]

_TITLE "ParseExample"
'=== Full description for the ParseLine&() function is available
'=== in the separate HTML document.
'=====================================================================
WIDTH 100, 32
REDIM a$(3 TO 4) 'result array (at least one element)
 
 
 
'=== e$ = example description
'=== s$ = used separators (max. 5 chars)
'=== q$ = used quotes (max. 2 chars) (empty = regular ")
'=== l$ = test line to parse
'=====================================================================
e$ = "empty lines or those containing defined separators only, won't give a result"
s$ = " ,.": q$ = ""
l$ = "      ,. , ., ., . ,.,., .,,,,,. ,., "
GOSUB doFunc
e$ = "a simple text line, using space, comma and period as separators and regular quoting"
s$ = " ,.": q$ = ""
l$ = "Hello World, just want to say,greetings " + CHR$(34) + "to all" + CHR$(34) + " from RhoSigma."
GOSUB doFunc
e$ = "now a complex space separated test line with regular quoting and empty quotes"
s$ = " ": q$ = ""
l$ = "     " + CHR$(34) + "  ABC  " + CHR$(34) + " 123 " + CHR$(34) + CHR$(34) + " " + CHR$(34) + CHR$(34) + "X Y Z" + CHR$(34) + CHR$(34)
GOSUB doFunc
e$ = "same space separated test line with reodered quoting and empty quotes"
s$ = " ": q$ = ""
l$ = "       ABC   123" + CHR$(34) + CHR$(34) + CHR$(34) + " X Y Z " + CHR$(34) + CHR$(34) + CHR$(34) + "345  "
GOSUB doFunc
e$ = "again the space separated test line with regular quoting and an unfinished (EOL) quote"
s$ = " ": q$ = ""
l$ = "       ABC   123" + CHR$(34) + " " + CHR$(34) + " X Y Z " + CHR$(34) + " " + CHR$(34) + CHR$(34) + "345  "
GOSUB doFunc
e$ = "an opening quote at EOL is in fact an empty quote, it adds another empty array element"
s$ = " ": q$ = ""
l$ = "  " + CHR$(34) + "a final open quote is empty" + CHR$(34) + "   " + CHR$(34)
GOSUB doFunc
'-----------------------------
'-----------------------------
e$ = "a SUB declaration line using paranthesis as TWO char quoting"
s$ = " ": q$ = "()"
l$ = "SUB RectFill (lin%, col%, hei%, wid%, fg%, bg%, ch$)"
GOSUB doFunc
e$ = "same SUB line with many extra paranthesis, showing that TWO char quoting avoids nesting"
s$ = " ": q$ = "()"
l$ = "SUB RectFill (lin%, col%, ((hei%)), wid%, (fg%), bg%, (ch$))"
GOSUB doFunc
'-----------------------------
'-----------------------------
e$ = "space separated command line with regular quoting"
s$ = " ": q$ = ""
l$ = "--testfile " + CHR$(34) + "C:\My Folder\My File.txt" + CHR$(34) + " --testmode --output logfile.txt"
GOSUB doFunc
e$ = "space and/or equal sign separated command line with regular quoting"
s$ = " =": q$ = ""
l$ = "--testfile=" + CHR$(34) + "C:\My Folder\My File.txt" + CHR$(34) + " --testmode --output=logfile.txt"
GOSUB doFunc
e$ = "space and/or equal sign separated command line with alternative ONE char quoting"
s$ = " =": q$ = "|"
l$ = "--testfile=|C:\My Folder\My File.txt| --testmode --output=logfile.txt"
GOSUB doFunc
e$ = "space and/or equal sign separated command line with alternative TWO char quoting"
s$ = " =": q$ = "{}"
l$ = "--testfile={C:\My Folder\My File.txt} --testmode --output=logfile.txt"
GOSUB doFunc
'-----------------------------
'-----------------------------
e$ = "parsing a filename using (back)slashes as separators but NO spaces"
s$ = "\/": q$ = ""
l$ = "C:\My Folder\My File.txt"
GOSUB doFunc
e$ = "for quoted filenames the quoting char(s) must be separators instead of quotes (see source)"
'NOTE: a char cannot be used as separator and quote at the same time
s$ = "\/" + CHR$(34): q$ = "*" '* is not allowd in filenames, so it's perfect to knock out the regular quote here
l$ = CHR$(34) + "C:\My Folder\My File.txt" + CHR$(34)
GOSUB doFunc
'=====================================================================
SYSTEM
 
 
 
'-- This GOSUB subroutine will execute the examples from above and
'-- print the given inputs and function results.
doFunc:
CLS
COLOR 12: PRINT "square brackets just used to better visualize the start and end of strings ..."
PRINT
COLOR 14: PRINT "Example: ";: COLOR 10: PRINT e$
PRINT
COLOR 15
PRINT "given input to function:"
PRINT "------------------------"
COLOR 14: PRINT "      Line: ";: COLOR 12: PRINT "[";: COLOR 7: PRINT l$;: COLOR 12: PRINT "]"
COLOR 14: PRINT "Separators: ";: COLOR 12: PRINT "[";: COLOR 7: PRINT s$;: COLOR 12: PRINT "]"
COLOR 14: PRINT "    Quotes: ";: COLOR 12: PRINT "[";: COLOR 7: PRINT q$;: COLOR 12: PRINT "]";: COLOR 3: PRINT " (empty = " + CHR$(34) + ")     "
PRINT
COLOR 14: PRINT "     Array: ";: COLOR 7: PRINT "LBOUND ="; LBOUND(a$), "UBOUND ="; UBOUND(a$)
PRINT
res& = ParseLine&(l$, s$, q$, a$(), 0)
COLOR 15
PRINT "result of function call (new UBOUND or -1 for nothing to parse):"
PRINT "----------------------------------------------------------------"
COLOR 14: PRINT "Result: ";: COLOR 7: PRINT res&
PRINT
IF res& > 0 THEN
    COLOR 15
    PRINT "array dump:"
    PRINT "-----------"
    FOR x& = LBOUND(a$) TO UBOUND(a$)
        COLOR 14: PRINT "Index:";: COLOR 7: PRINT x&,
        COLOR 14: PRINT "Content: ";: COLOR 12: PRINT "[";: COLOR 7: PRINT a$(x&);: COLOR 12: PRINT "]"; TAB(80);
        COLOR 14: PRINT "Length:";: COLOR 7: PRINT LEN(a$(x&))
    NEXT x&
    PRINT
END IF
PRINT "press any key ...": SLEEP
RETURN
 
 
 
 
 
'--- Full description available in separate HTML document.
'---------------------------------------------------------------------
FUNCTION ParseLine& (inpLine$, sepChars$, quoChars$, outArray$(), minUB&)
'--- option _explicit requirements ---
DIM ilen&, icnt&, slen%, s1%, s2%, s3%, s4%, s5%, q1%, q2%
DIM oalb&, oaub&, ocnt&, flag%, ch%, nest%, spos&, epos&
'--- so far return nothing ---
ParseLine& = -1
'--- init & check some runtime variables ---
ilen& = LEN(inpLine$): icnt& = 1
IF ilen& = 0 THEN EXIT FUNCTION
slen% = LEN(sepChars$)
IF slen% > 0 THEN s1% = ASC(sepChars$, 1)
IF slen% > 1 THEN s2% = ASC(sepChars$, 2)
IF slen% > 2 THEN s3% = ASC(sepChars$, 3)
IF slen% > 3 THEN s4% = ASC(sepChars$, 4)
IF slen% > 4 THEN s5% = ASC(sepChars$, 5)
IF slen% > 5 THEN slen% = 5 'max. 5 chars, ignore the rest
IF LEN(quoChars$) > 0 THEN q1% = ASC(quoChars$, 1): ELSE q1% = 34
IF LEN(quoChars$) > 1 THEN q2% = ASC(quoChars$, 2): ELSE q2% = q1%
oalb& = LBOUND(outArray$): oaub& = UBOUND(outArray$): ocnt& = oalb&
'--- skip preceding separators ---
plSkipSepas:
flag% = 0
WHILE icnt& <= ilen& AND NOT flag%
    ch% = ASC(inpLine$, icnt&)
    SELECT CASE slen%
        CASE 0: flag% = -1
        CASE 1: flag% = ch% <> s1%
        CASE 2: flag% = ch% <> s1% AND ch% <> s2%
        CASE 3: flag% = ch% <> s1% AND ch% <> s2% AND ch% <> s3%
        CASE 4: flag% = ch% <> s1% AND ch% <> s2% AND ch% <> s3% AND ch% <> s4%
        CASE 5: flag% = ch% <> s1% AND ch% <> s2% AND ch% <> s3% AND ch% <> s4% AND ch% <> s5%
    END SELECT
    icnt& = icnt& + 1
WEND
IF NOT flag% THEN 'nothing else? - then exit
    IF ocnt& > oalb& GOTO plEnd
    EXIT FUNCTION
END IF
'--- redim to clear array on 1st word/component ---
IF ocnt& = oalb& THEN REDIM outArray$(oalb& TO oaub&)
'--- expand array, if required ---
plNextWord:
IF ocnt& > oaub& THEN
    oaub& = oaub& + 10
    REDIM _PRESERVE outArray$(oalb& TO oaub&)
END IF
'--- get current word/component until next separator ---
flag% = 0: nest% = 0: spos& = icnt& - 1
WHILE icnt& <= ilen& AND NOT flag%
    IF ch% = q1% AND nest% = 0 THEN
        nest% = 1
    ELSEIF ch% = q1% AND nest% > 0 THEN
        nest% = nest% + 1
    ELSEIF ch% = q2% AND nest% > 0 THEN
        nest% = nest% - 1
    END IF
    ch% = ASC(inpLine$, icnt&)
    SELECT CASE slen%
        CASE 0: flag% = (nest% = 0 AND (ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
        CASE 1: flag% = (nest% = 0 AND (ch% = s1% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
        CASE 2: flag% = (nest% = 0 AND (ch% = s1% OR ch% = s2% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
        CASE 3: flag% = (nest% = 0 AND (ch% = s1% OR ch% = s2% OR ch% = s3% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
        CASE 4: flag% = (nest% = 0 AND (ch% = s1% OR ch% = s2% OR ch% = s3% OR ch% = s4% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
        CASE 5: flag% = (nest% = 0 AND (ch% = s1% OR ch% = s2% OR ch% = s3% OR ch% = s4% OR ch% = s5% OR ch% = q1%)) OR (nest% = 1 AND ch% = q2%)
    END SELECT
    icnt& = icnt& + 1
WEND
epos& = icnt& - 1
IF ASC(inpLine$, spos&) = q1% THEN spos& = spos& + 1
outArray$(ocnt&) = MID$(inpLine$, spos&, epos& - spos&)
ocnt& = ocnt& + 1
'--- more words/components following? ---
IF flag% AND ch% = q1% AND nest% = 0 GOTO plNextWord
IF flag% GOTO plSkipSepas
IF (ch% <> q1%) AND (ch% <> q2% OR nest% = 0) THEN outArray$(ocnt& - 1) = outArray$(ocnt& - 1) + CHR$(ch%)
'--- final array size adjustment, then exit ---
plEnd:
IF ocnt& - 1 < minUB& THEN ocnt& = minUB& + 1
REDIM _PRESERVE outArray$(oalb& TO (ocnt& - 1))
ParseLine& = ocnt& - 1
END FUNCTION
 
 

As it is required to preserve the UTF-8 encoding of the HTML Documentation, it is packed into an 7-zip archive file attached below. The archive does also contain the example from the codebox above.

bplus · « **Reply #1 on:** August 27, 2021, 11:50:09 am »

Nice advance from split! :)

bplus · « **Reply #2 on:** August 27, 2021, 12:13:20 pm »

@RhoSigma

Is this the parser you are using with your GUI tools?

RhoSigma · « **Reply #3 on:** August 27, 2021, 12:25:01 pm »

Quote from: bplus on August 27, 2021, 12:13:20 pm

@RhoSigma
Is this the parser you are using with your GUI tools?

In fact it is based on that parser. The features of this new version are mostly driven by my different needs I got aware of in GuiTools.

In future versions of GuiTools I'll replace the old parser with this new version.

RhoSigma · « **Reply #4 on:** August 28, 2021, 09:13:24 am »

Tips & Tricks

If you need to specify control chars as separators, lets say tabulators, linefeeds, carrige returns and formfeeds, then you can of course do it the bulky way by adding CHR$(9)+CHR$(10)+CHR$(13)+CHR$(12) in place of the sepChars$ function argument, but writing it as MKL$(&H090A0D0C) instead would be much smarter in that case. As the order of the given separator chars doesn't matter, it will not hurt that MKL$ will reverse the given codes internally into the little endian order.

Accordingly MKI$ could be used for only two separator chars. However, if you need an odd number of chars, eg. 3 then you must at least go the MKI$+CHR$ way. Or you may use MKL$ nevertheless with only 3 hex bytes (this would imply the 4th hex byte is zero), and hope that your input line does not have any CHR$(0) in it.

euklides · « **Reply #5 on:** August 28, 2021, 10:44:37 am »

Complicated program for a fairly simple problem:
1) define the allowed characters,
2) put one no allowed character at the end of the string
3) read all the characters in the string, note the retained characters forming a word and, when you find a not allowed character, place the word into a incremented variable...

And your program takes ! or ? as a word ?

For a html text, take only text within > <

RhoSigma · « **Reply #6 on:** August 28, 2021, 11:25:42 am »

Quote from: euklides on August 28, 2021, 10:44:37 am

And your program takes ! or ? as a word ?

Alone from this, I'm for 99% sure you not even had a closer look to the example, not to say even running it, or take a look into the function description.

The function is exactly how I wanted it to be, it behaves exactly as I wanted it to behave.
You're welcome to use this function, or to code a better one and post it here, so we can compare.

euklides · « **Reply #7 on:** August 30, 2021, 02:46:18 am »

Ok
I work with some non commercial epubs and have my little program to extract from them the text (it's html text as you know) and the words (used for word compilation, spell check; number of different words and complexity of the text.) [ in french]

RhoSigma · « **Reply #8 on:** August 30, 2021, 05:33:07 am »

I see, maybe I was somewhat unprecise on my functions purpose. It's not just intended for word extraction out of a text, but also to process as many different data as possible, eg. splitting CSV data into its single values, extracting argument lists from a SUBs/FUNCs in program sources and even parsing of binary patterns would be possible etc..

I need this function in many different programs for many different purposes, so I made it to fit for all my needs, and that's why I call it "versatile".

Of course, alone for splitting text into words, a simpler implementation would do it too. But the good thing on my implementation is, that it it can do a lot of different parsing and it's still possible to use it for simple word splitting, given the right parameters.

SMcNeill · « **Reply #9 on:** August 30, 2021, 11:10:52 am »

I haven’t tried this yet, Rho, but let me ask: Does it have a “cluster” feature? (I’ll explain; I’m not really certain what to call what I’m looking for…)

For example, my data might look like:

“Smith, Joe”, “123 Frog Lane”, “New York, New York”
Or
>Smith, Joe<,ect…

Both are CSV data, but the first uses quotes to cluster data together, ignoring the commas, while the second uses >< as a delimiter.

Somewhere around here, I’ve got a function that offers such delimition ability via arrays, but I found it a little too complex to remember how it worked without going back and having to reread my documentation every time I plugged it into a program. If yours can handle various data cases, like above, I’ll probably just use it to plug into my projects in the future, rather than having to create a simpler version of the over-engineered beast I have currently.

Ideally, it’d need/offer:

Multiple start/stop cluster symbols. (Quotes, or parentheses, or brackets, ect.)

Non-cluster symbols. (Say triple quotes to indicate a regular quote in the data. “””Sexy Beast””” would actually represent “Sexy Beast” and those quotes are non-grouping data. Or “5 >> 3” would represent “5 > 3” if >< were delimiters…)

Multiple data separators. Comma, CHR$(13) + CHR$(10), CHR$(13), CHR$(10) might *all* indicate separation of data.

Data inclusionary/exclusionary criteria. For example if a line starts with “DATA 123, 456, 789”, can I exclude that “DATA “ from my results? Or a line that starts with REM or ‘ and ends with a CRLF? Can I start extraction only on lines that begin with “DATA” and end with a CRLF, while ignoring lines that don’t?

I’ve got a function here somewhere that handles such use cases, but it’s bulky and relies on the use of multiple arrays for each parameter. I’ve really got to tone it down to something simpler for in the future, and was curious just how robust this was, overall.

RhoSigma · « **Reply #10 on:** August 30, 2021, 12:39:33 pm »

Quote from: SMcNeill on August 30, 2021, 11:10:52 am

“Smith, Joe”, “123 Frog Lane”, “New York, New York”

>Smith, Joe<,ect…

easy:
1.) specify , as sepChars$ and empty quoChars$ (implies regular " quote)
2.) specify , as sepChars$ and >< as quoChars$

Quote from: SMcNeill on August 30, 2021, 11:10:52 am

Multiple start/stop cluster symbols. (Quotes, or parentheses, or brackets, ect.)

not sure about your "multiple", you may use different quoting start/stop chars, but not eg. () and <> in the same call. Quoting works in "one char" like the regular " or any other char of your choice, eg. |, * etc., even control chars
quoting is also possible in "two chars" mode like your ><, or (), [], {}, /\ or whatever, but only one pair at a time

Quote from: SMcNeill on August 30, 2021, 11:10:52 am

Non-cluster symbols. (Say triple quotes to indicate a regular quote in the data. “””Sexy Beast””” would actually represent “Sexy Beast” and those quotes are non-grouping data.

yes/no, would create 3 array enties empty/Sexy Beast/empty, if regular quoting is used (you would simply ignore the empty ones)

Quote from: SMcNeill on August 30, 2021, 11:10:52 am

Or “5 >> 3” would represent “5 > 3” if >< were delimiters…)

yes, as "two chars" quoting doesn't nest, the 1st > would open the quote and the 2nd, 3rd... > would be taken as literal, also < would be literal until a matching number is reached to close the quote.

Quote from: SMcNeill on August 30, 2021, 11:10:52 am

Multiple data separators. Comma, CHR$(13) + CHR$(10), CHR$(13), CHR$(10) might *all* indicate separation of data.

yes, currently upto 5 chars (easiely expandable with your skills)

Quote from: SMcNeill on August 30, 2021, 11:10:52 am

Data inclusionary/exclusionary criteria. For example if a line starts with “DATA 123, 456, 789”, can I exclude that “DATA “ from my results? Or a line that starts with REM or ‘ and ends with a CRLF? Can I start extraction only on lines that begin with “DATA” and end with a CRLF, while ignoring lines that don’t?

conditional, if you know DATA is always first, then simply ignore the 1st array element returned

Also, if to many conditions come together, then it's always possible to do the parsing in 2 or more passes.

News:

Author Topic: Versatile String parsing function (Read 18501 times)

RhoSigma

Versatile String parsing function

bplus

Re: Versatile String parsing function

bplus

Re: Versatile String parsing function

RhoSigma

Re: Versatile String parsing function

RhoSigma

Re: Versatile String parsing function

euklides

Re: Versatile String parsing function

RhoSigma

Re: Versatile String parsing function

euklides

Re: Versatile String parsing function

RhoSigma

Re: Versatile String parsing function

SMcNeill

Re: Versatile String parsing function

RhoSigma

Re: Versatile String parsing function