Author Topic: Extract numerics from text string  (Read 3946 times)

0 Members and 1 Guest are viewing this topic.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Extract numerics from text string
« on: November 07, 2019, 12:18:54 pm »
Hello,

I am looking for easy way to extract real numbers from text strings. Got any ideas?

BTW, this is what is needed for an EVAL function parsing.

Actually, the bugger is a negative sign for a number, the negative sign for an exponent, the negative sign for a subtraction operation plus, of course, the negative sign for a hyphen.

I am guessing you have to look both ways before crossing the line with a judgement.

And what if you have negative signs to the left of you and hyphens to the right, stuck in the middle...


Must be lunch time, sugar is getting low.




« Last Edit: November 07, 2019, 12:41:21 pm by bplus »

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Extract numerics from text string
« Reply #1 on: November 07, 2019, 12:41:22 pm »
Hello,

I am looking for easy way to extract real numbers from text strings. Got any ideas?

BTW, this is what is needed for an EVAL function parsing.

Actually, the bugger is a negative sign for a number, the negative sign for an exponent, the negative sign for a subtraction operation plus, of course, the negative sign for a hyphen.

And don’t forget the E and D for scientific notation.  My math evaluator does all this sort of stuff, somewhere in it, but I’d have to study back up to see where/how it does it.  It’s been years since I wrote the thing! 
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Extract numerics from text string
« Reply #2 on: November 07, 2019, 12:43:33 pm »
From the same era steve mentions I offer sxript as a self consistent example. Click there to find:

<----
You're not done when it works, you're done when it's right.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Extract numerics from text string
« Reply #3 on: November 07, 2019, 12:52:37 pm »
Let me see if I can remember the steps to do this...

Start parsing the string from left to right.  Look for “+-0123456789.“  These can be the start of a value.

After the start, if it’s a sign, look for additional signs.  - - or + - or something odd.   For the number, only worry about the last sign you generate. 0 - -1 (zero minus negative one); all you need to worry about is the last sign.  The others are math operators.

Once you find the last sign, look for “0123456789.”  They’re the only valid input acceptable after a sign.  If the next character isn’t one, it’s not a number.

Once you have a valid digit (or period), continue to collect digits and periods, until you reach an invalid character.  Only one period is allowed.  If you get more, reject the input.

At this point, you now have most numbers.  You *could* stop here.  If you need scientific notation, look for the “E” or “D”, then the second sign, then digits...
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Extract numerics from text string
« Reply #4 on: November 07, 2019, 01:15:41 pm »
My basic solution:  (integers only)

Code: QB64: [Select]
  1. s$ = "I use -5 and 57, 34"
  2.  
  3. FOR h = 1 TO LEN(s$)
  4.     oldch = ch
  5.     ch = ASC(s$, h)
  6.     IF oldch = 45 AND ch >= 49 AND ch <= 57 THEN num$ = "-" + CHR$(ch)
  7.     IF ch >= 49 AND ch <= 57 AND oldch <> 45 THEN num$ = num$ + CHR$(ch)
  8.  
  9.     IF ch < 49 OR ch > 57 OR h = LEN(s$) THEN
  10.         IF oldch <> 45 THEN
  11.             IF LEN(num$) THEN
  12.                 unrs = UBOUND(nrs) + 1
  13.                 REDIM _PRESERVE nrs(unrs) AS INTEGER
  14.                 nrs(unrs) = VAL(num$)
  15.                 num$ = ""
  16.             END IF
  17.         END IF
  18.     END IF
  19.  
  20. PRINT "Found numbers:"; UBOUND(nrs)
  21. FOR l = 1 TO UBOUND(nrs)
  22.     PRINT nrs(l)
  23.  

Upgrade for single numbers...

Code: QB64: [Select]
  1. s$ = "I use -5 and 57, 67, -3.14"
  2.  
  3. REDIM nrs(0) AS SINGLE
  4. FOR h = 1 TO LEN(s$)
  5.     oldch = ch
  6.     ch = ASC(s$, h)
  7.     IF ch = 46 THEN num$ = num$ + "."
  8.     IF oldch = 45 AND ch >= 49 AND ch <= 57 THEN num$ = "-" + CHR$(ch)
  9.     IF ch >= 49 AND ch <= 57 AND oldch <> 45 THEN num$ = num$ + CHR$(ch)
  10.  
  11.     IF ch <> 46 THEN
  12.         IF ch < 49 OR ch > 57 OR h = LEN(s$) THEN
  13.             IF oldch <> 45 THEN
  14.                 IF LEN(num$) THEN
  15.  
  16.                     unrs = UBOUND(nrs) + 1
  17.                     REDIM _PRESERVE nrs(unrs) AS SINGLE
  18.                     nrs(unrs) = VAL(num$)
  19.                     num$ = ""
  20.                 END IF
  21.             END IF
  22.         END IF
  23.     END IF
  24.  
  25. PRINT "Found numbers:"; UBOUND(nrs)
  26. FOR l = 1 TO UBOUND(nrs)
  27.     PRINT nrs(l)
  28.  
« Last Edit: November 07, 2019, 01:29:05 pm by Petr »

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Extract numerics from text string
« Reply #5 on: November 07, 2019, 02:43:45 pm »
So far Petr's ahead in my eyes, he has given me code to test maybe modify into a sub or function.

Steve, is next best (or best, after all I just asked for ideas) because he has outlined idea that seems plausible, kind of worried about part of string +--+... next to number, left of number, (if memory serves me, I think you do have to look at the right side of the character too) but this was also brought up by another great EVAL builder at JB. I think my EVAL assumes all + encountered mean add operation.

STx is third best because he is making me read, worse, making me dig up the material to read as well  ;-))
If I am lazy, it is definitely in this regard.

Yeah, the E's and D's make this thing a real chore, I guess that is partly why I am a fan of Integers and Discrete Maths.

« Last Edit: November 07, 2019, 02:46:44 pm by bplus »

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Extract numerics from text string
« Reply #6 on: November 08, 2019, 11:07:05 am »
Hi BPlus, I combed the source code a little bit (I think SELECT CASE is more right for this use) and I made a SUB of it, which returns a number field.


Code: QB64: [Select]
  1. REDIM nrs(0) AS SINGLE
  2. NumFromString "Sub NumFromString return all numbers 45 from -3.14 this 000 string to 1234567890, .65, 0.001, -0.0021 array7", nrs()
  3. FOR L = 1 TO UBOUND(nrs)
  4.     PRINT nrs(L)
  5.  
  6. SUB NumFromString (s AS STRING, nrs() AS SINGLE)
  7.     s$ = s$ + "a" 'for case, if last character in string is number
  8.     FOR R = 1 TO LEN(s$)
  9.         oldCh = Ch
  10.         Ch = ASC(s$, R)
  11.         SELECT CASE Ch
  12.             CASE 48 TO 57: num$ = num$ + CHR$(Ch)
  13.             CASE 45: num$ = "-" ' -
  14.             CASE 46: IF oldCh >= 48 AND oldCh <= 57 THEN num$ = num$ + "." ELSE num$ = "0."
  15.             CASE IS < 48, IS > 57
  16.                 IF LEN(num$) THEN
  17.                     U = UBOUND(nrs)
  18.                     REDIM _PRESERVE nrs(U + 1) AS SINGLE
  19.                     nrs(U + 1) = VAL(num$)
  20.                     num$ = ""
  21.                 END IF
  22.         END SELECT
  23.     NEXT
  24.  

Basically, it is just a small modification of the program, which I made easier the work. When practicing spelling with my children it was necessary to hide the letters i, I and y, Y in the text. Basically, this is a very similar but simpler problem:

Code: QB64: [Select]
  1. 'sub, which search "Y", "y", "I", "i" in strings and replace it with "_" (for kids testing)
  2.  
  3.  
  4. Text$ = "White House, Washington, Yellowstone, Hollywood, Smokies, Permission"
  5. PRINT Ifinder$(Text$)
  6.  
  7.  
  8. FUNCTION Ifinder$ (text AS STRING)
  9.     FOR T = 1 TO LEN(text$)
  10.         ch = ASC(text$, T)
  11.         SELECT CASE ch
  12.             CASE 73, 105, 89, 121: c$ = "_"
  13.             CASE ELSE
  14.                 c$ = CHR$(ch)
  15.         END SELECT
  16.         Ifinder$ = Ifinder$ + c$
  17.     NEXT
  18.  

This is just the core of the program, of course UNICODE functions are used for Czech and the text is not fixed string, but is copied from the clipboard. Simple and very effective.
« Last Edit: November 08, 2019, 11:15:38 am by Petr »

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Extract numerics from text string
« Reply #7 on: November 08, 2019, 10:19:14 pm »
Here is check that should allow exponents along with normal and real but not allow date, time (different category that might be checked also by other IsDate, IsTime functions.

Code: QB64: [Select]
  1. _TITLE "Find numbers in text" 'b+ 2019-11-08
  2. 'a slighty different approach
  3. s$(0) = "Sub NumFromString return all numbers 45 from -3.14 this 000 string to 1234567890, .65, 0.001, -0.0021 array7"
  4. s$(1) = "You can reach me at +22.17 degrees latitude, -44.96 degrees"
  5. s$(2) = "I use -5 and 0.0001, 34, 3.19216457E-46, -9, 00.00.00, -4.987651223847D+49, 12:32:59, 11/09/2019"
  6.  
  7. REDIM wrds$(1 TO 1)
  8. FOR i = 0 TO 2
  9.     PRINT "For Text string = "; s$(i)
  10.     replace1for1 s$(i), ",", " "
  11.     PRINT "Replace commas for spaces: "; s$(i)
  12.     Split s$(i), " ", wrds$()
  13.     PRINT: PRINT "Numbers found"
  14.     FOR j = 1 TO UBOUND(wrds$)
  15.         IF IsNumber(wrds$(j)) THEN PRINT wrds$(j)
  16.     NEXT
  17.     PRINT "press any to wakeup next..."
  18.     SLEEP
  19.     CLS
  20. PRINT "Testing is done."
  21.  
  22. SUB replace1for1 (s$, char1$, new1$)
  23.     p = INSTR(s$, char1$)
  24.     WHILE p > 0
  25.         MID$(s$, p, 1) = new1$
  26.         p = INSTR(p + 1, s$, char1$)
  27.     WEND
  28.  
  29. FUNCTION IsNumber% (s$)
  30.     c$ = s$ 'make a copy of s$ because will tear it up
  31.     'has at most 1 decimal point
  32.     'neg at right  of E or D    or  + immediate right of E or D
  33.     'neg at start
  34.     'all other chars are digits
  35.     'maybe + too at start would be OK
  36.  
  37.     IF _TRIM$(s$) = "" THEN EXIT FUNCTION
  38.     p = INSTR(c$, ".")
  39.     IF p > 0 THEN
  40.         IF INSTR(p + 1, c$, ".") > 0 THEN EXIT FUNCTION 'not 2 deicmals
  41.         c$ = MID$(c$, 1, p - 1) + MID$(c$, p + 1) 'get rid of decimal
  42.         'PRINT "After decimal = "; c$
  43.     END IF
  44.     p = INSTR(c$, "E")
  45.     IF p > 0 THEN
  46.         IF MID$(c$, p + 1, 1) = "-" OR MID$(c$, p + 1, 1) = "+" THEN
  47.             c$ = MID$(c$, 1, p - 1) + MID$(c$, p + 2) 'get rid of E+ or E-
  48.             'PRINT "After E fix = "; c$
  49.         ELSE
  50.             EXIT FUNCTION
  51.         END IF
  52.     END IF
  53.     p = INSTR(c$, "D")
  54.     IF p > 0 THEN
  55.         IF MID$(c$, p + 1, 1) = "-" OR MID$(c$, p + 1, 1) = "+" THEN
  56.             c$ = MID$(c$, 1, p - 1) + MID$(c$, p + 2) 'get rid of D+ or D-
  57.             'PRINT "After D fix = "; c$
  58.         END IF
  59.     END IF
  60.     p = INSTR(c$, "-")
  61.     IF p > 0 THEN
  62.         IF INSTR(p + 1, c$, "-") THEN EXIT FUNCTION
  63.         IF p <> 1 THEN EXIT FUNCTION
  64.         c$ = MID$(c$, 1, p - 1) + MID$(c$, p + 1) 'get rid of -
  65.         'PRINT "After - = "; c$
  66.     END IF
  67.     p = INSTR(c$, "+")
  68.     IF p > 0 THEN
  69.         IF INSTR(p + 1, c$, "+") THEN EXIT FUNCTION
  70.         IF p <> 1 THEN EXIT FUNCTION
  71.         c$ = MID$(c$, 1, p - 1) + MID$(c$, p + 1) 'get rid of -
  72.         'PRINT "After + = "; c$
  73.     END IF
  74.     'all the rest better be digits
  75.     FOR i = 1 TO LEN(c$)
  76.         IF ASC(c$, i) < 48 OR ASC(c$, i) > 57 THEN EXIT FUNCTION
  77.     NEXT
  78.     IsNumber% = -1
  79.  
  80. SUB Split (SplitMeString AS STRING, delim AS STRING, loadMeArray() AS STRING)
  81.     DIM curpos AS LONG, arrpos AS LONG, LD AS LONG, dpos AS LONG 'fix use the Lbound the array already has
  82.     curpos = 1: arrpos = LBOUND(loadMeArray): LD = LEN(delim)
  83.     dpos = INSTR(curpos, SplitMeString, delim)
  84.     DO UNTIL dpos = 0
  85.         loadMeArray(arrpos) = MID$(SplitMeString, curpos, dpos - curpos)
  86.         arrpos = arrpos + 1
  87.         IF arrpos > UBOUND(loadMeArray) THEN REDIM _PRESERVE loadMeArray(LBOUND(loadMeArray) TO UBOUND(loadMeArray) + 1000) AS STRING
  88.         curpos = dpos + LD
  89.         dpos = INSTR(curpos, SplitMeString, delim)
  90.     LOOP
  91.     loadMeArray(arrpos) = MID$(SplitMeString, curpos)
  92.     REDIM _PRESERVE loadMeArray(LBOUND(loadMeArray) TO arrpos) AS STRING 'get the ubound correct
  93.  
  94.  

Actually this has me thinking about finding numbers between (){} or [] after I had found problem with commas at the end of numbers.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Extract numerics from text string
« Reply #8 on: November 08, 2019, 10:55:34 pm »
Quote
Actually this has me thinking about finding numbers between (){} or [] after I had found problem with commas at the end of numbers.

You’ll drive yourself crazy, unless you set a hard standard for what’s acceptable and what’s not for this type of tool.

For example: “1,234,567”

Is the above 1234567 or 1, and 234, and 567?  One number separated by commas, or 3 numbers?  What if we write it as “$12,345.67”?  Is the minus in “-$123” part of the number, or a subtraction sign?  Is “11-10-2018” a date, or math between three values?  “How about 11 - 10 - 2018”?

Once you start trying to decide IsDate, IsTime, IsMoney, and such, you’ll end up with a lot of overlap in those functions.  If you look at it properly, “12:13” is a time, but it’s also 2 numbers just like “12 + 13” is — that “:” might designate “the ratio of 12 to 13”, such as in odds.  “The bet on the football game is 12:13, favoring the home team.”

Instead of saying date, time, and such aren’t numbers, you might be better served to set flags for what they might be.

“12:13” could give 3 numbers — 12, 13, 12:13...

Or you could pop up an user box and ask them, “This value (11 - 10 - 2018) could be multiple things.  How would you like to define it?  [DATE] [MULTIPLE NUMBERS]”....
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Extract numerics from text string
« Reply #9 on: November 09, 2019, 12:28:49 am »
Hi Steve,

All good points, at least I got some practice with verifying numbers with exponents, that is perfection now.