Author Topic: removing tags from html  (Read 4387 times)

0 Members and 1 Guest are viewing this topic.

Offline random1

  • Newbie
  • Posts: 86
    • View Profile
removing tags from html
« on: August 02, 2021, 02:11:14 pm »
Hi All

I need help removing html tags.  What I do is download a web-page that has the data I need
using wget.exe.  Next I load the content of the page into a string.  I have most of it working
except I can't seem to remove the tabs.  There are many online tools that will do the job but
I would like to automate the process so that everything works from within my qb64 program. 

R1
 

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: removing tags from html
« Reply #1 on: August 02, 2021, 11:58:58 pm »
"Basic"lly, you parse out tags something like this...

Code: QB64: [Select]
  1. REM Load your entire data mined html file at once.
  2. OPEN "thisismyfile.html" FOR BINARY AS 1
  3. a$ = SPACE$(LOF(1))
  4. GET #1, 1, a$
  5.  
  6. REM Example of how to parse what got loaded. Let's say a$ was the below...
  7.  
  8. a$ = "<html><title>my html parser</title><body><h1>heading 1</h1><br><br><h2>heading 2</h2></body></html>"
  9.  
  10. REM Now parse out the tags...
  11. DO UNTIL INSTR(seed, a$, ">") = 0
  12.     seed = INSTR(seed, a$, ">")
  13.     parse$ = MID$(a$, seed + 1, INSTR(seed, a$, "<") - seed - 1)
  14.     IF LEN(parse$) THEN PRINT parse$
  15.     seed = seed + LEN(parse$) + 1

Now if you jut run the code, it will make an empty file called, "thisismyfile.html" which is okay, you can save your actual html file as that, later. Anyway, the rest of the code uses the example a$ string I included and parses out the tags.

Give it a try, and modify it as needed. In the QB64 Wiki, look up INSTR and how the first parameter, "Seed" works.

Good luck,

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline random1

  • Newbie
  • Posts: 86
    • View Profile
Re: removing tags from html
« Reply #2 on: August 03, 2021, 01:29:52 am »
Pete

I already have something very similar, what I need help with is a method that will remove
horizontal tab spaces so that my data formats correctly. 

R1     

FellippeHeitor

  • Guest
Re: removing tags from html
« Reply #3 on: August 03, 2021, 01:33:08 am »
File strings.bas is included with QB64, and you can replace tabs - CHR$(9) - with spaces, for example, like in the example below:

Code: QB64: [Select]
  1. a$ = StrReplace$(a$, CHR$(9), SPACE$(4))
  2. '$include:'source/utilities/strings.bas'

Or just remove them altogether with:

Code: QB64: [Select]
  1. a$ = StrReplace$(a$, CHR$(9), "")
  2. '$include:'source/utilities/strings.bas'

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: removing tags from html
« Reply #4 on: August 03, 2021, 02:10:25 am »
Which should be something like:

Code: QB64: [Select]
  1. a$ = STRING$(12, CHR$(9)) + "123456"
  2. REM Now parse out the tab characters...
  3. DO UNTIL INSTR(a$, CHR$(9)) = 0
  4.     a$ = MID$(a$, 1, INSTR(a$, CHR$(9)) - 1) + MID$(a$, INSTR(a$, CHR$(9)) + 1)
  5.  
  6.  
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline random1

  • Newbie
  • Posts: 86
    • View Profile
Re: removing tags from html
« Reply #5 on: August 03, 2021, 11:33:29 am »
Thanks, worked like a charm.  I guess I need to get out more, this is the first I've heard of
the string.bas option.

R1 

FellippeHeitor

  • Guest
Re: removing tags from html
« Reply #6 on: August 03, 2021, 11:41:06 am »
Thanks, worked like a charm.  I guess I need to get out more, this is the first I've heard of
the string.bas option.

R1 

That file is actually used by qb64.bas, and it's been there for a good while, but it's indeed not something we advertise. Beginning with v1.5 you will also find a lite version of INI-Manager in the utilities folder - it's also used internally by QB64, but it's there for anyone to use as an $INCLUDE.

Offline random1

  • Newbie
  • Posts: 86
    • View Profile
Re: removing tags from html
« Reply #7 on: August 03, 2021, 04:37:13 pm »
I need help with another problem, I need to convert a ascii string to binary.  I could bang it out
but wondered if there is an existing method I could use.  To help explain my needs, I want to
compare two strings, one is binary and the other is plain text.  It's used to exit out of a loop
when the two strings match.  I have a old sub somewhere but wanted to see if something
newer is out there.

r1

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: removing tags from html
« Reply #8 on: August 04, 2021, 12:23:31 pm »
I need help with another problem, I need to convert a ascii string to binary.  I could bang it out
but wondered if there is an existing method I could use.  To help explain my needs, I want to
compare two strings, one is binary and the other is plain text.  It's used to exit out of a loop
when the two strings match.  I have a old sub somewhere but wanted to see if something
newer is out there.

r1

I was curious if I could bang it out:
Code: QB64: [Select]
  1. _Title "Str2Bin$ Function" ' b+ 2021-08-04
  2. 'For i = 65 To 122 ' looks OK
  3. '    Print i; Chr$(i); " "; AscBin$(i); Val("&b" + AscBin$(i)),
  4. 'Next
  5.  
  6.     Input "Please enter string to convert to binary (just enter quits) "; s$
  7.     Print Str2Bin$(s$)
  8. Loop Until s$ = ""
  9.  
  10. Function Str2Bin$ (s$)
  11.     If s$ = "" Then Str2Bin$ = "0" Else Str2Bin$ = String$(8 * Len(s$), "0")
  12.     For i = 1 To Len(s$)
  13.         'Print Mid$(s$, i, 1), AscBin$(Asc(s$, i))
  14.         Mid$(Str2Bin$, 8 * (i - 1) + 1, 8) = AscBin$(Asc(s$, i))
  15.     Next
  16.  
  17. Function AscBin$ (integerBase10 As Integer) 'any integer < 256  ie all ascii
  18.     Dim j As Integer
  19.     AscBin$ = String$(8, "0")
  20.     While j <= 8
  21.         If (integerBase10 And 2 ^ j) > 0 Then Mid$(AscBin$, 8 - j, 1) = "1" + AscBin$
  22.         j = j + 1
  23.     Wend
  24.  
  25.  
  26.  
« Last Edit: August 04, 2021, 12:30:13 pm by bplus »

Offline random1

  • Newbie
  • Posts: 86
    • View Profile
Re: removing tags from html
« Reply #9 on: August 04, 2021, 05:34:14 pm »
Thanks for the reply but my issue turned out to be a mistake on my part.
I was clearing the string before the program entered the loop which is why
the program was not exiting the loop until it ran out of data to process.  I
felt kind of stupid when I found it.   On large projects I normally code the
individual small parts as standalone's and once they are working I add them
to the main code.   This keeps me from having to compile the entire program
while debugging.   Again thanks for the reply, I will add it to my toolbox.

R1   


   

Offline George McGinn

  • Global Moderator
  • Forum Regular
  • Posts: 210
    • View Profile
    • Resume
Re: removing tags from html
« Reply #10 on: August 11, 2021, 12:31:52 am »
@bplus, I like this small program, but I modified your code, as I found several small things (I know you just banged this out, so I just polished it a bit) that I wanted. I share my efforts below.

First, I changed the INPUT to LINE INPUT so that you can use commas, etc. They do translate into a valid binary number (00101100 is: , ).
I also added a function that splits up the binary string into 8 bits or a word separated by a space.  Now you can copy the results and input it into this website to convert the binary back to your ASCII input string: https://www.rapidtables.com/convert/number/binary-to-ascii.html

And I did a little clean up due to my additions.



Code: QB64: [Select]
  1. _TITLE "Str2Bin$ Function" ' b+ 2021-08-04
  2. '$CONSOLE:ONLY
  3. 'count =  0
  4. 'FOR i = 33 TO 126 ' looks OK
  5. '       count = count + 1
  6. '       IF count = 5 THEN
  7. '               PRINT TAB(5); i; CHR$(i); " "; AscBin$(i); VAL("&b" + AscBin$(i))
  8. '               count = 0
  9. '       ELSE
  10. '               PRINT TAB(5); i; CHR$(i); " "; AscBin$(i); VAL("&b" + AscBin$(i)),
  11. '       END IF
  12. 'NEXT i
  13.  
  14.     LINE INPUT "Please enter string to convert to binary (just enter quits) "; q$
  15.     IF q$ <> "" THEN
  16.                 BinString$ = Str2Bin$(q$)
  17.                 PRINT SplitBin$(BinString$)
  18.         END IF
  19. LOOP UNTIL q$ = ""
  20.  
  21. FUNCTION Str2Bin$ (s$)
  22.     IF s$ = "" THEN Str2Bin$ = "0" ELSE Str2Bin$ = STRING$(8 * LEN(s$), "0")
  23.     FOR i = 1 TO LEN(s$)
  24.         'Print Mid$(s$, i, 1), AscBin$(Asc(s$, i))
  25.          MID$(Str2Bin$, 8 * (i - 1) + 1, 8) = AscBin$(ASC(s$, i))
  26.     NEXT
  27.  
  28. FUNCTION AscBin$ (integerBase10 AS INTEGER) 'any integer < 256  ie all ascii
  29.     DIM j AS INTEGER
  30.     AscBin$ = STRING$(8, "0")
  31.     WHILE j <= 8
  32.         IF (integerBase10 AND 2 ^ j) > 0 THEN MID$(AscBin$, 8 - j, 1) = "1" + AscBin$
  33.         j = j + 1
  34.     WEND
  35.  
  36. FUNCTION SplitBin$ (s$)
  37.         strlen = LEN(s$)
  38.         FOR i = 1 to strlen
  39.                 qString$ = qString$ + MID$(s$, i, 1)
  40.                 IF (i MOD 8) = 0 THEN qString$ = qString$ + " "
  41.         NEXT i
  42.        
  43.         SplitBin$ = qString$
  44.  
____________________________________________________________________
George McGinn
Theoretical/Applied Computer Scientist
Member: IEEE, IEEE Computer Society
Technical Council on Software Engineering
IEEE Standards Association
American Association for the Advancement of Science (AAAS)

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: removing tags from html
« Reply #11 on: August 11, 2021, 08:04:23 am »
Hi @George McGinn,

I can understand need for Line Input but why put space between chars? The bin$ function assumes a 0's fill to convert 8 bits for each char ie every 8 0 or 1 amounts to a char in the Bin$ string such that you know how many chars are in the string by dividing it's length by 8.

What is need to separate with spaces?

Offline George McGinn

  • Global Moderator
  • Forum Regular
  • Posts: 210
    • View Profile
    • Resume
Re: removing tags from html
« Reply #12 on: August 11, 2021, 09:43:33 am »
Hi @bplus,

The reason I put a space between each character or binary set is when I use tools, like the website I posted, and I use the entire string, it says the binary number is too large.

Also, for convenience, 8 bits are usually grouped into a single block, conventionally called a byte. And you can combine 2 bytes to form a word.

Even in most HEX dumps, the values are grouped into it's byte equivalent. I needed it that way for what I need it for, but I wanted to share it as another option.

Either way is fine, I'm just used to seeing and using binary numbers that are in byte (8-bit) strings. If you are converting characters into HEX or BINARY, it just makes sense to me to put a space between each character for clarity (and those who are just learning binary).

And yes, I am capable of drawing lines after every 8th bit, but visually, putting a space between bytes is easier on the eyes, no?

EDIT: Also, when I was a systems programmer, we used binary in word form, or 16 bits. It's also ingrained within me. (Today's processors commonly use 32 or 64 bit words).
 

Hi @George McGinn,

I can understand need for Line Input but why put space between chars? The bin$ function assumes a 0's fill to convert 8 bits for each char ie every 8 0 or 1 amounts to a char in the Bin$ string such that you know how many chars are in the string by dividing it's length by 8.

What is need to separate with spaces?
« Last Edit: August 11, 2021, 09:46:15 am by George McGinn »
____________________________________________________________________
George McGinn
Theoretical/Applied Computer Scientist
Member: IEEE, IEEE Computer Society
Technical Council on Software Engineering
IEEE Standards Association
American Association for the Advancement of Science (AAAS)

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: removing tags from html
« Reply #13 on: August 11, 2021, 01:16:45 pm »
Thanks @George McGinn

Since the start of this subject about converting Text to Bin$, I was curious to what purpose it might be applied.

I know the conversion of numbers works on similar principles but there the reason is obvious.