Author Topic: Retrieving web page titles from a list of URLs? (bonus if from Excel xlsx file)  (Read 3352 times)

0 Members and 1 Guest are viewing this topic.

Offline madscijr

  • Seasoned Forum Regular
  • Posts: 295
    • View Profile
Some background: A looong time ago (in the days of Windows XP) I was able to do this in Excel VBA, using an IE Web browser object / control, and it worked. I think I had it working in Windows 10 a few years back, but at some point Microsoft tightened their security and is pushing Edge so that Windows tries to open Edge whenever my code instantiates an IE browser, and none of the scripting works (the browser keeps opening a yes/no popup asking if I want to allow scripts to run, and as soon as you click Yes another opens endlessly, which stops the VBA from continuing because the browser object never becomes not busy).

Googling about this, my impression is that like I'd have to use Selenium going forward if I want to script a Web browser with VBA these days. I know nothing about Selenium and really don't have time right now to be getting into a whole new thing.

I'm sure I could find some Python examples to do this, but I am not too familiar or comfortable with Python. Besides, I just don't LIKE Python as much as BASIC! (whether vba or qb)

I'm wondering how this can be done in our beloved QB64? Given a string with a Web URL (http or https), how can we navigate to the URL, pull the contents of the <title> tag, and return that in a string?
Actually I want to do this for a bunch of URLs, stored in a text file.

Even better, read the URLs from column X on sheet Y of an Excel workbook (a native xlsx file, not CSV.

Even better, write the title back to the given row of the Excel file at column Z.

Any takers?

Any examples, pointers, guidance, help or advice would be much appreciated!
« Last Edit: September 03, 2021, 06:04:53 pm by madscijr »

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Lots of ways to do that. Here's one.

Use cURL or wget to download the webpage. YOu can SHELL to it in QB64.
Use open for binary to grab the html contents of that page, and place it as one gigantic string in QB64.
Do an instr() search for <title> and </title> to grab the title from the title tag part of the giant string.

Pete



Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline madscijr

  • Seasoned Forum Regular
  • Posts: 295
    • View Profile
Lots of ways to do that. Here's one.

Use cURL or wget to download the webpage. YOu can SHELL to it in QB64.
Use open for binary to grab the html contents of that page, and place it as one gigantic string in QB64.
Do an instr() search for <title> and </title> to grab the title from the title tag part of the giant string.

Pete

Thanks Pete!
I'm an ignoramus, are cURL or wget built in to Windows 10? (Guess I'll google that!)
Is there a non-shell (ie native QB64) way to retrieve the contents of a Web URL?

Thanks again, I at least now have something to try...

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
I haven't tried Indy's Windows API FTP library routine, yet, but I imagine it would allow QB64 to do this, without using SHELL.

https://www.qb64.org/forum/index.php?topic=3263.0

Now in the old XP days, when Internet Explorer used to keep the cache pages in an easy to find temporary-internet folder, and under the name of the webpage, non-encrypted, you just needed to do a SHELL call to the website with IE, and then open the html file in the temporary-internet folder, to parse it. Those days, as you mentioned, are gone.

Now a Mickey Mouse way around that would be to do a SHELL to firefox with the URL and when the page opens, use a WIndows API SENDKEY routine to simulate the following keyboard presses:

Ctrl+U opens a new tab with the html source code for the page you are viewing.
Ctrl+A to select all text.
Ctrl+C to copy text to clipboard.

Now, just parse the _clipboard contents in QB64. Actually, _SCREENPRINT in QB64 would do Ctr l+ A and Ctrl + C, and CTRL + U...

SHELL to URL in FireFox
_SCREENPRINT CHR$(21) ' Ctrl+U
_SCREENPRINT CHR$(1) ' Ctrl + A
_SCREENPRINT CHR$(3) ' Ctrl + C
Parse _CLIPBOARD$ to find the title.

You may need a _DELAY .1 after each.

Anyway, someone might have a more elegant solution, so stay tuned.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
The above in code would be...

Code: QB64: [Select]
  1. SHELL _DONTWAIT "Firefox http://www.qb64.org/wiki/SCREENPRINT"
  2. _SCREENPRINT CHR$(21) ' Ctrl+U
  3. _DELAY 2 ' Give time for the new tab to open...
  4. _SCREENPRINT CHR$(1) ' Ctrl + A
  5. _SCREENPRINT CHR$(3) ' Ctrl + C
  6. x$ = MID$(a$, INSTR(LCASE$(a$), "<title>") + 7)
  7. x$ = MID$(x$, 1, INSTR(x$, "</title>") - 1)
  8.  
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
This was new to me, so for fun, I looked into using POWERSHELL! Minor caution, it makes or overwrites a file in  whatever folder you run this in, called: mypsfile.html. If for some weird reason you named a file that in your local QB64 folder, it would get over-written by running this code.

Code: QB64: [Select]
  1. current_dir$ = _CWD$
  2. url$ = "http://www.qb64.org/wiki/SCREENPRINT"
  3. output$ = current_dir$ + "\mypsfile.html"
  4.  
  5. x$ = "Invoke-WebRequest -Uri " + url$ + " -OutFile " + output$
  6. PRINT "Using Windows PowerShell. Please wait. This will take a few seconds...": PRINT
  7. SHELL _HIDE "powershell " + x$
  8.     i = i + 1
  9.         OPEN output$ FOR BINARY AS #1
  10.         a$ = SPACE$(LOF(1))
  11.         GET #1, , a$
  12.         x$ = MID$(a$, INSTR(LCASE$(a$), "<title>") + 7)
  13.         x$ = MID$(x$, 1, INSTR(x$, "</title>") - 1)
  14.         PRINT x$
  15.         EXIT DO
  16.     ELSE
  17.         ' File hasn't been downloaded yet.
  18.     END IF
  19.     _DELAY 6
  20. LOOP UNTIL i = 10
  21.  

Pete
« Last Edit: September 03, 2021, 07:51:20 pm by Pete »
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Same approach with cURL...

Again, check the file names to be sure you can use the code without overwriting a file you have, with the same name I used in these examples, mypsfile.html, mycurlfile.html, and mywgetfile.html.

Code: QB64: [Select]
  1. curl$ = "c:\curl\bin\curl -o"
  2. url$ = "http://www.qb64.org/wiki/SCREENPRINT"
  3. output$ = _CWD$ + "\mycurlfile.html"
  4.  
  5. SHELL _HIDE curl$ + " " + output$ + "  " + url$
  6.  
  7.     OPEN output$ FOR BINARY AS #1
  8.     a$ = SPACE$(LOF(1))
  9.     GET #1, , a$
  10.     x$ = MID$(a$, INSTR(LCASE$(a$), "<title>") + 7)
  11.     x$ = MID$(x$, 1, INSTR(x$, "</title>") - 1)
  12.     PRINT x$
  13.     PRINT "File not found: "; output$
  14.  

or with wget...

Code: QB64: [Select]
  1. wget$ = "c:\wget\wget -O "
  2. url$ = "http://www.qb64.org/wiki/SCREENPRINT"
  3. output$ = _CWD$ + "\mywgetfile.html"
  4.  
  5. SHELL _HIDE wget$ + " " + output$ + "  " + url$
  6.  
  7.     OPEN output$ FOR BINARY AS #1
  8.     a$ = SPACE$(LOF(1))
  9.     GET #1, , a$
  10.     x$ = MID$(a$, INSTR(LCASE$(a$), "<title>") + 7)
  11.     x$ = MID$(x$, 1, INSTR(x$, "</title>") - 1)
  12.     PRINT x$
  13.     PRINT "File not found: "; output$
  14.  

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline madscijr

  • Seasoned Forum Regular
  • Posts: 295
    • View Profile
Same approach with cURL...

Wow, thanks for putting all the time into this, lol.

I'm not too crazy about PowerShell but if it works, why not. (In a perfect world, MS would use QB as the official Windows scripting language!)

Not at my computer so don't know if I have Firefox installed, been a long time since I used it.
But I think I'll give ALL of your methods a try when I get back. Thanks again!

Now, has anyone been able to use QB64 to read/write a native Excel XLSX file? :-D
« Last Edit: September 04, 2021, 11:45:20 am by madscijr »

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
XLSX files are nothing more than compressed ZIP files renamed.  (Same with docx, cbr, cbz, epub, and a ton of other file formats.)

Don’t believe it?  Copy one and rename the extension from *.xlsx to *.zip. 

Since this trick works, and since windows can mimic reading a zip file like a directory, you can use that to read the internal contents of the xlsx file.

Step 1: Copy a back up of the file.
Step 2: Rename the extension on the back up to zip format
Step 3: Get a directory listing of the zip file’s contents
Step 4: Open “your_backup.zip/desired_file.xml” for input…
Step 5: Do whatever you need to do with your data…
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline madscijr

  • Seasoned Forum Regular
  • Posts: 295
    • View Profile
This was new to me, so for fun, I looked into using POWERSHELL! Minor caution, it makes or overwrites a file in  whatever folder you run this in, called: mypsfile.html. If for some weird reason you named a file that in your local QB64 folder, it would get over-written by running this code.

I tried running this code, but it didn't work.
It's not throwing any errors, and Windows Security didn't report blocking anything.
Do I need to change any Windows 10 settings or do anything else to get it to work?

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Just to be certain, I copied and pasted the POWERSHELL code I posted into an untitled QB64 IDE, did an F5 run, and it worked.

It took 21 seconds to return the results and complete. This is why I use cURL, instead of POWERSHELL, it's so much faster.

It may be that your Win 10 does not have POWERSHELL. You could try the Windows key, search for powershell, and see if it returns a result. You could also run one line of QB64 code...

SHELL "powershell"

to see if the POWERSHELL window app opens. if not, you don't have it. (Note in my code, the _HIDE in the shell statement hides the POWERSHELL app, so you wouldn't be seeing it, anyway. It's just too ugly!) On a side note, someone did post a great text to speech routine, that uses POWERSHELL. I love that one.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SpriggsySpriggs

  • Forum Resident
  • Posts: 1145
  • Larger than life
    • View Profile
    • GitHub
Glad to see PowerShell getting more use these days. It wasn't too popular of a suggestion on the forum a year or two ago.
Shuwatch!