Author Topic: Wrote code that will extract the text of Wikipedia articles into an array.  (Read 3777 times)

0 Members and 1 Guest are viewing this topic.

Offline loudar

  • Newbie
  • Posts: 73
  • improve it bit by bit.
    • View Profile
Wrote this little block of code that will download any Wikipedia article and extract the raw text from it. This can be used for e.g. own dictionaries, encyclopedias or processing of information. Feel free to use and toy around with and update me on it :>

Figuring out how to shell out curl was a bit of a hassle, but now it works! :D

Code: QB64: [Select]
  1. word$(c) = "QB64" 'replace with the word you want to have extracted, could be used in a loop to extract more than one word
  2. shellcmd$ = "start powershell -Command " + CHR$(34) + "curl https://en.wikipedia.org/wiki/" + word$(c) + " -o temp.txt" + CHR$(34)
  3. SHELL _HIDE shellcmd$ 'starts powershell from the CMD directly with the command because curl doesn't work from CMD
  4.  
  5. SLEEP 5 'waits for the file to be created, there is probably a better way to determine when the file is created
  6.  
  7. OPEN "temp.txt" FOR INPUT AS #1
  8. x = 0
  9.     LINE INPUT #1, line$
  10.     IF MID$(line$, 1, 3) = "<p>" THEN 'only reads paragraphs, as this is is the safest method for getting the raw text, feel free to toy around with other things as well
  11.         x = x + 1
  12.         p = 0
  13.         DO
  14.             p = p + 1
  15.             IF MID$(line$, p, 1) = "<" THEN 'for now just excludes every html command to get a clean text
  16.                 DO
  17.                     p = p + 1
  18.                 LOOP UNTIL MID$(line$, p, 1) = ">"
  19.             ELSE
  20.                 rawline$(x) = rawline$(x) + MID$(line$, p, 1) 'saves the respective paragraph as a raw line, accessible with parameter x as count for paragraphs in an article
  21.             END IF
  22.         LOOP UNTIL p = LEN(line$)
  23.         'PRINT rawline$(x) 'uncomment this line to print the raw data on the screen and check for eventual bugs in the raw lines
  24.         'here is the space for processing of each line
  25.     END IF
  26. LOOP UNTIL EOF(1) = -1
Check out what I do besides coding: http://loudar.myportfolio.com/

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
This is interesting and nice simple start, thanks for sharing.

Marked as best answer by loudar on May 17, 2020, 01:49:03 am

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
This is very similar to how I extracted the wiki keywords from our QB64 wiki here: https://www.qb64.org/forum/index.php?topic=756.msg6457#msg6457

If you’re not downloading secure pages (https instead of http), you shouldn’t need to powershell and use curl to get the page to parse it.  An example of using QB64 for the whole process can be viewed via the link above.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
I download html pages into a single variable. In fact, I read the file with the binary method you posted so years back. It's a bit better than the one the IMan posted 15+ years ago, at QBF. Easy to parse it after that. I haven't done Wiki pages, so I don't know if I'd handle them the same way or not.

Wget has been improved upon, so I believe https is now supported. It sucks I can't be sure, because it was well over a year ago I did any of this stuff, but I think I'm correct, because I downloaded the latest version, wheres before, I stuck to using cURL, because you could associate it with Firefox and easily download pages from https sites; however, I like the use of Powershell here. There is a whole new world of possibilities thanks to Powershell, so it's nice to see another one of its uses demoed.

On behalf of all the QB sheeple, thanks for shearing.

Pete

 
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline loudar

  • Newbie
  • Posts: 73
  • improve it bit by bit.
    • View Profile
If you’re not downloading secure pages (https instead of http), you shouldn’t need to powershell and use curl to get the page to parse it.

I tried it without powershell and it didn't work, neither via QB or manually in the CMD. That's why I chose to go with it, seemed like the only and the simplest option! :D
Check out what I do besides coding: http://loudar.myportfolio.com/

Offline MLambert

  • Forum Regular
  • Posts: 115
    • View Profile
Hi,
I do a lot of screen scraping and I use curl.

I just shell to it .... shell "curl.exe etc..."

I don't understand why powershell is needed.

Mike