Author Topic: Wrote code that will extract the text of Wikipedia articles into an array. (Read 6695 times)

loudar · « **on:** July 27, 2019, 07:06:53 am »

Wrote this little block of code that will download any Wikipedia article and extract the raw text from it. This can be used for e.g. own dictionaries, encyclopedias or processing of information. Feel free to use and toy around with and update me on it :>

Figuring out how to shell out curl was a bit of a hassle, but now it works! :D

Code: QB64: [Select]

word$(c) = "QB64" 'replace with the word you want to have extracted, could be used in a loop to extract more than one word
shellcmd$ = "start powershell -Command " + CHR$(34) + "curl https://en.wikipedia.org/wiki/" + word$(c) + " -o temp.txt" + CHR$(34)
SHELL _HIDE shellcmd$ 'starts powershell from the CMD directly with the command because curl doesn't work from CMD
 
SLEEP 5 'waits for the file to be created, there is probably a better way to determine when the file is created
 
OPEN "temp.txt" FOR INPUT AS #1
x = 0
DO
    LINE INPUT #1, line$
    IF MID$(line$, 1, 3) = "<p>" THEN 'only reads paragraphs, as this is is the safest method for getting the raw text, feel free to toy around with other things as well
        x = x + 1
        p = 0
        DO
            p = p + 1
            IF MID$(line$, p, 1) = "<" THEN 'for now just excludes every html command to get a clean text
                DO
                    p = p + 1
                LOOP UNTIL MID$(line$, p, 1) = ">"
            ELSE
                rawline$(x) = rawline$(x) + MID$(line$, p, 1) 'saves the respective paragraph as a raw line, accessible with parameter x as count for paragraphs in an article
            END IF
        LOOP UNTIL p = LEN(line$)
        'PRINT rawline$(x) 'uncomment this line to print the raw data on the screen and check for eventual bugs in the raw lines
        'here is the space for processing of each line
    END IF
LOOP UNTIL EOF(1) = -1

bplus · « **Reply #1 on:** July 27, 2019, 09:43:17 am »

This is interesting and nice simple start, thanks for sharing.

SMcNeill · « **Reply #2 on:** July 27, 2019, 10:46:42 am »

This is very similar to how I extracted the wiki keywords from our QB64 wiki here: https://www.qb64.org/forum/index.php?topic=756.msg6457#msg6457

If you’re not downloading secure pages (https instead of http), you shouldn’t need to powershell and use curl to get the page to parse it. An example of using QB64 for the whole process can be viewed via the link above.

Pete · « **Reply #3 on:** July 27, 2019, 12:05:32 pm »

I download html pages into a single variable. In fact, I read the file with the binary method you posted so years back. It's a bit better than the one the IMan posted 15+ years ago, at QBF. Easy to parse it after that. I haven't done Wiki pages, so I don't know if I'd handle them the same way or not.

Wget has been improved upon, so I believe https is now supported. It sucks I can't be sure, because it was well over a year ago I did any of this stuff, but I think I'm correct, because I downloaded the latest version, wheres before, I stuck to using cURL, because you could associate it with Firefox and easily download pages from https sites; however, I like the use of Powershell here. There is a whole new world of possibilities thanks to Powershell, so it's nice to see another one of its uses demoed.

On behalf of all the QB sheeple, thanks for shearing.

Pete

loudar · « **Reply #4 on:** July 28, 2019, 08:25:30 am »

Quote from: SMcNeill on July 27, 2019, 10:46:42 am

If you’re not downloading secure pages (https instead of http), you shouldn’t need to powershell and use curl to get the page to parse it.

I tried it without powershell and it didn't work, neither via QB or manually in the CMD. That's why I chose to go with it, seemed like the only and the simplest option! :D

MLambert · « **Reply #5 on:** August 12, 2019, 11:52:37 pm »

Hi,
I do a lot of screen scraping and I use curl.

I just shell to it .... shell "curl.exe etc..."

I don't understand why powershell is needed.

Mike

News:

Author Topic: Wrote code that will extract the text of Wikipedia articles into an array. (Read 6695 times)

loudar

Wrote code that will extract the text of Wikipedia articles into an array.

bplus

Re: Wrote code that will extract the text of Wikipedia articles into an array.

SMcNeill

Re: Wrote code that will extract the text of Wikipedia articles into an array.

Pete

Re: Wrote code that will extract the text of Wikipedia articles into an array.

loudar

Re: Wrote code that will extract the text of Wikipedia articles into an array.

MLambert

Re: Wrote code that will extract the text of Wikipedia articles into an array.