Author Topic: Converting .pdf to plain text (Read 5800 times)

Richard · « **on:** December 11, 2020, 11:14:40 pm »

I recently downloaded a pdf from Microsoft (5000 + pages) and found it quite difficult to use because there were no page numbers on the document and using MS Edge to view I could not search for keywords.

Is there any code in existence I could use to short cut program writing to generate a PLAIN TEXT file (from the pdf) from which I could simply use say NOTEPAD etc to read (cut / paste etc).

Alternatively a free off-line version of software converter (that works) would be OK - many google searches resulted in missing links or programs that did not work for me.

Pete · « **Reply #1 on:** December 11, 2020, 11:28:55 pm »

SHELL the pdf file open.

Use _SCREENPRINT CHR$(1) to Copy all the pdf text.

Use _SCREENPRINT CHR$(3)to copy the pdf text.

Do an OPEN "temp.tmp" FOR BINARY AS #1.

PUT _CLIPBOARD$ to the file.

Close the file.

SHELL the temp.tmp file in Notepad.

--------------------------------------------------

Pete

SMcNeill · « **Reply #2 on:** December 11, 2020, 11:31:52 pm »

If you have Calibre, it converts between various formats easily, and can do pdf to txt.

https://calibre-ebook.com/

If you don’t have it... it’s free. ;)

Richard · « **Reply #3 on:** December 12, 2020, 01:53:18 am »

@Pete

Thanks for your code. Below is my working code (based on your coding)

Code: QB64: [Select]

SCREEN _NEWIMAGE(640, 480, 32) : delay2% = 15
_DELAY delay2%
 
SHELL _DONTWAIT "A:\a.pdf"     '     opens up in MS EDGE latest version
_DELAY delay2%
 
s% = _SCREENIMAGE: _SOURCE s%
 
_SCREENPRINT CHR$(1) ' Ctrl A select all
_DELAY delay2%
_SCREENPRINT CHR$(3) ' Ctrl C copy to clipboard
_DELAY delay2%
 
OPEN "a:\a.txt" FOR BINARY AS #1
a$ = _CLIPBOARD$ '    _SCREENPRINT CHR$(22) ' Ctrl V
PUT #1, , a$
CLOSE: END
 

I needed to have delays.

For my MS pdf document (A.pdf = 5616 pages = 29,209,536 bytes) the first time the above program runs, delay2% =15 to work correctly converting - subsequent reruns delay2%=5 works OK. The converted file (A.txt) is 5,547,404 bytes for 135049 text lines. A.pdf opened up automatically into MS EDGE (latest version) and delay2% was experimented with until it was visible that the select all was highlighting all the text. (I had divided up the display into 4 quadrants - QB64 IDE - MS EDGE - File explorer - +1)

EDIT: Any way to determine when Windows has say finished highlighting (select all), etc - rather than for me to experiment with different delay times (or use unnecessarily long delay times) - no error messages pop up, just get totally wrong result (last remembered clipboard)?

Now to try out Steve(tm) solution.

Pete · « **Reply #4 on:** December 12, 2020, 02:28:06 am »

I thought about telling you you'll need delays, but I figured if you could put the pseudocode together, you'd figure it out, and you did. ;)

As for a way to tell when it's finished, this is a trick I thought of off the top of my head...

Make a loop.

In the loop, poll the _CLIPBOARD$, something like a = len(_CLIPBOARD$) and olda = a. Fiddle with some small delays in the loop and when a = olda, you know the complete text has been loaded and copied, and the loop can be exited. Of course olda has to be > 0 so it doesn't just exit at the start if the first polling hasn't even copied any text yet.

Pete

Richard · « **Reply #5 on:** December 12, 2020, 03:35:15 am »

@Steve (tm)

Thanks for your suggestion regarding calibre - which I now have downloaded and installed.

With my MS .pdf (A.pdf above) - calibre seems to be taking many minutes to convert and of about 5 attempts to convert to plain text, only attempt #2 is created.

Comparison Pete(tm) vs Steve(tm) for A.pdf 29,209,536 bytes

5,547,404 bytes (135,049 lines) vs 6,228,362 bytes (352,272 lines)

The calibre output has every alternate line as a BLANK LINE.

Steve, could you establish as to if there is a pdf size limit where calibre becomes unstable - I think my example might be too big for calibre!

News:

Author Topic: Converting .pdf to plain text (Read 5800 times)

Richard

Converting .pdf to plain text

Pete

Re: Converting .pdf to plain text

SMcNeill

Re: Converting .pdf to plain text

Richard

Re: Converting .pdf to plain text

Pete

Re: Converting .pdf to plain text

Richard

Re: Converting .pdf to plain text