Author Topic: Converting .pdf to plain text  (Read 3036 times)

0 Members and 1 Guest are viewing this topic.

Offline Richard

  • Seasoned Forum Regular
  • Posts: 364
    • View Profile
Converting .pdf to plain text
« on: December 11, 2020, 11:14:40 pm »
I recently downloaded a pdf from Microsoft (5000 + pages) and found it quite difficult to use because there were no page numbers on the document and using MS Edge to view I could not search for keywords.

Is there any code in existence I could use to short cut program writing to generate a PLAIN TEXT file (from the pdf)  from which I could simply use say NOTEPAD etc to read (cut / paste etc).

Alternatively a free off-line version of software converter (that works) would be OK - many google searches resulted in missing links or programs that did not work for me.

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Converting .pdf to plain text
« Reply #1 on: December 11, 2020, 11:28:55 pm »
SHELL the pdf file open.

Use _SCREENPRINT CHR$(1) to Copy all the pdf text.

Use _SCREENPRINT CHR$(3)to copy the pdf text.

Do an OPEN "temp.tmp" FOR BINARY AS #1.

PUT _CLIPBOARD$ to the file.

Close the file.

SHELL the temp.tmp file in Notepad.

--------------------------------------------------

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Converting .pdf to plain text
« Reply #2 on: December 11, 2020, 11:31:52 pm »
If you have Calibre, it converts between various formats easily, and can do pdf to txt.

https://calibre-ebook.com/

If you don’t have it...  it’s free.  ;)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Richard

  • Seasoned Forum Regular
  • Posts: 364
    • View Profile
Re: Converting .pdf to plain text
« Reply #3 on: December 12, 2020, 01:53:18 am »
@Pete

Thanks for your code. Below is my working code (based on your coding)

Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(640, 480, 32) : delay2% = 15
  2. _DELAY delay2%
  3.  
  4. SHELL _DONTWAIT "A:\a.pdf"     '     opens up in MS EDGE latest version
  5. _DELAY delay2%
  6.  
  7.  
  8. _SCREENPRINT CHR$(1) ' Ctrl A select all
  9. _DELAY delay2%
  10. _SCREENPRINT CHR$(3) ' Ctrl C copy to clipboard
  11. _DELAY delay2%
  12.  
  13. OPEN "a:\a.txt" FOR BINARY AS #1
  14. a$ = _CLIPBOARD$ '    _SCREENPRINT CHR$(22) ' Ctrl V
  15. PUT #1, , a$
  16.  

I needed to have delays. 

For my MS pdf document (A.pdf = 5616 pages = 29,209,536 bytes) the first time the above program runs, delay2% =15 to work correctly converting - subsequent reruns delay2%=5 works OK. The converted file (A.txt) is 5,547,404 bytes for 135049 text lines. A.pdf opened up automatically into MS EDGE (latest version) and delay2% was experimented with until it was visible that the select all was highlighting all the text. (I had divided up the display into 4 quadrants - QB64 IDE - MS EDGE - File explorer - +1)


EDIT:  Any way to determine when Windows has say finished highlighting (select all), etc - rather than for me to experiment with different delay times (or use unnecessarily long delay times) - no error messages pop up, just get totally wrong result (last remembered clipboard)?

Now to try out Steve(tm) solution.
« Last Edit: December 12, 2020, 02:06:55 am by Richard »

Offline Pete

  • Forum Resident
  • Posts: 2361
  • Cuz I sez so, varmint!
    • View Profile
Re: Converting .pdf to plain text
« Reply #4 on: December 12, 2020, 02:28:06 am »
I thought about telling you you'll need delays, but I figured if you could put the pseudocode together, you'd figure it out, and you  did. ;)

As for a way to tell when it's finished, this is a trick I thought of off the top of my head...

Make a loop.

In the loop, poll the _CLIPBOARD$, something like a = len(_CLIPBOARD$) and olda = a. Fiddle with some small delays in the loop and when a = olda, you know the complete text has been loaded and copied, and the loop can be exited. Of course olda has to be > 0 so it doesn't just exit at the start if the first polling hasn't even copied any text yet.

Pete
Want to learn how to write code on cave walls? https://www.tapatalk.com/groups/qbasic/qbasic-f1/

Offline Richard

  • Seasoned Forum Regular
  • Posts: 364
    • View Profile
Re: Converting .pdf to plain text
« Reply #5 on: December 12, 2020, 03:35:15 am »
@Steve (tm)

Thanks for your suggestion regarding calibre - which I now have downloaded and installed.

With my MS .pdf (A.pdf above) - calibre seems to be taking many minutes to convert and of about 5 attempts to convert to plain text, only attempt #2 is created.

Comparison Pete(tm) vs Steve(tm) for A.pdf 29,209,536 bytes

5,547,404 bytes (135,049 lines)   vs   6,228,362 bytes (352,272 lines)

The calibre output has every alternate line as a BLANK LINE.

Steve, could you establish as to if there is a pdf size limit where calibre becomes unstable - I think my example might be too big for calibre!