QB64.org Forum

Active Forums => QB64 Discussion => Topic started by: MLambert on February 21, 2020, 04:18:14 am

Title: Is there someway to speed up reading a text file
Post by: MLambert on February 21, 2020, 04:18:14 am
Hi,

Is there someway to increase the speed of an input #1,A$,B$ etc ... ??

Maybe increase the read buffer size ?

Thks,

Mike
Title: Re: Is there someway to speed up reading a text file
Post by: TerryRitchie on February 21, 2020, 05:36:45 am
See this thread

https://www.qb64.org/forum/index.php?topic=2183.0
Title: Re: Is there someway to speed up reading a text file
Post by: FellippeHeitor on February 21, 2020, 06:56:17 am
Hmm, no. Last I checked, binary mode would speed up LINE INPUT reads, but still not allow INPUT reads, as those remained exclusive to INPUT mode. I could be wrong.
Title: Re: Is there someway to speed up reading a text file
Post by: SMcNeill on February 21, 2020, 07:31:30 am
Fastest way is always to just read the whole life at once and then parse it.

OPEN “yourfile.txt” FOR BINARY AS #1
text$ = SPACE$(LOF(1))
GET #1, , text$
CLOSE

‘Then parse text$ using appropriate CRLF and comma separaters.
Title: Re: Is there someway to speed up reading a text file
Post by: Pete on February 21, 2020, 10:28:25 am
The thread Terry posted has some examples and explanation. What Steve posted is what I use to load the entire contents of a file all at once. I think he came up with that one a couple of years back. It's great for loading html pages. Anyway, if you do load the entire contents, be aware, as Steve pointed out, of the control line characters. Specifically all stored text lines terminate in CHR$(13) + CHR$(10). So if I were loading an entire text file to my word processor app, I'd might want to parse out those characters. Something like...

DO until instr(a$, CHR$(13) + CHR$(10)) = 0
a$ = mid$(a$, 1, instr(CHR$(13) + CHR$(10)) - 1) +  mid$(a$, instr(CHR$(13) + CHR$(10)) + 2)
LOOP

Now my a$ variable is free of those control characters.

However, if you want to use those characters to read lines, it would go something like this...

Code: QB64: [Select]
  1. ' You will need to make and name a text file "tmp.tmp" in your local QB64 directory to run this example.
  2. IF _FILEEXISTS("tmp.tmp") THEN ELSE PRINT "File not found.": END
  3. OPEN "tmp.tmp" FOR BINARY AS #1
  4. x$ = SPACE$(LOF(1))
  5. GET #1, 1, x$
  6.  
  7. DO UNTIL INSTR(x$, CHR$(13)) = 0
  8.     a$ = MID$(x$, 1, INSTR(x$, CHR$(13) + CHR$(10)) - 1)
  9.     x$ = MID$(x$, LEN(a$) + 3)
  10.     PRINT a$
  11.  

Parse out,

Pete

Title: Re: Is there someway to speed up reading a text file
Post by: TerryRitchie on February 21, 2020, 01:42:29 pm
My thinking was the poster could read the entire file in then parse it out as needed.
Title: Re: Is there someway to speed up reading a text file
Post by: Pete on February 21, 2020, 02:16:50 pm
Ah Terry, that's what Steve are talking about. Am I missing something here?

Anyway, QB64 BINARY LINE INPUT is so fast, I really cannot see any appreciable time difference between using it or loading the entire file and then parsing it out, unless you want something towards the end of a very large file, sure, that's faster.

Pete
Title: Re: Is there someway to speed up reading a text file
Post by: bplus on February 21, 2020, 03:03:53 pm
I would use BINARY LINE INPUT unless I had to parse other stuff too then I would use this:
Code: QB64: [Select]
  1. FUNCTION fLineCnt (txtFile$, arr() AS STRING)
  2.     DIM filecount%, b$
  3.     filecount% = 0
  4.     IF _FILEEXISTS(txtFile$) THEN
  5.         OPEN txtFile$ FOR BINARY AS #1
  6.         b$ = SPACE$(LOF(1))
  7.         GET #1, , b$
  8.         CLOSE #1
  9.         REDIM _PRESERVE arr(1 TO 1) AS STRING
  10.         Split b$, CHR$(13) + CHR$(10), arr()
  11.         filecount% = UBOUND(arr)
  12.     END IF
  13.     fLineCnt = filecount% 'this file returns the number of lines loaded, 0 means file did not exist
  14.  
  15. 'notes: REDIM the array(0) to be loaded before calling Split '<<<< IMPORTANT dynamic array and empty, can use any lbound though
  16. 'This SUB will take a given N delimited string, and delimiter$ and create an array of N+1 strings using the LBOUND of the given dynamic array to load.
  17. 'notes: the loadMeArray() needs to be dynamic string array and will not change the LBOUND of the array it is given.  rev 2019-08-27
  18. SUB Split (SplitMeString AS STRING, delim AS STRING, loadMeArray() AS STRING)
  19.     DIM curpos AS LONG, arrpos AS LONG, LD AS LONG, dpos AS LONG 'fix use the Lbound the array already has
  20.     curpos = 1: arrpos = LBOUND(loadMeArray): LD = LEN(delim)
  21.     dpos = INSTR(curpos, SplitMeString, delim)
  22.     DO UNTIL dpos = 0
  23.         loadMeArray(arrpos) = MID$(SplitMeString, curpos, dpos - curpos)
  24.         arrpos = arrpos + 1
  25.         IF arrpos > UBOUND(loadMeArray) THEN REDIM _PRESERVE loadMeArray(LBOUND(loadMeArray) TO UBOUND(loadMeArray) + 1000) AS STRING
  26.         curpos = dpos + LD
  27.         dpos = INSTR(curpos, SplitMeString, delim)
  28.     LOOP
  29.     loadMeArray(arrpos) = MID$(SplitMeString, curpos)
  30.     REDIM _PRESERVE loadMeArray(LBOUND(loadMeArray) TO arrpos) AS STRING 'get the ubound correct
  31.  
Title: Re: Is there someway to speed up reading a text file
Post by: MLambert on February 25, 2020, 05:50:06 am
Thks everyone for the input.

Loading the file into memory is impracticable as there are millions of transactions.

Now reading the file as binary is interesting but I would then have to break down each record into 400+ fields and I know that this is memory work but
I don't know if I would gain anytime here.

I thought that maybe there was a way to increase the input buffer size of the input file.

Mike
Title: Re: Is there someway to speed up reading a text file
Post by: bplus on February 25, 2020, 02:07:03 pm
Are records fixed length?
Title: Re: Is there someway to speed up reading a text file
Post by: Pete on February 25, 2020, 03:16:24 pm
Jumping ahead to what BPlus, I think, is thinking... Why not remake this file into a RANDOM ACCESS file? At least you can index those, within your program. Going to an indexed point is a lot faster than sifting through a file record by record from the start.

Are we clear though that OPEN 'myfile" FOR BINARY AS #1 is the same as OPEN "myfile" FOR INPUT AS #1, except FOR BINARY reads the records much faster using LINE INPUT #1 than FOR INPUT does? In QBasic, we never could do LINE INPUT # with FOR BINARY. That is a special addition in QB64. It simply makes sequential file reading much, much faster than the traditional OPEN FOR INPUT QBasic reading method.

Also, you could load the file as chunks all at once. Something like a$ = space$(1000000): GET #1, 1, a$... Parse it and then... GET #1, 1000001, a$... etc.

Pete
Title: Re: Is there someway to speed up reading a text file
Post by: MLambert on February 25, 2020, 06:15:57 pm
The records are variable length.

I understand about random access ... but the files need to be sorted and then processed sequentially.... batch processing with key control breaks.

I wrote my own 'database' logic with random accesses but because of updating and deleting of the data this became too hard with the volume of data to be processed so I now use mysql for that part of the processing.

In regards to binary reads .. my question to this is .. do I save processing time with all of the string manipulation I must perform to unpack the data into variable length fields ?

Reading the data in blocks would have to be in control key block lengths and these blacoks may be 2 records or 2000000 records.

It is a statistical application that needs to process vasts amount of data to produce results.  For example from 4 fields I produce 1300 different calculations.

Each record read may have 150 of these 4 field groupings.

By the way I have used C++ and QB64 beats it hands down. When I have a year or two I will try assembler.

Thanks,

Mike
Title: Re: Is there someway to speed up reading a text file
Post by: Pete on February 25, 2020, 10:21:58 pm
If I had a gun to my head, and had to decide on the spot, I'd use OPEN "myfile" FOR BINARY AS #1: LINE INPUT #1, a$ ... and check each record for what I was after. I really don't think loading in chunks and parsing them, as complex as this issue appears to be, would be any faster than using this QB64 BINARY file reading method.

Oh, looky thar at me aveetar. I has two guns to my head already!

 - Sam
Title: Re: Is there someway to speed up reading a text file
Post by: EricE on February 26, 2020, 01:26:08 am
We need some quantitative data.
Here is a rough program that reads a text into memory and then searches for CR/LF pairs in order to find the lines it contains.
Then the disk file is opened and the LINE INPUT function is used to read each line it contains.

The text file used is "War and Peace" and is of size 3359548 bytes. There are 66055 lines of text contained in this file.

On my computer I got the following results.
Reading the file into memory takes so little time it cannot be measured using the TIMER function (0 seconds duration).
Reading all the lines when the file is in memory required only 0.055 seconds.
Reading all the lines when the file is on disk using the LINE INPUT function required 15.820 seconds.

Code: QB64: [Select]
  1. ' "War and Peace" test
  2. ' "http://www.gutenberg.org/files/2600/2600-0.txt"
  3.  
  4. file$ = "2600-0.txt"
  5. CRLF$ = CHR$(13) + CHR$(10)
  6.  
  7. '----
  8. starttime! = TIMER
  9. fin% = FREEFILE
  10. OPEN file$ FOR BINARY AS fin%
  11. filesize& = LOF(fin%)
  12. FileBuffer$ = SPACE$(filesize&)
  13. GET fin%, , FileBuffer$
  14. CLOSE fin%
  15. endtime! = TIMER
  16. PRINT "READING INTO MEMORY", filesize&, endtime! - starttime!
  17.  
  18. '----
  19. linecount& = 0
  20. bytecount& = 0
  21. starttime! = TIMER
  22. WHILE bytecount& < filesize&
  23.     CrlfPos& = INSTR(bytecount& + 1, FileBuffer$, CRLF$)
  24.     fileline$ = MID$(FileBuffer$, bytecount& + 1, CrlfPos& - bytecount& - 1)
  25.     ' PRINT fileline$
  26.     linecount& = linecount& + 1
  27.     bytecount& = CrlfPos& + 1
  28. endtime! = TIMER
  29.  
  30. PRINT "FILE IN MEMORY", bytecount&, linecount&, endtime! - starttime!
  31.  
  32. '----
  33. fin% = FREEFILE
  34. OPEN file$ FOR INPUT AS fin%
  35. linecount& = 0
  36. bytecount& = 0
  37. starttime! = TIMER
  38. DO UNTIL EOF(fin%)
  39.     LINE INPUT #fin%, fileline$ 'read entire text file line
  40.     linecount& = linecount& + 1
  41.     bytecount& = bytecount& + LEN(fileline$) + 2 ' include ending CR,LF characters
  42. endtime! = TIMER
  43. CLOSE fin%
  44. PRINT "FILE LINE INPUT", bytecount&, linecount&, endtime! - starttime!
  45. '----
  46.  

Title: Re: Is there someway to speed up reading a text file
Post by: MLambert on February 26, 2020, 03:16:37 am
Thks again for the help.

In regards to binary reads .. no-one has answered my concerns regarding the time spent to unpack the variables and extract the data compared to the 'normal' input of input#1,A$,B$ etc...  which would help me to decide if the binary read is worth while looking at.

Also, as previously explained, my files are huge and cannot be read into memory ... say 3,000,000 records at maybe 600 chs long is a lot of memory. If I use virtual memory then I am up for page swapping etc... and again I ask the question how does this compare in processing speeds.

I appreciate the input ... but maybe someone who wrote the QB64 code can tell me if I can increase the read buffer size ?

Thsk all,

Mike

Title: Re: Is there someway to speed up reading a text file
Post by: EricE on February 26, 2020, 04:14:07 am
Hi MLambert,

As for the internal implementation of QB64's read buffer I will leave to the QB64 developers.

A very large file does not need to be completely loaded into memory, only chunks of it need be read into a buffer for processing.
The application program would implement its own read buffer.
3,000,000 records of 600 characters each should be no problem for a QB64 program to handle.

The actual processing would depend on the format of the data to be read.
The A$, B$, in your "INPUT #1, A$, B$, ".

In my opinion a solution might be very close here if we have a little more information about the data to be processed.

But it will also be good to get the input from the QB64 developers as well.
 
Title: Re: Is there someway to speed up reading a text file
Post by: SMcNeill on February 26, 2020, 09:43:20 am
Since these are variable length fields, of various numbers of input, of extremely unusable size, here’s what I’d do:

1) Open file for binary and read what I consider a reasonable buffer.  100,000,000 bytes is a good size, if you’re dealing with GB totals...

2) Use _INSTRREV to find the last CRLF in that 100mb buffer.  This is the point where you break off operations and buffer from on the next pass, so you don’t separate data.

3) Parse the data using commas and CRLF characters as necessary delimiters for the data.

4) once parsing of the 100mb - CRLF last position is done, repeat the process from that CRLF position you found in step 2, until the whole multi-GB file is handled.



INPUT # reads sequentially from the disk a single byte at a time.  It’s SLOOOOOOOOOW.  Binary files read at the size of your disk clusters/sectors — usually 4096+ bytes per pass nowadays, so read times are a FRACTION of using INPUT #.  Parsing from memory is much faster than parsing from disk, and will do the job in a fraction of the time.

Personally, I don’t see why you can’t read the whole file in one go and then process it.  3,000,000 records of 600 characters is 1.8 GB, and most machines can  handle that readily.  I got a few applications which use 22GB of ram to load extensive datasets all at once into memory, and have never had an issue with them with 32GB total system ram...  As long as your PC has enough memory, I don’t see why you’d have any problem just loading in one go and then parsing.
Title: Re: Is there someway to speed up reading a text file
Post by: MLambert on February 28, 2020, 04:48:42 am
I don't read into memory because one day it will blow the limits ..

The reset of the advice sounds great and I will test the read binary.

Thanks again everyone.

Mike