Author Topic: Is there someway to speed up reading a text file (Read 9839 times)

MLambert · « **on:** February 21, 2020, 04:18:14 am »

Hi,

Is there someway to increase the speed of an input #1,A$,B$ etc ... ??

Maybe increase the read buffer size ?

Thks,

Mike

TerryRitchie · « **Reply #1 on:** February 21, 2020, 05:36:45 am »

See this thread

https://www.qb64.org/forum/index.php?topic=2183.0

FellippeHeitor · « **Reply #2 on:** February 21, 2020, 06:56:17 am »

Hmm, no. Last I checked, binary mode would speed up LINE INPUT reads, but still not allow INPUT reads, as those remained exclusive to INPUT mode. I could be wrong.

SMcNeill · « **Reply #3 on:** February 21, 2020, 07:31:30 am »

Fastest way is always to just read the whole life at once and then parse it.

OPEN “yourfile.txt” FOR BINARY AS #1
text$ = SPACE$(LOF(1))
GET #1, , text$
CLOSE

‘Then parse text$ using appropriate CRLF and comma separaters.

Pete · « **Reply #4 on:** February 21, 2020, 10:28:25 am »

The thread Terry posted has some examples and explanation. What Steve posted is what I use to load the entire contents of a file all at once. I think he came up with that one a couple of years back. It's great for loading html pages. Anyway, if you do load the entire contents, be aware, as Steve pointed out, of the control line characters. Specifically all stored text lines terminate in CHR$(13) + CHR$(10). So if I were loading an entire text file to my word processor app, I'd might want to parse out those characters. Something like...

DO until instr(a$, CHR$(13) + CHR$(10)) = 0
a$ = mid$(a$, 1, instr(CHR$(13) + CHR$(10)) - 1) + mid$(a$, instr(CHR$(13) + CHR$(10)) + 2)
LOOP

Now my a$ variable is free of those control characters.

However, if you want to use those characters to read lines, it would go something like this...

Code: QB64: [Select]

' You will need to make and name a text file "tmp.tmp" in your local QB64 directory to run this example.
IF _FILEEXISTS("tmp.tmp") THEN ELSE PRINT "File not found.": END
OPEN "tmp.tmp" FOR BINARY AS #1
x$ = SPACE$(LOF(1))
GET #1, 1, x$
CLOSE #1
 
DO UNTIL INSTR(x$, CHR$(13)) = 0
    a$ = MID$(x$, 1, INSTR(x$, CHR$(13) + CHR$(10)) - 1)
    x$ = MID$(x$, LEN(a$) + 3)
    PRINT a$
LOOP
END 
 

Parse out,

Pete

TerryRitchie · « **Reply #5 on:** February 21, 2020, 01:42:29 pm »

My thinking was the poster could read the entire file in then parse it out as needed.

Pete · « **Reply #6 on:** February 21, 2020, 02:16:50 pm »

Ah Terry, that's what Steve are talking about. Am I missing something here?

Anyway, QB64 BINARY LINE INPUT is so fast, I really cannot see any appreciable time difference between using it or loading the entire file and then parsing it out, unless you want something towards the end of a very large file, sure, that's faster.

Pete

bplus · « **Reply #7 on:** February 21, 2020, 03:03:53 pm »

I would use BINARY LINE INPUT unless I had to parse other stuff too then I would use this:

Code: QB64: [Select]

FUNCTION fLineCnt (txtFile$, arr() AS STRING)
    DIM filecount%, b$
    filecount% = 0
    IF _FILEEXISTS(txtFile$) THEN
        OPEN txtFile$ FOR BINARY AS #1
        b$ = SPACE$(LOF(1))
        GET #1, , b$
        CLOSE #1
        REDIM _PRESERVE arr(1 TO 1) AS STRING
        Split b$, CHR$(13) + CHR$(10), arr()
        filecount% = UBOUND(arr)
    END IF
    fLineCnt = filecount% 'this file returns the number of lines loaded, 0 means file did not exist
END FUNCTION
 
'notes: REDIM the array(0) to be loaded before calling Split '<<<< IMPORTANT dynamic array and empty, can use any lbound though
'This SUB will take a given N delimited string, and delimiter$ and create an array of N+1 strings using the LBOUND of the given dynamic array to load.
'notes: the loadMeArray() needs to be dynamic string array and will not change the LBOUND of the array it is given.  rev 2019-08-27
SUB Split (SplitMeString AS STRING, delim AS STRING, loadMeArray() AS STRING)
    DIM curpos AS LONG, arrpos AS LONG, LD AS LONG, dpos AS LONG 'fix use the Lbound the array already has
    curpos = 1: arrpos = LBOUND(loadMeArray): LD = LEN(delim)
    dpos = INSTR(curpos, SplitMeString, delim)
    DO UNTIL dpos = 0
        loadMeArray(arrpos) = MID$(SplitMeString, curpos, dpos - curpos)
        arrpos = arrpos + 1
        IF arrpos > UBOUND(loadMeArray) THEN REDIM _PRESERVE loadMeArray(LBOUND(loadMeArray) TO UBOUND(loadMeArray) + 1000) AS STRING
        curpos = dpos + LD
        dpos = INSTR(curpos, SplitMeString, delim)
    LOOP
    loadMeArray(arrpos) = MID$(SplitMeString, curpos)
    REDIM _PRESERVE loadMeArray(LBOUND(loadMeArray) TO arrpos) AS STRING 'get the ubound correct
END SUB
 

MLambert · « **Reply #8 on:** February 25, 2020, 05:50:06 am »

Thks everyone for the input.

Loading the file into memory is impracticable as there are millions of transactions.

Now reading the file as binary is interesting but I would then have to break down each record into 400+ fields and I know that this is memory work but
I don't know if I would gain anytime here.

I thought that maybe there was a way to increase the input buffer size of the input file.

Mike

bplus · « **Reply #9 on:** February 25, 2020, 02:07:03 pm »

Are records fixed length?

Pete · « **Reply #10 on:** February 25, 2020, 03:16:24 pm »

Jumping ahead to what BPlus, I think, is thinking... Why not remake this file into a RANDOM ACCESS file? At least you can index those, within your program. Going to an indexed point is a lot faster than sifting through a file record by record from the start.

Are we clear though that OPEN 'myfile" FOR BINARY AS #1 is the same as OPEN "myfile" FOR INPUT AS #1, except FOR BINARY reads the records much faster using LINE INPUT #1 than FOR INPUT does? In QBasic, we never could do LINE INPUT # with FOR BINARY. That is a special addition in QB64. It simply makes sequential file reading much, much faster than the traditional OPEN FOR INPUT QBasic reading method.

Also, you could load the file as chunks all at once. Something like a$ = space$(1000000): GET #1, 1, a$... Parse it and then... GET #1, 1000001, a$... etc.

Pete

MLambert · « **Reply #11 on:** February 25, 2020, 06:15:57 pm »

The records are variable length.

I understand about random access ... but the files need to be sorted and then processed sequentially.... batch processing with key control breaks.

I wrote my own 'database' logic with random accesses but because of updating and deleting of the data this became too hard with the volume of data to be processed so I now use mysql for that part of the processing.

In regards to binary reads .. my question to this is .. do I save processing time with all of the string manipulation I must perform to unpack the data into variable length fields ?

Reading the data in blocks would have to be in control key block lengths and these blacoks may be 2 records or 2000000 records.

It is a statistical application that needs to process vasts amount of data to produce results. For example from 4 fields I produce 1300 different calculations.

Each record read may have 150 of these 4 field groupings.

By the way I have used C++ and QB64 beats it hands down. When I have a year or two I will try assembler.

Thanks,

Mike

Pete · « **Reply #12 on:** February 25, 2020, 10:21:58 pm »

If I had a gun to my head, and had to decide on the spot, I'd use OPEN "myfile" FOR BINARY AS #1: LINE INPUT #1, a$ ... and check each record for what I was after. I really don't think loading in chunks and parsing them, as complex as this issue appears to be, would be any faster than using this QB64 BINARY file reading method.

Oh, looky thar at me aveetar. I has two guns to my head already!

- Sam

EricE · « **Reply #13 on:** February 26, 2020, 01:26:08 am »

We need some quantitative data.
Here is a rough program that reads a text into memory and then searches for CR/LF pairs in order to find the lines it contains.
Then the disk file is opened and the LINE INPUT function is used to read each line it contains.

The text file used is "War and Peace" and is of size 3359548 bytes. There are 66055 lines of text contained in this file.

On my computer I got the following results.
Reading the file into memory takes so little time it cannot be measured using the TIMER function (0 seconds duration).
Reading all the lines when the file is in memory required only 0.055 seconds.
Reading all the lines when the file is on disk using the LINE INPUT function required 15.820 seconds.

Code: QB64: [Select]

' "War and Peace" test
' "http://www.gutenberg.org/files/2600/2600-0.txt"
 
file$ = "2600-0.txt"
CRLF$ = CHR$(13) + CHR$(10)
 
'----
starttime! = TIMER
fin% = FREEFILE
OPEN file$ FOR BINARY AS fin%
filesize& = LOF(fin%)
FileBuffer$ = SPACE$(filesize&)
GET fin%, , FileBuffer$
CLOSE fin%
endtime! = TIMER
PRINT "READING INTO MEMORY", filesize&, endtime! - starttime!
 
'----
linecount& = 0
bytecount& = 0
starttime! = TIMER
WHILE bytecount& < filesize&
    CrlfPos& = INSTR(bytecount& + 1, FileBuffer$, CRLF$)
    fileline$ = MID$(FileBuffer$, bytecount& + 1, CrlfPos& - bytecount& - 1)
    ' PRINT fileline$
    linecount& = linecount& + 1
    bytecount& = CrlfPos& + 1
WEND
endtime! = TIMER
 
PRINT "FILE IN MEMORY", bytecount&, linecount&, endtime! - starttime!
 
'----
fin% = FREEFILE
OPEN file$ FOR INPUT AS fin%
linecount& = 0
bytecount& = 0
starttime! = TIMER
DO UNTIL EOF(fin%)
    LINE INPUT #fin%, fileline$ 'read entire text file line
    linecount& = linecount& + 1
    bytecount& = bytecount& + LEN(fileline$) + 2 ' include ending CR,LF characters
LOOP
endtime! = TIMER
CLOSE fin%
PRINT "FILE LINE INPUT", bytecount&, linecount&, endtime! - starttime!
'----
END
 

MLambert · « **Reply #14 on:** February 26, 2020, 03:16:37 am »

Thks again for the help.

In regards to binary reads .. no-one has answered my concerns regarding the time spent to unpack the variables and extract the data compared to the 'normal' input of input#1,A$,B$ etc... which would help me to decide if the binary read is worth while looking at.

Also, as previously explained, my files are huge and cannot be read into memory ... say 3,000,000 records at maybe 600 chs long is a lot of memory. If I use virtual memory then I am up for page swapping etc... and again I ask the question how does this compare in processing speeds.

I appreciate the input ... but maybe someone who wrote the QB64 code can tell me if I can increase the read buffer size ?

Thsk all,

Mike

News:

Author Topic: Is there someway to speed up reading a text file (Read 9839 times)

MLambert

Is there someway to speed up reading a text file

TerryRitchie

Re: Is there someway to speed up reading a text file

FellippeHeitor

Re: Is there someway to speed up reading a text file

SMcNeill

Re: Is there someway to speed up reading a text file

Pete

Re: Is there someway to speed up reading a text file

TerryRitchie

Re: Is there someway to speed up reading a text file

Pete

Re: Is there someway to speed up reading a text file

bplus

Re: Is there someway to speed up reading a text file

MLambert

Re: Is there someway to speed up reading a text file

bplus

Re: Is there someway to speed up reading a text file

Pete

Re: Is there someway to speed up reading a text file

MLambert

Re: Is there someway to speed up reading a text file

Pete

Re: Is there someway to speed up reading a text file

EricE

Re: Is there someway to speed up reading a text file

MLambert

Re: Is there someway to speed up reading a text file