Author Topic: Looking for a speedy way search through the contents of text files (Read 6407 times)

hanness · « **on:** February 13, 2020, 07:56:32 pm »

As a part of a program that I have written, I parse through a number of log files looking for errors. I know that the error messages that I am looking for will always contain the word "Error" starting at the 22nd character of the line.

Up until now, the way I handle this is that I open the log file, read a line of text, check to see if the word "Error" appears starting at the 22nd character of the line, then proceed to the next line of text. I repeat this process until I reach the end of the file.

The problem with this method is that it is slow. The log files I am searching can be 40MB+ each and I may have to parse 25+ log files.

My question: Rather than reading in one line at a time, is there any way to search the entire file at once in QB64? My thinking is that I would search the entire file, then if I find the word "Error" I would check to see if this is located starting at the 22nd character of the line. My hope is that if such a method exists, it might be faster than reading in one line at a time.

FellippeHeitor · « **Reply #1 on:** February 13, 2020, 08:16:34 pm »

Maybe not exactly what you're after, but there are some tips to it:
1- Are you opening that file FOR INPUT? Cause you can open files to LINE INPUT but use FOR BINARY in the OPEN line. That'll speed things up greatly.

-= OR =-

2- You can read the whole file at once like this:

Code: QB64: [Select]

OPEN "myBigFile.txt" FOR BINARY AS #1
a$ = SPACE$(LOF(1))
GET #1, 1, a$
CLOSE #1

Then you can parse it, like this:

Code: QB64: [Select]

foundError = INSTR(a$, "Error")
IF foundError THEN
    'At this point, the variable foundError contains the first occurrence, if any, of the word "Error" in the file.
    DO
        'Do something with the data, probably using LEFT$, RIGHT$, MID$
        'then find the next occurrence of "Error":
        foundError = INSTR(foundError + 1, a$, "Error") 
    LOOP WHILE foundError > 0 'the loop instruction will only go back to the beginning of the block if foundError > 0, that is, if we still have "Error" in this file
END IF

Pete · « **Reply #2 on:** February 13, 2020, 10:10:31 pm »

I'd say Fell and I use the same system, so rather than post what I do, which would just be different variable names, I'll try to explain a bit what is happening here.

First, The entire file gets loaded into memory with a BINARY file read. To load the whole file at once, we find the length of the file LOF(1), as used in Fell's example, and fill a variable with that many spaces: a$ = SPACE$(LOF(1)). If you are not familiar with the GET statement, it is asking to start at the first record in the file, the "1" in the middle, and it grabs as many bytes as defined by the length of the variable, which in this case means get the entire length of the file, or all the records. So those SPACEs are no longer spaces, now they are your complete file records. To parse that mother, we need to step through it and find each instance of the search term, "Error". The INSTR() function has this cool seed feature. Although most of the time we see INSTR() with 2 parameters, the variable followed by the character or characters to search for, the seed option allows us to move the index through our file to the previous find. Fell adds a +1 to that, so the record is indexed one character past the "Error" so essentially the string (file) is now read starting as rror....( and all the rest of the stuff, until the next "Error" term is found in the loop. This continues until the seed, in Fell's example, foundError, is +1 past the "E" in the last "Error" find. That means in the next loop the INSTR() value is zero, and therefore the loop is exited.

Happy parsing!

Pete

hanness · « **Reply #3 on:** February 13, 2020, 11:38:33 pm »

Excellent! Thanks for the suggestions. I'm going to try these to see what kind of results I get.

hanness · « **Reply #4 on:** February 14, 2020, 01:22:23 am »

Looking at the responses I received, I think I have a framework for figuring this out. The one monkey wrench in the works is that the word "Error" can appear in other places in the file, but I want only occurrences where that word appears at precisely the 22nd character of any line of text.

Looking at the file in detail I can see that each line of text ends with ASCII 13 10 (I guess that's basically carriage return / line feed). So that will will let me break up the huge string into individual lines which I can then parse.

Do you really think that opening it for BINARY and doing all this processing would still be faster than opening for INPUT and doing individual LINE INPUT statements?

SMcNeill · « **Reply #5 on:** February 14, 2020, 01:37:04 am »

Quote from: hanness on February 14, 2020, 01:22:23 am

Looking at the responses I received, I think I have a framework for figuring this out. The one monkey wrench in the works is that the word "Error" can appear in other places in the file, but I want only occurrences where that word appears at precisely the 22nd character of any line of text.

Looking at the file in detail I can see that each line of text ends with ASCII 13 10 (I guess that's basically carriage return / line feed). So that will will let me break up the huge string into individual lines which I can then parse.

Do you really think that opening it for BINARY and doing all this processing would still be faster than opening for INPUT and doing individual LINE INPUT statements?

Open it for binary with Line Input.

OPEN “datafile.txt” FOR BINARY AS #1
DO UNTIL EOF(1)
LINE INPUT #1, text$
IF MID$(22, text$, 5) = “Error” THEN ‘Do your error stuff
LOOP

Files opened for INPUT read a single byte at a time to allow synchronized READ/WRITE access. Files opened for BINARY read by disk sector size (usually 512 bytes at a time), so require MUCH fewer disk reads and performs several hundred times faster.

RhoSigma · « **Reply #6 on:** February 14, 2020, 01:40:21 am »

If you wanna go some deeper, your problem sounds like a perfect application for using this: https://www.qb64.org/forum/index.php?topic=2101.msg113328#msg113328

Pete · « **Reply #7 on:** February 14, 2020, 02:17:02 am »

QB64 has remarkable speed when opening a file for BINARY, even if you read it line by line, like in Steve's example. Now you could parse out the eol characters CHR$(13) + CHR$(10) and get the same results with something like this, too....

CAUTION This demo makes a file named tmpx123.tmp in the directory you run this program in. On the rare chance you already have a file named tmpx123.tmp in that folder DO NOT RUN THIS PROGRAM until you change the file name or your file will be overwritten with this test data.

Code: QB64: [Select]

a$(1) = "This is a test of a file parsing algorithm."
a$(2) = "If the word......... Error #1"
a$(3) = "This is the second.. Error #2"
a$(4) = "Here is some more text on a line."
a$(5) = "That concludes this simple presentation"
a$(6) = "To search the word.. Error #3 at position 22 of any line."
OPEN "tmpx123.tmp" FOR OUTPUT AS #1
FOR i = 1 TO 6
    PRINT #1, a$(i)
NEXT
CLOSE #1
 
' Now we can see if the search term "Error" is at any line #22, even when loading the entire file at once."
ff = FREEFILE
OPEN "tmpx123.tmp" FOR BINARY AS #ff
x$ = SPACE$(LOF(ff))
GET #ff, 1, x$
CLOSE #ff
 
i = 0 ' Variable just to keep track of which line has the search term in it...
search$ = "Error"
DO
    i = i + 1
    a$ = MID$(x$, 1, INSTR(x$, CHR$(13) + CHR$(10)) - 1)
    x$ = MID$(x$, LEN(a$) + 3)
    IF LEFT$(x$, 1) = CHR$(10) THEN a$ = "": x$ = MID$(x$, 3)
    IF MID$(a$, 22, LEN(search$)) = search$ THEN PRINT i, a$ ' Looks for the word "Error" 22 characters in on any line.
LOOP

Pete

News: