Author Topic: Fastest way to compare two files?  (Read 3945 times)

0 Members and 1 Guest are viewing this topic.

Offline Cobalt

  • QB64 Developer
  • Forum Resident
  • Posts: 878
  • At 60 I become highly radioactive!
    • View Profile
Fastest way to compare two files?
« on: September 20, 2018, 02:57:21 pm »
Anybody out there have a fast way to compare two files?
I have tried using a string(a$=space$(256): Get #5,,a$) to read more bytes of the files and compare them but that was actually slower than comparing the files one LONG integer at a time. fastest I've found at the moment is using INTEGER64s to read 8 bytes at a time to compare larger files but still takes a long time.
Granted after becoming radioactive I only have a half-life!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Fastest way to compare two files?
« Reply #1 on: September 20, 2018, 03:48:58 pm »
OPEN file1$ FOR BINARY AS #1
OPEN file2$ FOR BINARY AS #2
t1$ = SPACE$(LOF(1))
t2$ = SPACE$(LOF(2))
GET #1, 1, t1$
GET #2, 1, t2$

IF t1$ = t2$ THEN It's an EXACT duplicate ELSE It's not...


For long files, I'd simply compare LOF(1) and LOF(2) for a pretest.  If they're not a match then there's no reason to compare the contents.
« Last Edit: September 20, 2018, 03:50:41 pm by SMcNeill »
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
    • View Profile
Re: Fastest way to compare two files?
« Reply #2 on: September 20, 2018, 04:08:27 pm »
Steve's way seems to be the fastest just for my logical understanding (not tested), however, if you rather wanna go with a chunky test (eg. INTEGER64) then beside the test "IF a&& = b&&" you also could try "xx&& = aa&& XOR bb&&" and then checking xx& for zero, if it is, then the chunk is identical, if it's not zero, then it's different. No idea, if "=" is faster than "XOR" or vise versa, it's just a thought.
My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack

Offline TempodiBasic

  • Forum Resident
  • Posts: 1792
    • View Profile
Re: Fastest way to compare two files?
« Reply #3 on: September 20, 2018, 05:02:56 pm »
Hi Cobalt

I agree with Steve's idea, so the first comparison is on the lenght of the file, the second IMHO on the date DD-MM-YYYY and if it is important the hours and minutes of creation, the third step is to compare the internal contents.

so in pseudocode
Code: QB64: [Select]
  1.  
  2. OPEN file1$ FOR BINARY AS #1
  3. OPEN file2$ FOR BINARY AS #2
  4. Lenght1 = LOF(1)
  5. Lenght2 = LOF(2)
  6. IF Lenght1 = Lenght2  THEN
  7.        
  8.   IF  date_creation_File1 ( DLL_function) = date creation_File2 (DLL function) THEN
  9.       IF Hours_creation_File1 (DLL function) = Hours_creation_File2 (DLL function) THEN
  10.              EXIT  function/procedure/Subroutine
  11.          ELSE
  12.           ' here we compare the contents of the two files....
  13.             'Following Steve's way
  14.  
  15.              t1$ = SPACE$(Lenght1)
  16.              t2$ = SPACE$(Lenght2)
  17.              GET #1, 1, t1$
  18.              GET #2, 1, t2$
  19.  
  20.              IF t1$ = t2$ THEN It's an EXACT duplicate ELSE It's not...
  21. ' OR
  22. ' RhoSigma's way
  23.  
  24.             xx&& = aa&& XOR bb&&  ' here you imagine that aa&& and bb&& are the contents of the two file to compare
  25.             IF xx&& = 0 THEN It's an EXACT duplicate ELSE It's not...
  26.  
  27.        END IF
  28.   END IF
  29.  
  30.  

about reference of DLL to get time and date of file creation here a sample from the wiki of QB64 to adapt to this goal

Code: QB64: [Select]
  1. File Times
  2. ÄÄÄÄÄÄÄÄÄÄ
  3.  
  4.  
  5. CONST GENERIC_READ = -&H80000000
  6. CONST GENERIC_WRITE = &H40000000
  7. CONST FILE_SHARE_READ = &H1
  8. CONST FILE_SHARE_WRITE = &H2
  9. CONST OPEN_EXISTING = &H3
  10. CONST INVALID_HANDLE_VALUE = -1
  11.  
  12. FUNCTION CreateFileA%& (BYVAL lpFileName AS _OFFSET, BYVAL dwDesiredAccess AS _UNSIGNED LONG, BYVAL dwShareMode AS _UNSIGNED LONG, BYVAL lpSecurityAttributes AS _OFFSET, BYVAL dwCreationDisposition AS _UNSIGNED LONG, BYVAL dwFlagsAndAttributes AS _UNSIGNED LONG, BYVAL hTemplateFile AS _OFFSET)
  13. FUNCTION CloseHandle& (BYVAL hObject AS _OFFSET)
  14. FUNCTION GetFileTime& (BYVAL hFile AS _OFFSET, BYVAL lpCreationTime AS _OFFSET, BYVAL lpLastAccessTime AS _OFFSET, BYVAL lpLastWriteTime AS _OFFSET)
  15. FUNCTION SetFileTime& (BYVAL hFile AS _OFFSET, BYVAL lpCreationTime AS _OFFSET, BYVAL lpLastAccessTime AS _OFFSET, BYVAL lpLastWriteTime AS _OFFSET)
  16. FUNCTION FileTimeToLocalFileTime& (BYVAL lpFileTime AS _OFFSET, BYVAL lpLocalFileTime AS _OFFSET)
  17. FUNCTION LocalFileTimeToFileTime& (BYVAL lpLocalFileTime AS _OFFSET, BYVAL lpFileTime AS _OFFSET)
  18. FUNCTION FileTimeToSystemTime& (BYVAL lpFileTime AS _OFFSET, BYVAL lpSystemTime AS _OFFSET)
  19. FUNCTION SystemTimeToFileTime& (BYVAL lpSystemTime AS _OFFSET, BYVAL lpFileTime AS _OFFSET)
  20. FUNCTION GetLastError& ()
  21.  
  22. TYPE FILETIME
  23.   dwLowDateTime AS _UNSIGNED LONG
  24.   dwHighDateTime AS _UNSIGNED LONG
  25.  
  26. TYPE SYSTEMTIME
  27.   wMonth AS _UNSIGNED INTEGER
  28.   wDayOfWeek AS _UNSIGNED INTEGER
  29.   wMinute AS _UNSIGNED INTEGER
  30.   wSecond AS _UNSIGNED INTEGER
  31.   wMilliseconds AS _UNSIGNED INTEGER
  32.  
  33. DIM CreateDate AS FILETIME
  34. DIM ModifyDate AS FILETIME
  35. DIM AccessDate AS FILETIME
  36.  
  37. DIM systime AS SYSTEMTIME
  38.  
  39. DIM FileName AS STRING
  40. DIM FileHandle AS _OFFSET
  41.  
  42. FileName = "readme.txt" + CHR$(0) '<<<<<< Existing file in QB64 folder. Use existing file path!
  43.  
  44. FileHandle = CreateFileA%&(_OFFSET(FileName), GENERIC_READ, FILE_SHARE_READ OR FILE_SHARE_WRITE, 0, OPEN_EXISTING, 0, 0)
  45. IF FileHandle <> INVALID_HANDLE_VALUE THEN
  46.   IF GetFileTime&(FileHandle, _OFFSET(CreateDate), _OFFSET(ModifyDate), _OFFSET(AccessDate)) THEN
  47.     PRINT HEX$(CreateDate.dwLowDateTime) + HEX$(CreateDate.dwHighDateTime)
  48.     PRINT HEX$(ModifyDate.dwLowDateTime) + HEX$(ModifyDate.dwHighDateTime)
  49.     PRINT HEX$(AccessDate.dwLowDateTime) + HEX$(AccessDate.dwHighDateTime)
  50.     PRINT
  51.     IF FileTimeToSystemTime&(_OFFSET(CreateDate), _OFFSET(systime)) THEN
  52.       PRINT "Creation time, in GMT, in decimal:"
  53.       PRINT "Year:"; systime.wYear
  54.       PRINT "Month:"; systime.wMonth, "("; MID$("JanFebMarAprMayJunJulAugSepOctNovDec", (systime.wMonth * 3) - 2, 3); ")"
  55.       PRINT "DayOfWeek:"; systime.wDayOfWeek, "("; MID$("SunMonTueWedThuFriSat", (systime.wDayOfWeek * 3) + 1, 3); ")"
  56.       PRINT "Day"; systime.wDay
  57.       PRINT "Hour"; systime.wHour
  58.       PRINT "Minute"; systime.wMinute
  59.       PRINT "Second"; systime.wSecond
  60.       PRINT "Milliseconds"; systime.wMilliseconds
  61.     ELSE
  62.       PRINT "FileTimeToSystemTime failed. Error: 0x" + LCASE$(HEX$(GetLastError&))
  63.     END IF
  64.   ELSE
  65.     PRINT "GetFileTime failed. Error: 0x" + LCASE$(HEX$(GetLastError&))
  66.   END IF
  67.   IF CloseHandle&(FileHandle) = 0 THEN
  68.     PRINT "CloseHandle failed. Error: 0x" + LCASE$(HEX$(GetLastError&))
  69.     END
  70.   END IF
  71.   PRINT "CreateFileA failed. Error: 0x" + LCASE$(HEX$(GetLastError&))
  72.   END
  73. END  
  74.  
  75. Code courtesy of Michael Calkins

or here https://docs.microsoft.com/it-it/dotnet/api/system.io.file.getcreationtime?redirectedfrom=MSDN&view=netframework-4.7.2#System_IO_File_GetCreationTime_System_String_

I hope it is useful
Programming isn't difficult, only it's  consuming time and coffee

Offline Cobalt

  • QB64 Developer
  • Forum Resident
  • Posts: 878
  • At 60 I become highly radioactive!
    • View Profile
Re: Fastest way to compare two files?
« Reply #4 on: September 20, 2018, 07:50:38 pm »
OPEN file1$ FOR BINARY AS #1
OPEN file2$ FOR BINARY AS #2
t1$ = SPACE$(LOF(1))
t2$ = SPACE$(LOF(2))
GET #1, 1, t1$
GET #2, 1, t2$

IF t1$ = t2$ THEN It's an EXACT duplicate ELSE It's not...


For long files, I'd simply compare LOF(1) and LOF(2) for a pretest.  If they're not a match then there's no reason to compare the contents.
that works up until audio and video files which can be 98% matches but different file sizes and be the exact same video or audio. if someone just trimmed a few seconds off.(side effect of downloading playlists) there might be a few different byte in the beginning or the end from the trim but the bulk is exactly the same. thats why i tried setting a$ to 256,1024 or even lof(#)/100. but it turned out slower than reading long int for long int on files 150MB and up. at the moment I skip anything bigger than 50MB cause compare times can exceed 1200 seconds a cycle. it stores them but just flags them as same file name.

Just struck me as odd that comparing the files with strings at 256 or 1024 bytes at a time was actually taking longer than just reading them 4 bytes at a time with a long integer. within the first 20000 records scanned it finds just over 80000 duplicates mostly from various copies of QB64 I'm sad to say. there seem to often be multiply exact copies of files with the same version of QB64 too.
Granted after becoming radioactive I only have a half-life!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Fastest way to compare two files?
« Reply #5 on: September 20, 2018, 07:58:49 pm »
Quote
that works up until audio and video files which can be 98% matches but different file sizes and be the exact same video or audio. if someone just trimmed a few seconds off.(side effect of downloading playlists) there might be a few different byte in the beginning or the end from the trim but the bulk is exactly the same.

Unless you're writing a program to check for percent that matches, you'll always get the same result, whether you check 4 bytes at once, or the whole file contents.

Let's say we have 2 files:

"1234567890"
"1234567890  "       <---- see the extra spaces?

Read those 1 byte, 2  bytes, or all at once, and they're not going to say, "Matches.  File 2 just has a few extra spaces, or File 1 was trimmed somewhat...."

Sounds like what you need is a way to compare percent matches, and then flag them for manual comparison so you can check for yourself if the contents are close enough to be considered "duplicate" by you.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline codeguy

  • Forum Regular
  • Posts: 174
    • View Profile
Re: Fastest way to compare two files?
« Reply #6 on: September 20, 2018, 10:30:40 pm »
Why not use instr() if you want to see if some or all of the contents of file 1 is contained in file 2?
If the lengths of the two files match and instr() returns 1, they are exactly the same. If instr() returns a value > 1, file 1 is contained in file 2. If it returns 0, file 1 is not contained in file 2.