Segmented File Hashing Utility

HarrySpoofer · #1 08-19-2023, 17:07

This is a small command line utility for MS-Windows written in C in VisualStudio 2019, which creates multiple MD5 hashes of a single file. Also known as "segmented hashing".

USAGE: SegmentedHash.exe FileToHash NumberOfHashSegments OffsetRange e.g.: 1BEEF-20000

For example:

SegmentedHash.exe BigFile.bin 100

...calculates 100 consecutive MD5 hashes of the entire Bigfile.bin file. In other words: it divides the BigFile.bin into 100 equal size segments and calculates an MD5 hash of each segment.

SegmentedHash.exe BigFile.bin 100 642c06f40-642f509a6

...calculates 100 consecutive MD5 hashes of the partial Bigfile.bin starting from the hexadecimal offset 642c06f40 and ending at the offset 642f509a6 (inclusive).

The file offsets can also be specified in an open form, e.g.:
-1CB2 means from the beginning of the file (offset 0) up to the file offset 0x1cb2 and 1BC2- means from the file offset 0x1cb2 up to the end of file.

The hashing algorithm can be changed by altering the line: #define ALGORITHM

Other possible algorithms are: CALG_SHA1, CALG_SHA_256, CALG_SHA_512, CALG_3DES, CALG_AES_128, etc...

Note: This utility does not write any files. It only reads the file in BUFSIZE chunks (see the source in SegmentedHash.cpp) and calculates the hashes.

Q: WHAT IS THIS USEFUL FOR?

A: Scenario: You have been downloading a 16TB file over a slow FTP connection for a week but several bytes of the file came over corrupted.
This utility allows you to detect which bytes did not transfer correctly without doing the full 16TB file compare / re-download. This is done by running the SegmentedHash utility on the FTP server AND on the FTP client machine and comparing only the hashes of that big file before and after the transfer. Once a mismatching hash is identified, you can narrow down the search to a smaller range of file offsets and find the corrupted bytes. Just several kB of hashes need to be transferred and compared to find the culprit in a huge file. Once this is done the correct bytes can be downloaded anew and used to patch the huge downloaded corrupted file.

Code:

#include < stdio.h >
#include < assert.h >
#include < windows.h >
#include < Wincrypt.h >

#define ALGORITHM CALG_MD5
#define BUFSIZE 4096

DWORD PrintHash(HCRYPTHASH hHash)
{
    DWORD cbData = sizeof(DWORD);
    PBYTE pbData = NULL;
    DWORD cHashSize;
    CHAR Digits[] = "0123456789abcdef";
    DWORD dwStatus = 0;

    
    if (!CryptGetHashParam(hHash, HP_HASHSIZE, (PBYTE)&cHashSize, &cbData, 0) && (cbData != sizeof(cHashSize)))
        goto ErrorExit;

    pbData = (PBYTE)malloc(cHashSize);

    if ((pbData) && (CryptGetHashParam(hHash, HP_HASHVAL, pbData, &cHashSize, 0)))
    {
        for (DWORD i = 0; i < cHashSize; i++)
        {
            printf("%c%c", Digits[pbData[i] >> 4], Digits[pbData[i] & 0xf]);
        }
        printf("\n");
        goto Exit;
    }

ErrorExit:
    dwStatus = GetLastError();
    printf("ERROR: CryptGetHashParam failed: %08x\n", dwStatus);
Exit:
    if (pbData) free(pbData);
    return dwStatus;
    
}

LONGLONG mySetFilePointer(HANDLE hFile, LONGLONG distance, DWORD MoveMethod)
{
    LARGE_INTEGER dist;

    dist.QuadPart = distance;
    dist.LowPart = SetFilePointer(hFile, dist.LowPart, &dist.HighPart, MoveMethod);

    if (dist.LowPart == INVALID_SET_FILE_POINTER && GetLastError() != NO_ERROR)
    {
        dist.QuadPart = -1;
    }

    return dist.QuadPart;
}


int wmain(int argc, wchar_t* argv[])
{
    LARGE_INTEGER fsize;
    ULONGLONG NextToRead = 0;
    ULONGLONG SegFirst=0;
    ULONGLONG SegLast = 0;
    ULONGLONG SegSize;
    ULONGLONG WindowFirst = 0;
    ULONGLONG WindowLast = 0;
    ULONGLONG WindowSize;
    ULONGLONG nSegments = 1;
    ULONGLONG Remainder = 0;
    DWORD dwStatus = 0;
    BOOL bResult = FALSE;
    HCRYPTPROV hProv = 0;
    HCRYPTHASH hHash = 0;
    HANDLE hFile = NULL;
    BYTE Bufffer[BUFSIZE];
    DWORD cbRead = 0;
    wchar_t* EndPtr;
    ULONGLONG tmp;

    LPCWSTR filename = argv[1];
    // Logic to check usage goes here.

    if (argc < 2)
    {
        printf("USAGE: FileToHash NumberOfHashSegments  e.g.: 1BEEF-20000\n");
        return -1;
    }

    hFile = CreateFileW(filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL);

    if (INVALID_HANDLE_VALUE == hFile)
    {
        dwStatus = GetLastError();
        printf("Error opening file %ls\nError: %08x\n", filename, dwStatus);
        return dwStatus;
    }

    if (!GetFileSizeEx(hFile, &fsize))
    {
        CloseHandle(hFile);
        dwStatus = GetLastError();
        printf("Error obtainig file size %ls\nError: %08x\n", filename, dwStatus);
        return dwStatus;
    }
    WindowLast = fsize.QuadPart-1;

    if (argc > 2)
        nSegments = max (_wtoi(argv[2]), 1);

    if (argc > 3)
    {
        if (*argv[3] == L'-')
            WindowLast = wcstoull(argv[3] + 1, NULL, 16);  //_wcstoui64
        else
        {
            WindowFirst = wcstoull(argv[3], &EndPtr, 16);  //_wcstoui64
            if (*EndPtr == L'-')
            {
                if (*(EndPtr + 1) == L'\0')
                    WindowLast = (ULONGLONG)fsize.QuadPart - 1;
                else
                    WindowLast = wcstoull(EndPtr + 1, NULL, 16);  //_wcstoui64
            }
        }
    }

    if (WindowFirst > WindowLast)
    {
        tmp = WindowFirst;
        WindowLast = WindowFirst;
        WindowFirst = tmp;
    }

    WindowLast = min(WindowLast, (ULONGLONG)fsize.QuadPart-1);
    WindowSize = WindowLast - WindowFirst + 1;
    nSegments = min(nSegments, WindowSize);

    Remainder = WindowSize % nSegments;
    SegSize = WindowSize / nSegments + (Remainder > 0);   

    // Get handle to the crypto provider
    if (!CryptAcquireContext(&hProv, NULL, NULL, PROV_RSA_FULL, CRYPT_VERIFYCONTEXT))
    {
        dwStatus = GetLastError();
        printf("ERROR: CryptAcquireContext failed: %08x\n", dwStatus);
        CloseHandle(hFile);
        return dwStatus;
    }
 
    if (!CryptCreateHash(hProv, ALGORITHM, 0, 0, &hHash))
    {
        dwStatus = GetLastError();
        printf("ERROR: CryptCreateHash failed: %08x\n", dwStatus);
        CloseHandle(hFile);
        CryptReleaseContext(hProv, 0);
        return dwStatus;
    }

    printf("\nCalculating hashes for %llu segments of the file %ls\nfrom ofset %016llx to %016llx (inclusive)\n\n", nSegments, filename, WindowFirst, WindowLast);
    printf("|       File Offset Range       |\t|           MD5 Hash           |\n");
    printf("|-------------------------------|\t|------------------------------|\n");
    

    mySetFilePointer(hFile, WindowFirst, FILE_BEGIN);
    NextToRead = WindowFirst;
    SegFirst = WindowFirst;
    SegLast = min(WindowFirst+SegSize-1, WindowLast);

    while (bResult = ReadFile(hFile, Bufffer, (DWORD)min(BUFSIZE, SegLast - NextToRead+1), &cbRead, NULL))
    {
        assert(cbRead == min(BUFSIZE, SegLast - NextToRead+1));
        assert(NextToRead + cbRead - 1 <= WindowLast);

        NextToRead += cbRead;
        
        if ( (cbRead > 0) && (!CryptHashData(hHash, Bufffer, cbRead, 0)) )
        {
            dwStatus = GetLastError();
            printf("ERROR: CryptHashData failed: %08x\n", dwStatus);
            goto Exit;
        }
      
        if ( (NextToRead > SegLast) || (cbRead == 0) )
        {
            printf("%016llX-%016llX\t", SegFirst, SegLast);
            PrintHash(hHash);

            if ((NextToRead > WindowLast) || (cbRead == 0))
                break;

            CryptDestroyHash(hHash);

            if (!CryptCreateHash(hProv, ALGORITHM, 0, 0, &hHash))
            {
                dwStatus = GetLastError();
                printf("ERROR: CryptCreateHash failed: %08x\n", dwStatus);
                goto Exit;
            }

            SegFirst = min(SegLast + 1, WindowLast);
            SegLast = min(SegFirst + SegSize - 1, WindowLast);

            if (Remainder == 1)
                SegSize--;

            if (Remainder > 0)
                Remainder--;
        }
    }

    if (!bResult)
    {
        dwStatus = GetLastError();
        printf("ERROR: ReadFile failed: %08x\n", dwStatus);
        goto Exit;
    }

    dwStatus = 0;

Exit:
    CryptDestroyHash(hHash);
    CryptReleaseContext(hProv, 0);
    CloseHandle(hFile);

    return dwStatus;
}

sendersu · #2 08-19-2023, 17:37

Nice piece of code!
is it possible to make it cross-platform and be available for Linux flavors as well?

chants · #3 09-04-2023, 00:23

Similar to how the torrent protocol works. Unfortunately ftp and http don't have this functionality built into the protocol so without something like ssh access this isn't going to do much such as downloading from public servers.

Practically I don't think a segment of over 2 megabytes without a hash is a good idea. Companies like Microsoft give md5 or sha1 hashes of multigigabyte ISOs for example. But giving a 2MB segmented sets of hashes is still a very small amount of data. Not sure why at this day and age, this is not solved. Bandwidth efficiency for large downloads remains an annoying issue in various contexts.

Abdul Moeed · #4 09-16-2023, 21:14

Quote:

Originally Posted by HarrySpoofer

Q: WHAT IS THIS USEFUL FOR?

A: Scenario: You have been downloading a 16TB file over a slow FTP connection for a week but several bytes of the file came over corrupted.
This utility allows you to detect which bytes did not transfer correctly without doing the full 16TB file compare / re-download. This is done by running the SegmentedHash utility on the FTP server AND on the FTP client machine and comparing only the hashes of that big file before and after the transfer.

This is a great concept but unless you own or at least have access to the FTP server, how on Earth would you be able to run that utility on the ftp server?
This means that the FTP server owner also must know about this tool and then run this tool on their server side.

Or it is useful only in cases when you are transferring files between your own servers...

Rebe · #5 09-17-2023, 21:35

That's a good point. Is there a compiled version handy?

h8er · #6 10-11-2023, 23:16

Hello, nice tool, another good option for something like this is using PAR2 / PAR3 which support error recovering too

https://en.wikipedia.org/wiki/Parchive

The Following 5 Users Say Thank You to HarrySpoofer For This Useful Post:
MarcElBichon (08-19-2023), ontryit (08-24-2023), pnta (10-14-2023), tonyweb (08-21-2023), Zeokat (08-20-2023)

The Following User Says Thank You to Abdul Moeed For This Useful Post:
Rebe (09-17-2023)