Exetools  

Go Back   Exetools > General > General Discussion

Notices

 
 
Thread Tools Display Modes
Prev Previous Post   Next Post Next
  #1  
Old 03-02-2010, 05:56
pp2 pp2 is offline
Friend
 
Join Date: Jan 2002
Posts: 60
Rept. Given: 1
Rept. Rcvd 2 Times in 2 Posts
Thanks Given: 1
Thanks Rcvd at 16 Times in 12 Posts
pp2 Reputation: 2
[Solution] How to check whether data is compressed or ciphered

Some time ago, I asked myself if I can distinguish compressed data from enciphered. This can be useful in data and file analysis and some other cases. Maybe you will find this information useful also.

First, how we can theoretically answer whether data is packed or crypted? Honestly packed data blocks will not use all possible code combinations (or this data will be not unpackable), but best compressors do not use less than 0.01% of all possible code combinations. So, only statistics will help us in analyzing big amounts of data. Example:
compressed data
entropy for 8-bit elements: 0.999833729
ciphered data
entropy for 8-bit elements: 0.999999867

As you can see, difference is entropy is almost zero, and it cannot be right criteria to distinguish blocks of data. For some blocks of data compressed data entropy will be almost the same, as ciphered.
The right algorithm is to calculate chi-squared criteria for block of data.
Compare with the same blocks of data:
compressed data
chi-squared 0.001830034
ciphered data
chi-squared 0.000001432

Yes, here we got a 1300 times differing values! But why? Because ciphered (with good cipher) data will contain all possible code combinations, and compressed will not. This algorithm reveals these unused codes and makes such a difference.
Ok, how to calculate 8-bit entropy and chi-square?
Imagine, elements array has count of all bytes in data block (for 8-bit entropy), i.e. 0-th element has number of 0x00 bytes in block, 1-st - 0x01 and etc. Here is pseudocode for calculating entropy value:
Code:
long double GetEntropy(unsigned int bits)
{
    unsigned int i;
    long double result, temp;
    result = 0.0;
    for (i = 0; i < (1UL << bits); i++)
    {
        if (elements[i] == 0)
            continue;
        temp = (long double)elements[i] / quantity;
        temp *= log(temp) / log(2);
        result += temp;
    }
    return -result / (long double)bits;
}
And now, using the same elements array we can calculate chi-square criteria:
Code:
long double GetChiSquared(int bits)
{
    unsigned int i;
    long double result, temp;
    result = 0.0;
    for (i = 0; i < (1UL << bits); i++)
    {
        temp = (long double)quantity / (long double)(1ul << bits);
        result += ((long double)elements[i] - temp) * ((long double)elements[i] - temp) / temp;
    }
    return result / quantity;
}
Drawbacks: we cannot 100% prove, that file is compressed or ciphered if data size is too small. If data size grows, prove strengthens. My own investigations reveal, that random blocks more than 50mb in size (compressed by modern archivers or ciphered with popular block ciphers) can be distinguished with 99% guarantee (yes, maybe you will need some additional magic, which is left as an home exercise for most curious, hint - do not use only 8-bit values).

Happy coding
Reply With Quote
 

Tags
chi-square, entropy


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing UPX protection? (compressed file) Rhodium General Discussion 4 08-11-2003 19:50
Help Me - CRC Check and FileSize Check byvs General Discussion 11 07-31-2003 13:32


All times are GMT +8. The time now is 15:34.


Always Your Best Friend: Aaron, JMI, ahmadmansoor, ZeNiX, chessgod101
( Since 1998 )