#1
|
|||
|
|||
Tool to scan files for common byte sequences
I am looking for a tool that loads a set of files and will find common byte sequences between them. Does such a tool exist?
For example, if each file contains the sequence 0x01 0x02 0x03 0x04 0x05, then the tool will find this common string and print it. |
#2
|
|||
|
|||
Quote:
http://gnuwin32.sourceforge.net/packages/gsar.htm https://wingrep.codeplex.com/ https://www.fileseek.ca/ Or you can just use Notepad++ and use the "Find in Files" menu option. Last edited by Stingered; 01-22-2018 at 02:31. Reason: link change |
#3
|
|||
|
|||
This is for searching for a given string.
I paste a screenshot of the prototype here: https://i.imgur.com/8IxxjE6.png. It shows that the string 0x00 0x04 0x00 0xE8 0x02 0x00 is common to 8 files out of the sample set. And here it is, viewed in a hex editor: https://i.imgur.com/I06WEu7.png. |
#4
|
|||
|
|||
Quote:
Would something like this work? Code:
#include <stdio.h> #include <string.h> #include <ctype.h> #include <stdlib.h> #define BUF_SIZE 65536 int getnibble(char c) { c = toupper(c); return (c > '9' ? c - 'A' + 10 : c - '0'); } void main(int argc, char** argv) { if (argc != 3) { printf( "Usage:\n" "%s <filename> <hex>\n", argv[0] ); return; } char* filename = argv[1]; char* hexchars = argv[2]; int len = strlen(hexchars); if (len % 2) { printf("Error: Odd number of hex chars\n"); return; } len /= 2; // len = number of bytes in pattern // parse hexchars to real bytes char* pattern = (char*)malloc(len); char* p = pattern; while (*hexchars) { int h = getnibble(*hexchars++); int l = getnibble(*hexchars++); if (h > 16 || l > 16) { printf("Error: invalid hex\n"); free(pattern); return; } *p++ = (h << 4) + l; } // Open the file FILE* f = fopen(filename, "rb"); if (f) { char* buf = (char*)malloc(BUF_SIZE); // we want to read in less than the whole buffer each time to avoid // missing the needle when it's halfway across a boundary int readsize = BUF_SIZE - len; int amtread; int offset = 0; char* p; // search result int bytessearched; // how many bytes we've already searched in this block // read in the first block in full amtread = fread(buf, 1, BUF_SIZE, f); while (amtread != 0) { // search for the start byte bytessearched = 0; while ((p = (char*)memchr(buf + bytessearched, *pattern, amtread - len - bytessearched)) != NULL) { if (memcmp(p, pattern, len) == 0) { printf("Found at %x\n", offset + p - buf); } bytessearched = p - buf + 1; } // copy the tail of the buffer over the head memmove(buf, buf + BUF_SIZE - len, len); // read in the next block amtread = fread(buf + len, 1, BUF_SIZE - len, f); offset += BUF_SIZE - len; } free(buf); } fclose(f); free(pattern); } |
#5
|
|||
|
|||
Quote:
|
#6
|
|||
|
|||
Last ditch effort:
http://www.vxsearch.com/search_files_by_binary_patterns.html Windows app, 30-day trial download: http://www.vxsearch.com/downloads.html |
#7
|
|||
|
|||
Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.
I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all. The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length. |
The Following User Says Thank You to dila For This Useful Post: | ||
Stingered (01-22-2018) |
#8
|
|||
|
|||
The problem with this task - e.g. common substrings problem, is that is a high complexity so that it requires a lot of difficult heuristic tricks to get it below O(n^2) otherwise it is too slow or uses too much memory. I have not seen any tools to do this. It would work with pictures or videos or audio as well - to find matching image sections, video subclips, etc. But really, it would be quite useful. I am quite certain we are talking an NP-hard problem please see:
Quote:
And there are proofs I believe that shortest common substring is NP-hard. See for example Quote:
|
#9
|
|||
|
|||
Quote:
|
#10
|
|||
|
|||
Quote:
|
#11
|
|||
|
|||
Grep is just looking for regex's so its complexity is that of pattern matching of regex's. Now you are asking a very general and arbitrary common substring problem. They are not the same issue really at all.
This would be very useful, but it has a really problematic size vs speed tradeoff and would need some kind of limiting parameters like you are getting at. The NP-hard issue can be side stepped through heuristics and domain specific approach. Nonetheless, I doubt you will find such a tool for general cases. |
The Following User Says Thank You to chants For This Useful Post: | ||
Stingered (01-23-2018) |
#12
|
|||
|
|||
Quote:
I believe I read that the metasploit framework included some heuristics for this kind of search, but I could find no specific tool. I too agree that a tool like this would be very useful. |
The Following User Says Thank You to Stingered For This Useful Post: | ||
dila (01-23-2018) |
#13
|
|||
|
|||
Quote:
|
#14
|
|||
|
|||
Tool to scan files for common byte sequences
Similar problem solve the archivers.
That if try to take out the algorithm from some open-source archiver? |
The Following User Says Thank You to dosprog For This Useful Post: | ||
dila (02-16-2018) |
#15
|
|||
|
|||
Yes, I figured the problem was similar to building a dictionary of common sequences, which you'd then substitute with shorter codes corresponding to the dictionary entries.
As we discussed, it doesn't sound like a perfect solution is possible, but some heuristics would work. You mentioned compression which would do exactly this kind of operation - pick your favourite algorithm. (I won't make the code available for my tool, since it was a rushed prototype and I don't think there any chance of anyone getting all the necessary libs to compile it.) |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Is there any tool to replace the files packed in the NullSoft Install System package? | BlackWhite | General Discussion | 4 | 09-02-2018 00:27 |