Exetools  

Go Back   Exetools > General > General Discussion

Notices

Reply
 
Thread Tools Display Modes
  #1  
Old 01-22-2018, 01:24
dila dila is offline
Friend
 
Join Date: Jan 2010
Posts: 60
Rept. Given: 12
Rept. Rcvd 32 Times in 14 Posts
Thanks Given: 35
Thanks Rcvd at 74 Times in 20 Posts
dila Reputation: 32
Tool to scan files for common byte sequences

I am looking for a tool that loads a set of files and will find common byte sequences between them. Does such a tool exist?

For example, if each file contains the sequence 0x01 0x02 0x03 0x04 0x05, then the tool will find this common string and print it.
Reply With Quote
  #2  
Old 01-22-2018, 02:30
Stingered Stingered is offline
Friend
 
Join Date: Dec 2017
Posts: 256
Rept. Given: 0
Rept. Rcvd 2 Times in 2 Posts
Thanks Given: 296
Thanks Rcvd at 179 Times in 89 Posts
Stingered Reputation: 2
Quote:
Originally Posted by dila View Post
I am looking for a tool that loads a set of files and will find common byte sequences between them. Does such a tool exist?

For example, if each file contains the sequence 0x01 0x02 0x03 0x04 0x05, then the tool will find this common string and print it.
A quick search came across these:

http://gnuwin32.sourceforge.net/packages/gsar.htm

https://wingrep.codeplex.com/

https://www.fileseek.ca/

Or you can just use Notepad++ and use the "Find in Files" menu option.

Last edited by Stingered; 01-22-2018 at 02:31. Reason: link change
Reply With Quote
  #3  
Old 01-22-2018, 03:46
dila dila is offline
Friend
 
Join Date: Jan 2010
Posts: 60
Rept. Given: 12
Rept. Rcvd 32 Times in 14 Posts
Thanks Given: 35
Thanks Rcvd at 74 Times in 20 Posts
dila Reputation: 32
This is for searching for a given string.

I paste a screenshot of the prototype here: https://i.imgur.com/8IxxjE6.png. It shows that the string 0x00 0x04 0x00 0xE8 0x02 0x00 is common to 8 files out of the sample set.

And here it is, viewed in a hex editor: https://i.imgur.com/I06WEu7.png.
Reply With Quote
  #4  
Old 01-22-2018, 04:16
Stingered Stingered is offline
Friend
 
Join Date: Dec 2017
Posts: 256
Rept. Given: 0
Rept. Rcvd 2 Times in 2 Posts
Thanks Given: 296
Thanks Rcvd at 179 Times in 89 Posts
Stingered Reputation: 2
Quote:
Originally Posted by dila View Post
This is for searching for a given string.

I paste a screenshot of the prototype here: https://i.imgur.com/8IxxjE6.png. It shows that the string 0x00 0x04 0x00 0xE8 0x02 0x00 is common to 8 files out of the sample set.

And here it is, viewed in a hex editor: https://i.imgur.com/I06WEu7.png.
Hmmm... So you're looking for something like the OD command in unix (except for the addition of multiple file search)?

Would something like this work?

Code:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>

#define BUF_SIZE 65536

int getnibble(char c)
{
	c = toupper(c);
	return (c > '9' ? c - 'A' + 10 : c - '0');
}

void main(int argc, char** argv)
{
	if (argc != 3)
	{
		printf(
			"Usage:\n"
			"%s <filename> <hex>\n",
			argv[0]
		);
		return;
	}

	char* filename = argv[1];
	char* hexchars = argv[2];
	
	int len = strlen(hexchars);
	if (len % 2)
	{
		printf("Error: Odd number of hex chars\n");
		return;
	}
	len /= 2; // len = number of bytes in pattern
	
	// parse hexchars to real bytes
	char* pattern = (char*)malloc(len);
	char* p = pattern;
	while (*hexchars)
	{
		int h = getnibble(*hexchars++);
		int l = getnibble(*hexchars++);
		
		if (h > 16 || l > 16)
		{
			printf("Error: invalid hex\n");
			free(pattern);
			return;
		}
		
		*p++ = (h << 4) + l;
	}
	
	// Open the file
	FILE* f = fopen(filename, "rb");
	if (f)
	{
		char* buf = (char*)malloc(BUF_SIZE);
		
		// we want to read in less than the whole buffer each time to avoid 
		// missing the needle when it's halfway across a boundary
		int readsize = BUF_SIZE - len; 
		
		int amtread;
		int offset = 0;
		char* p; // search result
		int bytessearched; // how many bytes we've already searched in this block
		
		// read in the first block in full
		amtread = fread(buf, 1, BUF_SIZE, f);
		while (amtread != 0)
		{
			// search for the start byte
			bytessearched = 0;
			while ((p = (char*)memchr(buf + bytessearched, *pattern, amtread - len - bytessearched)) != NULL)
			{
				if (memcmp(p, pattern, len) == 0)
				{
					printf("Found at %x\n", offset + p - buf);
				}
				bytessearched = p - buf + 1;
			}
			
			// copy the tail of the buffer over the head
			memmove(buf, buf + BUF_SIZE - len, len);
			
			// read in the next block
			amtread = fread(buf + len, 1, BUF_SIZE - len, f);
			offset += BUF_SIZE - len;
		}
		
		free(buf);
	}
	fclose(f);
	
	free(pattern);
}
And then just create a batch file for the multiple file search?
Reply With Quote
  #5  
Old 01-22-2018, 04:32
Stingered Stingered is offline
Friend
 
Join Date: Dec 2017
Posts: 256
Rept. Given: 0
Rept. Rcvd 2 Times in 2 Posts
Thanks Given: 296
Thanks Rcvd at 179 Times in 89 Posts
Stingered Reputation: 2
Quote:
Originally Posted by dila View Post
This is for searching for a given string.

I paste a screenshot of the prototype here: https://i.imgur.com/8IxxjE6.png. It shows that the string 0x00 0x04 0x00 0xE8 0x02 0x00 is common to 8 files out of the sample set.

And here it is, viewed in a hex editor: https://i.imgur.com/I06WEu7.png.
Ah... so you've already written something, sry (I see in the first image what you mean). Can't find anything that does this.
Reply With Quote
  #6  
Old 01-22-2018, 05:15
Stingered Stingered is offline
Friend
 
Join Date: Dec 2017
Posts: 256
Rept. Given: 0
Rept. Rcvd 2 Times in 2 Posts
Thanks Given: 296
Thanks Rcvd at 179 Times in 89 Posts
Stingered Reputation: 2
Last ditch effort:

http://www.vxsearch.com/search_files_by_binary_patterns.html

Windows app, 30-day trial download:

http://www.vxsearch.com/downloads.html
Reply With Quote
  #7  
Old 01-22-2018, 06:22
dila dila is offline
Friend
 
Join Date: Jan 2010
Posts: 60
Rept. Given: 12
Rept. Rcvd 32 Times in 14 Posts
Thanks Given: 35
Thanks Rcvd at 74 Times in 20 Posts
dila Reputation: 32
Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.
Reply With Quote
The Following User Says Thank You to dila For This Useful Post:
Stingered (01-22-2018)
  #8  
Old 01-22-2018, 07:09
chants chants is offline
VIP
 
Join Date: Jul 2016
Posts: 725
Rept. Given: 35
Rept. Rcvd 48 Times in 30 Posts
Thanks Given: 666
Thanks Rcvd at 1,050 Times in 475 Posts
chants Reputation: 48
The problem with this task - e.g. common substrings problem, is that is a high complexity so that it requires a lot of difficult heuristic tricks to get it below O(n^2) otherwise it is too slow or uses too much memory. I have not seen any tools to do this. It would work with pictures or videos or audio as well - to find matching image sections, video subclips, etc. But really, it would be quite useful. I am quite certain we are talking an NP-hard problem please see:
Quote:
https://en.wikipedia.org/wiki/Longest_common_substring_problem
Although this problem is less complex, you did not specify only the longest common substring but all common substrings. Yes the suffix tree and other tricks can solve this one faster.

And there are proofs I believe that shortest common substring is NP-hard.
See for example
Quote:
https://docs.lib.purdue.edu/cgi/viewcontent.cgi?httpsredir=1&article=2225&context=cstech
Reply With Quote
The Following 2 Users Say Thank You to chants For This Useful Post:
dila (01-23-2018), Stingered (01-23-2018)
  #9  
Old 01-22-2018, 13:03
Stingered Stingered is offline
Friend
 
Join Date: Dec 2017
Posts: 256
Rept. Given: 0
Rept. Rcvd 2 Times in 2 Posts
Thanks Given: 296
Thanks Rcvd at 179 Times in 89 Posts
Stingered Reputation: 2
Quote:
Originally Posted by chants View Post
The problem with this task - e.g. common substrings problem, is that is a high complexity so that it requires a lot of difficult heuristic tricks to get it below O(n^2) otherwise it is too slow or uses too much memory. I have not seen any tools to do this. It would work with pictures or videos or audio as well - to find matching image sections, video subclips, etc. But really, it would be quite useful. I am quite certain we are talking an NP-hard problem please see:

Although this problem is less complex, you did not specify only the longest common substring but all common substrings. Yes the suffix tree and other tricks can solve this one faster.

And there are proofs I believe that shortest common substring is NP-hard.
See for example
I mean, yes, the same for grep (in general).
Reply With Quote
  #10  
Old 01-22-2018, 13:07
Stingered Stingered is offline
Friend
 
Join Date: Dec 2017
Posts: 256
Rept. Given: 0
Rept. Rcvd 2 Times in 2 Posts
Thanks Given: 296
Thanks Rcvd at 179 Times in 89 Posts
Stingered Reputation: 2
Quote:
Originally Posted by dila View Post
Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.
Actually, I think it could be depending one what you are searching for. obviously it is not a simple search, but very specific (over a large spectrum). Does not mean it isn't useful. I was intrigued by the thought.
Reply With Quote
  #11  
Old 01-22-2018, 14:08
chants chants is offline
VIP
 
Join Date: Jul 2016
Posts: 725
Rept. Given: 35
Rept. Rcvd 48 Times in 30 Posts
Thanks Given: 666
Thanks Rcvd at 1,050 Times in 475 Posts
chants Reputation: 48
Grep is just looking for regex's so its complexity is that of pattern matching of regex's. Now you are asking a very general and arbitrary common substring problem. They are not the same issue really at all.

This would be very useful, but it has a really problematic size vs speed tradeoff and would need some kind of limiting parameters like you are getting at. The NP-hard issue can be side stepped through heuristics and domain specific approach. Nonetheless, I doubt you will find such a tool for general cases.
Reply With Quote
The Following User Says Thank You to chants For This Useful Post:
Stingered (01-23-2018)
  #12  
Old 01-23-2018, 02:33
Stingered Stingered is offline
Friend
 
Join Date: Dec 2017
Posts: 256
Rept. Given: 0
Rept. Rcvd 2 Times in 2 Posts
Thanks Given: 296
Thanks Rcvd at 179 Times in 89 Posts
Stingered Reputation: 2
Quote:
Originally Posted by chants View Post
Grep is just looking for regex's so its complexity is that of pattern matching of regex's. Now you are asking a very general and arbitrary common substring problem. They are not the same issue really at all.

This would be very useful, but it has a really problematic size vs speed tradeoff and would need some kind of limiting parameters like you are getting at. The NP-hard issue can be side stepped through heuristics and domain specific approach. Nonetheless, I doubt you will find such a tool for general cases.
Good explanation (both posts). There was an old (privately written back in the early 90's) tool that I used for function and string searches (waaay before this particular code base was being indexed). And even that tool could take all day to search through (the code base was very large). Not exactly the same example, but I see your point regarding the size vs speed you described.

I believe I read that the metasploit framework included some heuristics for this kind of search, but I could find no specific tool.

I too agree that a tool like this would be very useful.
Reply With Quote
The Following User Says Thank You to Stingered For This Useful Post:
dila (01-23-2018)
  #13  
Old 01-24-2018, 12:06
ontryit ontryit is offline
Friend
 
Join Date: Nov 2011
Posts: 172
Rept. Given: 127
Rept. Rcvd 17 Times in 14 Posts
Thanks Given: 411
Thanks Rcvd at 70 Times in 43 Posts
ontryit Reputation: 17
Quote:
Originally Posted by dila View Post
Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.
You can share you prototype tool with its source code here, may be it will be useful for someone. Thx
Reply With Quote
  #14  
Old 02-14-2018, 03:29
dosprog dosprog is online now
Friend
 
Join Date: Feb 2018
Posts: 114
Rept. Given: 0
Rept. Rcvd 17 Times in 16 Posts
Thanks Given: 33
Thanks Rcvd at 146 Times in 74 Posts
dosprog Reputation: 17
Tool to scan files for common byte sequences

Similar problem solve the archivers.
That if try to take out the algorithm from some open-source archiver?
Reply With Quote
The Following User Says Thank You to dosprog For This Useful Post:
dila (02-16-2018)
  #15  
Old 02-16-2018, 16:46
dila dila is offline
Friend
 
Join Date: Jan 2010
Posts: 60
Rept. Given: 12
Rept. Rcvd 32 Times in 14 Posts
Thanks Given: 35
Thanks Rcvd at 74 Times in 20 Posts
dila Reputation: 32
Yes, I figured the problem was similar to building a dictionary of common sequences, which you'd then substitute with shorter codes corresponding to the dictionary entries.

As we discussed, it doesn't sound like a perfect solution is possible, but some heuristics would work. You mentioned compression which would do exactly this kind of operation - pick your favourite algorithm.

(I won't make the code available for my tool, since it was a rushed prototype and I don't think there any chance of anyone getting all the necessary libs to compile it.)
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there any tool to replace the files packed in the NullSoft Install System package? BlackWhite General Discussion 4 09-02-2018 00:27


All times are GMT +8. The time now is 08:19.


Always Your Best Friend: Aaron, JMI, ahmadmansoor, ZeNiX, chessgod101
( 1998 - 2024 )