decompiling back to C++? - Page 3

#31 08-17-2004, 06:01

Quote:

Originally Posted by br00t_4_c

You can reconstruct source code from a disassembled binary executable that may well closely resemble the original source code but as Sarge very astutely mentioned variable and function names will be mangled, comments will be lost, etc.

Yes, that is the realistic view. These information are simply lost during compilation. Assuming there is no debug info, just the compiled, stripped .exe we can't do anything against this.
I am sure, however, that even such a source with names like variable1, variable2 etc. can be a great help for anyone who wants to understand the original ideas behind the code.
Don't forget that the other alternative is facing a huge, unorganized list of assembly functions.
Some information that I am sure can be (at least partly) recognized when the optimization doesn't hide it:
C++ specific:
- virtual tables
- ctors, dtors
- inheritance relationships
- dynamic_casts
- class sizes
- stack objects
- global objects
- member functions
- member pointers, member function pointers
- heap allocations
C specific:
- switch statements
- loops
- function calls
Assuming we have a tool that collects all these information and it is built into a debugger (OllyDbg for example), just imagine what help it could be.
OllyDbg supports writing comments next to the code. If this tool also supported naming of the recognized structures, complete parts of the original code could be reconstructed.

Quote:

Originally Posted by br00t_4_c

Maybe if there was a decompiler that incorporated some kick-ass artificial intelligence that could magically analyze and emulate the personality and proclivities of the developer who wrote the code

Creating utopias
If we had such AI the programs probably wouldn't be written by humans. Humans would just assist defining the target conditions.
Then the abstraction level would be more far from the assembly level, and that AI would still be not enough. But there would be recognizable patterns in the created code and a tool could be created to display them. A lot of info would be lost, but with some patience complete parts of the original code (or target conditions) could be reconstructed manually.
Back to the ground
As the coder is (probably) a person, just another person is smart enough to recognize his/her thoughts. The automatically recognizable patterns should be shown, these are the language elements (cycles, function calls etc.), but the rest should be left to the user. I know that there are some coding patterns that could be easily recognized, but the rev.engineer is who should recognize and mark them. The best tool doesn't do everything, but it does it in a reliable way you can build on.

If anyone were interested in writing the OllyDbg plugin contoured above, I would give further details on the possible recognization of the mentioned structures with pleasure.

#32 08-17-2004, 17:11

with me.
it's possible.
but convert this what for?
without understand of algorithm, a program like a death body.

i don't know any program to do that but anyone can convert it manually.

each c++ compiler has a own way to generate code from c++ to c and then from c to asm.

so if you want to convert code from program to c++ you must:
+ convert machine code to asm - ida pro can handle almost
+ convert this asm to c - the hardest step @@!!!@@
+ depend on what compiler that generated this code, convert this c code to c++ code - the easiest step.

hope this can help you.
regards

#33 08-18-2004, 01:06

@mihaliczaj:

On the subject of decompiler development, it would be interesting to develop an application that could perhaps transform a disassembly into source code by making statistical comparisons of a given disassembly (having unknown source) with disassembly from known source code. I'm thinking of something along the lines of a "genetic" decompiler, that is to say one that over time the decompiler would be capable of some form of "learning" and would generate a more and more accurate reproduction of the original source code. If a sufficiently large database of disassembly to source mappings could be generated for a given set of languages and a suitable set of pattern recognition algorithms could be developed and these pattern recognition algorithms were "evolutionary" (i.e. we have a set of pattern recoginition algorithms that can either be selected or discarded based on how accurate an approximation of known source code can be generated from the corresponding disassembly) we might be able to arrive at a decompiler that reconstructs source code in a manner that is true to the original. I do however believe that such an application goes way, way beyond the scope of what can be accomplished with the olly plugin api. I think it would maybe be more appropriate to consider developing it as a stand-alone app. But that's just my opinion. Holler back if you want to discuss this further.

Shub-Nigurrath · #34 08-18-2004, 02:15

Hi,
have you ever heard of Desquirr? An IDA plugin, free, sourceforge, which tries to do exactly what you are talking about..

hxxp://desquirr.sourceforge.net/desquirr/

Quote:

Desquirr is a decompiler plugin for IDA Pro. It is currently capable of simple data flow analysis of binaries with Intel x86 machine code.

might give some nice ideas, beside Boomerang, another free decompiling tool, already mentioned in this forum..

#35 08-18-2004, 06:12

Quote:

Originally Posted by br00t_4_c

...by making statistical comparisons of a given disassembly...
...a "genetic" decompiler...
...some form of "learning"...

It can be a good approach, but there are some uncertain points. It is hard to measure the result, and it is hard to code those heuristics that recreates the original code.

What I imagine is a tool that at first determines the compiler as precisely as it is possible, and then knowing the compilation 'habits' of that compiler configuration it looks for some information that can be retrieved in a reliable way. Otherwise we would just see fail imaginations of our software, I mean it would very easily mislead itself.

It is simply a false goal to get back the original source code.
What we can reach is a level between C and C++ from where we can see such correspondences that were otherwise unvisible. Having had this information we can reproduce additional C++ parts, moving that level more and more towards C++, but I think it cannot be automized.
The best would be a disassembled C++ source code explorer tool that would enable the user to reorganize the source code.
The problem is that a simple function can be either:
* a global function
* a static function
* a class level function (static member function)
* a member function
In the asm code there is no difference.
Exploring the already visible C++ parts an experienced coder who knows the software very well may be able to make such decisions and may be able to give reasonable names for the entities.

#36 08-18-2004, 06:27

Quote:

Originally Posted by phoenixodin2

without understand of algorithm, a program like a death body.
i don't know any program to do that but anyone can convert it manually.

Yes, that is the point. This is what I mean. You should do this manually, but a program can help in this a lot. I think it is impossible to give an always-working automatical solution, but instead it is possible to give one that assists you in recreating a code that has an equivalent structure.

Quote:

Originally Posted by phoenixodin2

each c++ compiler has a own way to generate code from c++ to c and then from c to asm.

This was true for a long time, but when new language features had been added this didn't work any more.
A referred to an excellent book about the details in a previous post.

Quote:

Originally Posted by phoenixodin2

+ convert this asm to c - the hardest step @@!!!@@

This is not that hard, there are existing tools for this.
C is very near to asm.

Quote:

Originally Posted by phoenixodin2

+ depend on what compiler that generated this code, convert this c code to c++ code - the easiest step.

This is the hardest.
The problem is that in C++ you don't just write what should the computer do (as you do in C), but at the same time you organize the code to mirror your thoughts. There is hidden information.
Just think about public, protected and private. It is some kind of documentation and error prevention, you cannot find anything about these in the generated code. This is why the assistance of a coder is needed to recognize such things.

#37 08-18-2004, 06:43

Quote:

Originally Posted by Shub-Nigurrath

...have you ever heard of Desquirr?...
...Boomerang, another free decompiling tool...

There are a lot that creates pseudo C code from asm, because there is less information that is lost.
Ok, usally a C code is also a C++ code, but this is not what we want.
We know that the sofware has been written in C++, so we would like to get back the extra information that is in the source code.

There are some key points (vptr, dynamic_cast, static initialization...) that makes it possible to discover some of the original C++ structures.

I haven't heard about a tool that tries this.

/*
Just an example for a code that works differently in an old version of C and in C++ :

Code:

int a = 2 //**/ 2
;

*/

#38 09-22-2004, 05:39

Hi all,

I found this thread by accident, maybe its not that old..

My first question:
why are you afraid of optimizations? I think that a decompiler shouldn't look for compiler-specific structures or patterns in code. Instead, it should read the _semantics_ of the code. The meaning of it. Everything that the program does is written down in the code. It will do the same whether it is optimized or not (or if not, then the difference is not important from the result's point of view). Imagine a simple intermediate language, that is close to asm, but with simpler constructs, mov, cmp, jumps, simple arithmetical and logical funtions. The transformation frop asm to this language is trivial. Then, since you know the meaning of c constructs, you can automatically find them in this language, you just have to map them there. Of course, there will be different constructs that mean the same, but why should we care.. like you do a loop with a for or with a while and some init code.
I think, that this approach could be used for this purpose.

The second... not a question:
Someone wrote about a c++ Vector or Boost template.. how you could reverse them. Well, as you may know, all the template stuff is like #define-s in a more advanced way. You can even instruct the compiler to give you the source after substituting everything but before compileing it. If you have the source of the templates that are possibly used in the code you wanna decompile, then you can parse the decompiled source, and look for the constructs that could be created with a template, and simply transform them back.

Ok, it's not that easy, I know. I just thinked a lot about this thing, but was too lazy to code anything... I just want to argue with you, maybe we'll come up with useful ideas in this discussion.

kp

#39 09-23-2004, 03:33

IMHO, you can certainly get something.... Is it possible to extract some high level constructs from the binary... And this information can certainly be presented as C++.... Now, you will certainly not get any variable name, no classes/methods, just plain functions (it may be possible to get some information for virtual functions, but everytime the method is totally resolved at compile time, the information will not be available)... I don't think information on structures can be automatically extracted...

So what you will most likely get is a set of functions (which is not likely to be the same set of functions/methods used in the orginal code), accessing local and global variables, with un-precise types (some types can be inferred when the variable is used with known functions, but a variable is mostly a memory location.. can be anything).
The interesting information that can be extracted is part of the high-level constructs used in the function bodys: loops, tests... Some research project are already capable of extracting such information, which is then used to perform some optimization directly on the binary.

Anyway, to answer the original question, i would answer yes: it is IMHO possible to decompile to C++... For the most part, you can write assembly and C in C++, so it's not a big deal. However, you will get something which is more C than C++, and which may bear only small resemblance to the original code. And doing that would already need quite some time....

#40 09-23-2004, 18:44

I'm glad you mentioned optimizing. I already though about a method.. If you make az abstarct representation of the semantics of the code (as I wrote in the first part of my previous post) then you can even transform it back to asm. Since you can describe asm with this language, you can create many representations back and choose the one that is most optimal for you. You just have to be sure that you transformed everything.
I'm not sure I can describe my idea well... so the algorithm could be like:
describe the asm instructions with these abstract statements (there could be more than one representation, of course). Then you look for these sequences in the abstract representation (doestn't have to be strictly a sequence, there canm be holes that will be represented by other instructions). Then, you will have many-many asm representations that do exactly the same. So you can make a function that computes you an optimality value, and choose the best.

I didn't know whether there is research in this direction, maybe I didn't find anything new

kp

#41 09-24-2004, 03:12

I saw an article describing something similar while working on the automatic generation of multimedia instructions (MMX, SSE and the like). While most works tend to focus on generating optimized binary (or source code using special constructs to direct the use of multimedia opcodes), one group was studying instead the possibility of applying the same kind of optimization directly on the binaries, thus using an approach similar to one you picture.

Unfortunately, i can't find the references of the article i read... I'll post it if i find it later... In any case, it should be referenced on citeseer (hxxp://www.citeseer.com), and referenced through keywords like 'SIMD', 'vectorization' or 'multimedia instruction set', so you may want to try to look there if you are interested.

#42 09-28-2004, 17:21

It is computationally infeasible to decompile arbitrary bytecode output back to C++, except in certain (isolated) special instances, such as targeting a subset of C++, and targeting a specific architecture and compiler. In the case of VMs, such as Java and Flash, it is possible due to the (relatively limited) nature of the output bytecode. In the case of C++, the myriad of possible compilation options, optimization, debugging code, etc. complicates matters enormously; what's more, there is significant data loss through discarded variable names in the interest of size, etc. Decompilation is thus an NP-complete problem that can only be partially solved through the use of O(n^2) algorithms; that this is verifiable has been demonstrated on several occasions in respected computational and information theory journals.

Though many have strived to produce such a result, the room for leeway is slight, and (generally) does not work on arbitrary, but rather specially-constructed programs in a very limited way. The entire methodology is flawed; the most effective way to obtain lost source code is to glean understanding from the underlying assembly code and then recreating it through existing programming knowledge.

Sincerely,
-archaios

#43 10-04-2004, 02:14

Yes, I know decompilng is impossible...
But check this location:

hxxp://www.backerstreet.com/cg/work.htm

There are several decompiling projects an infos. Maybe some of them are useful.

gypsy

#44 10-08-2004, 10:55

you can see uncc,which can decompile to c language, and I am working on it to enhance it's ability.
you can find uncc on the net easily.

#45 10-11-2004, 08:30

It maybe rewrite, but not decompiling.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Decompiling the mov compiler	chants	General Discussion	3	12-08-2016 21:16
Who are familiar with decompiling?	DMichael	General Discussion	3	08-09-2013 01:04
VB3 decompiling	wasq	General Discussion	23	05-23-2005 02:30