For the inline assembler, keep in mind that it heavily "corrupts" the optimization of the surrounding C code (well, at least it always did for MSVC, donno about Intel, but would guess it's the same). When the compiler reaches the asm block, it's a "black box" for it... so it dumps all the register values into local variables, appends the assembler block... and then loads the register values back.
So, it's better to write the whole function in inline assembler - than just a part, in which case the result might be worse than keeping it all in C, because the rest of the function is optimized much worse than if it were all in C.
It would be interesting to know what led Microsoft to bad inline assembler in x64...
|