Well, I probably shouldn't have written "whole function"... let's say the whole block of code where the performance is of interest.
Sure, you may do basic parameter checking in C, allocate the local variables... and then use the inline asm to do the real job. So, you don't have to worry about stack allocation, you have access to fields of C structures... and x64 asm is basically the same as x86, so I wouldn't call it difficult