Optimizing a C function call using 64-bit MASM

vengy

Currently using this 64-bit MASM code to call a C runtime function such as memcmp(). I recall this convention was from a GoAsm article on optimizations.

              memcmp          PROTO;:QWORD,:QWORD,:QWORD
              PUSH            RSP
              PUSH            QWORD PTR [RSP]
              AND             SPL,0F0h
              MOV             R8,R11
              MOV             RDX,R10
              MOV             RCX,RAX
              SUB             RSP,32
              CALL            memcmp
              LEA             RSP,[RSP+40]
              POP             RSP

Is this a valid optimized version below?

              memcmp          PROTO;:QWORD,:QWORD,:QWORD
              PUSH            RSP
              PUSH            QWORD PTR [RSP]
              AND             RSP,-16        ; new
              MOV             R8,R11
              MOV             RDX,R10
              MOV             RCX,RAX
              LEA             RSP,[RSP-32]   ; new
              CALL            memcmp
              LEA             RSP,[RSP+40]
              POP             RSP                  
              

The justification for replacing

              AND             SPL,0F0h
              

with

              AND             RSP,-16
              

is that it avoids invoke partial register updates. Understanding fastcall stack frame

Replacing

              SUB             RSP,32
              

with

              LEA             RSP,[RSP-32]

is that ensuing instructions do not depend on the flags being updated by the subtraction
then not updating the flags will be more efficient as well.
Why does GCC emit "lea" instead of "sub" for subtraction?

In this case, are there other optimization tricks too?

Peter Cordes

AND yes, the original code was silly and not saving any code-size (SPL takes a REX prefix, too, like 64-bit operand-size).

LEA - pointless and a waste of code-size: x86 CPUs already avoid false dependencies on FLAGS via register renaming; that's necessary to efficiently run normal x86 code which is full of instructions like add, sub, and, etc. Compilers would use lea much more heavily if that wasn't the case. The answer on that linked Q&A is wrong and should be downvoted / deleted. The only danger is on a few less-common CPUs (Pentium 4 and Silvermont for different reasons) from instructions like inc that only write some flags. (INC instruction vs ADD 1: Does it matter?). Even the cost of inc on Silvermont-family is pretty minor, just an extra uop but not during decode, so it doesn't stall.

add is not slower than lea on any CPUs, either itself or in its influence on later instructions. (Except in-order Atom pre-Silvermont, where lea ran earlier in the pipeline than add (on an actual AGU), so it could be better or worse depending on where data was coming from / going to). You'd only use lea in some cases like an adc loop where you actually need to keep CF unchanged so next iteration can read it. i.e. to not mess up a true dependency (RAW), nothing to do with avoiding a false (WAW) output dependency. (See Problems with ADC/SBB and INC/DEC in tight loops on some CPUs - note that cases where adc / inc / adc creates a partial-flag stall are cases where add would cause a correctness problem, so I'm not counting that as a case where add would make later instructions faster.)


You probably don't need to save the old RSP; the ABI requires 16-byte stack alignment before a call, and that includes your caller (unless you're getting called from code that doesn't follow the ABI, so you don't have known RSP alignment relative to a 16-byte boundary).

Normally you'd just do sub rsp, 40 like a compiler would, to realign RSP and reserve space for the shadow space. (And you'd do this at the top/bottom of the function, not around every call, along with saving/restoring call-preserved registers).


(In practice memcmp is unlikely to care about stack alignment, unless it needs to save/restore some more XMM regs. The Windows x64 calling convention unwisely only has 6 call-clobbered x/ymm registers, and that might be slightly tight depending on how much loop unrolling they do in a hand-written(?) memcmp.)

And even if you did need to handle an unknown incoming RSP alignment, saving RSP to two different locations for pop rsp is still not a very efficient way to go about it. Normally you'd just use RBP to make a traditional frame pointer to clean up with mov rsp, rbp / pop rbp, which works regardless of unknown adjustment to RSP. e.g. even in functions that use alloca (or in asm, that do an unknown number of pushes or variable-sized sub rsp, which is effectively the same thing as and rsp, -16).

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related