c - AMD64 -- nopw assembly instruction? -
in compiler output, i'm trying understand how machine-code encoding of nopw
instruction works:
00000000004004d0 <main>: 4004d0: eb fe jmp 4004d0 <main> 4004d2: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1) 4004d9: 1f 84 00 00 00 00 00
there discussion "nopw" @ http://john.freml.in/amd64-nopl. can explain meaning of 4004d2-4004e0? looking @ opcode list, seems 66 ..
codes multi-byte expansions. feel better answer here unless tried grok opcode list few hours.
that asm output following (insane) code in c, optimizes down simple infinite loop:
long = 0; main() { recurse(); } recurse() { i++; recurse(); }
when compiled gcc -o2
, compiler recognizes infinite recursion , turns infinite loop; well, in fact, loops in main()
without calling recurse()
function.
editor's note: padding functions nops isn't specific infinite loops. here's set of functions range of lengths of nops, on godbolt compiler explorer.
the 0x66
bytes "operand-size override" prefix. having more 1 of these equivalent having one.
the 0x2e
'null prefix' in 64-bit mode (it's cs: segment override otherwise - why shows in assembly mnemonic).
0x0f 0x1f
2 byte opcode nop takes modrm byte
0x84
modrm byte in case codes addressing mode uses 5 more bytes.
some cpus slow decode instructions many prefixes (e.g. more three), modrm byte specifies sib + disp32 better way use 5 bytes 5 more prefix bytes.
amd k8 decoders in agner fog's microarch pdf:
each of instruction decoders can handle 3 prefixes per clock cycle. means 3 instructions 3 prefixes each can decoded in same clock cycle. instruction 4 - 6 prefixes takes clock cycle decode.
essentially, bytes 1 long nop instruction never executed anyway. it's in there ensure next function aligned on 16-byte boundary, because compiler emitted .p2align 4
directive, assembler padded nop. gcc's default x86 is
-falign-functions=16
. nops executed, optimal choice of long-nop depends on microarchitecture. microarchitecture chokes on many prefixes, intel silvermont or amd k8, 2 nops 3 prefixes each might have decoded faster.
the blog article question linked ( http://john.freml.in/amd64-nopl ) explains why compiler uses complicated single nop instruction instead of bunch of single-byte 0x90 nop instructions.
you can find details on instruction encoding in amd's tech ref documents:
mainly in "amd64 architecture programmer's manual volume 3: general purpose , system instructions". i'm sure intel's technical references x64 architecture have same information (and might more understandable).
Comments
Post a Comment