c - AMD64 -- nopw assembly instruction? -


in compiler output, i'm trying understand how machine-code encoding of nopw instruction works:

00000000004004d0 <main>:   4004d0:       eb fe                   jmp    4004d0 <main>   4004d2:       66 66 66 66 66 2e 0f    nopw   %cs:0x0(%rax,%rax,1)   4004d9:       1f 84 00 00 00 00 00 

there discussion "nopw" @ http://john.freml.in/amd64-nopl. can explain meaning of 4004d2-4004e0? looking @ opcode list, seems 66 .. codes multi-byte expansions. feel better answer here unless tried grok opcode list few hours.


that asm output following (insane) code in c, optimizes down simple infinite loop:

long = 0;  main() {     recurse(); }  recurse() {     i++;     recurse(); } 

when compiled gcc -o2, compiler recognizes infinite recursion , turns infinite loop; well, in fact, loops in main() without calling recurse() function.


editor's note: padding functions nops isn't specific infinite loops. here's set of functions range of lengths of nops, on godbolt compiler explorer.

the 0x66 bytes "operand-size override" prefix. having more 1 of these equivalent having one.

the 0x2e 'null prefix' in 64-bit mode (it's cs: segment override otherwise - why shows in assembly mnemonic).

0x0f 0x1f 2 byte opcode nop takes modrm byte

0x84 modrm byte in case codes addressing mode uses 5 more bytes.

some cpus slow decode instructions many prefixes (e.g. more three), modrm byte specifies sib + disp32 better way use 5 bytes 5 more prefix bytes.

amd k8 decoders in agner fog's microarch pdf:

each of instruction decoders can handle 3 prefixes per clock cycle. means 3 instructions 3 prefixes each can decoded in same clock cycle. instruction 4 - 6 prefixes takes clock cycle decode.


essentially, bytes 1 long nop instruction never executed anyway. it's in there ensure next function aligned on 16-byte boundary, because compiler emitted .p2align 4 directive, assembler padded nop. gcc's default x86 is
-falign-functions=16
. nops executed, optimal choice of long-nop depends on microarchitecture. microarchitecture chokes on many prefixes, intel silvermont or amd k8, 2 nops 3 prefixes each might have decoded faster.

the blog article question linked ( http://john.freml.in/amd64-nopl ) explains why compiler uses complicated single nop instruction instead of bunch of single-byte 0x90 nop instructions.

you can find details on instruction encoding in amd's tech ref documents:

mainly in "amd64 architecture programmer's manual volume 3: general purpose , system instructions". i'm sure intel's technical references x64 architecture have same information (and might more understandable).


Comments

Popular posts from this blog

java - SNMP4J General Variable Binding Error -

windows - Python Service Installation - "Could not find PythonClass entry" -

Determine if a XmlNode is empty or null in C#? -