The Intel 80386, part 16: Code walkthrough

This post has been republished via RSS; it originally appeared at: MSDN Blogs.

Let's put into practice what we've learned so far
by walking through a simple function and studying
its disassembly.



#define _lock_str(s)                    _lock(s+_STREAM_LOCKS)

#define _unlock_str(s)                  _unlock(s+_STREAM_LOCKS)
extern FILE _iob[];
int fclose(FILE *stream)

{

    int result = EOF;
    if (stream->_flag & _IOSTRG) {

        stream->_flag = 0;

    } else {

        int index = stream - _iob;

        _lock_str(index);

        result = _fclose_lk(stream);

        _unlock_str(index);

    }
    return result;

}

This is a function from the C runtime library,
so the functions use the __cdecl
calling convention.
This means that the parameters are pushed
right-to-left,
and the caller is responsible for cleaning
them from the stack.



_fclose:

    push    ebx

    push    esi

    push    edi

This code was compiled back in the days when
frame pointer omission was fashionable.
The function does not create a traditional
stack frame with the ebp register
acting as frame pointer.

The 80386 calling convention says that
the ebx,
esi,
the edi,
and
ebp registers must be
preserved across the call.



    mov     esi,dword ptr [esp+10h] ; esi = stream

We will be using the stream
variable a lot, so we'll load it into a register
for convenient access.



; int result = EOF;

    mov     edi,0FFFFFFFFh          ; edi = result = EOF

The other variable is result,
which we will keep in the edi register,
and we set it to its initial value of −1.
This is a straight MOV instruction,
which is five-byte encoding (one opcode byte
plus a four-byte immediate).
A smaller encoding would have been
or edi, -1,
which uses two bytes for the opcode and
one for the 8-bit signed immediate.
But the smaller encoding comes at a perfornance
cost because it creates a false dependency
on the edi register.
(Mind you, the 80386 did not have out-of-order
execution, so dependencies really aren't a factor yet.)



; if (stream->_flag & _IOSTRG) {

    test    byte ptr [esi+0Ch],40h  ; is this a string?

    je      not_string              ; N: then need a true flush

Even though _flag is a 32-bit field,
we use a byte test to save code size.
This takes advantage of the fact that testing
a single bit can be done by testing a single bit
in a 32-bit field,
or by testing a single bit in an 8-bit subfield.
The _flag field is at offset 0Ch,
and the value of _IOSTRG is
0x40,
so the bit we want is in the first byte.

We learned some time ago that this size
optimization defeats the

store-to-load forwarder,
but the 80386 didn't have a store-to-load forwarder,
so that wasn't really a factor.



; stream->_flag = 0;

    mov     dword ptr [esi+0Ch],0

Again, the compiler chooses a full 32-bit immediate
instead of using a smaller instruction.
An alternative would have been
and dword ptr [esi+0Ch], 0,
using a sign-extended 8-bit immediate instead
of a 32-bit immediate,
but at a cost of incurring a read-modify-write
rather than simply a write.



; return result;

    mov     eax,edi                 ; eax = return value

    pop     edi

    pop     esi

    pop     ebx

    ret

The compiler chose to inline the common
return instruction into this
branch of the if statement.
The value being returned is in the result
variable, which we had enregistered in
the edi register.
The return value goes in the eax
register, so we move it there.
And then we restore the registers we had
saved on the stack
and return to the caller.
Since this function uses the
__cdecl calling convention,
the function does no stack cleanup;
it is the caller's responsibility to clean the stack.

nop

This nop instruction is padding to
bring the next instruction, a jump target,
to an address that is a multiple of 16.
The 80386 fetches instructions in 16-byte chunks,
and putting jump targets at the start of a 16-byte
chunk means that all of the fetched bytes are
potentially executable.



not_string:

; int index = stream - _iob;

    mov     ebx,esi                 ; ebx = stream

    sub     ebx,77E243F0h           ; ebx = stream - _iob (byte offset)

    sar     ebx,5                   ; ebx = stream - _iob (element offset)

This sequence of instructions calculates the value
for the index local variable,
which the compiler chose to enregister in the
ebx register.
We start with the value in the esi
register, which is the stream variable.
Next, we subtract the offset of the _iob
variable, which is a global variable, so its
address looks like a constant in the code stream.
We then take that byte offset and shift it
right by 5, which means dividing by 32,
which is the size of a FILE structure
in this particular implementation.
The result now sits in the ebx register.



; _lock_str(index) ⇒ _lock(index+_STREAM_LOCKS)

    add     ebx,19h                 ; add _STREAM_LOCKS

    push    ebx                     ; the sole parameter

    call    _lock                   ; call the function

    add     esp,4                   ; clean stack arguments

The _lock_str macro is a wrapper around
the _lock function.
We add STREAM_LOCKS,
which happens to be 25, or 0x19,
and the push it onto the stack as the sole parameter
for the _lock function.
Since this is a __cdecl function,
it is the caller's responsibility to clean the stack,
so we add 4 (the number of bytes of parameters)
to the esp register to drop them from the stack.



; result = _fclose_lk(stream)

    push    esi                     ; the sole parameter

    call    _fclose_lk              ; call the function

    add     esp,4                   ; clean stack arguments

    mov     edi,eax                 ; save in edi = result

Another function call:
We push the sole parameter, call the function,
and clean the stack.
The return value was placed in the eax
register, so we move it into the edi
register, which we saw represents the result
variable.



; _unlock_str(index) ⇒_unlock(index+_STREAM_LOCKS)

    push    ebx                     ; the sole parameter

    call    _unlock                 ; call the function

    add     esp,4                   ; clean stack arguments

The compiler realized it could pull out the common subexpression
s+_STREAM_LOCKS and stored the value of
that subexpression in the ebx register.
It could therefore push the precomputed value
(helpfully saved in the ebx register)
as the parameter for the _lock function.



; return result;

    mov     eax,edi                 ; eax = return value

    pop     edi

    pop     esi

    pop     ebx

    ret

And this is the same code we saw last time.
The return value (result) is moved
to the eax register, which is where
the __cdecl calling convention
places it.
We then restore the registers we had saved at entry
and return to our caller,
leavving our caller to clean the stack parameters.

The resulting function size is 81 bytes.

Okay, now let's see how we could optimize this
function further.
Let's look closely at the calculation of
index + _STREAM_LOCKS.



    mov     ebx,esi                 ; ebx = stream

    sub     ebx,77E243F0h           ; ebx = stream - _iob (byte offset)

    sar     ebx,5                   ; ebx = stream - _iob (element offset)

    add     ebx,19h                 ; add _STREAM_LOCKS

The first thing you might think of is combining the
first two instructions into a single LEA
instruction:



    lea     ebx,[esi+881dbc10h]     ; ebx = stream - _iob (byte offset)

The LEA instruction lets us perform
an addition operation in a single instruction
by taking advantage of the effective address computation
circuitry in the memory unit.
The operation we want to perform is subtraction of
a constant,
which we can transform into an addition of the negative
of that constant.

Unfortunately, the trick doesn't work in this case
because the "constant" is a relocatable address,
and there is no loader fixup type for "negative of
the address of a variable."

But all is not lost.
There's another trick we could use:
Fold in the subsequent addition.

`ebx`	= ((`esi` − 77E243F0h) >> 5) + 19h
	= ((`esi` − 77E243F0h) >> 5) + (320h >> 5)
	= (`esi` − 77E243F0h + 320h) >> 5
	= (`esi` − 77E240D0h) >> 5

Another way to do this calculation:

`adjusted_index`	`= stream - _iob + 0x19`
	`= stream - (_iob - 0x19)`
	`= stream - &_iob[-0x19]`

Either way, the result is this:



    mov     ebx,esi                 ; ebx = stream

    sub     ebx,77E240D0h           ; ebx = stream - &_iob[-0x19] (byte offset)

    sar     ebx,5                   ; ebx = stream - &_iob[-0x19] (element offset)

Another observation is that
stream
and
result
do not have overlapping useful lifetimes.
The useful lifetime of result
doesn't start until it receives the value from
_fclose_lk.
Prior to that, its value is known at compile time
to be EOF,
so there's no need to devote a register to it.

And we can combine the
add esp, 4 with the subsequent push
(which decrements the esp register)
by simply storing the new value into the top-of-stack slot.

The case of a string-based stream does not use the
ebx register, so we can use a technique
know as shrink-wrapping, where we start with
one stack frame, and then expand it to a larger
one on certain code paths.
In this case, we start by saving only
the esi register,
and then later save the ebx register
only if we realize that we need it.

A simple size/speed optimization (in favor of size)
is to use the pop instruction to pop a value
off the stack (and ignore it).
This replaces a three-byte add esp,4
with a one-byte register pop.

A very aggressive size optimization would be to replace
the two-byte instructions
mov eax, r or
mov r, eax
with the one-byte
xchg eax, r instruction.
This assumes you need to move the value into or out of
the eax register and you don't care about the
source any more.

Finally, a string-based stream is quite
uncommon (and certainly the case of closing
a string-based stream),
so we'll make that the out-of-line case,
and we won't bother optimizing the fetch of the
jump target for the same reason.



_fclose:

    push    esi                     ; save register

    mov     esi,dword ptr [esp+0Ch] ; esi = stream

    test    byte ptr [esi+0Ch],40h  ; Is this an _IOSTRG?

    jnz     is_string                  
    push    ebx                     ; shrink-wrap

    mov     ebx,esi

    sub     ebx,77E240D0h           ; ebx = stream - &_iob[-0x19] (byte offset)

    sar     ebx,5                   ; ebx = index + _STREAM_LOCKS

    push    ebx

    call    _lock                   ; call the function
    mov    [esp],esi                ; parameter for _fclose_lk

    call   _fclose_lk               ; close the stream
    mov    [esp],ebx                ; parameter for _unlock

    mov    ebx,eax                  ; ebx = result

    call   _unlock
    pop    eax                      ; clean the stack once

    mov    eax,ebx                  ; eax = result

    pop    ebx

    pop    esi

    ret
is_string:

    mov    dword ptr [esi+0Ch],0    ; stream->_flag = 0

    or     eax,-1                   ; return EOF

    pop    esi

    ret

This reduces the function size to 65 bytes.

Yet another trick is to pre-push the parameters
for multiple function calls.



_fclose:

    mov     ecx,dword ptr [esp+8]   ; ecx = stream

    test    byte ptr [ecx+0Ch],40h  ; Is this an _IOSTRG?

    jnz     is_string               ; Y: handle strings out of line
    push    ebx                     ; shrink-wrap

    mov     ebx,ecx

    sub     ebx,77E240D0h           ; ebx = stream - &_iob[-0x19] (byte offset)

    sar     ebx,5                   ; ebx = index + _STREAM_LOCKS

    push    ecx                     ; push for _fclose_lk

    push    ebx                     ; push for _lock

    call    _lock                   ; call the function

    pop     eax                     ; discard arg to _lock

    call    _fclose_lk              ; close the stream

    mov     dword ptr [esp],ebx     ; parameter for _unlock

    mov     ebx,eax                 ; save result

    call    _unlock

    pop     eax                     ; discard arg to _unlock

    mov     eax,ebx                 ; recover result

    pop     ebx

    ret
is_string:

    mov    dword ptr [ecx+0Ch],0    ; stream->_flag = 0

    or     eax,-1                   ; return EOF

    ret

This brings us down to 57 bytes.

If we abandon the idea of
enregistering the result,
we can do this:



_fclose:

    mov     ecx,dword ptr [esp+8]   ; ecx = stream

    test    byte ptr [ecx+0Ch],40h  ; Is this an _IOSTRG?

    jnz     is_string               ; Y: handle strings out of line
    mov     eax,ecx

    sub     eax,77E240D0h           ; ebx = stream - &_iob[-0x19] (byte offset)

    sar     eax,5                   ; ebx = index + _STREAM_LOCKS

    push    ecx                     ; garbage (for future result)

    push    eax                     ; push for _unlock

    push    ecx                     ; push for _fclose_lk

    push    eax                     ; push for _lock

    call    _lock                   ; call the function

    pop     eax                     ; discard arg to _lock

    call    _fclose_lk              ; close the stream

    mov     dword ptr [esp+0Ch],eax ; save result

    pop     eax                     ; discard arg to _lock

    call    _unlock

    pop     eax                     ; discard arg to _unlock

    pop     eax                     ; recover result

    ret
is_string:

    mov    dword ptr [ecx+0Ch],0    ; stream->_flag = 0

    or     eax,-1                   ; return EOF

    ret

But this comes out to 59 bytes.

Leave a Reply Cancel reply