• Welcome to Jose's Read Only Forum 2023.
 

Reverse-Polish (Postfix) Notaton and X86 Assembler

Started by Charles Pegge, August 04, 2007, 01:19:26 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Charles Pegge


This shows how easily it translates.



FreeBasic

' POSTFIX or REVERSE-POLISH NOTATION AND x86 ASSEMBLER
' Charles E V Pegge
' 04 Aug 2007

' FREEBASIC


' FLOATING POINT EXAMPLE


dim as double ans, a=1, b=2, c=3, d=4

' ans=(a+b)*(c+d)      ' conventional notation
                       '
' a b + c d + * to ans ' postfix notation
'======================'
asm
fld   qword ptr [a]   ' push A onto FPU stack
fadd qword ptr  [b]   ' add  B
fld   qword ptr [c]   ' push C onto FPU stack
fadd  qword ptr [d]   ' add D
fmulp           st(1) ' multiply the 2 results and pop the stack
fstp qword ptr  [ans] ' store the result and pop the stack
end asm                '
'======================'

print ans

Petr Schreiber

#1
Hi Charles,

nice samples. RPN is very nice.
Here my two tries:

First for PB/WIN 8.03:

#COMPILE EXE
#DIM ALL

FUNCTION PBMAIN () AS LONG

  #REGISTER NONE

  LOCAL a, b, c, d AS DOUBLE
  LOCAL ans AS DOUBLE
 
  a = 1
  b = 2
  c = 3
  d = 4
 

  ' ans=(a+b)*(c+d)      ' conventional notation
  ' a b + c d + * to ans ' postfix notation
  '======================'
  ! fld a                ; push A onto FPU stack
  ! fadd b               ; add  B
  ! fld c                ; push C onto FPU stack
  ! fadd d               ; add D
  ! fmulp st(1), st(0)   ; multiply the 2 results and pop the stack
  ! fstp ans             ; store the result and pop the stack
  '======================'

  MSGBOX STR$(ans)

END FUNCTION


Second for HP-42S ;D

RCL "A"
RCL+ "B"
RCL "B"
RCL+ "D"
*
STO "ANS"



Thanks,
Petr
AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Charles Pegge

Thanks Petr,

Here is the same thing using long integers on the CPU.

The first block shows a very literal translation of RPN into x86 assembler. This is not as futile as it looks because this strategy may be applied to expressions of almost unlimited complexity on just about any CPU that can add, multiply and has a stack. But when there are general purpose registers available then optimising the code can make a big difference as you can see in the second block.

FreeBasic


' LONG INTEGER EXAMPLE


dim as long ans, a=1, b=2, c=3, d=4

' ans=(a+b)*(c+d)      ' conventional notation
                       '
' a b + c d + * to ans ' postfix notation
'

'========================='
' RIGID INTERPRETATION
' no regs used for storing
' only for doing operations
'========================='
asm                       '
push dword ptr [a]       ' load A onto stack
push dword ptr [b]       ' load B onto stack
'-------------------------'
pop  ecx                 ' pop operand
pop  eax                 ' pop accummulator
add  eax,ecx             ' do op
push eax                 ' leave result on stack
'-------------------------'
push dword ptr [c]       ' load C onto stack
push dword ptr [d]       ' load D onto stack
'-------------------------'
pop  ecx                 ' pop operand
pop  eax                 ' pop accummulator
add  eax,ecx             ' do op
push eax                 ' leave result on stack
'-------------------------'
pop  ecx                 ' pop operand
pop  eax                 ' pop accummulator
mul  ecx                 ' do op
push eax                 ' leave result on stack
'-------------------------'
pop  eax                 ' pop result
mov [ans],eax            ' store it
end asm                   '
'========================='

print ans


' OPTIMISED INTERPRETATION
'========================'
asm                      '
mov  ecx, [a]           ' load A into ecx accummulator
add  ecx, [b]           ' add  B to it
mov  eax, [c]           ' load C into into eax accummulator
add  eax, [d]           ' add  D to it
mul  ecx                ' multiply the two accumulators (result in eax)
mov  [ans],eax          ' save result
end asm                  '
'========================'

print ans



Petr Schreiber

#3
Hi,

here comes again PB/WIN version

#COMPILE EXE
#DIM ALL

DECLARE FUNCTION GetTickCount LIB "KERNEL32.DLL" ALIAS "GetTickCount" () AS DWORD

FUNCTION PBMAIN () AS LONG
 
  #REGISTER NONE
  LOCAL t1, t2 AS DWORD
  LOCAL i AS LONG
  LOCAL a, b, c, d AS LONG
  LOCAL ans AS LONG

  a = 1
  b = 2
  c = 3
  d = 4

  ' ans=(a+b)*(c+d)      ' conventional notation
                         '
  ' a b + c d + * to ans ' postfix notation
  '

  '========================='
  ' RIGID INTERPRETATION
  ' no regs used for storing
  ' only for doing operations
  '========================='
  t1 = GetTickCount
  FOR i = 1 TO 1000000000
 
    ! push a                   ' load A onto stack
    ! push b                   ' load B onto stack
    '-------------------------'
    ! pop  ecx                 ' pop operand
    ! pop  eax                 ' pop accummulator
    ! add  eax,ecx             ' do op
    ! push eax                 ' leave result on stack
    '-------------------------'
    ! push c                   ' load C onto stack
    ! push d                   ' load D onto stack
    '-------------------------'
    ! pop  ecx                 ' pop operand
    ! pop  eax                 ' pop accummulator
    ! add  eax,ecx             ' do op
    ! push eax                 ' leave result on stack
    '-------------------------'
    ! pop  ecx                 ' pop operand
    ! pop  eax                 ' pop accummulator
    ! mul  ecx                 ' do op
    ! push eax                 ' leave result on stack
    '-------------------------'
    ! pop  eax                 ' pop result
    ! mov  ans,eax             ' store it
  NEXT
  t2 = GetTickCount

  MSGBOX "ASM General:"+STR$((t2-t1)/1000)+"s, result:"+STR$(ans)

  ' OPTIMISED INTERPRETATION
  '========================'
                     '
  t1 = GetTickCount
  FOR i = 1 TO 1000000000
                     
    ! mov  ecx, a             ' load A into ecx accummulator
    ! add  ecx, b             ' add  B to it
    ! mov  eax, c             ' load C into into eax accummulator
    ! add  eax, d             ' add  D to it
    ! mul  ecx                ' multiply the two accumulators (result in eax)
    ! mov  ans, eax           ' save result

  NEXT
  t2 = GetTickCount


  MSGBOX "ASM Optimized:"+STR$((t2-t1)/1000)+"s, result:"+STR$(ans)
 
  PB without ASM "suggestions"
  '========================' 
  t1 = GetTickCount
  FOR i = 1 TO 1000000000
    ans = (a+b)*(c+d)
  NEXT
  t2 = GetTickCount

  MSGBOX "PB Classic:"+STR$((t2-t1)/1000)+"s, result:"+STR$(ans)

END FUNCTION


I presume I cannot use EBX as PB uses it already ?
I have added #REGISTER NONE to not get in troubles too :)

It is interesting to watch the speed difference.
General approach takes ~12.7 seconds for 1 000 000 000 iterations, while optimized just 3.5 seconds, which is quite the same as PB does by optimizing high-level expression.


Thanks,
Petr
AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Theo Gottwald

Have a Hint on EBX for you. Not really from me, but directly from Bob Zale, copied from the PB Forum:

QuoteActually, the register preservation requirements for EBX/ESI/EDI only apply to assembler code which precedes BASIC statements within a sub/function. PowerBASIC assumes that it can use these 3 registers to store data from one statement to the next. If there are no more statements, it's of no consequence. PowerBASIC must preserve registers between sub/function calls, and does so automatically. Of course, ESP and EBP must always be preserved.
One note... never change a segment register. Win32 in unforgiving.

Source: PB-Forum

Petr Schreiber

AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Charles Pegge

#6
Thanks for taking those measurements Petr. It shows the cost of generalised methods without  optimisation.  Of course once you start using interpreter techniques, the time penalty is much more extreme.

I am not sure when PB requires the EBX and ESI registers to be left unaltered in between lines, or for referencing global and static variables, arrays etc.
Pushing them only takes 1 clock each as does popping them.

One potential GOTCHA, Theo, is in loading one of these variables into a reserved register which is normally used to index them, causing a GPF when you try to access the next variable.

There is a simple and cheap method to protect against this, costing 3 clocks per variable loaded instead of 1


dim a as global long, b as global long, c as global long
'....
! push a     ' push contents of all the variables to be used
! push b     '
! push c     '
! pop edi    ' now pop then into the target registers
! pop esi    '
! pop ebx   '
'....         '



Kent Sarikaya

This is way beyond me, but in looking through the code I can see sort of how it is going... I don't understand how it knows which operand to use? It is never pushed or put into memory as the variable values are? How does it know to add instead of subtract?

Theo Gottwald

#8
I am not sure when PB requires the EBX and ESI registers to be left unaltered in between lines,

I can help you on this Charles. Reading my own post from before, I can tell you this:

PowerBasic uses these registers:

EBX - internal usage
ESI/EDI - for REGISTER variables, if you do not use #REGISTER NONE

Therefore they need to be preserved and restored if you have BASIC CODE and ASM Code intermixed.

As we see in the Disassembly, the EBP-Register is used to adress local datastructures and variables.

Example:
MOV DWORD PTR [EBP+FFFFFF6C], ESI

Would load a local variable with a LONG or DWORD Value.

Charles Pegge

Thanks Theo, we have to be aware of shifting sands - different compilers.

Kent,
As Push'n Pop is Last-in First-out  buffering, so the above snippet stacks the values of the variables a,b,c which we want to use in the registers . EDI ESI and EBX, may be used to index these variables, so we must not overwrite them while obtaining the values of a,b and c.

Once their values are on the stack, we can safely pop them into these registers.
The stack is indexed by means of the stack pointer ESP. which is automatically decremented before PUSHing and incremented after POPping. but this stack pointer register is seldom altered by us directly.

In this example, the values of a, b and c end up in ebx, esi and edi respectively.

Of course, the three registers we have just overwritten cannot be used
to access further BASIC variables unless we have taken steps to preserve them with another shell of pushes and pops around our assembler code.

However, it is safe to assume that LOCAL variables, including function parameters are indexed only by the EBP register, which is set up for us by PB at the beginning of the function, and it is not normally used for anything else unless you are desperate for a spare reg to make your algorithm zing.

Kent Sarikaya

Thanks for the explanation Charles and Theo. Amazing how you guys keep all of which compiler does what etc. The ASM files look different from compiler to compiler too.

Donald Darden

It's not usually pointed out, but RPN notation works well with ASM because it reduces complex instructions to their simplest form, wirking first with values (or variables), then performing the specified operation.  If you took a typical ASM operation, like ADD eax,a, and wrote it this way:  eax,a ADD, it would easily show the similarity in structure.

More specifically, the use of RPN favors a stack structure, but if we can limit or
confine our efforts to the general purpose registers, we will benefit because the registers are more quickly accessed.  Aside from the fact that some instructions are specific to certain registers, the use of the stack involves RAM memory, which has to be referenced via the stack pointer, which in term has to incremented or decremented, and this makes it quite a bit slower overall.

Another consideration with RPN is that it favors the retention and reuse of
intermediate results.  You can store them somewhere temporarily, then reintroduce them when needed, which can save computational time and effort.
One example of this capability is when multiplying a number by ten by using the
shift-and-add capabilities instead.  Let me describe the general principle and you can then play with it.

Begin with some integer and call it x.  Copy the original x into EAX.    Then copy it from EAX into a second register.  Then shift EAX left two places, which
effectively multiplies your x value by four.  Add the value of x from the temporary register back into EAX, so that the total is now equivalent to five times x.  Then shift EAX left one more place, and your 5x will become 10x.  This
will work for any integer value of x, as long as the value when multipled does not result in a register overflow.  You can rearrange the shifts and adds slightly and get the same result.  For instance, start with a left shift of 1 bit, save the
temporary result, shift left two more places to the left, and add in the temporary result again.  There you are.

Unfortunately, there is no clever method available for dividing by ten, which
would be extremely useful for convirting base 2 numbers to base 10 (decimal).

Charles Pegge

#12
division

Yes the only way to divide a number (unless its a simple power of 2) is by
successive approximation/ This makes both CPU and FPU work very hard, taking more than 40 clocks, about 4 times longer than multiplying.

So whenever possible replace dividing factors with multiplying factors eg:
a*001 instead of a/1000.

multiplying EAX by ten

 ! shl eax,1
! mov ecx,eax
! shl eax,2
! add eax,ecx


This takes 4 clocks on the CPU instead of 10 using MUL, however floating point multiplications on the FPU are about 3 clocks!


RPN notation

A SHL DUP SHL SHL +

pretending that the EAX and ECX registers are part of a stack.



PS: the new Intel chips do things on half clocks so take my figures as being relative.

Donald Darden

While the FPU is highly optimized, there is still a time penalty for conversion of numbers to floating point form and back.  Since the PB compilers are optimized for
use with Longs (and will generally revert to floating point form for other numeric
types), your best bet for fastest code is to attempt to use Longs wherever possible.