Trying to Analyze a Simple PB/CC Program

Donald Darden · June 10, 2007, 12:03:58 AM


'==================== Original PB/CC Program ========================
FUNCTION PBMAIN
  aa$="This is a test"
  ! mov al, al
  a& = LEN(aa$)
  ! mov ah, ah
END FUNCTION

Above is a small PB/CC program I used to try and figure out how PowerBasic
encodes the LEN() function in assembly code. As you will see, learning the answer is somewhat difficult. I used IDA Pro Freeware version 4.3 to take the
produced compile code, which is an EXE file, and tell it to dissemble a new file.
I indicate which EXE file to process, and select the PE format for loading the file.

On the next screen just clidk OK and accept the defaults. In short order, the IDE will display with the dissassembled code in its own window. It is commented, in color, but probably best rendered into an ASM file at this point.
You can explore the various options and menus later. Right now, under File,
Pick "Produce" and select "ASM" file. You can use Alt+F10 as an alternative.
Save the file where it is convenient for you. Now you can exit IDA Pro, but must
chose whether to save the created database for it or not. Your call.

Using any text editor, you can next look at the produced ASM file. If you used the above code example, you can search for "mov al, al", which should take you right to the corresponding code. This is the extract that I made below:

Code Select


 '======== Related Extracts From Produced ASM File From Compiled EXE =======                                  
Code found in ! MOV pair:
		mov al, al                                    ; my designated lead flag
		mov edx, [ebp+var_8C]              ;this must point to AA$ 
		call sub_4019B5
		call sub_40199D
		mov esi, eax                                ;this must be REGISTER ref to A&
		mov ah, ah                                  ;my designated trail flag

There are two called subs here, and they also appear in the ASM file. You can
easily find them by searching for each by name:

Code Select

                       ...  
sub_4019B5	proc near ; CODE XREF: sub_4010CB+34 p
		push esi                                        ;save of ESI contents on stack  
		sub dword ptr [ebp-78h], 4        ;between ESP and EPB, subtract 4
		mov esi, [ebp-78h]                      ; move that value to ESI              
		or edx, 80000000h                  ;Set negative bit for some reason                  
		mov [ebp+esi-5Ch], edx             ;save this at location EBP+ESI-5Ch
		pop esi                                        ;restore ESI contents from stack
		retn                                                  ;exit this sub
sub_4019B5	endp

sub_40181D	proc near ; CODE XREF: .text:00401857 p
					; sub_4018B1+23 p ...
		and esi, 7FFFFFFFh                   ;now clear the sign bit in ESI 
		jz short loc_401829                  ;if results zero, string addr invalid
		mov ecx, [esi-4]                            ;string length is BELOW strptr ref
		retn                                      

loc_401829:                                       ; CODE XREF: sub_40181D+6 j
		mov ecx, esi                                  ;otherwise ESI has string length
		mov esi, offset unk_4020BC        ;make ESI point to something else
		retn                                                    ;exit this sub
sub_40181D	endp

The process that PowerBasic employed is not very clear, is it? The way I
presently see it, PowerBasic sets the EBP pointer to work in two directions;
any passed parameters are located on the stack above the point marked by
EBP, and then ESP is set lower to allow a region for local and static variable use
between ESP and EBP. All positioning above and below EBP are by offsets that
PowerBASIC knows, based on allocations made as the program is analyzed and compiled.

It also appears that in this instance, since A& and AA$ are being treated as local variables, that the length of AA$ is set 4 bytes below the point where the location point for the string pointer is set. If true, this would have been reversed from the order used with PB/DOS. That needs to be checked further.

Note the tendency by PowerBasic to use ESI as the primary register for some of
the processing. A lot of assembly programmers build their reliance around the
EAX register, and perhaps PowerBasic's approach facilitates the idea of keeping
ESI as the first alternative register for a memory variable. It would be worth
noting what coding changes happen if we add a #REGISTER NONE to the PowerBasic program and recompile it.

While there are several things that are not clear at this point, it does appear
that PowerBasic uses the sign bit to signal something about string variables. Perhaps whether they are valid or not, or the type of string variable involved.
The thing I find most confusing here though, is that ESI, EDX, ECX all seem to
have specific roles, but if the length of the string is returned by the last sub in ECX, why is the very next statement in the main body assigning EAX to ESI? That sort of throws me at this point.

Anyway, it is an attempt, and perhaps it will help you get started with your own
analysis of your PowerBasic and ASM code.

Theo Gottwald · June 10, 2007, 12:20:18 PM

Donald --
I have just made small changes for better readability, while there are still some special Charactes left in the labels.
Seems that the IDA Pro printout can not be 1:1 be pasted into the forum, as the Forum does not understand the formatting.
Very nice example.

QuoteIt would be worth noting what coding changes happen if we add a #REGISTER NONE to the PowerBasic program and recompile it.

I bet ESI is not gonna be used.
What you see is the general #REGISTER ALL which takes your a& into ESI.
I believe, that if you do NOT explicitly specify which variables to take in´to register, PB takes the first two which fit in Order of declaration.

QuoteWhile there are several things that are not clear at this point, it does appear

You started with the most difficult things: "things with strings".
You may not have big chances to optimize performace when dynamic strings are used, because the most of time is not been spent in your code, but in the string subprograms.

About what happens internally, see here: http://www.codeproject.com/string/cppstringguide2.asp
To my actual knowledge, dynamic strings from Powerbasic are the BSTR from C.

For starting on this, I'd suggest to start with numeric subprogramms then you can most often follow whats going on in the code.

Donald Darden · June 23, 2007, 01:21:03 AM

It's funny, but I just realized that I most often think of resorting to assembly code
when I am interested in dealing with strings or structures. For plain old number
crunching or input/output, I just let Basic code do its thing. The Console Screen
represents a structure, and I sometimes write routines to make it easier to update it or get information from it, or to quickly change its appearance.

So naturally I am more interested in how to pass references between the high
level code and assembly code most effectively. Starting with numbers is really
trivial - you either have a pointer to the number or the number is passed by value,
and other than knowing how many bytes are in the string, or if it is unsigned or
signed, what else can you say? You can point out that there are several
numeric types, and Floating Point numbers are a special challenge in Assembly
language, but I have to ask why bother? Idon't see how you can expect to do
anything with Floating Point numbers that is going to be much faster than just
letting the compiler handle them.

Now as Theo pointed out, the compiler may be using floating point computations
where you know that integer would be sufficient, but once you figure out how
to cause the compiler to use integer where possible rather than using unnecessary conversions and the slow floating point unit, you are effectively
done. You are not likely to be doing much F.P. in assembly, are you?

Understanding Assembly coding is sometimes trying to see a more effective way
to do something that is slow or awkward the way it is, or to attempt to do
something that doesn't fit well with a higher level construct. But to just recode
high level language into assembly is not really very rewarding in terms of work
completed or results gained. Strings are challenging because they represent
a sizeable amount of information, even whole files, and we often want to do
things with that information that requires parsing it, extracting and converting
portions of it, and creating new structures or invoking key decisions based on
the information we recover. Because of the sheer volume of infomation we have to deal with, the time required to process it can become critical. Thus, it can
deserve special attention when it comes to assembly coding.

Another area that often benefits from assembly coding is the handling of real
time events, which would include game playing and other interactive situations.
But that is not my area of focus, so any discussion of writing game routines and
such will be done by someone else.

Trying to Analyze a Simple PB/CC Program

Donald Darden

Theo Gottwald

Donald Darden