Powerbasic dynamic Strings Memory usage

José Roca · August 08, 2007, 07:38:20 AM

Until PowerBASIC adds native unicode support, it is of some importance to low-level COM programmers to know if PB dynamic strings end always with a null byte or two. Why? Because there is another kind of string, the null-terminated unicode string, not currently supported by PB. Currently, if we don't want to use SysALlocString / FreeString, we have to use a double null terminated dynamic string and use STRPTR to pass a pointer to the string data. If PB dynamic strings end with a single null, we have to add another null to the string; if it ends with two nulls, we don't need to add nothing; if we aren't sure, we have to add always a null to be safe.

My guess is that instead of:

| 2 | 0 | 0 | 0 | H | I | 0 | 0 |

The resulting string will be:

| 2 | 0 | 0 | 0 | H | I | 0 | x |

Where x will be uninitialized data. Sometimes it will be a null character and sometimes not.

BTW the adoption of unicode is unavoidable. Since Windows NT, the Windows API is all unicode, the ansi functions being wrappers for the unicode ones. Therefore, there is a speed penalty using the Windows API ansi functions because the operating system has to convert them to unicode, call the unicode vesion of the function and convert them back to ansi. One day, ansi and asciiz strings will be a thing of the past.

Dominic Mitchell · August 08, 2007, 02:03:06 PM

My observations and the formula I posted don't support that.
They backup what Bruce McKinney said about SysAllocStringByteLen cramming two ANSI characters into each wide character. That is why there are least two or more nulls at the end of a PowerBASIC dynamic string. I do agree with you on playing it safe.

Donald Darden · August 08, 2007, 10:48:56 PM

I might point out that adding another NULL byte to a dynamic string for the purposes of passing it to a Windows API does not free you from the problem of whether a new string space has to be allocated and a new instance of the string constructed. Nor does it mean that the packed byte code in the string will be expanded to wide (16-bit) format. In fact, the most common method of handling dynamic strings before passing them as arguments in API calls is to allocate an ASCIIZ string of sufficient length for the purpose and ask Powerbasic to assign the dynamic string to it: aa = d$.

For converting wide code to byte code and vice-versa, the current PowerBasic
compilers include ACODE$() and UCODE$() intrinsic functions. While it is no major task to write equivalent functions in BASIC or ASM for other dialects, being able to use native functions simplifies the process

I'm not sure that Unicode (16-bit) will ever really replace byte (8-bit) code in common usage. Unless you are prepared to interpret code values beyond the standard set, you will be hard pressed to justify the rework required, and if you are expecting to continue using the standard set, then to what advantage is there in doubling string lengths with every other byte value being set to null?

Microsoft's commitment to Unicode has always been half-hearted, and only progresses in spurts, likely to appease supporters of other languages rather than from any domestic need. If I wanted to write multilingual applications or process data in various tongues, then I might be a proponent of Unicode, but just as the airline industy has found it necessary to adopt a single language for tower and pilot, I expect to see English to stay at the forefront of business and computer
communications for a long time coming. It may be that the Chinese will come to dominate the world and force us to all learn Mandarin or something, but then Unicode will hardly be adequate for the range of symbols then required.

Point is, Microsoft does not decide for me what I will use in my own code. If I need to interface to COM or the APIs. then I am only interested in what form my data has to be in for the purpose of the various calls. I do not plan to write my program in a style that compromises my other goals just because someone else thinks that it is the thing to do.

José Roca · August 09, 2007, 03:40:06 AM

The purpose of adding native support to the compiler is to allow you to work with it transparently, without having to constantly use UCODE$, ACODE$ or their API counterparts. Two new datatypes can be added, e.g. WSTRING and WCHAR, and the compiler will handle them transparently, so you will work with them as you are doing with STRING and ASCIIZ.

Edwin Knoppert · August 09, 2007, 05:17:22 AM

>so you will work with them as you are doing with STRING and ASCIIZ

Hmm, i always wonder how one should program.
In c# (or a VB6) the string is set like a$ = "hello"
a$'s contents are in unicode but it looks more like a translation from "hello" to "h e l l o " to me.
So if i am chinese, how would i benefit using unicode?
Do i need to enter special charactercodes?

In c they use L"hello" to make it an unicode, this makes it more clear to me one uses ansi notation which will be converted to unicode.
To make it an unicode string they could use "h\0e\0l\0l\0o\0"

I would not mind to support unicode in my tools but it seems i would have conversions from ansi to uni all the time..

Donald Darden · August 09, 2007, 05:57:17 AM

It really does depend on where you are coming from. Jose has devoted a lot of time to learning how to extend his reach by using APIs, COM, Variants, Objects, and so on, and has really helped show others how to get there through his examples, wrappers, conversions, and explanations. For his purposes, the use of wide character strings serve his other goals best. But if you are english-centric,
and the bulk of your code is to process data using PowerBasic, then you might easily decide that byte strings as provided by PowerBasic is the best way to go.

ACODE$() and UCODE$() provide the means to go from one form to another. If your compiler recognizes WStrings, then you do not need to specify the step of
converting, the compiler will call the necessary function automatically as part of the assign operation. But this is not magic, there is still a time penalty whenever any type of conversion is required. However, it usually takes a sizeable number of conversions before the difference can be noted. You will have observed that it is a common practice to repeat some operations many thousands of times just to get a measure of the time differences between different approaches. In actual use, this is hardly a real factor, because the time involved is actually quite short.

I think that the real compaint about ACODE$() and UCODE$() is that this method may seem a bit kludgy or awkward. It also avoids consideration of any special coding that the UNICODE could support, that would not be reflected in the standard character set. But that is a choice for the programmer or the client to make. Some may temporize by selecting a specific font that supports symbols not available in the standard set. Same byte code, but perhaps the symbols above code value 127 will be shown differently. How such code values are then interpreted would have to be taken into account when writing the supporting program.

Note that this is an evolving area, with a lot of existing art in place. This would make it hard for anyone to come along at this point and mandate that from now own, everything should be done this way or that. It has happened, such as when the original IBM EBCDIC code was largely superceded by ASC, but if you deal with mainframe datasets, then EBCDIC code is still very much alive and in use. Even in systems where it has largely been done away with, it may still be an option for data transfers or storage. That is at least a 50 year period in which the superceded code has continued to endure. And that is just one example of the endurance of coding methods.

Theo Gottwald · August 10, 2007, 06:56:17 PM

I'd say that Purebasic also has already native UNICODE Support.
It looks to me as if this will be a new standard, in that case everything else will be outdated somewhen.

Powerbasic dynamic Strings Memory usage

José Roca

Dominic Mitchell

Donald Darden

José Roca

Edwin Knoppert

Donald Darden

Theo Gottwald