• Welcome to Jose's Read Only Forum 2023.
 

FreeBASIC CWstr

Started by Juergen Kuehlwein, April 09, 2018, 11:39:00 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jeff Marshall

I've started a discussion at https://www.freebasic.net/forum/viewtopic.php?f=17&t=27569, which includes a back link to here.

Juergen Kuehlwein

Jeff,


QuoteFYI, hides the problem, doesn't solve it.

i agree! It makes the compiler work, but it doesn´t SOLVE the problem of the following code. Investigating this bug further shows, that "chars" after
chars = fb_wstr_CalcDiff( src, p ) + 1;

gets a value of -2147483648 where you would expect zero (p being one position "before" src should result in -1, +1 should make zero). So it seems "fb_wstr_CalcDiff" doesn´t return the expected result, if p < src.


Adding
  if (p < src)
    return NULL;


before makes it exit and return NULL, when src is all spaces. This would be expected. But "fb_wstr_CalcDiff" is used in other places as well, i didn´t look, if this (end < ini) could be a problem elsewhere too. Hope this helps...


JK

Juergen Kuehlwein

Investigating this case still further, i can say that end < ini cannot occur in all other places, where fb_wstr_CalcDiff is called.

I understand the underlying logic of the code in strw_trim.c and as far as i can tell, it is correct. What isn´t correct is the result of fb_wstr_CalcDiff in case p < src. Fb_wstr_CalcDiff should return the difference between the pointers p and scr in wide characters. The corresponding code is:

return ((intptr_t)p - (intptr_t)src) / sizeof( FB_WCHAR );

As said before, this delivers correct results as long as p>src, but for p<src it fails to deliver the expected result. I´m not an expert in C coding at all, but i keep asking myself - why this compilcated? Wouldn´t

chars = p - src + 1;

do instead of fb_wstr_CalcDiff? ... and according to my tests it does, even if p<src.


JK

Jeff Marshall

JK, thanks for investigating.  You have the right idea.

I think your expression: chars = p - src + 1;, is valid and safe.  gcc should be doing the right thing with the pointer arithmetic.

fb_wstr_CalcDiff provides some self documentation as to what the code is supposed to be doing, so might be good to keep the function.   

/* Calculate the number of characters between two pointers. */
static __inline__ ssize_t fb_wstr_CalcDiff( const FB_WCHAR *ini, const FB_WCHAR *end )
{
return end - ini;
}


For your interest:
fb_wstr_SkipCharRev function itself can return a pointer to one element before the first element, which is actually undefined behaviour in C, but often works because of how the C compiler handles it.  So I think I will make a small change in the logic for fb_wstr_SkipCharRev to avoid that and to get rid of the '+1' needed after the function is called.  The result will be that fb_wstr_CalcDiff won't ever see "end" pointer less than "ini" pointer to begin with. 

And, in the the original expression:
return ((intptr_t)end - (intptr_t)ini) / sizeof( FB_WCHAR );
the symptoms of the problem we are seeing is in how gcc optimizes the expression by translating a division by a power of 2 to a shift instructions.  fbc does this too, but in gcc, the actual optimization is different depending on type of the divisor:

((int)end - (int)ini) / sizeof(FB_WCHAR)      '' gcc optimizes to SHR instruction (wrong!)
((int)end - (int)ini) / (int)sizeof(FB_WCHAR) '' gcc optimizes to SAR instruction (correct)

SHR instruction doesn't preserve the sign and so fails for negative values, as would be the case in (end < ini).

I don't see any reason why fb_wstr_CalcDiff must cast pointers to integers.  But that code is nearly the same since 2005, so maybe related to an old version of gcc, I don't know.

Juergen Kuehlwein

Jeff,

fb_wstr_CalcDiff provides some self documentation as to what the code is supposed to be doing, so might be good to keep the function.

this is, why i like line comments (comments added at the end of a line) so much. It doesn´t distract the eye when reading the code, you can easily ignore the text in green (or whatever color you prefer for comments) especially if it is properly aligned. But in case you don´t understand what´s going on and why, such a comment will save you a lot of time, if you must revisit this code location later on.

My memory is still working quite well, but i´m not sure, if i can recall all of i know right now about this code in let´s say in 6 or twelve month. Things, which are obvious right now, will become obscure over time ...

chars = p - src + 1;                                                //calculate the difference in "characters"

solves this problem for good.

Interspersed comments should be used for important information. Too many of them make the code hard to read. But those aligned comments on the right side don´t do any harm and are quite useful for explaining things, which are of lower priority but nevertheless useful for understanding the code flow.

Analyzing someone else´s code is always a challenge. Adding comments to places, where you finally understand, what the code does, helps preserving your effort. In other words i would very much appreciate, if you accepted adding line comments to the sources. I think many people interested in participating would benefit from more readable sources. As you can see from the work i have done, it´s not impossible to understand the sources, but in order to get the whole picture more comments and explanations would help a lot.

Please think about it,


JK

Jeff Marshall

The name alone, fb_wstr_CalcDiff, without any other comment or information is the documentation.  Any time this name appears, it can be known that the expression is taking difference of 2 wstr pointers.

Juergen Kuehlwein

#201
Well, just coding "p - src" would make this one-line function "fb_wstr_CalcDiff" obsolete and would additionally save us the overhead of a function call. The downside is, that the information provided by the descriptive function name is lost then too - a comment would help here.

I don´t want to argue, but i think you get my point!


Just another topic we need a decision for is STRPTR. The way i coded it requires USTRING (or any clone of it) to have the data pointer in first place of the member variables. We could just make this a requirement. But what if someone doesn´t comply to this rule? Unstable, erratic and maybe crashing code would be the (undesirable) result.

So a much cleaner solution would be to make STRPTR an overloadable operator. Then specific code for returning a WSTRING PTR for the STRPTR operator had to be supplied or a compiler error (Invalid data type) would be thrown.

What do you think?


JK


later: in the meantime i know how to do it (STRPTR as overloaded operator) - works like a charm. I really think it´s better than relying on the data´s position.

Jeff Marshall

> just coding "p - src" would make this one-line function "fb_wstr_CalcDiff" obsolete and would additionally save us the overhead of a function call

true, fb_wstr_CalcDiff has all the properties of a function: a name, parameters, type checking, a typed return value.  Except this one (like many of the functions in fb_unicode.h) is an inline function.  gcc will optimize fb_wstr_CalcDiff in line with code where it is used just 2 assembler instructions.

> So a much cleaner solution would be to make STRPTR an overloadable operator.

I think an overloadable operator was one of my thoughts also a few posts ago.  But, in the end though, STRPTR is a lot like a cast() as wstring ptr.  And one of the bugs from sf.net talks about allowing both cast() byref as wstring, cast() as wstring ptr, etc, all in one UDT, changing how fbc ranks string type matching to allow more user defined conversions.  Which made me think if we should have separate WSTRPTR & STRPTR, like we have separate WSTR and STR.

Juergen Kuehlwein

gcc optimization - ok, i don´t know anything about it, except that it exists.

I will have my own set of sources (with comments) and write a little utility for stripping these comments before pushing a file to the repository. This way we can have both, i can have as much comments as i like and the sources are kept "clean". Merging or rebasing will be more cumbersome then for me, but this is my problem.


WSTRPTR - i don´t see an absolute need for this, because in case of an STRPTR operator it is clear, that i want a pointer and it isn´t a matter of fbc ranking, but a matter of being defined as operator or not in an UDT. On the other hand, there already are some W... functions in FB, and WSTRPTR would make clear what kind of pointer is expected. So STRPTR alone would be sufficient, but WSTRPTR would be a clearer syntax - the more i think about it, the more i like it!

Quoteto allow more user defined conversions

I think in case of STRING/WSTRING this can cause only trouble. Where is the need for returning a STRING AND a WSTRING from the same UDT? Internally it´s an either - or, either handle the data as STRING - or handle the data as WSTRING. If the data internally is a WSTRING, converting it to STRING might spoil this data, because this isn´t a lossless conversion. The other way round would work, but it could easily be done by the user too. So allowing for returning both, might be a problem, because it can cause hard to find (WSTRING to STRING conversion under the hood) errors.


Let me know, if you want to have the code changes for making STRPTR an overloadable operator. There isn´t very much to do, just a few more lines here and there. Currently i cannot just push the sources, because i tried many things in many different places, so it isn´t obvious, what is for what.


JK

Jeff Marshall

In the end, I made STRPTR(wstring) return a WSTRING PTR.  I couldn't think of any reason it should not.  To quote the fb wiki, "Note that when passed a Wstring, Operator Strptr still returns a Zstring Ptr, which may not be the desired result.".  My thought is, would it EVER be the desired result.  So I think this is a safe change.

Besides, STR(ustring), and WSTR(ustring) can still be used for specific conversions.

I feel like I have made progress over last several weekends.  I wrote a lot of tests.  The MID assignment statement was a weird one, and I spent  some time on that one.  Maybe it's bugged, I don't know, it has some peculiar behaviours.  For now I left myself a note about it.

I implemented WSTR(ustring) in a way that is close to WSTR(wstring).  When testing ustring = WSTR(ustring), I came across an issue in your implementation of the DWSTR in ustring.bi; DECLARE OPERATOR LET (BYREF pwszStr AS WSTRING PTR) clears the buffer.  And if pwszStr ptr actually points to itself, the buffer is cleared before the contents are read.  Some kind of memory-move would be better.

I think the remaining parts for me are:
- LEFT/RIGHT, which involves fixing a string related bug
- SELECT CASE, which is probably OK
- IIF, which is probably OK, with efficient logic decision
- SWAP, which I need to think about a bit.
- Parameter passing, which is probably OK, so just needs the tests written.

I think I will create a pull request for the work done so far to date.  I'll expand on next steps over at fb.net some time this weekend.

Juergen Kuehlwein

Jeff,


i read your post in FB-forum (please read my post there too) first and replied there first. So what´s new is the problem with MID (could you supply code, where it shows "some peculiar behaviors"?) and that ustring = WSTR(ustring) fails. To be honest i never tested that, because such code doesn´t make much sense. On the other hand it´s not forbidden and therefore it should work! I will have a look.

José, when you read this, could you have a look too?

As far as i read your changes i like your approach better than mine. My goal was to show, that it is possible, and to make it work (somehow). Your code (no wonder) is a more logical development of the compiler.


JK

   

Jeff Marshall

#206
> So what´s new is the problem with MID (could you supply code, where it shows "some peculiar behaviors"?)

In general, the WSTRING in rtlib support is not optimized for speed.  So for most wstring functions, there is no version that will take a length parameter, and length is always calculated with a wstrlen().  Except for MID() assignment, which always assumes the buffer is large enough, if it's a pointer (not fixed length type).  Creates some situations to watch out for:

sub print_wstr( byref s as wstring )
var n = len(s)
print "    " & left( str(s) & space(30), 30);
for i as integer = 0 to n
print hex( s[i], 4 ) & " ";
next
print
end sub

scope
print "show a string"
dim w as wstring * 16 = "abcdefgh"
print_wstr w
print
end scope

scope
print "mid(wstring*n,,) overwrites null terminator"
'' initializer doesn't clear string
dim w as wstring * 16 = "Q"
print_wstr w

'' overwrites null terminator and we get garbage
mid( w, 2, 1 ) = "X"
print_wstr w
print
end scope

scope
print "mid(wstring,,) overwrites null terminator"
dim w as wstring * 16 = "R"

'' try a pointer, it doesn have fixed length limit
dim p as wstring ptr = @w
print_wstr *p

'' overwrites null terminator and we get garbage
mid( *p, 2, 1 ) = "Y"
print_wstr *p
print
end scope

scope
print "mid(wstring,) writes data beyond end of string"
dim t as wstring * 16
dim w as wstring * 16 = "0123456789abcde"
print_wstr w

dim p as wstring ptr = @w
mid( *p, 16 ) = "qrst"

print_wstr w
print_wstr t
print
end scope


The rtlib needs work with new functions to handle string length parameters.  Which should give better speed and allow for embedded null characters.  Both of which is needed for the eventual addition of builtin dyanmic wstring. 

Jeff Marshall

> ustring = WSTR(ustring) fails

In one of your tests (tests/ustring/wstr.bas) you have:

      dim w as wstring * 50 = wchr( 1234 )
      dim u as ustring = wchr( 1234 )
      w = wstr(w)
      u = wstr(u)


In your original fbc ustring code, rtlToWstr() doesn't do anything if the argument is a "USTRING".  So it just resolves to u = u and so it calls a let operator that does check buffer address:

PRIVATE OPERATOR DWSTR.Let (BYREF cws AS DWSTR)
  IF m_pBuffer = cws.m_pBuffer THEN EXIT OPERATOR   ' // Ignore cws = cws
  this.Clear
  this.Add(cws)
END OPERATOR


In my implementation of WSTR(udt), it actually does the conversion, and so calls:

PRIVATE OPERATOR DWSTR.Let (BYREF pwszStr AS WSTRING PTR)
  this.Clear
  IF pwszStr = 0 THEN EXIT OPERATOR
  this.Add(*pwszStr)
END OPERATOR

if pwszStr points to m_pBuffer, (or first couple of bytes, looking at Clear), So the string just gets erased before the data is copied.

José Roca

Quote
The rtlib needs work with new functions to handle string length parameters.  Which should give better speed and allow for embedded null characters.  Both of which is needed for the eventual addition of builtin dyanmic wstring.

You seem to be thinking about BSTRings, that carry its length with them. They're a different data type. FreeBasic's WSTRING is a null terminated unicode string (maybe it should have been named ZWSTRING or WSTRINGZ to avoid confussions); therefore, they can't have embedded nulls.

Juergen Kuehlwein

#209
Jeff,

so, if you  add this line:
IF m_pBuffer = cast(ubyte ptr, pwszStr) THEN EXIT OPERATOR   'Ignore self-assign
to:

PRIVATE OPERATOR DWSTR.Let (BYREF pwszStr AS WSTRING PTR)
  IF m_pBuffer = cast(ubyte ptr, pwszStr) THEN EXIT OPERATOR   'Ignore self-assign

  this.Clear
  IF pwszStr = 0 THEN EXIT OPERATOR
  this.Add(*pwszStr)
END OPERATOR


it should work again. Could you please test?


JK