Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22843 Enable constructing UnicodeString from literal in fixed time #3106

Merged
merged 1 commit into from
Sep 9, 2024

Conversation

roubert
Copy link
Member

@roubert roubert commented Aug 15, 2024

When passing a string literal to any of the legacy constructors that take just a plain pointer to a UTF-16 string it becomes necessary to iterate through the string to find its length, even though this length was known to the compiler (which just has no way of passing it on to the constructor).

But when calling the new templated string view constructor instead it becomes possible for the compiler to use the known length of a string literal to directly create a string view of the correct size and pass this on to the constructor.

By replacing the legacy constructors with the new constructor this is made the default behaviour.

Because non-templated functions take priority it's necessary to replace the legacy constructors or else the new constructor would never be called for any data type that can implicitly decay into a pointer (and this includes string literals). For the time being, this is gated on U_HIDE_DRAFT_API which will select either the old or the new constructors and even though the old constructors technically get deleted the new constructor is made possible to call in identical ways so that no changes to any call sites are necessary.

There is however a snag there and that's an old promise that nullptr is valid input and identical to the default constructor (ie. the empty string) but this is explicitly not valid input for string views, so this requires a backward compatibility shim to catch any nullptr before any attempt is made to use it to create a string view (which would crash most implementations of the standard library).

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22843
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@roubert roubert marked this pull request as draft August 15, 2024 21:35
@roubert
Copy link
Member Author

roubert commented Aug 15, 2024

Please take a look at this.

I have intentionally not done anything about the documentation yet, thinking that it'll be a better use of time to first make sure that we all agree that this change should be made at all, and if so, that the implementation is sound.

As can be seen directly in this PR no existing call sites should need to be changed, the updated new templated constructor will accept all existing callers with legacy behaviour kept intact, so I think that the change could be deemed acceptable for the ICU API stability promises. But it must be mentioned that the old constructors really do get removed so in case anyone anywhere has code that does something like taking the address of one of the old constructors then that code will cease to work after this change. As the change is gated on U_HIDE_DRAFT_API there should be ample time to find and address any such unusual usage, if any actually exists at all.

@roubert roubert marked this pull request as ready for review August 15, 2024 22:21
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, I like this a lot. While it adds yet more template magic, it also magically makes existing call sites that use literals faster.

One question/concern below.

Once we have an answer for the explicit-constructor question, this wants to be proposed as an API change via email to icu-design.
(I am not too worried about people getting the address of one of the old constructors.)

@eggrobin what do you think?
@richgillam heads-up...

icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to do the same treatment for the other functions that have overloads for both raw-pointer and template-S?

icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
@roubert
Copy link
Member Author

roubert commented Aug 19, 2024

Are you planning to do the same treatment for the other functions that have overloads for both raw-pointer and template-S?

I haven't made any such plans for I haven't really looked at those functions at all (so I don't know if there's anything that would need to be done there).

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/unicode/unistr.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @roubert, the ICU 76 feature freeze is in less than 2 weeks. It would be great if you could finish your work here. Please take a look at the other new functions that take some ConvertibleToU16StringView, and have raw-pointer overloads, to see if they should get the same treatment.

When you have a complete set of changes, remember to write an API change proposal; you could reply to my proposal with an addendum, like I just did.

icu4c/source/common/unicode/unistr.h Outdated Show resolved Hide resolved
@markusicu
Copy link
Member

PS: Please don't squash until after approval, except when you need to rebase.

@roubert
Copy link
Member Author

roubert commented Aug 26, 2024

@eggrobin, @richgillam as you were the reviewers of PR #3076, I'd like to know your opinion on these proposed changes here.

@eggrobin
Copy link
Member

I'd like to know your opinion on these proposed changes here.

I have opined. Apologies for the delayed review, I was on vacation last week.
Other than the points discussed above, this seems like an improvement (and reduces the number of constructors, except for the draft transition period).

@markusicu
Copy link
Member

Reminder...

Are you planning to do the same treatment for the other functions that have overloads for both raw-pointer and template-S?

I haven't made any such plans for I haven't really looked at those functions at all (so I don't know if there's anything that would need to be done there).

@roubert
Copy link
Member Author

roubert commented Aug 29, 2024

Reminder...

I just think that it's a good idea to 1) make sure that we all agree that this should be done at all and 2) agree on how exactly to do it, before spending any time on doing any further changes.

@markusicu
Copy link
Member

Suggestion:

  • drop the additional handling of uint16_t in char16ptr.h
  • keep the addition of toU16StringViewNullable
  • I don't feel strongly about keeping or reverting the changes from checking U_SIZEOF_WCHAR_T==2 in the template logic vs. in the preprocessor
  • keep the replacement of (fn(ptr) + fn(const S&) --> fn(const S &)+nullable

Note: I looked at the other overloads that I added, and AFAICT none of them has a sibling overload that just takes a pointer, so this might be it for now.

@roubert
Copy link
Member Author

roubert commented Aug 30, 2024

@richgillam heads-up...

If there's any reason for why any of this wouldn't work for Apple, now would be a very good time to learn about that.

When passing a string literal to any of the legacy constructors that
take just a plain pointer to a UTF-16 string it becomes necessary to
iterate through the string to find its length, even though this length
was known to the compiler (which just has no way of passing it on to the
constructor).

But when calling the new templated string view constructor instead it
becomes possible for the compiler to use the known length of a string
literal to directly create a string view of the correct size and pass
this on to the constructor.

By replacing the legacy constructors with the new constructor this is
made the default behaviour.
@roubert
Copy link
Member Author

roubert commented Sep 3, 2024

Are you planning to do the same treatment for the other functions that have overloads for both raw-pointer and template-S?

I've now searched through the entire file without finding any such functions.

@roubert
Copy link
Member Author

roubert commented Sep 3, 2024

OK, it seems like we've reached some kind of silent consensus here, so I've now updated the documentation comments and made this PR ready to merge (if the proposed API change gets approved by the TC).

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/unicode/char16ptr.h is different
  • icu4c/source/common/unicode/unistr.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
icu4c/source/common/unicode/unistr.h Show resolved Hide resolved
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API change was approved in the ICU-TC meeting yesterday.

@roubert roubert merged commit 6c9d39a into unicode-org:main Sep 9, 2024
99 checks passed
@roubert roubert deleted the 22843-ctor branch September 9, 2024 11:51
roubert added a commit to roubert/icu that referenced this pull request Sep 9, 2024
As was pointed out in PR unicode-org#3106, the standard doesn't mandate the
existence of std::char_traits<uint16_t> and libc++ did actually
deprecate it in release 18 and delete it in release 19.

As ICU4C uses this for backward compatibility for old code that still
depends on setting -DUCHAR_TYPE=uint16_t it'd be nice to keep using it
as long as possible but once it's gone from the standard library there
isn't really much useful that ICU4C could do about that so conditionally
compiling the code when the standard library version is old enough
should suffice for keeping backward compatibility for yet some time.

llvm/llvm-project@cce062d
roubert added a commit that referenced this pull request Sep 10, 2024
As was pointed out in PR #3106, the standard doesn't mandate the
existence of std::char_traits<uint16_t> and libc++ did actually
deprecate it in release 18 and delete it in release 19.

As ICU4C uses this for backward compatibility for old code that still
depends on setting -DUCHAR_TYPE=uint16_t it'd be nice to keep using it
as long as possible but once it's gone from the standard library there
isn't really much useful that ICU4C could do about that so conditionally
compiling the code when the standard library version is old enough
should suffice for keeping backward compatibility for yet some time.

llvm/llvm-project@cce062d
@richgillam
Copy link
Contributor

@richgillam heads-up...

If there's any reason for why any of this wouldn't work for Apple, now would be a very good time to learn about that.

See my comment above. Apple does still define UChar as uint16_t. I tried taking that out a while ago and got all kinds of bug reports. But it sounds like maybe Robin's objections are theoretical, at least in Apple's case, so maybe we leave what Fredrik is doing the way it is?

@richgillam
Copy link
Contributor

See my comment above. Apple does still define UChar as uint16_t. I tried taking that out a while ago and got all kinds of bug reports. But it sounds like maybe Robin's objections are theoretical, at least in Apple's case, so maybe we leave what Fredrik is doing the way it is?

...And now I get to the bottom and am reminded that this was already merged. I lost track-- did we decide to leave the uint_16 stuff in there?

@markusicu
Copy link
Member

I lost track-- did we decide to leave the uint_16 stuff in there?

It's in, but #3144 makes it conditional, removing it starting with libc++ 18. Making a string_view over uint16_t generates warnings with libc++ 18 and no longer works at all with libc++ 19.

It should continue to work if you use a standard library that still supports std::basic_string_view<uint16_t> -- but the standard does not condone it.

@richgillam
Copy link
Contributor

I lost track-- did we decide to leave the uint_16 stuff in there?

It's in, but #3144 makes it conditional, removing it starting with libc++ 18. Making a string_view over uint16_t generates warnings with libc++ 18 and no longer works at all with libc++ 19.

It should continue to work if you use a standard library that still supports std::basic_string_view<uint16_t> -- but the standard does not condone it.

Okay. Sounds like it's time for me to figure out how to get everybody to change to using char16_t...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants