Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Fix FORMAT function to better comply with Microsoft SQL Server specification #86225

Merged
merged 5 commits into from
May 18, 2022

Conversation

luigidellaquila
Copy link
Contributor

Based on the documentation, FORMAT() function should comply with Microsoft SQL Server Format specification.

Current implementation uses Java DateTimeFormat as it is, only with replacement of basic pattern characters that are slightly different between the two specs, but it's not fully compliant.

The goal of this PR is to better translate Microsoft datetime patterns to the equivalent in Java, and addresses the cases that are not currently covered, ie.

  • allowed characters in Microsoft spec that are not allowed in Java are now quoted and returned correctly
  • quoting syntax (that is different in the two specs) is properly translated
  • some further fixes to patterns translation (eg. Y and y in Java are equivalent, while in MS spec Y is just a normal character, not interpreted as a pattern)

Fixes #66560

@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Apr 27, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@@ -136,39 +136,39 @@ SELECT ISO_DAY_OF_WEEK(birth_date) AS d, SUM(salary) s FROM test_emp GROUP BY d

d:i | s:l
---------------+---------------
7 |386466
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this, it's annoying for me as well...
IntelliJ keeps trimming blank spaces in csv-spec files on save, and there is apparently no way to avoid it, even disabling it from the IDE settings (not sure if it's a mix of IntelliJ + checkstyle + OSX).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luigidellaquila this is probably due to .editorconfig. You could add some further config to it, which worked when I tested it:

diff --git a/.editorconfig b/.editorconfig
index 92111b6c074..00dc2f6fa83 100644
--- a/.editorconfig
+++ b/.editorconfig
@@ -225,3 +225,6 @@ indent_size = 2
 
 [*.{xsd,xml}]
 indent_size = 4
+
+[*.{csv,sql}-spec]
+trim_trailing_whitespace = false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @pugnascotia, I'll try it

@luigidellaquila
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.
Also, I suggest changing the title of the PR to Fix FORMAT function to better comply with Microsoft SQL Server specification. Not all the characteristics of MS format are supported.

// manage patterns that are different from MS to Java and have to be translated
boolean replaced = false;
for (String[] item : MS_TO_JAVA_PATTERNS) {
if (i + item[0].length() <= pattern.length() && item[0].equals(pattern.substring(i, i + item[0].length()))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract item[0].length() outside if and call it only once.

@@ -67,6 +86,105 @@ protected Function<TemporalAccessor, String> formatterFor(String pattern) {
}
};

private static String msToJavaPattern(String pattern) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method should be unit tested. There are many edge cases with translating one format to another and we need to be extra careful with them and make sure everything is covered. Make this method package protected and add tests for it.

StringBuilder partialQuotedString = new StringBuilder();

boolean originQuoting = false;
boolean targetQuoting = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better named as shouldQuoteTarget?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, the naming is bad and misleading.
What I mean here is:

  • originQuoting: the character in the original (MS) pattern was intended to be quoted (eg. previous char was a backslash or is inside a quoting block).
  • targetQuoting: last character in the generated (java) pattern has to be quoted, so I'm building a portion of pattern that will then be quoted. Unfortunately I cannot quote characters one by one, I have to accumulate portions of string that have to be quoted and then quote them properly

I'll try to find better names

StringBuilder result = new StringBuilder(pattern.length());
StringBuilder partialQuotedString = new StringBuilder();

boolean originQuoting = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better named as isStartQuote?

Comment on lines 180 to 182
result.append("'");
result.append(partialQuotedString.toString().replaceAll("'", "''"));
result.append("'");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract this as a small method and use it here and above as well.

);
assertEquals(
"2019-09-04 04:10:37.12345678 AM",
new Format(Source.EMPTY, dateTime, l("YYYY-MM-dd HH:mm:ss.ffffffff t"), zoneId).makePipe().asProcessor().process(null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a test with YYYY to show that the MS pattern is case-sensitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added above

@luigidellaquila luigidellaquila changed the title SQL: Fix FORMAT function to comply with Microsoft SQL Server specification SQL: Fix FORMAT function to better comply with Microsoft SQL Server specification Apr 28, 2022
- code refactoring
- further unit tests
- fix .editorconfig to avoid trim of trailing whitespaces in csv-spec files
@cla-checker-service
Copy link

cla-checker-service bot commented Apr 28, 2022

💚 CLA has been signed

Copy link
Contributor

@Luegg Luegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there more breaking changes apart from Y vs. y?

For Y vs y I would suggest to keep supporting upper case Y and potentially emit a deprecation warning instead of breaking current behavior. Since even the docs uses Y in the example it seems quite likely that users also do so.

@luigidellaquila
Copy link
Contributor Author

@elasticmachine run CLA

@luigidellaquila
Copy link
Contributor Author

luigidellaquila commented Apr 28, 2022

@Luegg yes, there are many breaking changes here, like A, N, G, VV...
Actually, with the original implementation, only nine MS patterns were translated to Java (see diff), all the rest of the characters were just copied and interpreted with Java semantics. As you can see, there are maybe ten or more characters that Java DateTimeFormatters interprets as valid date patterns and that MS just ignores.

IMHO it could still make sense to treat Y as an exception, since it's pretty common, but probably we should not maintain the compatibility with all the other cases.

@astefan
Copy link
Contributor

astefan commented Apr 28, 2022

For Y vs y I would suggest to keep supporting upper case Y and potentially emit a deprecation warning instead of breaking current behavior. Since even the docs uses Y in the example it seems quite likely that users also do so.

I have mixed feelings about this. This is a bug and should be treated like it in my opinion: changing the output means fixing a faulty behavior in the first place. On the other hand, we could do as you suggest and be lenient towards such changes, but this could introduce a precedent: every bug fix should also consider any (if it wouldn't be a bug fix) introduced breaking changes and add deprecation warnings and so on. @bpintea @costin wdyt?

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Left some comments around the coding style to replace the majority of if .. continue with else if.
Regarding the changes, I don't think we need to be lenient - this function is specific to SQL Server and should act accordingly. The fact that Y has been supported instead of y is a bug - if folks want to use the Java formatter rules, they can do so using the regular DATE_FORMATTER or simply change their queries to use y instead. It is annoying however it has been a bug that affected also our documentation.
In short, I would just apply the changes without any special treatment of the previous behavior.

originalCharacterQuoted = false;
lastTargetCharacterQuoted = true;
partialQuotedString.append(c);
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use else if instead of continue through-out this method.

@Luegg
Copy link
Contributor

Luegg commented Apr 28, 2022

Interesting, I wasn't aware that we support 3 different formats for date formatting and assumed that's the main one.

In any case, there is one more occurrence of the upper case Y in the docs:

image

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bpintea
Copy link
Contributor

bpintea commented May 16, 2022

I'm wondering just how close to the MS specification we will want to get?
I'm thinking of the treatment of %, which has this single custom format specifier semantic, or single " characters, which are just piped through by MS's FORMAT, but swallowed by current implementation, or double tick '', that will be piped through by MS's FORMAT, but treated as a single ' by DateTimeFormatter.

@luigidellaquila
Copy link
Contributor Author

I'm wondering just how close to the MS specification we will want to get?

@bpintea MS and Java formats are not different only in the syntax, but also on some concepts. One of them is custom format specifiers, that don't seem to have an equivalent in Java.
The goal here is to have has much compliance with MS specs as possible, especially on the allowed characters (see #66560), while preserving current implementation based on Java DateTimeFormatter.
Some concepts, like those you mentioned, would require a more complex effort (probably a complete rewrite of the formatter), this is why I decided to leave them out of the scope of the PR

// and then translate the pattern
result.append(item[1]);
characterProcessed = true;
i += (fragmentLength - 1); // fast-forward, because the replaced pattern could be longer than one character
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the parentheses shouldn't be needed. Just in case any other changes will be needed.
Also, I'm generally in favour of using a while loop and explicitly incrementing by needed amount (vs. decrementing due to for).

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luigidellaquila luigidellaquila merged commit f69c739 into elastic:master May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/SQL SQL querying >bug Team:QL (Deprecated) Meta label for query languages team v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: incorrect FORMAT behaviour
8 participants