-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-34111][table] Add support for json_quote, json_unquote, address PR feedback #24156 #24967
[FLINK-34111][table] Add support for json_quote, json_unquote, address PR feedback #24156 #24967
Conversation
…ack from PR apache#24156 Co-authored-by: Anupam Aggarwal <[email protected]> Co-authored-by: Jeyhun Karimov <[email protected]>
…e reference to utf8mb4
2896372
to
48b570f
Compare
@flinkbot run azure |
Thanks for the contribution @anupamaggarwal I found one issue with symbols encoded outside of BMP e.g. SELECT '≠'; This works fine SELECT json_quote('≠'); it fails as
I tend to think that here
need to work with codepoints rather than chars At the same time query SELECT json_unquote('≠'); works fine. I also checked same queries for MySQL and they work perfectly fine |
…quote(Formatter) due to wrong (char) object type
thanks for your review @snuyanzin ! |
@@ -377,6 +377,12 @@ string: | |||
- sql: SUBSTR(string, integer1[, integer2]) | |||
table: STRING.substr(INTEGER1[, INTEGER2]) | |||
description: Returns a substring of string starting from position integer1 with length integer2 (to the end by default). | |||
- sql: JSON_QUOTE(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I forgot to mention: also need to update chinese doc.
It's ok to put there english version of description however it's better to keep them synced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
We can also create a Jira with a special tag to document gaps in the Chinese documentation: https://flink.apache.org/how-to-contribute/contribute-documentation/#chinese-documentation-translation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created JIRA https://issues.apache.org/jira/browse/FLINK-35800 (and added a comment to original JIRA https://issues.apache.org/jira/browse/FLINK-34111)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creation of jira
I would still vote for having filled chinese doc with english description which then could be translated within the jira issue you've created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure @snuyanzin, fixed in cbcb044
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm during testing I noticed 2 strange things
- if escaped quote is alone or with something not UTF-8 bytes related then everything is ok e.g.
SELECT JSON_UNQUOTE('"\""'), JSON_UNQUOTE('"\"\t"');
it returns
" "
which is ok, however once we add utf-8 bytes like
SELECT JSON_UNQUOTE('"\"\ufffa"');
it starts printing backslash as well which is NOT OK
2.It seems current implementation allows to have multiple double quotes
e.g. this query is still working (in MySQL it fails )
SELECT JSON_UNQUOTE('"1""1"');
From one side MySQL doesn't allow this from another side I think it is still ok if it returns valid result.
However the main issue is that it fails (trace below) if we use such approach for invalid utf e.g.
SELECT JSON_UNQUOTE('"1""\uzzzz"');
at the same time this query works fine
SELECT JSON_UNQUOTE('"1\uzzzz"');
Caused by: org.apache.flink.util.FlinkRuntimeException: java.lang.NumberFormatException: For input string: "zzzz" under radix 16
at org.apache.flink.table.runtime.functions.scalar.JsonUnquoteFunction.eval(JsonUnquoteFunction.java:118)
at StreamExecCalc$3.processElement(Unknown Source)
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:425)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:520)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:110)
at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:100)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:113)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:71)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:338)
Caused by: java.lang.NumberFormatException: For input string: "zzzz" under radix 16
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
at java.base/java.lang.Integer.parseInt(Integer.java:668)
at org.apache.flink.table.runtime.functions.scalar.JsonUnquoteFunction.fromUnicodeLiteral(JsonUnquoteFunction.java:99)
at org.apache.flink.table.runtime.functions.scalar.JsonUnquoteFunction.unquote(JsonUnquoteFunction.java:78)
at org.apache.flink.table.runtime.functions.scalar.JsonUnquoteFunction.eval(JsonUnquoteFunction.java:112)
... 11 more
UPD: i think it is better to follow existing's vendor's way like MySQL and do not allow more than 2 non-escape double quotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @anupamaggarwal!
I have just a few high-level comments so far.
Will dig a bit deeper into this soon.
Thank you, Fabian
...able-planner/src/test/java/org/apache/flink/table/planner/functions/JsonFunctionsITCase.java
Show resolved
Hide resolved
...-planner/src/test/scala/org/apache/flink/table/planner/expressions/ScalarFunctionsTest.scala
Outdated
Show resolved
Hide resolved
thanks @snuyanzin for your review!
As you pointed out this should ideally throw an exception (as mysql), however IS JSON returns true in this case (IIUC this should be a bug? since
If we fix this by throwing an exception (preferable), it could be in-consistent with what IS JSON considers as valid json? Same is the case with
However in this case we throw an exception (from json_unquote) since \uzzzz is not a valid literal For
IS JSON returns FALSE and we return the input string back per documentation.
Hmm, I tested this through the sql-client.sh, and I get |
Thanks for the explanation
that's the main point, since in related docs of this PR is written [1]
So based on the doc instead of exception I would expect same value as it was in input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @anupamaggarwal!
I've found another case that we are not handling correctly.
Luckily, it doesn't depend on IS_JSON
and easy to fix :-D
Thanks, Fabian
...able-planner/src/test/java/org/apache/flink/table/planner/functions/JsonFunctionsITCase.java
Outdated
Show resolved
Hide resolved
...ntime/src/main/java/org/apache/flink/table/runtime/functions/scalar/JsonUnquoteFunction.java
Outdated
Show resolved
Hide resolved
b0637b0
to
f8bb3e6
Compare
f8bb3e6
to
04db96e
Compare
...ntime/src/main/java/org/apache/flink/table/runtime/functions/scalar/JsonUnquoteFunction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the quote checks and additional test cases.
I found one more thing that we should fix.
Thank you, Fabian
...ntime/src/main/java/org/apache/flink/table/runtime/functions/scalar/JsonUnquoteFunction.java
Outdated
Show resolved
Hide resolved
...ntime/src/main/java/org/apache/flink/table/runtime/functions/scalar/JsonUnquoteFunction.java
Outdated
Show resolved
Hide resolved
} catch (Throwable t) { | ||
// ignore | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should swallow any kind of exception or error.
Let's catch the errors that we expect (IllegalArgumentException, etc.) and handle those by returning the original input.
Unexpected exceptions should be forwarded because they might be related to other issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @fhueske, @snuyanzin makes sense, fixed
...ntime/src/main/java/org/apache/flink/table/runtime/functions/scalar/JsonUnquoteFunction.java
Outdated
Show resolved
Hide resolved
One more diff between MySQL and Flink behavior, however this I'm not sure need to fix since at least from my point of view MySQL's behavior is questionable SELECT JSON_UNQUOTE('"\\u0022\\u005c\\u005c\\u0075\\u0030\\u0030\\u0061\\u0061\\u0022"');
SELECT JSON_UNQUOTE(JSON_UNQUOTE('"\\u0022\\u005c\\u005c\\u0075\\u0030\\u0030\\u0061\\u0061\\u0022"')); returns
Flink SELECT JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"');
SELECT JSON_UNQUOTE(JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"')); returns
|
d0cc2cc
to
7546d9c
Compare
thanks @snuyanzin, fixed in 7546d9c Hmm, that's strange, when I tested
through sql-client I get the same result as MySql I didn't entirely understand why this would be unexpected though
which seems ok to me, unless I missed something? Added some tests in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @anupamaggarwal.
I've double checked the behavior of
SELECT JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"');
SELECT JSON_UNQUOTE(JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'));
and can confirm that it is the same as in MySQL.
I think this PR is ready to be merged (just one improvement of the docs).
What do you think @snuyanzin?
docs/data/sql_functions_zh.yml
Outdated
description: Quotes a string as a JSON value by wrapping it with double quote characters, escaping interior quote and special characters ('"', '\', '/', 'b', 'f', 'n', 'r', 't'), and returning the result as a string. If the argument is NULL, the function returns NULL. | ||
- sql: JSON_UNQUOTE(string) | ||
table: STRING.JsonUnquote() | ||
description: Unquotes JSON value, unescapes escaped special characters ('"', '\', '/', 'b', 'f', 'n', 'r', 't', 'u' hex hex hex hex), and returns the result as a string. If the argument is NULL, returns NULL. If the value starts and ends with double quotes but is not a valid JSON string literal, the value is passed through unmodified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: Unquotes JSON value, unescapes escaped special characters ('"', '\', '/', 'b', 'f', 'n', 'r', 't', 'u' hex hex hex hex), and returns the result as a string. If the argument is NULL, returns NULL. If the value starts and ends with double quotes but is not a valid JSON string literal, the value is passed through unmodified. | |
description: Unquotes JSON value, unescapes escaped special characters ('"', '\', '/', 'b', 'f', 'n', 'r', 't', 'u' hex hex hex hex), and returns the result as a string. If the argument is NULL, returns NULL. If the value does not start and ends with double quotes or if it starts and ends with double quotes but is not a valid JSON string literal, the value is passed through unmodified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 37fa2e7
yes, probably there is something wrong with my mysql... |
@anupamaggarwal can you please rebase to the latest master to be sure that it still passes ci ? (I tend to think that current failure is not related to the changes in PR) |
@flinkbot run azure |
Thanks @snuyanzin I pulled the latest from master on my branch and the build is passing now (It seems CI is a bit flaky since one of the runs post merge failed but the second went through). The failure didn't seem related to this change though. |
Cool, thanks for your contribution @anupamaggarwal if there is no objections I'm going to merge later today or tomorrow |
Thank you both @anupamaggarwal and @snuyanzin! |
What is the purpose of the change
Adds implementation of
json_quote
andjson_unquote
functions.PR is a continuation of #24156 and addresses feedback received on the original PR.
Brief change log / changes compared to PR 24156
Diff compared to #24156 - link
json_quote
, now with escaping logicjson_unquote
to unescape special charactersOpen questions
json_unquote
logic forIS JSON
is used but in certain casesIS JSON
behaviour does not seem to be consistent with Json spec as documented here https://www.json.org/json-en.html""""
should not be a valid json (which is also flagged by JsonLint) butselect '""""' IS JSON
returns trueReferences
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation