Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-34111][table] Add support for json_quote, json_unquote, address PR feedback #24156 #24967

Closed

Conversation

anupamaggarwal
Copy link
Contributor

@anupamaggarwal anupamaggarwal commented Jun 20, 2024

What is the purpose of the change

Adds implementation of json_quote and json_unquote functions.
PR is a continuation of #24156 and addresses feedback received on the original PR.

Brief change log / changes compared to PR 24156

Diff compared to #24156 - link

  • Change implementation for json_quote, now with escaping logic
  • Change implementation for json_unquote to unescape special characters
  • Delegates valid json checking to SqlJsonUtils#isJsonValue
  • Add/ update some additional tests
  • Updates documentation to remove reference to MySql utf8mb4 encoding

Open questions

  • High level behavior for invalid json in json_unquote (pass original input as is, Vs throwing exception)
  • To check for JSON_VALIDITY for json_unquote logic for IS JSON is used but in certain cases IS JSON behaviour does not seem to be consistent with Json spec as documented here https://www.json.org/json-en.html
    • According to json spec """" should not be a valid json (which is also flagged by JsonLint) but select '""""' IS JSON returns true
      • In case of invalid json string for json_unquote an exception is thrown in case of Mysql (However we won't do the same)

References

  • postgres implementation for escape

Verifying this change

  • Added ITs in JsonFunctionsITCase

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (docs)

@flinkbot
Copy link
Collaborator

flinkbot commented Jun 20, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@anupamaggarwal
Copy link
Contributor Author

@flinkbot run azure

@anupamaggarwal anupamaggarwal marked this pull request as ready for review June 21, 2024 02:38
@anupamaggarwal anupamaggarwal changed the title [FLINK-34111] [table] Add support for json_quote, json_unquote, address PR feedback #24156 [FLINK-34111][table] Add support for json_quote, json_unquote, address PR feedback #24156 Jun 21, 2024
@snuyanzin
Copy link
Contributor

snuyanzin commented Jul 4, 2024

Thanks for the contribution @anupamaggarwal
I checked for most cases it works well

I found one issue with symbols encoded outside of BMP

e.g.

SELECT '';

This works fine
now if we try it with newly added functions

SELECT json_quote('');

it fails as

Caused by: java.util.IllegalFormatConversionException: x != java.lang.Character
	at java.base/java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:4426)
	at java.base/java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2938)
	at java.base/java.util.Formatter$FormatSpecifier.print(Formatter.java:2892)
	at java.base/java.util.Formatter.format(Formatter.java:2673)
	at java.base/java.util.Formatter.format(Formatter.java:2609)
	at java.base/java.lang.String.format(String.java:2897)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.formatStr(JsonQuoteFunction.java:77)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.quote(JsonQuoteFunction.java:68)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.eval(JsonQuoteFunction.java:88)
	at StreamExecCalc$3.processElement(Unknown Source)
	at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
	at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
	at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
	at ...

I tend to think that here

	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.formatStr(JsonQuoteFunction.java:77)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.quote(JsonQuoteFunction.java:68)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.eval(JsonQuoteFunction.java:88)

need to work with codepoints rather than chars

At the same time query

SELECT json_unquote('');

works fine.

I also checked same queries for MySQL and they work perfectly fine

…quote(Formatter) due to wrong (char) object type
@anupamaggarwal
Copy link
Contributor Author

thanks for your review @snuyanzin !
Fixed in 4267018

@@ -377,6 +377,12 @@ string:
- sql: SUBSTR(string, integer1[, integer2])
table: STRING.substr(INTEGER1[, INTEGER2])
description: Returns a substring of string starting from position integer1 with length integer2 (to the end by default).
- sql: JSON_QUOTE(string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I forgot to mention: also need to update chinese doc.
It's ok to put there english version of description however it's better to keep them synced

Copy link
Contributor

@fhueske fhueske Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
We can also create a Jira with a special tag to document gaps in the Chinese documentation: https://flink.apache.org/how-to-contribute/contribute-documentation/#chinese-documentation-translation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creation of jira
I would still vote for having filled chinese doc with english description which then could be translated within the jira issue you've created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure @snuyanzin, fixed in cbcb044

Copy link
Contributor

@snuyanzin snuyanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm during testing I noticed 2 strange things

  1. if escaped quote is alone or with something not UTF-8 bytes related then everything is ok e.g.
SELECT JSON_UNQUOTE('"\""'), JSON_UNQUOTE('"\"\t"');

it returns

"                              "

which is ok, however once we add utf-8 bytes like

SELECT JSON_UNQUOTE('"\"\ufffa"');

it starts printing backslash as well which is NOT OK

2.It seems current implementation allows to have multiple double quotes
e.g. this query is still working (in MySQL it fails )

SELECT JSON_UNQUOTE('"1""1"');

From one side MySQL doesn't allow this from another side I think it is still ok if it returns valid result.
However the main issue is that it fails (trace below) if we use such approach for invalid utf e.g.

 SELECT JSON_UNQUOTE('"1""\uzzzz"');

at the same time this query works fine

SELECT JSON_UNQUOTE('"1\uzzzz"');
Caused by: org.apache.flink.util.FlinkRuntimeException: java.lang.NumberFormatException: For input string: "zzzz" under radix 16
	at org.apache.flink.table.runtime.functions.scalar.JsonUnquoteFunction.eval(JsonUnquoteFunction.java:118)
	at StreamExecCalc$3.processElement(Unknown Source)
	at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
	at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
	at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
	at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:425)
	at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:520)
	at org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:110)
	at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:100)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:113)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:71)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:338)
Caused by: java.lang.NumberFormatException: For input string: "zzzz" under radix 16
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
	at java.base/java.lang.Integer.parseInt(Integer.java:668)
	at org.apache.flink.table.runtime.functions.scalar.JsonUnquoteFunction.fromUnicodeLiteral(JsonUnquoteFunction.java:99)
	at org.apache.flink.table.runtime.functions.scalar.JsonUnquoteFunction.unquote(JsonUnquoteFunction.java:78)
	at org.apache.flink.table.runtime.functions.scalar.JsonUnquoteFunction.eval(JsonUnquoteFunction.java:112)
	... 11 more

UPD: i think it is better to follow existing's vendor's way like MySQL and do not allow more than 2 non-escape double quotes

Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @anupamaggarwal!

I have just a few high-level comments so far.
Will dig a bit deeper into this soon.

Thank you, Fabian

@anupamaggarwal
Copy link
Contributor Author

thanks @snuyanzin for your review!
I think both these cases are related to inconsistent behavior (IIUC) of IS JSON function which is used in json_unquote implementation for checking JSON validity. I had observed something similar earlier (see Open questions section of the PR description.

SELECT JSON_UNQUOTE('"1""1"');

As you pointed out this should ideally throw an exception (as mysql), however IS JSON returns true in this case (IIUC this should be a bug? since 1""1 is not a valid json)

select '"1""1"' is json; //true 

If we fix this by throwing an exception (preferable), it could be in-consistent with what IS JSON considers as valid json?
Should we care about compatibility with IS JSON behavior here (or just fix this and file a JIRA for IS JSON?)

Same is the case with

SELECT JSON_UNQUOTE('"1""\uzzzz"')
select '"1""\uzzzz"' is JSON; //true

However in this case we throw an exception (from json_unquote) since \uzzzz is not a valid literal

For

SELECT JSON_UNQUOTE('"1\uzzzz"')

IS JSON returns FALSE and we return the input string back per documentation.

SELECT JSON_UNQUOTE('""\ufffa"');
it starts printing backslash as well which is NOT OK

Hmm, I tested this through the sql-client.sh, and I get " which seems expected? (as ""\ufffa" is a valid JSON )

@snuyanzin
Copy link
Contributor

Thanks for the explanation

However in this case we throw an exception (from json_unquote) since \uzzzz is not a valid literal

that's the main point, since in related docs of this PR is written [1]

If the value starts and ends with double quotes but is not a valid JSON string literal, the value is passed through unmodified.

So based on the doc instead of exception I would expect same value as it was in input

[1] https://github.com/apache/flink/pull/24967/files#diff-539fb22ee6aeee4cf07230bb4155500c6680c4cc889260e2c58bfa9d63fb7de5R385

Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @anupamaggarwal!

I've found another case that we are not handling correctly.
Luckily, it doesn't depend on IS_JSON and easy to fix :-D

Thanks, Fabian

Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the quote checks and additional test cases.
I found one more thing that we should fix.

Thank you, Fabian

Comment on lines 47 to 49
} catch (Throwable t) {
// ignore
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should swallow any kind of exception or error.
Let's catch the errors that we expect (IllegalArgumentException, etc.) and handle those by returning the original input.
Unexpected exceptions should be forwarded because they might be related to other issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @fhueske, @snuyanzin makes sense, fixed

@snuyanzin
Copy link
Contributor

One more diff between MySQL and Flink behavior, however this I'm not sure need to fix since at least from my point of view MySQL's behavior is questionable
MySQL

SELECT  JSON_UNQUOTE('"\\u0022\\u005c\\u005c\\u0075\\u0030\\u0030\\u0061\\u0061\\u0022"');
SELECT  JSON_UNQUOTE(JSON_UNQUOTE('"\\u0022\\u005c\\u005c\\u0075\\u0030\\u0030\\u0061\\u0061\\u0022"'));

returns

"\\u00aa"
\u00aa  

Flink

SELECT  JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"');
SELECT  JSON_UNQUOTE(JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'));

returns

"\\u00aa"
ª 

@anupamaggarwal
Copy link
Contributor Author

thanks @snuyanzin, fixed in 7546d9c

Hmm, that's strange, when I tested

SELECT JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"');
SELECT JSON_UNQUOTE(JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'));

through sql-client I get the same result as MySql

I didn't entirely understand why this would be unexpected though
For example, if I test (using sql-client) with

select json_unquote('"\u00aa"'); I get ª
whereas
select json_unquote('"\\u00aa"'); returns \u00aa

which seems ok to me, unless I missed something?

Added some tests in ScalarFunctionsTest

Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @anupamaggarwal.

I've double checked the behavior of

SELECT JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"');
SELECT JSON_UNQUOTE(JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'));

and can confirm that it is the same as in MySQL.

I think this PR is ready to be merged (just one improvement of the docs).
What do you think @snuyanzin?

description: Quotes a string as a JSON value by wrapping it with double quote characters, escaping interior quote and special characters ('"', '\', '/', 'b', 'f', 'n', 'r', 't'), and returning the result as a string. If the argument is NULL, the function returns NULL.
- sql: JSON_UNQUOTE(string)
table: STRING.JsonUnquote()
description: Unquotes JSON value, unescapes escaped special characters ('"', '\', '/', 'b', 'f', 'n', 'r', 't', 'u' hex hex hex hex), and returns the result as a string. If the argument is NULL, returns NULL. If the value starts and ends with double quotes but is not a valid JSON string literal, the value is passed through unmodified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: Unquotes JSON value, unescapes escaped special characters ('"', '\', '/', 'b', 'f', 'n', 'r', 't', 'u' hex hex hex hex), and returns the result as a string. If the argument is NULL, returns NULL. If the value starts and ends with double quotes but is not a valid JSON string literal, the value is passed through unmodified.
description: Unquotes JSON value, unescapes escaped special characters ('"', '\', '/', 'b', 'f', 'n', 'r', 't', 'u' hex hex hex hex), and returns the result as a string. If the argument is NULL, returns NULL. If the value does not start and ends with double quotes or if it starts and ends with double quotes but is not a valid JSON string literal, the value is passed through unmodified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa2e7

@snuyanzin
Copy link
Contributor

yes, probably there is something wrong with my mysql...
thanks for checking
and thanks for addressing feedback

@snuyanzin
Copy link
Contributor

@anupamaggarwal can you please rebase to the latest master to be sure that it still passes ci ? (I tend to think that current failure is not related to the changes in PR)

@anupamaggarwal
Copy link
Contributor Author

@flinkbot run azure

@anupamaggarwal
Copy link
Contributor Author

anupamaggarwal commented Aug 8, 2024

Thanks @snuyanzin I pulled the latest from master on my branch and the build is passing now (It seems CI is a bit flaky since one of the runs post merge failed but the second went through). The failure didn't seem related to this change though.

@snuyanzin
Copy link
Contributor

Cool, thanks for your contribution @anupamaggarwal
thanks for the review @fhueske

if there is no objections I'm going to merge later today or tomorrow

@fhueske
Copy link
Contributor

fhueske commented Aug 9, 2024

Thank you both @anupamaggarwal and @snuyanzin!

@snuyanzin snuyanzin closed this in 9457b87 Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants