-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement](multi-catalog) Rewrite S3URI
to remove tricky virtualbucket mechanism and support different uri styles by flags.
#38064
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… bucket mechanism and support different uri styles by flags. (apache#33858) Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (aws/aws-sdk-java-v2#763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request. Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a virtual hosted path and implemented it through path style. For example: For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into: - virtualBucket: my-bucket - Bucket: data (bucket must be set, otherwise the s3 client will report an error) Especially this step is particularly tricky because of the limitations of the s3 client. - Key: file.txt The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key. **However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.** However, after apache#30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in apache#31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working. **Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.** Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags. This class represents a fully qualified location in S3 for input/output operations expressed as as URI. #### For AWS S3, URI common styles: - AWS Client Style(Hadoop S3 Style): `s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88` - Virtual Host Style: `https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88` - Path Style: `https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88` Regarding the above-mentioned common styles, we can use <code>isPathStyle</code> to control whether to use path style or virtual host style. "Virtual host style" is the currently mainstream and recommended approach to use, so the default value of <code>isPathStyle</code> is false. #### Other Styles: - Virtual Host AWS Client (Hadoop S3) Mixed Style: `s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88` - Path AWS Client (Hadoop S3) Mixed Style: `s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88` For these two styles, we can use <code>isPathStyle</code> and <code>forceParsingByStandardUri</code> to control whether to use. Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false && forceParsingByStandardUri = true</code> Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && forceParsingByStandardUri = true</code> When the incoming location is url encoded, the encoded string will be returned. For <code>getKey()</code>, <code>getQueryParams()</code> will return the encoding string
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
run buildall |
TPC-H: Total hot run time: 49609 ms
|
TPC-DS: Total hot run time: 203802 ms
|
ClickBench: Total hot run time: 31.2 s
|
Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
|
…es when writing to s3. Cherry-pick apache#35645. ``` org.apache.doris.common.UserException: errCode = 2, detailMessage = java.net.URISyntaxException: Illegal character in path at index 114: oss://xxxxxxxxxxx/hive/tpcds1000_partition_oss/call_center/cc_call_center_sk=1/cc_mkt_class=A bit narrow forms matter animals. Consist/cc_market_manager=Daniel Weller/cc_rec_end_date=2001-12-31/f6b5ff4253414b06-9fd365ef68e5ddc5_133f02fb-a7e0-4109-9100-fb748a28259e-0.zlib.orc at org.apache.doris.common.util.S3URI.validateUri(S3URI.java:134) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.common.util.S3URI.parseUri(S3URI.java:120) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.common.util.S3URI.<init>(S3URI.java:116) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.common.util.S3URI.create(S3URI.java:108) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.fs.obj.S3ObjStorage.deleteObject(S3ObjStorage.java:194) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.fs.remote.ObjFileSystem.delete(ObjFileSystem.java:150) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.fs.remote.SwitchingFileSystem.delete(SwitchingFileSystem.java:92) ~[doris-fe.jar:1.2- ``` Hadoop partition names will encode some special characters, but not space characters, which is different from URI encoding. Therefore, an error will be reported when constructing URI. The solution is to use regular expressions to parse URI, and then pass in each part of URI to construct URI. This URI constructor will encode each part of URI.
run buildall |
TPC-H: Total hot run time: 49668 ms
|
TPC-DS: Total hot run time: 203242 ms
|
ClickBench: Total hot run time: 30.3 s
|
Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cherry-pick #33858
Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (aws/aws-sdk-java-v2#763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request.
Therefore, Doris used
forceVirtualHosted
inS3URI
to convert it into a virtual hosted path and implemented it through path style. For example:For s3 uri
s3://my-bucket/data/file.txt
, It will eventually be parsed into:The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key.
However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.
However, after #30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in #31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working.
Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.
Rewrite
S3URI
to remove tricky virtual bucket mechanism and support different uri styles by flags.This class represents a fully qualified location in S3 for input/output operations expressed as as URI.
For AWS S3, URI common styles:
s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88
https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
Regarding the above-mentioned common styles, we can use
isPathStyle
to control whether to use path styleor virtual host style.
"Virtual host style" is the currently mainstream and recommended approach to use, so the default value of
isPathStyle
is false.Other Styles:
s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
For these two styles, we can use
isPathStyle
andforceParsingByStandardUri
to control whether to use.
Virtual Host AWS Client (Hadoop S3) Mixed Style:
isPathStyle = false && forceParsingByStandardUri = true
Path AWS Client (Hadoop S3) Mixed Style:
isPathStyle = true && forceParsingByStandardUri = true
When the incoming location is url encoded, the encoded string will be returned.
For
getKey()
,getQueryParams()
will return the encoding string