Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading from S3 compatible storage #18802

Open
robertdj opened this issue Sep 18, 2024 · 1 comment
Open

Reading from S3 compatible storage #18802

robertdj opened this issue Sep 18, 2024 · 1 comment
Labels
enhancement New feature or an improvement of an existing feature

Comments

@robertdj
Copy link

Description

I'm trying to use Polars to access a parquet file stored in DigitalOcean Spaces, that is a S3 compatible storage.
It works with the boto3 package, but I can't make it work with Polars.

I have set access_key_id and secret_access_key in ~/.aws/credentials.
I can list contents in the bucket with boto3.

import polars as pl

import boto3

session = boto3.Session()
client = session.client(
    "s3",
    region_name="fra1",
    endpoint_url="https://fra1.digitaloceanspaces.com",
)

client.list_buckets()

Note that the endpoint_url is specified.

In the Spaces I have a bucket called mybucket containing a file called test.parquet. (Apparently the aws_region should be fixed to us-east-1 for DigitalOcean.)

storage_options = {
    "aws_access_key_id": aws_access_key_id,
    "aws_secret_access_key": aws_secret_access_key,
    "aws_region": "us-east-1",
}
source = "s3://mybucket/test.parquet"
pl.read_parquet(source)

I get an error

ComputeError: Generic S3 error: Client error with status 403 Forbidden: No Body

If I specify the bucket more elaborately to be

source = "s3://cache.fra1.digitaloceanspaces.com/mybucket/test.parquet"

I get a different error suggesting that the endpoint is hard coded to s3.amazonaws.com.

ComputeError: error sending request for url (https://fra1.digitaloceanspaces.com.s3.amazonaws.com/)
@robertdj robertdj added the enhancement New feature or an improvement of an existing feature label Sep 18, 2024
@robertdj
Copy link
Author

It turns out that I can make this work if I use PyArrow:

import pyarrow.dataset as ds
import pyarrow.fs as fs

pyfs = fs.S3FileSystem(endpoint_override="https://fra1.digitaloceanspaces.com")
pyds = ds.dataset(source="mybucket/test.parquet", filesystem=pyfs, format="parquet")
df = pl.scan_pyarrow_dataset(pyds).collect()

But it would be nice if it worked directly with Polars :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant