Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB2 sql_database source | dlt queries wrong system tables for iseries #1858

Closed
jeff-skoldberg-gmds opened this issue Sep 23, 2024 · 6 comments · Fixed by #1867
Closed

DB2 sql_database source | dlt queries wrong system tables for iseries #1858

jeff-skoldberg-gmds opened this issue Sep 23, 2024 · 6 comments · Fixed by #1867
Assignees
Labels
community This issue came from slack community workspace question Further information is requested

Comments

@jeff-skoldberg-gmds
Copy link

dlt version

1.0.0

Describe the problem

I'm connecting to DB2 iseries using this basic code:

def load_select_tables_from_database() -> None:
    # Define the pipeline
    pipeline = dlt.pipeline(
        pipeline_name="iseries", 
        dataset_name="iseries",
        destination=dlt.destinations.duckdb("data.db")
    )
    source = sql_database(reflection_level="minimal").with_resources("MY_SCHEMA.MY_TABLE")

I get this traceback:

(pyodbc.ProgrammingError) ('42S02', '[42S02] [IBM][System i Access ODBC Driver][DB2 for i5/OS]SQL0204 - TABLES in SYSCAT type *FILE not found. (-204) (SQLPrepare)')
[SQL: SELECT "SYSCAT"."TABLES"."TABNAME" 
FROM "SYSCAT"."TABLES" 
WHERE "SYSCAT"."TABLES"."TYPE" = ? AND "SYSCAT"."TABLES"."TABSCHEMA" IS NULL ORDER BY "SYSCAT"."TABLES"."TABNAME"]
[parameters: ('T',)]
(Background on this error at: https://sqlalche.me/e/20/f405)
pyodbc.ProgrammingError: ('42S02', '[42S02] [IBM][System i Access ODBC Driver][DB2 for i5/OS]SQL0204 - TABLES in SYSCAT type *FILE not found. (-204) (SQLPrepare)')

ChatGPT suggested the following:
This new error (SQL0204 - TABLES in SYSCAT type *FILE not found) indicates that the query you're running is targeting the wrong system catalog or metadata structure. IBM Db2 for i5/OS (also known as Db2 for iSeries or AS/400) uses a different system catalog structure compared to Db2 LUW (Linux/Unix/Windows), which is why the SYSCAT.TABLES catalog might not exist on your iSeries system.
The error happens because SQLAlchemy is trying to query the SYSCAT.TABLES catalog, which exists in Db2 LUW but does not exist in Db2 for i5/OS (IBM i).
Db2 for iSeries (IBM i) uses a different system catalog, which stores metadata differently (typically in the QSYS2 library).


What can we do to resolve this?

Expected behavior

No response

Steps to reproduce

run the pipeline pasted above while connected to iseries AS/400

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

sql_database

dlt destination

No response

Other deployment details

duckdb until this works...

Additional information

No response

@rudolfix
Copy link
Collaborator

Hi @jeff-skoldberg-gmds
here's how we connect to DB2 (running in container, community edition)

unsw_table = sql_table(
    "db2+ibm_db://db2inst1:xxxx@localhost:50000/dlt_data",
    "unsw_flow",
    "db2inst1",
    chunk_size=100000,
    backend="pyarrow",
)

where "db2inst1" is name of the schema and also the name of the user (which is also the name of server instance). dlt_data is the name of the database.
The error you see is coming from SQLAlchemy that is not able to reflect the database. Since you omit the schema argument in sql_database I guess it connects using some kind of public user/schema that does not have access to the TABLES

@rudolfix rudolfix added community This issue came from slack community workspace question Further information is requested labels Sep 23, 2024
@rudolfix rudolfix self-assigned this Sep 23, 2024
@jeff-skoldberg-gmds
Copy link
Author

jeff-skoldberg-gmds commented Sep 23, 2024

I should have mentioned, when I connect using the ibm_db connection method, I get a licensing problem, described here.

Traceback:

Pipeline execution failed at stage extract when processing package 1727094437.8127725 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe RMSLHP: extraction of resource RMSLHP in generator table_rows caused an exception: (ibm_db_dbi.ProgrammingError) ibm_db_dbi::ProgrammingError: [IBM][CLI Driver] SQL1598N  An attempt to connect to the database server failed because of a licensing problem.  SQLSTATE=42968 SQLCODE=-1598
(Background on this error at: https://sqlalche.me/e/14/f405)
Exception: [IBM][CLI Driver] SQL1598N  An attempt to connect to the database server failed because of a licensing problem.  SQLSTATE=42968 SQLCODE=-1598

The IT department said I must use the specific driver, which avoids the licensing problem. They said I must use pyodbc with this driver: IBM i Access ODBC Driver"

I'm able to read tables using sqlalchemy + pyodbc (not using dlt):

connection_string = f'db2+pyodbc://{username}:{password}@{hostname}:{port}/{database}?driver=IBM%20i%20Access%20ODBC%20Driver'
engine = create_engine(connection_string)
with engine.connect() as connection:
    result = connection.execute(text("SELECT * FROM schema.tbl FETCH FIRST 1 ROWS ONLY"))
    for row in result:
        print(row)

this works.

Bringing this into to dlt...
I set it up like this:

[iseries.sources.sql_database]
credentials = "db2+pyodbc:///?odbc_connect=DRIVER%3D%7BIBM+i+Access+ODBC+Driver%7D%3BSYSTEM%3D192.foo.bar.baz%3BUID%3DAME%3BPWD%3DPW%3BPORT%3D446%3BDATABASE%3Ddbname%3B"

The connection is definitely successful, but it fails at the schema reflection step, because this is iseries as/400, it uses different system tables. It is expecting tables in QSYS2 schema, not SYSCAT, as described here.

Then I discovered another adaptor:
https://github.com/IBM/sqlalchemy-ibmi

In theory this should tell sqlalchemy the correct system tables to use for schema reflection. But, it will not let you pass a database parameter which dlt requires.


Summary

At this point I understand it is a sqlalchemy issue and not a dlt issue. But I can at least query tables in sqlalchemy + pyodbc outside of dlt, so I would hope there is a way to do it with dlt. But schema reflection will always fail because of the type of DB2 isntance.

Any further suggestions?

@rudolfix
Copy link
Collaborator

rudolfix commented Sep 23, 2024

Thanks for this info.

  1. database requirement in the base connection is too strict and we can make it optional. we'll be doing bugfix update on Wednesday. maybe that could help you to use ibmi dialect. Alternatively you can "hack" the installed dlt package directly:
    dlt/common/configuration/specs/connection_string_credentials.py - change database to database: Optional[str] = None
  2. your example is just selecting from a table, not doing database reflection:
engine = create_engine(connection_string)
with engine.connect() as connection:
    result = connection.execute(text("SELECT * FROM schema.tbl FETCH FIRST 1 ROWS ONLY"))
    for row in result:
        print(row)

you'd need to

table = sa.Table(name, MetaData(schema=schema), autoload_with=engine)

and see if it populates. there's still a chance the schema is wrong... also try it on ibmi dialect. if that works on it the (1) fix will also work for you
3. if everything else fails: if your tables are not extremely big so you need pyarrow backend you can implement your own sqlalchemy source. in fact your code above is 90% of what you need. here's a sample from our docs is avoiding reflections and deriving schema purely from the data

import dlt
from sqlalchemy import create_engine

# Use any SQL database supported by SQLAlchemy, below we use a public
# MySQL instance to get data.
# NOTE: you'll need to install pymysql with `pip install pymysql`
# NOTE: loading data from public mysql instance may take several seconds
engine = create_engine("mysql+pymysql://[email protected]:4497/Rfam")

with engine.connect() as conn:
    # Select genome table, stream data in batches of 100 elements
    query = "SELECT * FROM genome LIMIT 1000"
    rows = conn.execution_options(yield_per=100).exec_driver_sql(query)

    pipeline = dlt.pipeline(
        pipeline_name="from_database",
        destination="duckdb",
        dataset_name="genome_data",
    )

    # Convert the rows into dictionaries on the fly with a map function
    load_info = pipeline.run(map(lambda row: dict(row._mapping), rows), table_name="genome")

print(load_info)

@jeff-skoldberg-gmds
Copy link
Author

@rudolfix , this is fantastic, it gives me a few things to try. I will update you.
I did realize the sqlachemy example I gave was working because it was not doing schema reflections. I just wanted to prove I can connect. :)

Thanks again.

@jeff-skoldberg-gmds
Copy link
Author

YES! This works:

"hack" the installed dlt package directly:
dlt/common/configuration/specs/connection_string_credentials.py - change database to database: Optional[str] = None

@jeff-skoldberg-gmds
Copy link
Author

Thanks @rudolfix for putting in this fix! I was wondering, at what point is the fix available via PyPi so I can install with pip install dlt --upgrade?
I'm just planning out how I deploy my function in Azure, and I guess docker will have to clone the repo for now, instead of install from requirements?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace question Further information is requested
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants