DB2 sql_database source | dlt queries wrong system tables for iseries #1858

jeff-skoldberg-gmds · 2024-09-23T01:56:45Z

dlt version

1.0.0

Describe the problem

I'm connecting to DB2 iseries using this basic code:

def load_select_tables_from_database() -> None:
    # Define the pipeline
    pipeline = dlt.pipeline(
        pipeline_name="iseries", 
        dataset_name="iseries",
        destination=dlt.destinations.duckdb("data.db")
    )
    source = sql_database(reflection_level="minimal").with_resources("MY_SCHEMA.MY_TABLE")

I get this traceback:

(pyodbc.ProgrammingError) ('42S02', '[42S02] [IBM][System i Access ODBC Driver][DB2 for i5/OS]SQL0204 - TABLES in SYSCAT type *FILE not found. (-204) (SQLPrepare)')
[SQL: SELECT "SYSCAT"."TABLES"."TABNAME" 
FROM "SYSCAT"."TABLES" 
WHERE "SYSCAT"."TABLES"."TYPE" = ? AND "SYSCAT"."TABLES"."TABSCHEMA" IS NULL ORDER BY "SYSCAT"."TABLES"."TABNAME"]
[parameters: ('T',)]
(Background on this error at: https://sqlalche.me/e/20/f405)
pyodbc.ProgrammingError: ('42S02', '[42S02] [IBM][System i Access ODBC Driver][DB2 for i5/OS]SQL0204 - TABLES in SYSCAT type *FILE not found. (-204) (SQLPrepare)')

ChatGPT suggested the following:
This new error (SQL0204 - TABLES in SYSCAT type *FILE not found) indicates that the query you're running is targeting the wrong system catalog or metadata structure. IBM Db2 for i5/OS (also known as Db2 for iSeries or AS/400) uses a different system catalog structure compared to Db2 LUW (Linux/Unix/Windows), which is why the SYSCAT.TABLES catalog might not exist on your iSeries system.
The error happens because SQLAlchemy is trying to query the SYSCAT.TABLES catalog, which exists in Db2 LUW but does not exist in Db2 for i5/OS (IBM i).
Db2 for iSeries (IBM i) uses a different system catalog, which stores metadata differently (typically in the QSYS2 library).

What can we do to resolve this?

Expected behavior

No response

Steps to reproduce

run the pipeline pasted above while connected to iseries AS/400

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

sql_database

dlt destination

No response

Other deployment details

duckdb until this works...

Additional information

No response

The text was updated successfully, but these errors were encountered:

rudolfix · 2024-09-23T08:58:16Z

Hi @jeff-skoldberg-gmds
here's how we connect to DB2 (running in container, community edition)

unsw_table = sql_table(
    "db2+ibm_db://db2inst1:xxxx@localhost:50000/dlt_data",
    "unsw_flow",
    "db2inst1",
    chunk_size=100000,
    backend="pyarrow",
)

where "db2inst1" is name of the schema and also the name of the user (which is also the name of server instance). dlt_data is the name of the database.
The error you see is coming from SQLAlchemy that is not able to reflect the database. Since you omit the schema argument in sql_database I guess it connects using some kind of public user/schema that does not have access to the TABLES

jeff-skoldberg-gmds · 2024-09-23T13:21:20Z

I should have mentioned, when I connect using the ibm_db connection method, I get a licensing problem, described here.

Traceback:

Pipeline execution failed at stage extract when processing package 1727094437.8127725 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe RMSLHP: extraction of resource RMSLHP in generator table_rows caused an exception: (ibm_db_dbi.ProgrammingError) ibm_db_dbi::ProgrammingError: [IBM][CLI Driver] SQL1598N  An attempt to connect to the database server failed because of a licensing problem.  SQLSTATE=42968 SQLCODE=-1598
(Background on this error at: https://sqlalche.me/e/14/f405)
Exception: [IBM][CLI Driver] SQL1598N  An attempt to connect to the database server failed because of a licensing problem.  SQLSTATE=42968 SQLCODE=-1598

The IT department said I must use the specific driver, which avoids the licensing problem. They said I must use pyodbc with this driver: IBM i Access ODBC Driver"

I'm able to read tables using sqlalchemy + pyodbc (not using dlt):

connection_string = f'db2+pyodbc://{username}:{password}@{hostname}:{port}/{database}?driver=IBM%20i%20Access%20ODBC%20Driver'
engine = create_engine(connection_string)
with engine.connect() as connection:
    result = connection.execute(text("SELECT * FROM schema.tbl FETCH FIRST 1 ROWS ONLY"))
    for row in result:
        print(row)

this works.

Bringing this into to dlt...
I set it up like this:

[iseries.sources.sql_database]
credentials = "db2+pyodbc:///?odbc_connect=DRIVER%3D%7BIBM+i+Access+ODBC+Driver%7D%3BSYSTEM%3D192.foo.bar.baz%3BUID%3DAME%3BPWD%3DPW%3BPORT%3D446%3BDATABASE%3Ddbname%3B"

The connection is definitely successful, but it fails at the schema reflection step, because this is iseries as/400, it uses different system tables. It is expecting tables in QSYS2 schema, not SYSCAT, as described here.

Then I discovered another adaptor:
https://github.com/IBM/sqlalchemy-ibmi

In theory this should tell sqlalchemy the correct system tables to use for schema reflection. But, it will not let you pass a database parameter which dlt requires.

Summary

At this point I understand it is a sqlalchemy issue and not a dlt issue. But I can at least query tables in sqlalchemy + pyodbc outside of dlt, so I would hope there is a way to do it with dlt. But schema reflection will always fail because of the type of DB2 isntance.

Any further suggestions?

rudolfix · 2024-09-23T15:55:58Z

Thanks for this info.

database requirement in the base connection is too strict and we can make it optional. we'll be doing bugfix update on Wednesday. maybe that could help you to use ibmi dialect. Alternatively you can "hack" the installed dlt package directly:
dlt/common/configuration/specs/connection_string_credentials.py - change database to database: Optional[str] = None
your example is just selecting from a table, not doing database reflection:

engine = create_engine(connection_string)
with engine.connect() as connection:
    result = connection.execute(text("SELECT * FROM schema.tbl FETCH FIRST 1 ROWS ONLY"))
    for row in result:
        print(row)

you'd need to

table = sa.Table(name, MetaData(schema=schema), autoload_with=engine)

and see if it populates. there's still a chance the schema is wrong... also try it on ibmi dialect. if that works on it the (1) fix will also work for you
3. if everything else fails: if your tables are not extremely big so you need pyarrow backend you can implement your own sqlalchemy source. in fact your code above is 90% of what you need. here's a sample from our docs is avoiding reflections and deriving schema purely from the data

import dlt
from sqlalchemy import create_engine

# Use any SQL database supported by SQLAlchemy, below we use a public
# MySQL instance to get data.
# NOTE: you'll need to install pymysql with `pip install pymysql`
# NOTE: loading data from public mysql instance may take several seconds
engine = create_engine("mysql+pymysql://[email protected]:4497/Rfam")

with engine.connect() as conn:
    # Select genome table, stream data in batches of 100 elements
    query = "SELECT * FROM genome LIMIT 1000"
    rows = conn.execution_options(yield_per=100).exec_driver_sql(query)

    pipeline = dlt.pipeline(
        pipeline_name="from_database",
        destination="duckdb",
        dataset_name="genome_data",
    )

    # Convert the rows into dictionaries on the fly with a map function
    load_info = pipeline.run(map(lambda row: dict(row._mapping), rows), table_name="genome")

print(load_info)

jeff-skoldberg-gmds · 2024-09-23T16:35:55Z

@rudolfix , this is fantastic, it gives me a few things to try. I will update you.
I did realize the sqlachemy example I gave was working because it was not doing schema reflections. I just wanted to prove I can connect. :)

Thanks again.

jeff-skoldberg-gmds · 2024-09-23T17:33:24Z

YES! This works:

"hack" the installed dlt package directly:
dlt/common/configuration/specs/connection_string_credentials.py - change database to database: Optional[str] = None

jeff-skoldberg-gmds · 2024-09-26T00:17:24Z

Thanks @rudolfix for putting in this fix! I was wondering, at what point is the fix available via PyPi so I can install with pip install dlt --upgrade?
I'm just planning out how I deploy my function in Azure, and I guess docker will have to clone the repo for now, instead of install from requirements?

rudolfix added community This issue came from slack community workspace question Further information is requested labels Sep 23, 2024

rudolfix self-assigned this Sep 23, 2024

rudolfix mentioned this issue Sep 24, 2024

Fix/1858 make all connection string credentials optional #1867

Merged

rudolfix closed this as completed in #1867 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB2 sql_database source | dlt queries wrong system tables for iseries #1858

DB2 sql_database source | dlt queries wrong system tables for iseries #1858

jeff-skoldberg-gmds commented Sep 23, 2024

rudolfix commented Sep 23, 2024

jeff-skoldberg-gmds commented Sep 23, 2024 •

edited

Loading

rudolfix commented Sep 23, 2024 •

edited

Loading

jeff-skoldberg-gmds commented Sep 23, 2024

jeff-skoldberg-gmds commented Sep 23, 2024

jeff-skoldberg-gmds commented Sep 26, 2024

DB2 sql_database source | dlt queries wrong system tables for iseries #1858

DB2 sql_database source | dlt queries wrong system tables for iseries #1858

Comments

jeff-skoldberg-gmds commented Sep 23, 2024

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

rudolfix commented Sep 23, 2024

jeff-skoldberg-gmds commented Sep 23, 2024 • edited Loading

Summary

rudolfix commented Sep 23, 2024 • edited Loading

jeff-skoldberg-gmds commented Sep 23, 2024

jeff-skoldberg-gmds commented Sep 23, 2024

jeff-skoldberg-gmds commented Sep 26, 2024

jeff-skoldberg-gmds commented Sep 23, 2024 •

edited

Loading

rudolfix commented Sep 23, 2024 •

edited

Loading