Also in the context of reading from sql to populate a datase Kedro #beginners-need-help

Also, in the context of reading from sql to popula...

brewski

01/03/2022, 10:45 AM

Also, in the context of reading from sql to populate a dataset using [[https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pandas.SQLQueryDataSet.html#kedro.extras.datasets.pandas.SQLQueryDataSet]] is there a correct way of specifying (hopefully relative) paths to the config?

datajoely

01/03/2022, 4:13 PM

Hi @User can you paste your existing catalog entry?

datajoely

01/03/2022, 4:13 PM

I think what your desired state is possible but we do discourage sharing credentials across the team

brewski

01/04/2022, 2:08 PM

Copy code

@project_root>> ls conf/local/credentials/ client-cert.pem 
client-key.pem   server-ca.pem
credentials.yml

datajoely

01/04/2022, 2:08 PM

hi not the credentials file

datajoely

01/04/2022, 2:08 PM

the catlaog.yml you are using

brewski

01/04/2022, 2:08 PM

So right, I don't have a functional one

datajoely

01/04/2022, 2:08 PM

yeah you're working one

datajoely

01/04/2022, 2:09 PM

and the structure of your credentials key would be useful too

brewski

01/04/2022, 2:22 PM

Copy code

(head -n1 && tail -n1) <client-key.pem                             -----BEGIN RSA PRIVATE KEY-----
<BASE64 ENCODED KEY HERE>
-----END RSA PRIVATE KEY-----%

brewski

01/04/2022, 2:23 PM

Copy code

(head -n1 && tail -n1) <client-cert.pem                            -----BEGIN CERTIFICATE-----                 
<MORE BASE64>
-----END CERTIFICATE-----%

brewski

01/04/2022, 2:24 PM

Copy code

(head -n1 && tail -n1) <server-ca.pem
-----BEGIN CERTIFICATE-----
<BASE64 AGAIN>
-----END CERTIFICATE-----%

datajoely

01/04/2022, 2:26 PM

Okay so I've never done this before

brewski

01/04/2022, 2:26 PM

Currently working 'engine' code:

Copy code

from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import re
from sqlalchemy.sql.sqltypes import Text
from sqlalchemy import or_

global_link = "postgresql://postgres:passwd@ip/user"
ROOT = "/absolute/path/to/project/root"
engine = create_engine(
    global_link,
    connect_args={
        "sslmode": "verify-ca",
        "sslrootcert": f"{ROOT}/conf/local/server-ca.pem",
        "sslcert": f"{ROOT}/conf/local/client-cert.pem",
        "sslkey": f"{ROOT}/conf/local/client-key.pem",
    },
)

Base = automap_base()
Base.prepare(engine, reflect=True)
session = Session(engine)

brewski

01/04/2022, 2:27 PM

I'm on mobile idk if the wrapping is as bad on your end as it is on mine

brewski

01/04/2022, 2:27 PM

Sorry in advance

datajoely

01/04/2022, 2:27 PM

yeah it's hard to read

brewski

01/04/2022, 2:27 PM

5 minutes and I'll have a better one

datajoely

01/04/2022, 2:27 PM

but I think the answer isn't going to be ideal

datajoely

01/04/2022, 2:27 PM

I think you'll need to define a custom dataset that implements this

datajoely

01/04/2022, 2:28 PM

https://kedro.readthedocs.io/en/stable/_modules/kedro/extras/datasets/pandas/sql_dataset.html#SQLTableDataSet

datajoely

01/04/2022, 2:28 PM

if you look at how this implemented

datajoely

01/04/2022, 2:28 PM

we simply wrap the

pd.read_sql_table

function

datajoely

01/04/2022, 2:28 PM

in the

load()

method

datajoely

01/04/2022, 2:29 PM

so my angle to attack would be the following

datajoely

01/04/2022, 2:29 PM

1. See if we can get the

pd.read_sql_table

method working outside of Kedro?

datajoely

01/04/2022, 2:29 PM

2. Define a custom dataset that implements your working version

datajoely

01/04/2022, 2:29 PM

https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html

datajoely

01/04/2022, 2:29 PM

happy to help you get it working

datajoely

01/04/2022, 2:30 PM

but to my knowledge it doesn't exist today

brewski

01/04/2022, 2:31 PM

yes

brewski

01/04/2022, 2:31 PM

wait no

brewski

01/04/2022, 2:32 PM

read_sql_query works

datajoely

01/04/2022, 2:32 PM

we create the enigne this way

brewski

01/04/2022, 2:32 PM

the trouble is is that the create_engine command in my case needs to take arguments for the location of these three credential files

datajoely

01/04/2022, 2:33 PM

so thats possibly a working directory issue

brewski

01/04/2022, 2:33 PM

exactly

brewski

01/04/2022, 2:34 PM

I was wondering if kedro had a project root relative way to keep track of credential files

brewski

01/04/2022, 2:34 PM

not just their values

brewski

01/04/2022, 2:34 PM

and being perfectly honest, I'm still fuzzy on the rest

brewski

01/04/2022, 2:35 PM

the above code works fine though, and successfully works w/ pandas

Copy code

pd.read_sql(query.statement, query.session.bind)

for any query via sqlalchemy

datajoely

01/04/2022, 2:36 PM

So I still think we need to define a custom dataset

datajoely

01/04/2022, 2:36 PM

but I think this approach may be easiest

datajoely

01/04/2022, 2:36 PM

https://docs.sqlalchemy.org/en/14/dialects/postgresql.html?highlight=certificate#pg8000-ssl

datajoely

01/04/2022, 2:36 PM

the SSL context in this case isn't done for you you have to create it yourself via the

ssl

library

datajoely

01/04/2022, 2:36 PM

and you can then manage the working directory better

datajoely

01/04/2022, 2:37 PM

not an ideal solution I'm sorry

datajoely

01/04/2022, 2:37 PM

but I'm not sure what else to suggest

datajoely

01/04/2022, 2:37 PM

the credentials in general are light touch in Kedro as we prefer things to be managed at an environment level

datajoely

01/04/2022, 2:37 PM

I'm going to ask someone else to look at this thread for a second opinion

brewski

01/04/2022, 2:37 PM

interested in what this means

datajoely

01/04/2022, 2:38 PM

as in you can do the pem file reading yourself

datajoely

01/04/2022, 2:38 PM

rather than passing to another library

brewski

01/04/2022, 2:39 PM

would this mean deferring credential management to something like pipenv would be best practice or am I misunderstanding?

brewski

01/04/2022, 2:39 PM

ty!

brewski

01/04/2022, 2:40 PM

keeping track of the location of the files would then also logically be my responsibility rather than kedros?

datajoely

01/04/2022, 2:40 PM

No I mean with things like SSL it's not idea, but with AWS for example it's nice to just assign IAM roles outside of Kedro

datajoely

01/04/2022, 2:40 PM

I need to think of the best way to do SSL

datajoely

01/04/2022, 2:41 PM

annoyingly we support it well for flat files via fsspec

datajoely

01/04/2022, 2:41 PM

but this is (to my knowledge) the first time it's come up

datajoely

01/04/2022, 2:41 PM

so in any case a custom dataset is needed

brewski

01/04/2022, 2:41 PM

okay

brewski

01/04/2022, 2:41 PM

I will look into it

brewski

01/04/2022, 2:41 PM

thanks for the pointers!

brewski

01/04/2022, 2:41 PM

it's a start for sure

datajoely

01/04/2022, 2:41 PM

Please shout once you've got something and I'll do my best to help you though it

brewski

01/04/2022, 2:42 PM

absolutely

antony.milne

01/04/2022, 3:27 PM

Following on from what @User said... both

SQLTableDataSet

and

SQLQueryDataSet

are wrappers for pandas

pd.read_sql_table(**load_args)

and

pd.read_sql_query(**load_args)

. These call

create_engine

at some point (which is why we don't explicitly in the kedro datasets). These pandas functions take an argument

con

, which as well being a string can be a SQLAlchemy connection https://pandas.pydata.org/docs/reference/api/pandas.read_sql_query.html Hence what you need to do is inject into

load_args["con"]

the appropriate connection object. This would require a custom dataset but should be easy to implement. Subclass

SQLQueryDataSet

and then do something like this:

antony.milne

01/04/2022, 3:40 PM

Copy code

# catalog.yml
dataset_name:
  type: path.to.your.CustomSQLQueryDataSet
  credentials: ssl_credentials

# credentials.yml
ssl_credentials:
  url: postgresql://postgres:passwd@ip/user
  connect_args:
    sslmode: verify-ca
    sslrootcert: client-cert.pem
    ...

SQLQueryDataSet.__init__

currently does

Copy code

self._load_args["con"] = self._save_args["con"] = credentials["con"]

Your custom one would instead need to do something like

Copy code

._load_args["con"] = self._save_args["con"] = create_engine(**credentials)

antony.milne

01/04/2022, 3:44 PM

Now there is a bit of a problem with how we deal with filepaths. In theory you can specify the full filepath as

sslrootcert:/absolute/path/to/project/root/conf/local/credentials/client-cert.pem

as you do in your example. The problem with this is that it's a bit hacky because: 1. it hard codes the environment to

local

. This is where credentials should always reside anyway so not such a huge problem, just a bit ugly 2. it hard codes your project path in. This you can easily get around it by just specifying a relative path in your credentials yaml file and doing some path manipulation with something like

pathlib.Path.cwd()

in your custom dataset

antony.milne

01/04/2022, 3:45 PM

This is why @User instead suggested it might be cleaner to create the SSL context by the

ssl

library instead of using kedro's credentials.yml file - this avoids the above dirtiness with filepaths

antony.milne

01/04/2022, 3:45 PM

Also worth noting that at least on MySQL you can specify SSL args directly in the connection URL: https://docs.sqlalchemy.org/en/14/dialects/mysql.html#ssl-connections

brewski

01/10/2022, 12:11 AM

After a couple of days I'm still having some difficulty with getting sqlalchemy

create_engine

to accept "ssl_context" as a arg for "con". I'm on sqlalchemy v1.4.29 and I'm getting a programming error. Could I get some pointers with regards to this?

antony.milne

01/10/2022, 7:41 AM

Please could you make a little repo on github containing your current code so we can see? I'll try and play around with it there

9 Views

Previous Next