https://kedro.org/ logo
Title
b

brewski

01/03/2022, 10:45 AM
Also, in the context of reading from sql to populate a dataset using [[https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pandas.SQLQueryDataSet.html#kedro.extras.datasets.pandas.SQLQueryDataSet]] is there a correct way of specifying (hopefully relative) paths to the config?
d

datajoely

01/03/2022, 4:13 PM
Hi @User can you paste your existing catalog entry?
I think what your desired state is possible but we do discourage sharing credentials across the team
b

brewski

01/04/2022, 2:08 PM
@project_root>> ls conf/local/credentials/ client-cert.pem 
client-key.pem   server-ca.pem
credentials.yml
d

datajoely

01/04/2022, 2:08 PM
hi not the credentials file
the catlaog.yml you are using
b

brewski

01/04/2022, 2:08 PM
So right, I don't have a functional one
d

datajoely

01/04/2022, 2:08 PM
yeah you're working one
and the structure of your credentials key would be useful too
b

brewski

01/04/2022, 2:22 PM
(head -n1 && tail -n1) <client-key.pem                             -----BEGIN RSA PRIVATE KEY-----
<BASE64 ENCODED KEY HERE>
-----END RSA PRIVATE KEY-----%
(head -n1 && tail -n1) <client-cert.pem                            -----BEGIN CERTIFICATE-----                 
<MORE BASE64>
-----END CERTIFICATE-----%
(head -n1 && tail -n1) <server-ca.pem
-----BEGIN CERTIFICATE-----
<BASE64 AGAIN>
-----END CERTIFICATE-----%
d

datajoely

01/04/2022, 2:26 PM
Okay so I've never done this before
b

brewski

01/04/2022, 2:26 PM
Currently working 'engine' code:
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import re
from sqlalchemy.sql.sqltypes import Text
from sqlalchemy import or_

global_link = "postgresql://postgres:passwd@ip/user"
ROOT = "/absolute/path/to/project/root"
engine = create_engine(
    global_link,
    connect_args={
        "sslmode": "verify-ca",
        "sslrootcert": f"{ROOT}/conf/local/server-ca.pem",
        "sslcert": f"{ROOT}/conf/local/client-cert.pem",
        "sslkey": f"{ROOT}/conf/local/client-key.pem",
    },
)

Base = automap_base()
Base.prepare(engine, reflect=True)
session = Session(engine)
I'm on mobile idk if the wrapping is as bad on your end as it is on mine
Sorry in advance
d

datajoely

01/04/2022, 2:27 PM
yeah it's hard to read
b

brewski

01/04/2022, 2:27 PM
5 minutes and I'll have a better one
d

datajoely

01/04/2022, 2:27 PM
but I think the answer isn't going to be ideal
I think you'll need to define a custom dataset that implements this
if you look at how this implemented
we simply wrap the
pd.read_sql_table
function
in the
load()
method
so my angle to attack would be the following
1. See if we can get the
pd.read_sql_table
method working outside of Kedro?
2. Define a custom dataset that implements your working version
happy to help you get it working
but to my knowledge it doesn't exist today
b

brewski

01/04/2022, 2:31 PM
yes
wait no
read_sql_query works
d

datajoely

01/04/2022, 2:32 PM
we create the enigne this way
b

brewski

01/04/2022, 2:32 PM
the trouble is is that the create_engine command in my case needs to take arguments for the location of these three credential files
d

datajoely

01/04/2022, 2:33 PM
so thats possibly a working directory issue
b

brewski

01/04/2022, 2:33 PM
exactly
I was wondering if kedro had a project root relative way to keep track of credential files
not just their values
and being perfectly honest, I'm still fuzzy on the rest
the above code works fine though, and successfully works w/ pandas
pd.read_sql(query.statement, query.session.bind)
for any query via sqlalchemy
d

datajoely

01/04/2022, 2:36 PM
So I still think we need to define a custom dataset
but I think this approach may be easiest
the SSL context in this case isn't done for you you have to create it yourself via the
ssl
library
and you can then manage the working directory better
not an ideal solution I'm sorry
but I'm not sure what else to suggest
the credentials in general are light touch in Kedro as we prefer things to be managed at an environment level
I'm going to ask someone else to look at this thread for a second opinion
b

brewski

01/04/2022, 2:37 PM
interested in what this means
d

datajoely

01/04/2022, 2:38 PM
as in you can do the pem file reading yourself
rather than passing to another library
b

brewski

01/04/2022, 2:39 PM
would this mean deferring credential management to something like pipenv would be best practice or am I misunderstanding?
ty!
keeping track of the location of the files would then also logically be my responsibility rather than kedros?
d

datajoely

01/04/2022, 2:40 PM
No I mean with things like SSL it's not idea, but with AWS for example it's nice to just assign IAM roles outside of Kedro
I need to think of the best way to do SSL
annoyingly we support it well for flat files via fsspec
but this is (to my knowledge) the first time it's come up
so in any case a custom dataset is needed
b

brewski

01/04/2022, 2:41 PM
okay
I will look into it
thanks for the pointers!
it's a start for sure
d

datajoely

01/04/2022, 2:41 PM
Please shout once you've got something and I'll do my best to help you though it
b

brewski

01/04/2022, 2:42 PM
absolutely
a

antony.milne

01/04/2022, 3:27 PM
Following on from what @User said... both
SQLTableDataSet
and
SQLQueryDataSet
are wrappers for pandas
pd.read_sql_table(**load_args)
and
pd.read_sql_query(**load_args)
. These call
create_engine
at some point (which is why we don't explicitly in the kedro datasets). These pandas functions take an argument
con
, which as well being a string can be a SQLAlchemy connection https://pandas.pydata.org/docs/reference/api/pandas.read_sql_query.html Hence what you need to do is inject into
load_args["con"]
the appropriate connection object. This would require a custom dataset but should be easy to implement. Subclass
SQLQueryDataSet
and then do something like this:
# catalog.yml
dataset_name:
  type: path.to.your.CustomSQLQueryDataSet
  credentials: ssl_credentials

# credentials.yml
ssl_credentials:
  url: postgresql://postgres:passwd@ip/user
  connect_args:
    sslmode: verify-ca
    sslrootcert: client-cert.pem
    ...
SQLQueryDataSet.__init__
currently does
self._load_args["con"] = self._save_args["con"] = credentials["con"]
Your custom one would instead need to do something like
._load_args["con"] = self._save_args["con"] = create_engine(**credentials)
Now there is a bit of a problem with how we deal with filepaths. In theory you can specify the full filepath as
sslrootcert:/absolute/path/to/project/root/conf/local/credentials/client-cert.pem
as you do in your example. The problem with this is that it's a bit hacky because: 1. it hard codes the environment to
local
. This is where credentials should always reside anyway so not such a huge problem, just a bit ugly 2. it hard codes your project path in. This you can easily get around it by just specifying a relative path in your credentials yaml file and doing some path manipulation with something like
pathlib.Path.cwd()
in your custom dataset
This is why @User instead suggested it might be cleaner to create the SSL context by the
ssl
library instead of using kedro's credentials.yml file - this avoids the above dirtiness with filepaths
Also worth noting that at least on MySQL you can specify SSL args directly in the connection URL: https://docs.sqlalchemy.org/en/14/dialects/mysql.html#ssl-connections
b

brewski

01/10/2022, 12:11 AM
After a couple of days I'm still having some difficulty with getting sqlalchemy
create_engine
to accept "ssl_context" as a arg for "con". I'm on sqlalchemy v1.4.29 and I'm getting a programming error. Could I get some pointers with regards to this?
a

antony.milne

01/10/2022, 7:41 AM
Please could you make a little repo on github containing your current code so we can see? I'll try and play around with it there