Title
#beginners-need-help
brewski

brewski

01/03/2022, 10:45 AM
Also, in the context of reading from sql to populate a dataset using [[https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pandas.SQLQueryDataSet.html#kedro.extras.datasets.pandas.SQLQueryDataSet]] is there a correct way of specifying (hopefully relative) paths to the config?
datajoely

datajoely

01/03/2022, 4:13 PM
Hi @User can you paste your existing catalog entry?
4:13 PM
I think what your desired state is possible but we do discourage sharing credentials across the team
brewski

brewski

01/04/2022, 2:08 PM
@project_root>> ls conf/local/credentials/ client-cert.pem 
client-key.pem   server-ca.pem
credentials.yml
datajoely

datajoely

01/04/2022, 2:08 PM
hi not the credentials file
2:08 PM
the catlaog.yml you are using
brewski

brewski

01/04/2022, 2:08 PM
So right, I don't have a functional one
datajoely

datajoely

01/04/2022, 2:08 PM
yeah you're working one
2:09 PM
and the structure of your credentials key would be useful too
brewski

brewski

01/04/2022, 2:22 PM
(head -n1 && tail -n1) <client-key.pem                             -----BEGIN RSA PRIVATE KEY-----
<BASE64 ENCODED KEY HERE>
-----END RSA PRIVATE KEY-----%
2:23 PM
(head -n1 && tail -n1) <client-cert.pem                            -----BEGIN CERTIFICATE-----                 
<MORE BASE64>
-----END CERTIFICATE-----%
2:24 PM
(head -n1 && tail -n1) <server-ca.pem
-----BEGIN CERTIFICATE-----
<BASE64 AGAIN>
-----END CERTIFICATE-----%
datajoely

datajoely

01/04/2022, 2:26 PM
Okay so I've never done this before
brewski

brewski

01/04/2022, 2:26 PM
Currently working 'engine' code:
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import re
from sqlalchemy.sql.sqltypes import Text
from sqlalchemy import or_

global_link = "postgresql://postgres:passwd@ip/user"
ROOT = "/absolute/path/to/project/root"
engine = create_engine(
    global_link,
    connect_args={
        "sslmode": "verify-ca",
        "sslrootcert": f"{ROOT}/conf/local/server-ca.pem",
        "sslcert": f"{ROOT}/conf/local/client-cert.pem",
        "sslkey": f"{ROOT}/conf/local/client-key.pem",
    },
)

Base = automap_base()
Base.prepare(engine, reflect=True)
session = Session(engine)
2:27 PM
I'm on mobile idk if the wrapping is as bad on your end as it is on mine
2:27 PM
Sorry in advance
datajoely

datajoely

01/04/2022, 2:27 PM
yeah it's hard to read
brewski

brewski

01/04/2022, 2:27 PM
5 minutes and I'll have a better one
datajoely

datajoely

01/04/2022, 2:27 PM
but I think the answer isn't going to be ideal
2:27 PM
I think you'll need to define a custom dataset that implements this
2:28 PM
if you look at how this implemented
2:28 PM
we simply wrap the
pd.read_sql_table
function
2:28 PM
in the
load()
method
2:28 PM
message has been deleted
2:29 PM
so my angle to attack would be the following
2:29 PM
1. See if we can get the
pd.read_sql_table
method working outside of Kedro?
2:29 PM
2. Define a custom dataset that implements your working version
2:29 PM
happy to help you get it working
2:30 PM
but to my knowledge it doesn't exist today
brewski

brewski

01/04/2022, 2:31 PM
yes
2:31 PM
wait no
2:32 PM
read_sql_query works
datajoely

datajoely

01/04/2022, 2:32 PM
message has been deleted
2:32 PM
we create the enigne this way
brewski

brewski

01/04/2022, 2:32 PM
the trouble is is that the create_engine command in my case needs to take arguments for the location of these three credential files
datajoely

datajoely

01/04/2022, 2:33 PM
so thats possibly a working directory issue
brewski

brewski

01/04/2022, 2:33 PM
exactly
2:34 PM
I was wondering if kedro had a project root relative way to keep track of credential files
2:34 PM
not just their values
2:34 PM
and being perfectly honest, I'm still fuzzy on the rest
2:35 PM
the above code works fine though, and successfully works w/ pandas
pd.read_sql(query.statement, query.session.bind)
for any query via sqlalchemy
datajoely

datajoely

01/04/2022, 2:36 PM
So I still think we need to define a custom dataset
2:36 PM
but I think this approach may be easiest
2:36 PM
message has been deleted
2:36 PM
the SSL context in this case isn't done for you you have to create it yourself via the
ssl
library
2:36 PM
and you can then manage the working directory better
2:37 PM
not an ideal solution I'm sorry
2:37 PM
but I'm not sure what else to suggest
2:37 PM
the credentials in general are light touch in Kedro as we prefer things to be managed at an environment level
2:37 PM
I'm going to ask someone else to look at this thread for a second opinion
brewski

brewski

01/04/2022, 2:37 PM
interested in what this means
datajoely

datajoely

01/04/2022, 2:38 PM
as in you can do the pem file reading yourself
2:38 PM
rather than passing to another library
brewski

brewski

01/04/2022, 2:39 PM
would this mean deferring credential management to something like pipenv would be best practice or am I misunderstanding?
2:39 PM
ty!
2:40 PM
keeping track of the location of the files would then also logically be my responsibility rather than kedros?
datajoely

datajoely

01/04/2022, 2:40 PM
No I mean with things like SSL it's not idea, but with AWS for example it's nice to just assign IAM roles outside of Kedro
2:40 PM
I need to think of the best way to do SSL
2:41 PM
annoyingly we support it well for flat files via fsspec
2:41 PM
but this is (to my knowledge) the first time it's come up
2:41 PM
so in any case a custom dataset is needed
brewski

brewski

01/04/2022, 2:41 PM
okay
2:41 PM
I will look into it
2:41 PM
thanks for the pointers!
2:41 PM
it's a start for sure
datajoely

datajoely

01/04/2022, 2:41 PM
Please shout once you've got something and I'll do my best to help you though it
brewski

brewski

01/04/2022, 2:42 PM
absolutely
antony.milne

antony.milne

01/04/2022, 3:27 PM
Following on from what @User said... both
SQLTableDataSet
and
SQLQueryDataSet
are wrappers for pandas
pd.read_sql_table(**load_args)
and
pd.read_sql_query(**load_args)
. These call
create_engine
at some point (which is why we don't explicitly in the kedro datasets). These pandas functions take an argument
con
, which as well being a string can be a SQLAlchemy connection https://pandas.pydata.org/docs/reference/api/pandas.read_sql_query.html Hence what you need to do is inject into
load_args["con"]
the appropriate connection object. This would require a custom dataset but should be easy to implement. Subclass
SQLQueryDataSet
and then do something like this:
3:40 PM
# catalog.yml
dataset_name:
  type: path.to.your.CustomSQLQueryDataSet
  credentials: ssl_credentials

# credentials.yml
ssl_credentials:
  url: postgresql://postgres:passwd@ip/user
  connect_args:
    sslmode: verify-ca
    sslrootcert: client-cert.pem
    ...
SQLQueryDataSet.__init__
currently does
self._load_args["con"] = self._save_args["con"] = credentials["con"]
Your custom one would instead need to do something like
._load_args["con"] = self._save_args["con"] = create_engine(**credentials)
3:44 PM
Now there is a bit of a problem with how we deal with filepaths. In theory you can specify the full filepath as
sslrootcert:/absolute/path/to/project/root/conf/local/credentials/client-cert.pem
as you do in your example. The problem with this is that it's a bit hacky because:1. it hard codes the environment to
local
. This is where credentials should always reside anyway so not such a huge problem, just a bit ugly 2. it hard codes your project path in. This you can easily get around it by just specifying a relative path in your credentials yaml file and doing some path manipulation with something like
pathlib.Path.cwd()
in your custom dataset
3:45 PM
This is why @User instead suggested it might be cleaner to create the SSL context by the
ssl
library instead of using kedro's credentials.yml file - this avoids the above dirtiness with filepaths
3:45 PM
Also worth noting that at least on MySQL you can specify SSL args directly in the connection URL: https://docs.sqlalchemy.org/en/14/dialects/mysql.html#ssl-connections
brewski

brewski

01/10/2022, 12:11 AM
After a couple of days I'm still having some difficulty with getting sqlalchemy
create_engine
to accept "ssl_context" as a arg for "con". I'm on sqlalchemy v1.4.29 and I'm getting a programming error. Could I get some pointers with regards to this?
antony.milne

antony.milne

01/10/2022, 7:41 AM
Please could you make a little repo on github containing your current code so we can see? I'll try and play around with it there