Also, in the context of reading from sql to popula...
# beginners-need-help
b
Also, in the context of reading from sql to populate a dataset using [[https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pandas.SQLQueryDataSet.html#kedro.extras.datasets.pandas.SQLQueryDataSet]] is there a correct way of specifying (hopefully relative) paths to the config?
d
Hi @User can you paste your existing catalog entry?
I think what your desired state is possible but we do discourage sharing credentials across the team
b
Copy code
@project_root>> ls conf/local/credentials/ client-cert.pem 
client-key.pem   server-ca.pem
credentials.yml
d
hi not the credentials file
the catlaog.yml you are using
b
So right, I don't have a functional one
d
yeah you're working one
and the structure of your credentials key would be useful too
b
Copy code
(head -n1 && tail -n1) <client-key.pem                             -----BEGIN RSA PRIVATE KEY-----
<BASE64 ENCODED KEY HERE>
-----END RSA PRIVATE KEY-----%
Copy code
(head -n1 && tail -n1) <client-cert.pem                            -----BEGIN CERTIFICATE-----                 
<MORE BASE64>
-----END CERTIFICATE-----%
Copy code
(head -n1 && tail -n1) <server-ca.pem
-----BEGIN CERTIFICATE-----
<BASE64 AGAIN>
-----END CERTIFICATE-----%
d
Okay so I've never done this before
b
Currently working 'engine' code:
Copy code
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import re
from sqlalchemy.sql.sqltypes import Text
from sqlalchemy import or_

global_link = "postgresql://postgres:passwd@ip/user"
ROOT = "/absolute/path/to/project/root"
engine = create_engine(
    global_link,
    connect_args={
        "sslmode": "verify-ca",
        "sslrootcert": f"{ROOT}/conf/local/server-ca.pem",
        "sslcert": f"{ROOT}/conf/local/client-cert.pem",
        "sslkey": f"{ROOT}/conf/local/client-key.pem",
    },
)

Base = automap_base()
Base.prepare(engine, reflect=True)
session = Session(engine)
I'm on mobile idk if the wrapping is as bad on your end as it is on mine
Sorry in advance
d
yeah it's hard to read
b
5 minutes and I'll have a better one
d
but I think the answer isn't going to be ideal
I think you'll need to define a custom dataset that implements this
if you look at how this implemented
we simply wrap the
pd.read_sql_table
function
in the
load()
method
so my angle to attack would be the following
1. See if we can get the
pd.read_sql_table
method working outside of Kedro?
2. Define a custom dataset that implements your working version
happy to help you get it working
but to my knowledge it doesn't exist today
b
yes
wait no
read_sql_query works
d
we create the enigne this way
b
the trouble is is that the create_engine command in my case needs to take arguments for the location of these three credential files
d
so thats possibly a working directory issue
b
exactly
I was wondering if kedro had a project root relative way to keep track of credential files
not just their values
and being perfectly honest, I'm still fuzzy on the rest
the above code works fine though, and successfully works w/ pandas
Copy code
pd.read_sql(query.statement, query.session.bind)
for any query via sqlalchemy
d
So I still think we need to define a custom dataset
but I think this approach may be easiest
the SSL context in this case isn't done for you you have to create it yourself via the
ssl
library
and you can then manage the working directory better
not an ideal solution I'm sorry
but I'm not sure what else to suggest
the credentials in general are light touch in Kedro as we prefer things to be managed at an environment level
I'm going to ask someone else to look at this thread for a second opinion
b
interested in what this means
d
as in you can do the pem file reading yourself
rather than passing to another library
b
would this mean deferring credential management to something like pipenv would be best practice or am I misunderstanding?
ty!
keeping track of the location of the files would then also logically be my responsibility rather than kedros?
d
No I mean with things like SSL it's not idea, but with AWS for example it's nice to just assign IAM roles outside of Kedro
I need to think of the best way to do SSL
annoyingly we support it well for flat files via fsspec
but this is (to my knowledge) the first time it's come up
so in any case a custom dataset is needed
b
okay
I will look into it
thanks for the pointers!
it's a start for sure
d
Please shout once you've got something and I'll do my best to help you though it
b
absolutely
a
Following on from what @User said... both
SQLTableDataSet
and
SQLQueryDataSet
are wrappers for pandas
pd.read_sql_table(**load_args)
and
pd.read_sql_query(**load_args)
. These call
create_engine
at some point (which is why we don't explicitly in the kedro datasets). These pandas functions take an argument
con
, which as well being a string can be a SQLAlchemy connection https://pandas.pydata.org/docs/reference/api/pandas.read_sql_query.html Hence what you need to do is inject into
load_args["con"]
the appropriate connection object. This would require a custom dataset but should be easy to implement. Subclass
SQLQueryDataSet
and then do something like this:
Copy code
# catalog.yml
dataset_name:
  type: path.to.your.CustomSQLQueryDataSet
  credentials: ssl_credentials

# credentials.yml
ssl_credentials:
  url: postgresql://postgres:passwd@ip/user
  connect_args:
    sslmode: verify-ca
    sslrootcert: client-cert.pem
    ...
SQLQueryDataSet.__init__
currently does
Copy code
self._load_args["con"] = self._save_args["con"] = credentials["con"]
Your custom one would instead need to do something like
Copy code
._load_args["con"] = self._save_args["con"] = create_engine(**credentials)
Now there is a bit of a problem with how we deal with filepaths. In theory you can specify the full filepath as
sslrootcert:/absolute/path/to/project/root/conf/local/credentials/client-cert.pem
as you do in your example. The problem with this is that it's a bit hacky because: 1. it hard codes the environment to
local
. This is where credentials should always reside anyway so not such a huge problem, just a bit ugly 2. it hard codes your project path in. This you can easily get around it by just specifying a relative path in your credentials yaml file and doing some path manipulation with something like
pathlib.Path.cwd()
in your custom dataset
This is why @User instead suggested it might be cleaner to create the SSL context by the
ssl
library instead of using kedro's credentials.yml file - this avoids the above dirtiness with filepaths
Also worth noting that at least on MySQL you can specify SSL args directly in the connection URL: https://docs.sqlalchemy.org/en/14/dialects/mysql.html#ssl-connections
b
After a couple of days I'm still having some difficulty with getting sqlalchemy
create_engine
to accept "ssl_context" as a arg for "con". I'm on sqlalchemy v1.4.29 and I'm getting a programming error. Could I get some pointers with regards to this?
a
Please could you make a little repo on github containing your current code so we can see? I'll try and play around with it there
2 Views