https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • d

    datajoely

    01/14/2022, 12:56 PM
    option 2 is probably better, if you have access to PyCharm you can do amazing things like inspect DataFrames visually
  • d

    datajoely

    01/14/2022, 12:57 PM
    3. Consider data testing like Pandera or Great Expectations so that you can enforce things like schema, cardinality, missing values etc
  • g

    ggerog

    01/14/2022, 1:02 PM
    thanks! yea I am using Great Expectations on the data side. I was also aiming to write a test suite more than anything. So, I was thinking of running things like this and then having a battery of unit tests for each node.
    with KedroSession.create("library") as session:
        context = session.load_context()
        print(context.params)
        session.run(<node1 options>)
        run.node1_tests()
        session.run(<node2 options>)
        ... so on
    Does that seem reasonable?
    d
    i
    • 3
    • 6
  • d

    datajoely

    01/14/2022, 1:03 PM
    It does - but I have an alternative I can dig up
  • g

    ggerog

    01/14/2022, 1:03 PM
    Cool, thanks a lot for the help!
  • d

    datajoely

    01/14/2022, 1:52 PM
    Testing Kedro
  • j

    jaweiss2305

    01/16/2022, 3:13 PM
    Just curious how are other folks scheduling kedro runs? Cron jobs?
    d
    • 2
    • 21
  • d

    datajoely

    01/16/2022, 4:13 PM
    Kedro in production
  • c

    czix

    01/17/2022, 9:55 PM
    If a node returns a dict with multiple keys, and the dataset expects a dict - how do I avoid the unpacking of the node to the dataset?
    d
    • 2
    • 28
  • r

    RRoger

    01/18/2022, 3:33 AM
    Is this available for
    pandas.SQLQueryDataSet
    ? i.e. like using the SSH Tunnel feature (e.g. in DBeaver)
  • d

    datajoely

    01/18/2022, 10:13 AM
    So Kedro uses SQLAlchemy under the hood - I think you'll need to extend / implement some sort of custom dataset which implements the sort of SSH stuff demonstrated here https://gist.github.com/danallison/7217d76d944ea4d8dabd0ba3041ebefc The other alternative is to run Kedro on the remote host like so https://stackoverflow.com/a/31508516/2010808
  • d

    datajoely

    01/18/2022, 10:14 AM
    Currently we don't support SSH SQL connections out of the box, but we do for file based formats via fsspec
  • m

    martinlarsalbert

    01/19/2022, 4:15 PM
    I define a list "ids" in globals.yml. I then want to make a Jinja2 loop with this list in catalog.yml how do I do that? I've tried:
    {% for id in $ids %}
    and` {% for id in ${ids} %}` but none of this works...
  • d

    datajoely

    01/19/2022, 4:20 PM
    Hi @User we currently don't expose jinja variables via
    globals.yml
    this is currently on the backlog. I think the easiest way to make these variables available is to customise how you register
    TemplatedConfigLoader
    in
    hooks.py
    or perhaps subclass and extend
    TemplatedConfigLoader
  • d

    datajoely

    01/19/2022, 4:23 PM
    so we currently do this behind this scenes using a library called
    anyconfig
    the flag
    ac_config=True
    enables jinja
  • d

    datajoely

    01/19/2022, 4:24 PM
    anyconfig.load
    actually takes an extra parameter called
    paths
    where you can declare
    jinja2
    templates
  • d

    datajoely

    01/19/2022, 4:24 PM
    so in short - not super easy today, but it's possible if you define your own config loader
  • m

    martinlarsalbert

    01/19/2022, 4:25 PM
    Great and thanks for the fast reply again. Perhaps this should be added to the docs? (That you can only print stuff from the globals.yml with Jinja2, not apply logic to it etc.)
  • d

    datajoely

    01/19/2022, 4:26 PM
    We're currently designing a whole configuration overhaul so it's part of wider piece of work, we do have a note on the jinja stuff atm that say it's a bit of a hack (IIRC)
  • d

    datajoely

    01/19/2022, 4:27 PM
    within the team there is a strong, healthy debate on the topic on whether config should be strictly declarative or not
  • r

    Rroger

    01/19/2022, 9:44 PM
    What is exit code
    1073741819 (0xC0000005)
    ? It happens when using
    ThreadRunner
    . Using
    SequentialRunner
    is fine. Is it because too many processes are trying to access the same dataset simultaneously?
  • d

    datajoely

    01/19/2022, 9:45 PM
    Are you using
    ThreadRunner
    with Spark or Dask? For python pipelines you should be using
    ParallelRunner
  • r

    Rroger

    01/19/2022, 9:47 PM
    I'm using
    ThreadRunner
    with Pandas. I tried using
    ParallelRunner
    a while ago and got an error about lambda functions.
  • d

    datajoely

    01/19/2022, 9:49 PM
    Can you try with
    ParallelRunner
    here? We can help you work through the errors, due to the way Python works you don't get true concurrency with threads and we have to use multi-processing.
  • d

    datajoely

    01/19/2022, 9:49 PM
    Some things can't be split into different processes - so it isn't a 100% guaranteed to work
  • d

    datajoely

    01/19/2022, 9:49 PM
    In some cases it's good to break your pipeline into different parts and run them indepdently
  • d

    datajoely

    01/19/2022, 9:50 PM
    using the
    &&
    operator in your CLI command
  • r

    Rroger

    01/19/2022, 10:08 PM
    Using
    ParallelRunner
    leads to
    TypeError: cannot pickle 'module' object
    .
  • d

    datajoely

    01/19/2022, 10:09 PM
    Any idea what type of object it is when it fails?
    r
    d
    r
    • 4
    • 11
  • d

    datajoely

    01/19/2022, 10:09 PM
    as type 'module' is quite confusing
Powered by Linen
Title
d

datajoely

01/19/2022, 10:09 PM
as type 'module' is quite confusing
View count: 1