https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • d

    datajoely

    07/13/2021, 4:16 PM
    Alternatively, defensive promise based testing like GreatExpectations and Pandera are more and more common
  • b

    Ben H

    07/13/2021, 5:00 PM
    Hey @User . A few guidelines I try to stick to: 1. Unit test everything that is not inside
    pipeline.py
    . * Treat these like you would any other unit test you do. * Use tiny, targeted data per-test to keep your test suite fast. Usually single-row datasets work fine for unit tests, but occasionally you may stretch it to 10 randomly generated ones. 2. Integration test everything that is inside
    pipeline.py
    * This is where you typically stitch functions together. You already know they work individually from your unit tests, so completely disregard any form of unit testing for pipelines. Instead, think of them as an integration. * Provide free inputs * Test on final outputs * Completely ignore intermediates - you already unit tested them. 3. System test the entire pipeline. * Also known as "end-to-end" testing, or "automated acceptance" testing. * Your test runner will losely look like:
    (cd source_proj & kedro pipeline package my_pipeline)
    (cd test_dir & kedro new --config /path/to/test/config)
    (cd test_dir/test_proj & kedro pipeline pull source_proj/dist/my_pipeline*.whl)
    ... any other data / config / pipelines that need set up ...
    kedro run
    4. Never ever use
    catalog.yml
    or
    parameters.yml
    or
    data/
    files in your tests. * For unit tests, you'll likely want to try out many variations of parameters, so you can't anyway! * Use kedro code API instead 5. Make use of pytest features, they make life a lot easier * use
    fixtures
    for setting up default catalogs and parameters (top-tip: pytest has a built-in fixture
    tmp_path
    - use that in your catalog entries) *
    conftest.py
    is a really useful file
  • a

    Arnaldo

    07/13/2021, 6:43 PM
    Hi, @User your points 4 and 5 are the most interesting for me. That's what I was thinking about. Thanks for that guidelines! > Use kedro code API instead One more question about this: how do you use the Kedro code API to provide data for your tests?
  • a

    Arnaldo

    07/13/2021, 6:45 PM
    @User
  • d

    datajoely

    07/13/2021, 6:50 PM
    I think that’s a point about making your own DataCatalog and Pipeline objects rather than using a full session (I may be wrong on this one)
  • u

    user

    07/13/2021, 7:39 PM
    hi guys.. how are you? so.. I'll try add support for trino sqls and tables in datasets.pandas.. but I take a look in the kedro source code and notice that the sqldataset should work for every type of database.. but right now, trino-sqlalchemy only works for basic cases.. if we need to set something like ssl settings the con url doesnt support and then we need to use the native trino connection class.. I wondering if make sense add this for now and exactly where should I add.. because I can only imagine something like datasets.pandas.trino_sql_dataset.. but I dont know if this aproach make sense for the project..
  • d

    datajoely

    07/13/2021, 9:44 PM
    In this situation our normal approach is to extend a new class specific for this situation. There is another user for example that implemented a PrestoDataSet in a v similar way. I do want to look into improving or SQL support in general and supporting more sql alchemy extensions fees like something we should look into.
  • d

    datajoely

    07/13/2021, 9:45 PM
    Additionally if you have PySpark available the JDBC support is often better via the SparkJDBCTableDaraSet
  • u

    user

    07/13/2021, 9:47 PM
    About the presto implementation, I searched in the code and found nothing.
  • n

    noklam

    07/14/2021, 6:05 AM
    https://kedro.readthedocs.io/en/latest/07_extend_kedro/02_hooks.html?highlight=grafana#add-observability-to-your-pipeline Is there any setup instruction for this? I wasn't sure how can I see the log after adding the hooks.
  • d

    datajoely

    07/14/2021, 6:31 AM
    Sorry the is a GitHub issue where I discussed the person opening a PR
  • m

    Mad Hatter

    07/14/2021, 7:19 AM
    is there any way to identify at what specific node did it failed Error: Cycle(s) detected; toposort only works on acyclic graphs
  • d

    datajoely

    07/14/2021, 8:12 AM
    Hi @User is this a python side or viz side? This comes down to the point that the A in DAG is acyclic so you can't have any routes which loop back on itself. https://en.wikipedia.org/wiki/Directed_acyclic_graph
  • m

    Mad Hatter

    07/14/2021, 9:17 AM
    viz side wwwhen loading the page and when it renders
  • m

    Mad Hatter

    07/14/2021, 9:17 AM
    thanks but this i know i want to identify the node which is causing this
  • d

    datajoely

    07/14/2021, 9:19 AM
    So you are using Kedro Viz in an unsupported way - so I would possibly look to write something on your end that detects this situation at the point you generate the graph. You could use networkX for this https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.cycles.find_cycle.html
  • m

    Mad Hatter

    07/14/2021, 9:20 AM
    i just want to generate the POC post that will work according to the coding guidelines which have been mentioned in the documents thanks for this
  • m

    Mad Hatter

    07/14/2021, 9:23 AM
    i am directly generating the json which is saved by kedro-viz json i used demo.mock.json for reference and i was able to generate the graphs when nodes were less but when it increased i got this error before this it was chonky pipeline error
  • m

    Mad Hatter

    07/14/2021, 9:23 AM
    if the POC is approved i get to work on it as a enterprise project so will also have to ask what is this tools policy for enterprise licenses?
  • d

    datajoely

    07/14/2021, 9:27 AM
    It's an apache 2.0 so you are free to use it for commercial use https://github.com/quantumblacklabs/kedro-viz/blob/main/LICENSE.md
  • b

    Ben H

    07/14/2021, 9:59 AM
    Let's say we have the following node and python functions we want to test:
  • b

    Ben H

    07/14/2021, 10:00 AM
    Then our test may look something like:
  • m

    Mad Hatter

    07/14/2021, 12:01 PM
    if anyone has connected pipelines could someone send kedro-viz --save-file json
  • d

    datajoely

    07/14/2021, 12:02 PM
    what do you mean connected pipelines?
  • m

    Mad Hatter

    07/14/2021, 1:22 PM
    https://kedro.readthedocs.io/en/0.16.5/06_nodes_and_pipelines/02_pipelines.html#connecting-existing-pipelines
  • m

    Mad Hatter

    07/14/2021, 1:23 PM
    this
  • d

    datajoely

    07/14/2021, 1:29 PM
    So on the Python side we have algebra that allows you to add pipelines together into one,
    Pipeline([a,b]) + Pipeline([c,d]) = Pipeline([a,b,c,d])
    This is will make one big pipeline, but it must still be acyclic to run both on the Kedro and Viz side
  • a

    Arnaldo

    07/14/2021, 1:36 PM
    amazing example, @User. Many thanks for that! @User @User
  • a

    Arnaldo

    07/14/2021, 1:36 PM
    @User
  • u

    user

    07/14/2021, 1:37 PM
    Ow its me. Sorry. Hahah
Powered by Linen
Title
u

user

07/14/2021, 1:37 PM
Ow its me. Sorry. Hahah
View count: 1