https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • z

    Zhee

    11/04/2021, 1:37 PM
    Hello Everyone. I have a question about documentation best practices. The autogenerated documentation is great to focus on pipelines and package but i would like also to document more my datasets. (I generated a static site on github pages) What would be the best approach to add details about datasources (meaning, content .. )? The data catalog could be a good place to start but the yaml file doesn't gives us access to description fields or other things like this (or maybe i missed something). (Or it's simply not the right place to write such piece of information.) Should i add details in a custom part of the sphinx project or could it be incorporated somewhere else ? What are you used to do when documenting your datasources?
  • d

    datajoely

    11/04/2021, 6:57 PM
    I havenโ€™t forgotten about this @Zhee ! I will write up a proper response tomorrow
  • d

    datajoely

    11/05/2021, 10:22 AM
    Hi @User - now I can write up my thoughts now. tldr - out of the box we don't do all we could do in this area and I personally want to limit the YAML our users have to write. - Looking at the industry the best things I've seen are what Great Expectations (https://docs.greatexpectations.io/docs/tutorials/getting_started/check_out_data_docs/) and dbt (https://docs.getdbt.com/docs/building-a-dbt-project/documentation) are able to do. - Practically - it's not a lot of work to work with a Kedro DataCatalog object with the Python API, feed it into GE and generate this sort of documentation - it's not something we offer as a first party integration (yet) but would like to do some day. - Something we've been keen to do for a long time is extend
    kedro-viz
    to have some sort of 'catalog manager' where this would make sense to live. It's not under active development, but if users start shouting that they'd like it has more weight on the backlog ๐Ÿ™‚ - Finally, the most structured way of doing this today is to use Sphinx and the built in
    kedro build-docs
    command to generate static docs. This is mostly there for Python API docs, but everything on the Kedro docs (https://kedro.readthedocs.io/en/stable/) is made this way so you can steal how we do it too. I think it would be pretty neat to write a script that used the DataCatalog python API to create data documentation stubs which you then fill in with human readable descriptions.
  • z

    Zhee

    11/05/2021, 1:52 PM
    Thank you @User for sharing your thoughts on this. And I agree that sphinx and the build docs are already strong foundations to build extra documentation and they are flexible enough to do great things efficiently.I will study your last point and go in that direction!
  • d

    datajoely

    11/05/2021, 1:59 PM
    ๐Ÿ’ช shout if you need any pointers - for the record we don't like Sphinx but feel the alternatives aren't worth migrating yet
  • b

    Barros

    11/05/2021, 5:10 PM
    I have a question: Does IncrementalDataSet writes data if file already exists like PartitionedDataSet?
  • d

    datajoely

    11/05/2021, 5:10 PM
    It should function the same IIRC
  • d

    datajoely

    11/05/2021, 5:11 PM
    But best to test with a dummy file to be sure
  • d

    datajoely

    11/05/2021, 5:11 PM
    Behind the scenes it is a subclass
  • b

    Barros

    11/05/2021, 5:12 PM
    I wanted to implement a node that somehow checks if file exists and if positive, it does nothing in the IO. Is there a default way to do this?
  • b

    Barros

    11/05/2021, 5:13 PM
    I thought about giving the dataset as both input and output to have the keys in the load() method and the save() but Kedro complains that it cannot be
  • d

    datajoely

    11/05/2021, 5:13 PM
    So hooks are the best way to add this functionality https://kedro.readthedocs.io/en/latest/07_extend_kedro/02_hooks.html
  • d

    datajoely

    11/05/2021, 5:14 PM
    This sort of conditional logic isnโ€™t well supported out of the box but people do implement it
  • b

    Barros

    11/05/2021, 5:15 PM
    Makes sense
  • b

    Barros

    11/05/2021, 5:15 PM
    I never have written my own hooks
  • b

    Barros

    11/05/2021, 5:15 PM
    Let me see
  • z

    Zemeio

    11/05/2021, 10:59 PM
    Hey guys. I am trying to make a configuration that I can switch between folders where my pipelines run, but can also explicitly set a pipeline to run on test data. In order to do that I was building a configuration such as this, with the templated config loader
    yml
    # globals.yml
    env:
      base:
        folder: "data/prod"
      test:
        folder: "data/test"
    
    folders:
      # Base folders, where the main pipelines are run
      raw: "${env.base.folder}/01_raw"
      int: "${env.base.folder}/01_intermediate"
      # Test folders, where smaller subsets of the data designed for testing reside, and the test pipelines run
      test_raw: "${env.test.folder}/01_raw"
      test_int: "${env.test.folder}/01_intermediate"
    However, when I try to run this it does not get the value from env.base. Does the templated config loader only templates for catalog.yml? Is the only official way to do this using jinja templates?
  • d

    datajoely

    11/06/2021, 11:51 AM
    What you're trying to do is achieved best by what we can config environments https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#additional-configuration-environments So you can create a mirror structure in
    conf/prod
    and
    conf/test
    and then get Kedro to resolve which one you want a run time by doing
    kedro run --env=prod
    or use the env variable
    export KEDRO_ENV=prod
  • z

    Zemeio

    11/06/2021, 1:25 PM
    Thank you for the reply. I want to use the envs, yes, but I want to have pipelines that will sample my prod data to the test data with pipelines (or nodes). So I have to have a pipeline that goes from one env to the other. The way I thought to achieve this is by having a setting that always points to the test data (test) and one that can either point to the test data or to the prod data (base). In that case, the environment would make the base point to the test, so I can run stuff on a smaller dataset, and the base would point to the prod on the cloud, to use the huge datasets.
    d
    • 2
    • 13
  • z

    Zemeio

    11/06/2021, 1:25 PM
    Hence, why I had 2 folders and they would be on the same catalog (instead of parameters or envs)
  • d

    datajoely

    11/08/2021, 9:38 AM
    prod & test env resolution
  • m

    Matheus Serpa

    11/09/2021, 9:59 PM
    Hello guys! We write some data pipeline codes with kedro, getting stuck on a "circular dependency error." We read the "semente_table" to remove duplicates from the source dataset and then load the data to semente_table itself. Any suggestions on how to deal with this issue?
  • d

    datajoely

    11/10/2021, 9:47 AM
    Hi @User - this is failing because you create a cycle with
    semente_table
    kedro doesn't know which one to write first!
  • d

    datajoely

    11/10/2021, 9:48 AM
    therefore you should create a new dataset as an output of
    load_data
    which is called something like
    processed_semente_table
  • d

    datajoely

    11/10/2021, 9:48 AM
    This also means your pipeline is reproducible (assuming the raw data is not dynamic) as you will never be overwriting the source data
  • m

    Matheus Serpa

    11/10/2021, 1:34 PM
    I get it. Thanks for your support @User. I have another question (if you don't mind ๐Ÿ™‚ Which would be the good practice for loading the data from the processed_semente_table into the semente_table (which is a SQL table)? In summary, the steps are 1) read a CSV file with semente_data; 2) read semente_table from SQL DB; 3) remove duplicates (comparing the CSV with SQL DB); 4) insert the new data on semente_table We also tried the following: instead of reading semente_table in step 2, we read a semente_query with the columns used to detect duplicates and then eliminate the semente_table cycle.
    d
    • 2
    • 23
  • d

    datajoely

    11/10/2021, 1:49 PM
    Are duplicates defined by an ID column? I think we may want to do some sort of UPSERT operation here.
  • i

    Isaac89

    11/10/2021, 1:51 PM
    hi! I used Parallel runner to execute a pipeline and got ParallelRunner does not support output to externally created MemoryDataSets. Does it mean that MemoryDataSets are not compatible with the Parallelrunner ? ( I'm using kedro 0.17.0)
  • d

    datajoely

    11/10/2021, 1:52 PM
    Let me check that error - I've never seen it before, but it will be routed in the fact that parallel processes in Python can't share memory.
  • d

    datajoely

    11/10/2021, 1:54 PM
    Are these standard memory datasets or ones you've created in you complex iteration thingy @User ?
Powered by Linen
Title
d

datajoely

11/10/2021, 1:54 PM
Are these standard memory datasets or ones you've created in you complex iteration thingy @User ?
View count: 1