778216384475693066 #beginners-need-help

Channels

advanced-need-help

job-posting

welcome

Zhee

11/05/2021, 1:52 PM

Thank you @User for sharing your thoughts on this. And I agree that sphinx and the build docs are already strong foundations to build extra documentation and they are flexible enough to do great things efficiently.I will study your last point and go in that direction!

datajoely

11/05/2021, 1:59 PM

💪 shout if you need any pointers - for the record we don't like Sphinx but feel the alternatives aren't worth migrating yet

Barros

11/05/2021, 5:10 PM

I have a question: Does IncrementalDataSet writes data if file already exists like PartitionedDataSet?

datajoely

11/05/2021, 5:10 PM

It should function the same IIRC

datajoely

11/05/2021, 5:11 PM

But best to test with a dummy file to be sure

datajoely

11/05/2021, 5:11 PM

Behind the scenes it is a subclass

Barros

11/05/2021, 5:12 PM

I wanted to implement a node that somehow checks if file exists and if positive, it does nothing in the IO. Is there a default way to do this?

Barros

11/05/2021, 5:13 PM

I thought about giving the dataset as both input and output to have the keys in the load() method and the save() but Kedro complains that it cannot be

datajoely

11/05/2021, 5:13 PM

So hooks are the best way to add this functionality https://kedro.readthedocs.io/en/latest/07_extend_kedro/02_hooks.html

datajoely

11/05/2021, 5:14 PM

This sort of conditional logic isn’t well supported out of the box but people do implement it

Barros

11/05/2021, 5:15 PM

Makes sense

Barros

11/05/2021, 5:15 PM

I never have written my own hooks

Barros

11/05/2021, 5:15 PM

Let me see

Zemeio

11/05/2021, 10:59 PM

Hey guys. I am trying to make a configuration that I can switch between folders where my pipelines run, but can also explicitly set a pipeline to run on test data. In order to do that I was building a configuration such as this, with the templated config loader

Copy code

yml
# globals.yml
env:
  base:
    folder: "data/prod"
  test:
    folder: "data/test"

folders:
  # Base folders, where the main pipelines are run
  raw: "${env.base.folder}/01_raw"
  int: "${env.base.folder}/01_intermediate"
  # Test folders, where smaller subsets of the data designed for testing reside, and the test pipelines run
  test_raw: "${env.test.folder}/01_raw"
  test_int: "${env.test.folder}/01_intermediate"

However, when I try to run this it does not get the value from env.base. Does the templated config loader only templates for catalog.yml? Is the only official way to do this using jinja templates?

datajoely

11/06/2021, 11:51 AM

What you're trying to do is achieved best by what we can config environments https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#additional-configuration-environments So you can create a mirror structure in

conf/prod

and

conf/test

and then get Kedro to resolve which one you want a run time by doing

kedro run --env=prod

or use the env variable

export KEDRO_ENV=prod

Zemeio

11/06/2021, 1:25 PM

Thank you for the reply. I want to use the envs, yes, but I want to have pipelines that will sample my prod data to the test data with pipelines (or nodes). So I have to have a pipeline that goes from one env to the other. The way I thought to achieve this is by having a setting that always points to the test data (test) and one that can either point to the test data or to the prod data (base). In that case, the environment would make the base point to the test, so I can run stuff on a smaller dataset, and the base would point to the prod on the cloud, to use the huge datasets.

Zemeio

11/06/2021, 1:25 PM

Hence, why I had 2 folders and they would be on the same catalog (instead of parameters or envs)

datajoely

11/08/2021, 9:38 AM

prod & test env resolution

Matheus Serpa

11/09/2021, 9:59 PM

Hello guys! We write some data pipeline codes with kedro, getting stuck on a "circular dependency error." We read the "semente_table" to remove duplicates from the source dataset and then load the data to semente_table itself. Any suggestions on how to deal with this issue?

datajoely

11/10/2021, 9:47 AM

Hi @User - this is failing because you create a cycle with

semente_table

kedro doesn't know which one to write first!

datajoely

11/10/2021, 9:48 AM

therefore you should create a new dataset as an output of

load_data

which is called something like

processed_semente_table

datajoely

11/10/2021, 9:48 AM

This also means your pipeline is reproducible (assuming the raw data is not dynamic) as you will never be overwriting the source data

Matheus Serpa

11/10/2021, 1:34 PM

I get it. Thanks for your support @User. I have another question (if you don't mind 🙂 Which would be the good practice for loading the data from the processed_semente_table into the semente_table (which is a SQL table)? In summary, the steps are 1) read a CSV file with semente_data; 2) read semente_table from SQL DB; 3) remove duplicates (comparing the CSV with SQL DB); 4) insert the new data on semente_table We also tried the following: instead of reading semente_table in step 2, we read a semente_query with the columns used to detect duplicates and then eliminate the semente_table cycle.

datajoely

11/10/2021, 1:49 PM

Are duplicates defined by an ID column? I think we may want to do some sort of UPSERT operation here.

Isaac89

11/10/2021, 1:51 PM

hi! I used Parallel runner to execute a pipeline and got ParallelRunner does not support output to externally created MemoryDataSets. Does it mean that MemoryDataSets are not compatible with the Parallelrunner ? ( I'm using kedro 0.17.0)

datajoely

11/10/2021, 1:52 PM

Let me check that error - I've never seen it before, but it will be routed in the fact that parallel processes in Python can't share memory.

datajoely

11/10/2021, 1:54 PM

Are these standard memory datasets or ones you've created in you complex iteration thingy @User ?

Isaac89

11/10/2021, 1:55 PM

normal ones

datajoely

11/10/2021, 1:55 PM

That's surprising

datajoely

11/10/2021, 1:56 PM

let me check with the team