778216384475693066 #beginners-need-help

but “there is no active Kedro session”. Does anyone know how to make it work? I had managed to run it successfully before, just not sure what has changed since then.

datajoely

12/16/2021, 8:47 AM

Viz session issue

czix

12/16/2021, 1:26 PM

Is there a way to output a tuple in a node, without having to split them to separate variables?

datajoely

12/16/2021, 1:48 PM

Of course - just in your regular function:

Copy code

python

my_node_func() -> Tuple[int]:
   return tuple([1,2])

czix

12/16/2021, 2:03 PM

But in the pipeline I have to specify two output variables, e.g., in your example:

Copy code

python
Pipeline([
  node(func=my_node_func, input=None, output=["a","b"])
])

Or am I wrong?

datajoely

12/16/2021, 2:06 PM

ah you just provide one output!

czix

12/16/2021, 2:08 PM

output=["ab"]

datajoely

12/16/2021, 2:10 PM

So in this case you would just need to return

output="a"

and

"a"

would store a tuple

czix

12/16/2021, 2:11 PM

hmm, it seems like it works for other datatypes than a pandas dataframe, e.g, if I return two dataframes in a tuple, it will like to split them

datajoely

12/16/2021, 2:15 PM

are you getting an error if you try that with Pandas?

datajoely

12/16/2021, 2:17 PM

One thing you can do if it's giving you and error In

catalog.yml

Copy code

yaml
a:
  type: MemoryDataSet
  copy_mode: copy

czix

12/16/2021, 2:27 PM

No, it was actually my own fault. I forgot to remove the [] around the output as you said. Thank you!

Dhaval

12/16/2021, 4:23 PM

Hi everyone, kinda new to kedro. I was looking for some examples where in I can pass different datasets to the same pipeline(reusing same pipeline code for different datasets) to process information but unable to find anything. Can anyone help?

datajoely

12/16/2021, 4:29 PM

This is exactly what modular pipelines are for - I have a work in progress example project here https://github.com/datajoely/modular-spaceflights

Dhaval

12/16/2021, 4:30 PM

Thanks for the fast response. I'll go through it 😃

Rroger

12/16/2021, 9:21 PM

I suppose I could output a dummy dataset.

Rroger

12/17/2021, 12:54 AM

Is there a built in dataset that saves to a database? Or do we have to create our own class for that?

datajoely

12/17/2021, 7:35 AM

Yes there are pandas and spark database connectors

kedro.extras.datasets.pandas.SQLQueryDataSet

kedro.extras.datasets.pandas.SQLTableDataSet

kedro.extras.datasets.spark.SparkJDBCDataSet

kedro.extras.datasets.spark.SparkHiveDataSet

RRoger

12/17/2021, 9:25 PM

I tried doing a modification to the data ingestion pipeline in https://github.com/datajoely/modular-spaceflights by adding a node to save to a db; in

data_ingestion/pipeline.py

Copy code

node(
   name="upload_to_db",
   func=lambda x: x,
   input="shuttles",
   output="shuttles_table",
    ),

catalog_01_raw.yml

Copy code

shuttles_table:
  type: pandas.SQLTableDataSet
  table_name: shuttles
  credentials: postgres
  save_args:
    if_exists: replace

but the log shows that

shuttles_table

is a

MemoryDataSet

Copy code

2021-12-18 08:24:36,993 - kedro.pipeline.node - INFO - Running node: <lambda>([shuttles]) -> [data_ingestion.shuttles_table]
2021-12-18 08:24:36,993 - kedro.io.data_catalog - INFO - Saving data to `data_ingestion.shuttles_table` (MemoryDataSet)...

And the table is not created in the database.

RRoger

12/17/2021, 9:37 PM

✔The solution was to add the output in the

new_ingestion_pipeline

Pipeline

. I didn't realise that creating another function just to add a namespace to an existing

Pipeline

is done.

RRoger

12/17/2021, 9:53 PM

If a node (node

) is dependent on a previous node (node

) having uploaded to a database (e.g.

some_table

pandas.SQLTableDataSet

) and I use

some_table

as the input for

, does

automatically try to download

some_table

to memory (if not already in memory)? I would not like the data downloaded if: - the data is large, hence most of the pipeline time is spent on downloading - `B`'s code is to run SQL queries without ever requiring the data locally

datajoely

12/18/2021, 8:48 AM

So we don’t support SQL as a remote execution engine - today we bring things into python world. If you use PySpark it will expose sql as data frames and do some of this lazily

datajoely

12/18/2021, 8:48 AM

We use SQL as a storage layer not an execution engine

datajoely

12/18/2021, 8:49 AM

If you need to use SQL for execution- maybe dbt is right for data engineering and Kedro kicks in when doing the ML engineering

fanzipei

12/19/2021, 3:36 AM

Hi everyone, I am new to kedro. When I save pandas.CSVDataSet with save_args of compression: gzip defined in the catalog, it seems didn't work (it did not compressed at all and just saved as the plain text file). I tested df.to_csv('test.csv', compression='gzip') and it worked properly. Can anyone help? Thanks.

datajoely

12/19/2021, 10:03 AM

Can you post your yaml example?

fanzipei

12/19/2021, 1:15 PM

To reproduce my problem, please use the official iris-example and add this to catalog.yml.

Copy code

example_iris_data_gz:
  type: pandas.CSVDataSet
  filepath: data/02_intermediate/iris.csv.gz
  load_args:
    header: null
    compression: gzip
  save_args:
    index: null
    compression: gzip

and add a node which load the example_iris_data and export the example_iris_data_gz. Here I added the new node

Copy code

def compression(df):
    return df

and added it to the pipeline as:

Copy code

node(
                compression,
                'example_iris_data',
                'example_iris_data_gz',
                name='compression'
            )

Then run

Copy code

kedro run --from-nodes='compression'

There is a warning message as:

Copy code

C:\Users\fanzi\anaconda3\envs\kedro\lib\site-packages\pandas\io\common.py:609: RuntimeWarning: compression has no effect when passing a non-binary object as input.
  ioargs = _get_filepath_or_buffer(

Finally get a iris.csv.gz file that is actually only a text file.

datajoely

12/19/2021, 7:00 PM

Very interesting- that shouldn’t be happening I’ll get to the bottom of it tomorrow