778216384475693066 #advanced-need-help

Channels

advanced-need-help

job-posting

welcome

Schoolmeister

12/20/2021, 2:42 PM

kedro run

still works * this is the output from the terminal when starting

kedro jupyter lab

, everything seems OK: https://gist.github.com/michaeltoqua/95055dbb439a6a9240e61d5680c0aec5 * Even though I'm on 0.17.4, I've just tried what you suggested, and it seems to work!

datajoely

12/20/2021, 2:42 PM

Okay - I'm not entirely sure what went wrong, but I'm glad we've got a working solution

datajoely

12/20/2021, 2:43 PM

Maybe we should include a

kedro project rename

command to make this easier

Schoolmeister

12/20/2021, 2:44 PM

My initial guess was that something was cached, but I don't have enough knowledge about ipython and/or Kedro to know whether that's possible or not. Removing all caching dirs I could find sure didn't help.

Schoolmeister

12/20/2021, 2:44 PM

I'll just use the temporary solution in the meantime then - thanks!

datajoely

12/20/2021, 2:45 PM

If it is caching maybe poke around the

.ipython

folder at the root of the project? but I'm out of ideas!

Schoolmeister

12/20/2021, 2:46 PM

Yes, that's what I did but couldn't find anything. Renaming/removing the dir didn't help either.

datajoely

12/20/2021, 2:46 PM

¯\_(ツ)_/¯

datajoely

12/20/2021, 2:46 PM

Okay - shout if you have any other issues!

deepyaman

12/20/2021, 9:03 PM

Not sure if this counts as "needs help," but I'm trying to put together a guide for running Kedro pipelines with Dask (i.e. for distributed node execution), and I cleaned up somebody else's work into https://github.com/deepyaman/kedro-dask-example this weekend. All of the real work is contained in https://github.com/deepyaman/kedro-dask-example/blob/develop/src/kedro_dask_example/runner/dask_runner.py. As far as help goes: 1. If anybody wants to try it out themselves and see if it works (or doesn't) for them, any feedback is much appreciated! The easiest thing is to just

kedro run --runner kedro_dask_example.runner.DaskRunner

, but it's also not that interesting. To use the distributed scheduler, you can run

dask-scheduler

and

PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786

in a couple terminal windows, and then run the pipeline. I change the default value for

client_args

{"address": "127.0.0.1:8786"}

for this, because I'm lazy (but you can of course construct the runner the normal way). 2. If somebody has familiarity with Dask, a review of how I get the

Client

would be very helpful. I think

worker_client

_DaskDataSet

is correct, but not sure if I should be using

Client.current()

the way I am in

DaskRunner

. I think

worker_client

is unnecessary here, since it all runs on the scheduler, and

Client.as_current

seems to be for a use case where you have a client object already and want to use it, but I don't find much documentation around this and most of my understanding is from reading the

distributed

source. 3. I'll try and work on a first version of tracking load counts and releasing datasets tonight. My plan is to do it in the simplest way possible, in the

as_completed

loop. However, this feels a bit inefficient, as it really could've been released on the final load (rather than waiting for the node to finish running). I think this would require a distributed counter that

_DaskDataSet

instances could modify.. is this even smart?

datajoely

12/21/2021, 10:50 AM

dask on kedro

Schoolmeister

12/23/2021, 9:01 AM

Hey, can someone provide me some advice regarding building pipelines when using

PartitionedDataset

? I like using it, since it's a nice way to deal with structured data residing in a folder structure, but I'm not happy with the way the pipelines handle them. Simplified, our data folder structure is somewhat like the following:

Copy code

└───data
    ├───01_raw
    │   └───data_type1
    │       ├───entity1.csv
    │       ├───entity2.csv
    │       ├───...
    │       └───entityX.csv
    ├───02_intermediate
    │   └───data_type1
    │       ├───entity1.csv
    │       ├───entity2.csv
    │       ├───...
    │       └───entityX.csv
    └───03_primary
        └───data_type1
            ├───entity1.csv
            ├───entity2.csv
            ├───...
            └───entityX.csv

Each entity's data gets improved the further it goes down the layers. These entities are handled separately from one another. Ideally there is a pipeline that takes in a single

pd.DataFrame

containing the entity's data and transforms it from the raw layer to the primary layer. In that sense, it is a horizontal execution that can be done in parallel for each entity. However, as far as I can tell, using a

PartitionedDataset

forces the pipeline and pipeline nodes to accept a

Dict

as input. Now parallelizing becomes harder as the pipeline stages have become vertical. Each entity in the dictionary must be processed before being able to go to the next stage. Is there any way around this? We'd like to keep using DataFrames as inputs and DataFrames as outputs rather than Dicts, as using DataFrames also provides some semantic information about what exactly the pipelines does.

datajoely

12/23/2021, 12:03 PM

They are dicts by design, perhaps this is a situation where it would make sense to inherit and extend the

PartitionedDataSet

class for your own purposes?

datajoely

12/23/2021, 12:03 PM

Is the only issue how the data is loaded?

deepyaman

12/23/2021, 11:54 PM

> These entities are handled separately from one another. Assuming that the same (or similar, perhaps differently parametrized) transformation logic can be used for

01_raw/data_type1/entityM.csv

02_intermediate/data_type1/entityM.csv

03_primary/data_type1/entityM.csv

as for

01_raw/data_type1/entityN.csv

02_intermediate/data_type1/entityN.csv

03_primary/data_type1/entityN.csv

, it sounds to me that you want a modular pipeline that can do this transformation, that you then reuse. This allows transformation for each entity to occur at the node level, which makes it easier to parallelize. The route you described requires parallelization to occur within nodes, which runs into the blocking problem that you describe. It's also less Kedronic, since you're encroaching on the runner's responsibility.

j c h a r l e s

12/24/2021, 8:35 AM

Definitely let me know your eventual solution, curious to know as I would like to implement something similar

Schoolmeister

12/24/2021, 1:01 PM

Yes, the transformation logic is indeed the same. How exactly would an implementation using a modular pipeline work? I would expect that an earlier pipeline "unpacks" the

PartitionedDataset

dict and maps each key to an input of the modular pipeline, is that the right way to look at it?

deepyaman

12/26/2021, 6:12 AM

Do you absolutely need to use

PartitionedDataSet

? One area Kedro could be a lot better IMO is in how it handles

PartitionedDataSet

, and how that can be read/consumed in the unpartitioned form. As it stands, you can do this unpacking, as you describe, but: 1. it requires some sort of dynamic behavior to get the list of partitions 2. an unpacking pipeline really doesn't do anything, given the data already exists in a split form--you're just making another set of catalog entries to point to the same data I would probably leave the dynamic behavior to when you're constructing the pipeline, like:

Copy code

from kedro.pipeline import pipeline

all_entities_pipeline = Pipeline()
for i in range(NUM_ENTITIES):
    all_entities_pipeline += pipeline(single_entity_pipeline, namespace=f"entity{i}")

Still not perfect if you want a lot of catalog entries for each entity, probably need to look at using templating in that case.

j c h a r l e s

01/03/2022, 9:45 AM

Does anyone know how to override the content type for files uploaded to S3? It seems like all my files are being uploaded with

content-type: binary/octet-stream

, regardless of the content type of my files. Would like to preserve their types rather than coercing them to binary/octet-stream

j c h a r l e s

01/03/2022, 10:29 AM

Solution was adding these fs args & save args to the catalog

Copy code

my_catalog_data:
  filepath: s3://path.../data.html
  type: text.TextDataSet
  fs_args:
    open_args_save:
      ContentType: "text/html"

j c h a r l e s

01/03/2022, 10:31 AM

Guessed this by looking at s3fs source: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L1296

datajoely

01/03/2022, 4:13 PM

Yup that's correct!

user

01/06/2022, 4:42 PM

Why doesn't my Kedro starter prompt for input? https://stackoverflow.com/questions/70610418/why-doesnt-my-kedro-starter-prompt-for-input

austin-hilberg

01/10/2022, 5:13 PM

The

kedro-starter-pandas-iris

repo doesn't have a

prompts.yml

, and so using a local clone of the repo as a starter doesn't allow any input. Is this an oversight, or was that intentional?

datajoely

01/10/2022, 5:42 PM

Hi Austin - which repo are you refering to? We don't have different repos for each starter - one monorepo with different folders. https://github.com/kedro-org/kedro-starters/tree/main/pandas-iris

austin-hilberg

01/10/2022, 5:44 PM

🤦 I was looking at a third party fork. Thank you for pointing out my error.

datajoely

01/10/2022, 5:45 PM

No worries! Good luck

datajoely

01/10/2022, 5:45 PM

That being said - just checking you know you can do this

kedro new --starter=pandas-iris

datajoely

01/10/2022, 5:46 PM

and it will do it for you

austin-hilberg

01/10/2022, 5:55 PM

Thank you for the clarification. Yes, I was attempting to use a local clone as a stepping stone to creating my own starters. I realize the one I had cloned was an archived repo from quantumblacklabs, so I guess not really a third party but maybe deprecated. 😅