https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • d

    datajoely

    12/20/2021, 2:34 PM
    And does
    kedro run
    still work?
  • s

    Schoolmeister

    12/20/2021, 2:42 PM
    *
    kedro run
    still works * this is the output from the terminal when starting
    kedro jupyter lab
    , everything seems OK: https://gist.github.com/michaeltoqua/95055dbb439a6a9240e61d5680c0aec5 * Even though I'm on 0.17.4, I've just tried what you suggested, and it seems to work!
  • d

    datajoely

    12/20/2021, 2:42 PM
    Okay - I'm not entirely sure what went wrong, but I'm glad we've got a working solution
  • d

    datajoely

    12/20/2021, 2:43 PM
    Maybe we should include a
    kedro project rename
    command to make this easier
  • s

    Schoolmeister

    12/20/2021, 2:44 PM
    My initial guess was that something was cached, but I don't have enough knowledge about ipython and/or Kedro to know whether that's possible or not. Removing all caching dirs I could find sure didn't help.
  • s

    Schoolmeister

    12/20/2021, 2:44 PM
    I'll just use the temporary solution in the meantime then - thanks!
  • d

    datajoely

    12/20/2021, 2:45 PM
    If it is caching maybe poke around the
    .ipython
    folder at the root of the project? but I'm out of ideas!
  • s

    Schoolmeister

    12/20/2021, 2:46 PM
    Yes, that's what I did but couldn't find anything. Renaming/removing the dir didn't help either.
  • d

    datajoely

    12/20/2021, 2:46 PM
    ¯\_(ツ)_/¯
  • d

    datajoely

    12/20/2021, 2:46 PM
    Okay - shout if you have any other issues!
  • d

    deepyaman

    12/20/2021, 9:03 PM
    Not sure if this counts as "needs help," but I'm trying to put together a guide for running Kedro pipelines with Dask (i.e. for distributed node execution), and I cleaned up somebody else's work into https://github.com/deepyaman/kedro-dask-example this weekend. All of the real work is contained in https://github.com/deepyaman/kedro-dask-example/blob/develop/src/kedro_dask_example/runner/dask_runner.py. As far as help goes: 1. If anybody wants to try it out themselves and see if it works (or doesn't) for them, any feedback is much appreciated! The easiest thing is to just
    kedro run --runner kedro_dask_example.runner.DaskRunner
    , but it's also not that interesting. To use the distributed scheduler, you can run
    dask-scheduler
    and
    PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786
    in a couple terminal windows, and then run the pipeline. I change the default value for
    client_args
    to
    {"address": "127.0.0.1:8786"}
    for this, because I'm lazy (but you can of course construct the runner the normal way). 2. If somebody has familiarity with Dask, a review of how I get the
    Client
    would be very helpful. I think
    worker_client
    in
    _DaskDataSet
    is correct, but not sure if I should be using
    Client.current()
    the way I am in
    DaskRunner
    . I think
    worker_client
    is unnecessary here, since it all runs on the scheduler, and
    Client.as_current
    seems to be for a use case where you have a client object already and want to use it, but I don't find much documentation around this and most of my understanding is from reading the
    distributed
    source. 3. I'll try and work on a first version of tracking load counts and releasing datasets tonight. My plan is to do it in the simplest way possible, in the
    as_completed
    loop. However, this feels a bit inefficient, as it really could've been released on the final load (rather than waiting for the node to finish running). I think this would require a distributed counter that
    _DaskDataSet
    instances could modify.. is this even smart?
    d
    • 2
    • 7
  • d

    datajoely

    12/21/2021, 10:50 AM
    dask on kedro
  • s

    Schoolmeister

    12/23/2021, 9:01 AM
    Hey, can someone provide me some advice regarding building pipelines when using
    PartitionedDataset
    ? I like using it, since it's a nice way to deal with structured data residing in a folder structure, but I'm not happy with the way the pipelines handle them. Simplified, our data folder structure is somewhat like the following:
    └───data
        ├───01_raw
        │   └───data_type1
        │       ├───entity1.csv
        │       ├───entity2.csv
        │       ├───...
        │       └───entityX.csv
        ├───02_intermediate
        │   └───data_type1
        │       ├───entity1.csv
        │       ├───entity2.csv
        │       ├───...
        │       └───entityX.csv
        └───03_primary
            └───data_type1
                ├───entity1.csv
                ├───entity2.csv
                ├───...
                └───entityX.csv
    Each entity's data gets improved the further it goes down the layers. These entities are handled separately from one another. Ideally there is a pipeline that takes in a single
    pd.DataFrame
    containing the entity's data and transforms it from the raw layer to the primary layer. In that sense, it is a horizontal execution that can be done in parallel for each entity. However, as far as I can tell, using a
    PartitionedDataset
    forces the pipeline and pipeline nodes to accept a
    Dict
    as input. Now parallelizing becomes harder as the pipeline stages have become vertical. Each entity in the dictionary must be processed before being able to go to the next stage. Is there any way around this? We'd like to keep using DataFrames as inputs and DataFrames as outputs rather than Dicts, as using DataFrames also provides some semantic information about what exactly the pipelines does.
  • d

    datajoely

    12/23/2021, 12:03 PM
    They are dicts by design, perhaps this is a situation where it would make sense to inherit and extend the
    PartitionedDataSet
    class for your own purposes?
  • d

    datajoely

    12/23/2021, 12:03 PM
    Is the only issue how the data is loaded?
  • d

    deepyaman

    12/23/2021, 11:54 PM
    > These entities are handled separately from one another. Assuming that the same (or similar, perhaps differently parametrized) transformation logic can be used for
    01_raw/data_type1/entityM.csv
    ->
    02_intermediate/data_type1/entityM.csv
    ->
    03_primary/data_type1/entityM.csv
    as for
    01_raw/data_type1/entityN.csv
    ->
    02_intermediate/data_type1/entityN.csv
    ->
    03_primary/data_type1/entityN.csv
    , it sounds to me that you want a modular pipeline that can do this transformation, that you then reuse. This allows transformation for each entity to occur at the node level, which makes it easier to parallelize. The route you described requires parallelization to occur within nodes, which runs into the blocking problem that you describe. It's also less Kedronic, since you're encroaching on the runner's responsibility.
  • j

    j c h a r l e s

    12/24/2021, 8:35 AM
    Definitely let me know your eventual solution, curious to know as I would like to implement something similar
  • s

    Schoolmeister

    12/24/2021, 1:01 PM
    Yes, the transformation logic is indeed the same. How exactly would an implementation using a modular pipeline work? I would expect that an earlier pipeline "unpacks" the
    PartitionedDataset
    dict and maps each key to an input of the modular pipeline, is that the right way to look at it?
  • d

    deepyaman

    12/26/2021, 6:12 AM
    Do you absolutely need to use
    PartitionedDataSet
    ? One area Kedro could be a lot better IMO is in how it handles
    PartitionedDataSet
    , and how that can be read/consumed in the unpartitioned form. As it stands, you can do this unpacking, as you describe, but: 1. it requires some sort of dynamic behavior to get the list of partitions 2. an unpacking pipeline really doesn't do anything, given the data already exists in a split form--you're just making another set of catalog entries to point to the same data I would probably leave the dynamic behavior to when you're constructing the pipeline, like:
    from kedro.pipeline import pipeline
    
    all_entities_pipeline = Pipeline()
    for i in range(NUM_ENTITIES):
        all_entities_pipeline += pipeline(single_entity_pipeline, namespace=f"entity{i}")
    Still not perfect if you want a lot of catalog entries for each entity, probably need to look at using templating in that case.
  • j

    j c h a r l e s

    01/03/2022, 9:45 AM
    Does anyone know how to override the content type for files uploaded to S3? It seems like all my files are being uploaded with
    content-type: binary/octet-stream
    , regardless of the content type of my files. Would like to preserve their types rather than coercing them to binary/octet-stream
  • j

    j c h a r l e s

    01/03/2022, 10:29 AM
    Solution was adding these fs args & save args to the catalog
    my_catalog_data:
      filepath: s3://path.../data.html
      type: text.TextDataSet
      fs_args:
        open_args_save:
          ContentType: "text/html"
  • j

    j c h a r l e s

    01/03/2022, 10:31 AM
    Guessed this by looking at s3fs source: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L1296
  • d

    datajoely

    01/03/2022, 4:13 PM
    Yup that's correct!
  • u

    user

    01/06/2022, 4:42 PM
    Why doesn't my Kedro starter prompt for input? https://stackoverflow.com/questions/70610418/why-doesnt-my-kedro-starter-prompt-for-input
  • a

    austin-hilberg

    01/10/2022, 5:13 PM
    The
    kedro-starter-pandas-iris
    repo doesn't have a
    prompts.yml
    , and so using a local clone of the repo as a starter doesn't allow any input. Is this an oversight, or was that intentional?
  • d

    datajoely

    01/10/2022, 5:42 PM
    Hi Austin - which repo are you refering to? We don't have different repos for each starter - one monorepo with different folders. https://github.com/kedro-org/kedro-starters/tree/main/pandas-iris
  • a

    austin-hilberg

    01/10/2022, 5:44 PM
    🤦 I was looking at a third party fork. Thank you for pointing out my error.
  • d

    datajoely

    01/10/2022, 5:45 PM
    No worries! Good luck
  • d

    datajoely

    01/10/2022, 5:45 PM
    That being said - just checking you know you can do this
    kedro new --starter=pandas-iris
  • d

    datajoely

    01/10/2022, 5:46 PM
    and it will do it for you
Powered by Linen
Title
d

datajoely

01/10/2022, 5:46 PM
and it will do it for you
View count: 1