https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • d

    datajoely

    12/10/2021, 10:29 AM
    Yeah it's very easy to do from a technical point of view - the harder problem in IMO is how to do in a user friendly way. We get some feedback about catalogs getting too big already so I don't want to balloon them further without good reason
  • z

    Zemeio

    12/10/2021, 10:32 AM
    I see. Yeah, maybe the correct thing would be to split the documentation from the technical definition. Either that or start having a file for each dataset, so you expect it to be bigger.
  • d

    datajoely

    12/10/2021, 10:32 AM
    Yeah I think that makes sense - but we'd still need to test it etc 🙂
  • z

    Zemeio

    12/10/2021, 10:34 AM
    Kedro actually already supports having one file for each dataset, no customization or jinja required. So, if I do make this change I will have to abuse that =). Thanks for the help.
  • e

    ende

    12/10/2021, 11:26 PM
    Is it conceivable to have multiple sets of dependencies (in the form of say different virtual environments even) within a single kedro project? Or is that so ridiculous I should just purge the silly thought from my head? 😛
  • d

    datajoely

    12/11/2021, 5:45 PM
    So it is possible for source code, but not for execution. Do we do support requirements.txt living within a particular modular pipeline folder for pull/push packaging
  • d

    datajoely

    12/11/2021, 5:46 PM
    But it is an advanced pattern and I’d only suggest you do it if there are no other options
  • r

    Rroger

    12/13/2021, 10:15 PM
    I have a node a that just executes sql. Where would be the best place to store the script? In parameters?
  • j

    j c h a r l e s

    12/13/2021, 10:34 PM
    Random follow up question - is it possible or common that a data source is a SQL table?
  • j

    j c h a r l e s

    12/13/2021, 10:43 PM
    I'm trying to find an example of how to use credentials for a kedro project. Where is the recommended place to initialize something like an api client for pipelines?
    from kedro.config import ConfigLoader, MissingConfigException
    
    conf_paths = ["conf/base", "conf/local"]
    conf_loader = ConfigLoader(conf_paths)
    
    try:
        credentials = conf_loader.get("credentials*", "credentials*/**")
    except MissingConfigException:
        credentials = {}
  • j

    j c h a r l e s

    12/13/2021, 10:43 PM
    Do I initialize this in a node? in a pipeline? In a utility library file somewhere? It seems like I can run this code... anywhere? Curious what this group recommends
  • j

    j c h a r l e s

    12/13/2021, 10:47 PM
    Generally, do you have each node initialize things like api clients? Or do you create context objects which store these types of things, and pass them through to each node?
  • j

    j c h a r l e s

    12/13/2021, 11:42 PM
    I ended up using the following for the ConfigLoader to run as expected
    conf_loader = ConfigLoader("conf", "local")
    credentials = conf_loader.get("credentials*", "credentials*/**")
    . Also am calling this directly from a helper function within a specific node for now
  • j

    j c h a r l e s

    12/13/2021, 11:48 PM
    Another question - is there any concept of appending output data as a run continues. I have a node that iterates over a large amount of data (list of entities in a
    raw
    dataset. For each entity, it takes between 15 minutes and 1 hour to process that specific entity. I have a node that is looping over each entity, processing that entity, and then merging all of these entities together. What's the best way to split up this node so that I can process each user as its own node? The number of desired nodes would be one node per entity in the original dataset.
  • r

    Rroger

    12/14/2021, 12:09 AM
    As suggested in https://discord.com/channels/778216384475693066/846330075535769601/914271328037658635, you can create a loop to create many nodes.
  • j

    j c h a r l e s

    12/14/2021, 12:13 AM
    So it's possible to use a data set as an initial input, and then create nodes dynamically based on the contents of the dataframe?
  • j

    j c h a r l e s

    12/14/2021, 12:14 AM
    I can have a node that outputs a list of nodes as a result? e.g.
  • j

    j c h a r l e s

    12/14/2021, 12:16 AM
    I think I'm a bit confused at how to do this... I assumed any output for a computation on a dataset would be another dataset and couldnt really be a list of nodes?
  • j

    j c h a r l e s

    12/14/2021, 12:18 AM
    def create_pipeline(**kwargs):
        build_list_of_entity_nodes = node(
            func=generate_nodes_from_entities,
            inputs="entities_csv",
            outputs="list_of_entity_nodes"
        )
        ## 
        #  How do I run my_expensive_entity_function across this list of nodes,
        #  which is determined through the contents of entities_csv
        return Pipeline([build_list_of_entity_nodes])
  • r

    Rroger

    12/14/2021, 12:44 AM
    This is what I have in mind, there could be a better way. Rather than outputting a list of nodes, I think you need to first create the list of nodes then pass that list into the Pipeline object. Each node in the list would have a specific entity as input, which means you’d also have you split the csv yourself into separate entities.
  • j

    j c h a r l e s

    12/14/2021, 12:58 AM
    Seems like this pattern is discouraged in the two places I’ve found so far online: https://stackoverflow.com/questions/68253274/kedro-create-a-dynamic-node and
  • j

    j c h a r l e s

    12/14/2021, 12:59 AM
    Although Jinja2 is a very powerful and extremely flexible template engine, which comes with a wide range of features, we do not recommend using it to template your configuration unless absolutely necessary. The flexibility of dynamic configuration comes at a cost of significantly reduced readability and much higher maintenance overhead. We believe that, for the majority of analytics projects, dynamically compiled configuration does more harm than good.
  • j

    j c h a r l e s

    12/14/2021, 1:07 AM
    It seems like there should be some way of outputting a partitioned dataset?
  • r

    Rroger

    12/14/2021, 2:00 AM
    If a node doesn’t have an output, can this node be made into a dependency of another node? Eg
    node1
    executes a sql script in a db producing
    table1
    , then
    node2
    executes another sql script on
    table1
    .
  • d

    datajoely

    12/14/2021, 10:13 AM
    Hi @User and @User we have some updates in 0.17.6: - We have updated
    pandas.QueryDataSet
    to so that it takes a reference to a SQL file - We have introduced
    spark.DeltaTableDataSet
    and some tutorials on how to do 'out of dag operations' which may translate here: https://kedro.readthedocs.io/en/latest/11_tools_integration/01_pyspark.html#spark-and-delta-lake-interaction
  • d

    datajoely

    12/14/2021, 10:14 AM
    I'm not sure if all of your questions are answered and perhaps we can split them into different threads we can talk about them in detail
  • n

    NC

    12/14/2021, 2:51 PM
    Is there a way for a node to output a parameter that can be used by a subsequent node?
  • d

    datajoely

    12/14/2021, 2:52 PM
    Could you just return the parameter as an output and re-use it in the next node?
  • d

    datajoely

    12/14/2021, 2:53 PM
    you don't need to declare it in the catalog, but you need to use the same key name in the pipeline definition
  • n

    NC

    12/14/2021, 2:54 PM
    Right, that’s how I’ve been doing it so far. Just was wondering if there is a way to make it explicitly a parameter and referred to later as ‘params:xxx’.
Powered by Linen
Title
n

NC

12/14/2021, 2:54 PM
Right, that’s how I’ve been doing it so far. Just was wondering if there is a way to make it explicitly a parameter and referred to later as ‘params:xxx’.
View count: 1