https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • u

    user

    03/12/2022, 6:30 PM
    Hi kedro community, First of all thanks for the great tool! I am playing around with the deployment options, in particular the Prefect deployment. I noticed that when using the
    register_flow.py
    script, the datasets in the catalog object are named as in the file. However, in the Nodes the input and output datasets are namespaced. Therefore, when running that flow it will create only memory datasets because it will assume all the datasets don't exist in the catalog. Now if I change the
    register_flow.py
    so that it does not create MemoryDatasets for everything, the
    run_node
    function does not work as the input and catalog name don't match up and the save/load functions don't work anymore (it tries loading a namespaced dataset that it can't find in the catalog). Is there a way to obtain either a namespaced catalog or a pipeline object where the inputs/outputs of the nodes are not namespaced so that the
    run_node
    function works properly? 🙂
    d
    a
    +2
    • 4
    • 18
  • u

    user

    03/13/2022, 11:21 AM
    Could it be that the
    0.17.7
    does not work with the starters anymore? In
    0.17.6
    the spaceflight starter still works fine but in
    0.17.7
    the new namespacing seems to break the starter? (pipelines and their nodes now include namespacing but catalog does not)
  • d

    datajoely

    03/13/2022, 11:56 AM
    Prefect deployment
  • a

    antony.milne

    03/14/2022, 9:54 AM
    I've just checked this and it seems to work ok to me. This is what your 0.17.7 spaceflights starter project should look like: https://github.com/AntonyMilneQB/test/tree/spaceflights/spaceflights. It contains the two namespaced catalog entries in catalog.yml: https://github.com/AntonyMilneQB/test/blob/spaceflights/spaceflights/conf/base/catalog.yml#L76-L86
  • d

    datajoely

    03/14/2022, 9:56 AM
    I think the problem is specifically when following the prefect tutorial
  • a

    antony.milne

    03/14/2022, 9:57 AM
    ahhh ok
  • p

    PhillyCheeseCake

    03/14/2022, 11:28 AM
    Hello Everyone, I'm new to kedro and have had success with a use-case for tracking performance of a traditional ML model with variable architectures, making changes to input parameters, and saving the reporting results. I'm looking now to use the same data but applied to fundamentally different architectures and with different evaluation criteria. Specifically, I'd like to be able to use deep learning frameworks and hand crafted algorithms. The hand crafted algorithms are simple mathematical operations wrapped in a class to be deployed to firmware. What are the best practices with respect to kedro for this to be 1) easily scaled 2) have facile integration with kedro-mlflow in the future My current data flow is as follows : data load -> preprocessing -> feature calculations -> model training -> evaluation where model training contains the model specifications As I understand it, I have the following options: 1) route which nodes to use within the model training pipeline using parameters e.g. a parameter that says architecture_type and routes the data flow accordingly 2) determine node logic via parameters which specify the architecture (similar to above) 3) each fundamentally different architecture gets its own pipeline: 1)traditional ML 2) deep learning 3) hand crafted algos routed at pipeline registry level 4) implement modular pipelines for these 3 cases My judgement on this, are that options 1 and 2 do not scale well, are not good practice and seem to be ridiculous. Option 4 is attractive but I don't know whether the modular pipeline framework will be sufficiently flexible. Furthermore it seems from reading other posts in here that this may complicate tracking runs with mlflow (multiple models being saved within the same run) . Thus, I'm leaning towards option 3 to start and if I need additional granularity I can make modular pipelines within those 3 categories. Would really appreciate any kind of feedback, clarification or advice. Thanks!
    d
    • 2
    • 15
  • d

    datajoely

    03/14/2022, 12:37 PM
    Large scale application of Kedro
  • w

    Walber Moreira

    03/16/2022, 12:58 AM
    Night, guys! Does anyone know the optimal way to solve this use case below: 1. Node X creates a list of endpoints 2. >>> get the list and make requests for each endpoint <<< We were using node X and then making the loop request on node Y, but doing this I create an impure node… Should I create a custom dataset that receives a Iterable as input?
  • d

    datajoely

    03/16/2022, 8:43 AM
    So we don't typically encourage this because we separate the responsibilities of IO from the node, in practice the node shouldn't know how the data gets loaded/saved
  • d

    datajoely

    03/16/2022, 8:48 AM
    I'm trying to think about the best way to do this: - Custom dataset with no save method (a bit like our
    APIDataSet
    ) could work - You could maybe do things in a
    before_pipeline_run
    hook and
    catalog.add
    your new data - You could do this outside of Kedro and simply set up your catalog to expect data at a certain location
  • w

    Walber Moreira

    03/16/2022, 12:33 PM
    Yeah, I’m inclined towards the first option, it seems more stable, clear and scalable. The third option is out of the question because our deploy uses session.run() on databricks
  • d

    Deep

    03/24/2022, 7:07 AM
    Hello kedro team. I have a silly question. Take for example I'm running 3 nodes in a pipeline. All three of them are importing large datasets. Now is there a way to release the memory utilised by node 1 after it finishes executing?
  • d

    datajoely

    03/24/2022, 7:11 AM
    So no stupid questions! Kedro relies on python to do it's garbage collection and it most cases this will be as good as doing so manually. You could try implementing a 'after node run' hook if you wanted to do something explicitly
  • k

    kalah

    03/24/2022, 7:15 AM
    Hi Guys ,I am a AWS Data Engineer with just a week worth experience Kedro. I have been assigned a task to deploy one of our Kedro pipelines which has 67 nodes and 4 pipelines into AWS Step Functions. I was able to successfully deploy my Kedro Pipeline as AWS Step Functions using the instructions provided here : https://kedro.readthedocs.io/en/latest/10_deployment/10_aws_step_functions.html . But when I try to run the Step Function State machine , the execution fails parallel processing error as AWS Lambda does not support parallel processing. { "resourceType": "lambda", "resource": "invoke", "error": "OSError", "cause": { "errorMessage": "[Errno 38] Function not implemented", "errorType": "OSError", "requestId": "6f1de43c-8aff-4294-a888-1a905c8fb7eb", "stackTrace": [ .............. " File \"/home/app/kedro/framework/session/store.py\", line 76, in ShelveStore\n _lock = Lock()\n", " File \"/usr/local/lib/python3.8/multiprocessing/context.py\", line 68, in Lock\n return Lock(ctx=self.get_context())\n", " File \"/usr/local/lib/python3.8/multiprocessing/synchronize.py\", line 162, in __init__\n SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)\n", " File \"/usr/local/lib/python3.8/multiprocessing/synchronize.py\", line 57, in __init__\n sl = self._semlock = _multiprocessing.SemLock(\n" ] } } Can someone help resolve this as I believe this is because of the def _convert_kedro_pipeline_to_step_functions_state_machine(self) in the deploy.py file provided in the documentation. Any help would be much appreciated. Thanks
  • d

    datajoely

    03/24/2022, 7:20 AM
    So are you trying to run the pipeline in parallel mode? Can you try and see if it works in sequential mode
  • k

    kalah

    03/24/2022, 7:33 AM
    The deploy code provided by Kedro in the deploy.py file itself groups the nodes and then assigns it to run in parallel mode. def _convert_kedro_pipeline_to_step_functions_state_machine(self) -> None: """Convert Kedro pipeline into an AWS Step Functions State Machine""" definition = sfn.Pass(self, "Start") for i, group in enumerate(self.pipeline.grouped_nodes, 1): group_name = f"Group {i}" sfn_state = sfn.**Parallel**(self, group_name) for node in group: sfn_task = self._convert_kedro_node_to_sfn_task(node) sfn_state.branch(sfn_task) definition = definition.next(sfn_state) sfn.StateMachine( self, self.project_name, definition=definition, timeout=core.Duration.seconds(5 * 60), ) Is there a specific reason why the code is setup to deploy the nodes in Parallel mode in Lambda when it is known that AWS Lambda does not support parallel processsing.
    d
    • 2
    • 8
  • d

    datajoely

    03/24/2022, 9:53 AM
    AWS Step functions
  • u

    user

    03/27/2022, 7:29 AM
    how do I run a local script using github actions https://stackoverflow.com/questions/71634369/how-do-i-run-a-local-script-using-github-actions
  • p

    PhillyCheeseCake

    03/27/2022, 7:52 PM
    Hey all, how might I go about dynamically combining the results of multiple modular pipeline instances into a single dataset file when the number of pipeline results is variable? e.g. I have 4 pipeline instances, A, B, C and D. I need to combine results of A&B into a dataset based on shared properties and keep C and D separate. Any clean way of doing this natively? In other words, how can I get a node to take variable number of inputs?
  • d

    datajoely

    03/27/2022, 7:57 PM
    We typically err on the side of keeping things explicit. There are a couple of techniques you can use - you can make use of kwargs/args in your function definition so you can provide variable number of inputs. I sort of an example here: https://github.com/datajoely/modular-spaceflights/blob/main/src/modular_spaceflights/pipelines/feature_engineering/pipeline.py#L41 you also may want to get creative with hooks.
  • p

    PhillyCheeseCake

    03/29/2022, 10:11 AM
    Thanks a ton for all the help, my pipeline is now functional on a basic level, modular, and already quite beautiful.
  • w

    williamc

    03/29/2022, 7:12 PM
    Regarding
    TensorFlowModelDataset
    has anyone noticed some strange behaviour when trying to save (overwrite) a model that already exists? Instead of overwriting the old version, it copies the entire temporal folder into the old model folder. I have replicated the issue in two different Kedro projects (0.17.5)
  • d

    datajoely

    03/29/2022, 9:03 PM
    So looking at the implementation it does look a little funny - what would be the expected result in this case?
  • g

    Galileo-Galilei

    03/29/2022, 9:04 PM
    May it be related? : https://github.com/kedro-org/kedro/issues/696 I had a couple of issues with tensorflowdataset in the past, and ended up writing my own one but I can't remember the details.
  • d

    datajoely

    03/29/2022, 9:05 PM
    It looks related!
  • g

    Galileo-Galilei

    03/29/2022, 9:08 PM
    (This is one of my team member, and I think we had a couple of issues back then) We use something custom since this issue was raised, but I did not had time to investigate further . We also had some issues with tensorflow itself and it's been a while, sorry I can't help more
  • w

    williamc

    03/29/2022, 9:11 PM
    Yep it's related. The old model still exists after saving, and the new one is in a subdirectory. It's like the
    fsspec.put
    operation copies the root tmp directory that
    TensorFlowModelDataset
    creates into the dataset's filepath
  • d

    datajoely

    03/29/2022, 9:19 PM
    Thank you - would you mind copying that to the issue I just reopened I'm on my phone so won't do it justice!
  • w

    williamc

    03/29/2022, 9:21 PM
    Will do
Powered by Linen
Title
w

williamc

03/29/2022, 9:21 PM
Will do
View count: 1