778216384475693066 #advanced-need-help

Hello Everyone, I'm new to kedro and have had success with a use-case for tracking performance of a traditional ML model with variable architectures, making changes to input parameters, and saving the reporting results. I'm looking now to use the same data but applied to fundamentally different architectures and with different evaluation criteria. Specifically, I'd like to be able to use deep learning frameworks and hand crafted algorithms. The hand crafted algorithms are simple mathematical operations wrapped in a class to be deployed to firmware. What are the best practices with respect to kedro for this to be 1) easily scaled 2) have facile integration with kedro-mlflow in the future My current data flow is as follows : data load -> preprocessing -> feature calculations -> model training -> evaluation where model training contains the model specifications As I understand it, I have the following options: 1) route which nodes to use within the model training pipeline using parameters e.g. a parameter that says architecture_type and routes the data flow accordingly 2) determine node logic via parameters which specify the architecture (similar to above) 3) each fundamentally different architecture gets its own pipeline: 1)traditional ML 2) deep learning 3) hand crafted algos routed at pipeline registry level 4) implement modular pipelines for these 3 cases My judgement on this, are that options 1 and 2 do not scale well, are not good practice and seem to be ridiculous. Option 4 is attractive but I don't know whether the modular pipeline framework will be sufficiently flexible. Furthermore it seems from reading other posts in here that this may complicate tracking runs with mlflow (multiple models being saved within the same run) . Thus, I'm leaning towards option 3 to start and if I need additional granularity I can make modular pipelines within those 3 categories. Would really appreciate any kind of feedback, clarification or advice. Thanks!

datajoely

03/14/2022, 12:37 PM

Large scale application of Kedro

Walber Moreira

03/16/2022, 12:58 AM

Night, guys! Does anyone know the optimal way to solve this use case below: 1. Node X creates a list of endpoints 2. >>> get the list and make requests for each endpoint <<< We were using node X and then making the loop request on node Y, but doing this I create an impure node… Should I create a custom dataset that receives a Iterable as input?

datajoely

03/16/2022, 8:43 AM

So we don't typically encourage this because we separate the responsibilities of IO from the node, in practice the node shouldn't know how the data gets loaded/saved

datajoely

03/16/2022, 8:48 AM

I'm trying to think about the best way to do this: - Custom dataset with no save method (a bit like our

APIDataSet

) could work - You could maybe do things in a

before_pipeline_run

hook and

catalog.add

your new data - You could do this outside of Kedro and simply set up your catalog to expect data at a certain location

Walber Moreira

03/16/2022, 12:33 PM

Yeah, I’m inclined towards the first option, it seems more stable, clear and scalable. The third option is out of the question because our deploy uses session.run() on databricks

Deep

03/24/2022, 7:07 AM

Hello kedro team. I have a silly question. Take for example I'm running 3 nodes in a pipeline. All three of them are importing large datasets. Now is there a way to release the memory utilised by node 1 after it finishes executing?

datajoely

03/24/2022, 7:11 AM

So no stupid questions! Kedro relies on python to do it's garbage collection and it most cases this will be as good as doing so manually. You could try implementing a 'after node run' hook if you wanted to do something explicitly

kalah

03/24/2022, 7:15 AM

Hi Guys ,I am a AWS Data Engineer with just a week worth experience Kedro. I have been assigned a task to deploy one of our Kedro pipelines which has 67 nodes and 4 pipelines into AWS Step Functions. I was able to successfully deploy my Kedro Pipeline as AWS Step Functions using the instructions provided here : https://kedro.readthedocs.io/en/latest/10_deployment/10_aws_step_functions.html . But when I try to run the Step Function State machine , the execution fails parallel processing error as AWS Lambda does not support parallel processing. { "resourceType": "lambda", "resource": "invoke", "error": "OSError", "cause": { "errorMessage": "[Errno 38] Function not implemented", "errorType": "OSError", "requestId": "6f1de43c-8aff-4294-a888-1a905c8fb7eb", "stackTrace": [ .............. " File \"/home/app/kedro/framework/session/store.py\", line 76, in ShelveStore\n _lock = Lock()\n", " File \"/usr/local/lib/python3.8/multiprocessing/context.py\", line 68, in Lock\n return Lock(ctx=self.get_context())\n", " File \"/usr/local/lib/python3.8/multiprocessing/synchronize.py\", line 162, in __init__\n SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)\n", " File \"/usr/local/lib/python3.8/multiprocessing/synchronize.py\", line 57, in __init__\n sl = self._semlock = _multiprocessing.SemLock(\n" ] } } Can someone help resolve this as I believe this is because of the def _convert_kedro_pipeline_to_step_functions_state_machine(self) in the deploy.py file provided in the documentation. Any help would be much appreciated. Thanks

datajoely

03/24/2022, 7:20 AM

So are you trying to run the pipeline in parallel mode? Can you try and see if it works in sequential mode

kalah

03/24/2022, 7:33 AM

The deploy code provided by Kedro in the deploy.py file itself groups the nodes and then assigns it to run in parallel mode. def _convert_kedro_pipeline_to_step_functions_state_machine(self) -> None: """Convert Kedro pipeline into an AWS Step Functions State Machine""" definition = sfn.Pass(self, "Start") for i, group in enumerate(self.pipeline.grouped_nodes, 1): group_name = f"Group {i}" sfn_state = sfn.**Parallel**(self, group_name) for node in group: sfn_task = self._convert_kedro_node_to_sfn_task(node) sfn_state.branch(sfn_task) definition = definition.next(sfn_state) sfn.StateMachine( self, self.project_name, definition=definition, timeout=core.Duration.seconds(5 * 60), ) Is there a specific reason why the code is setup to deploy the nodes in Parallel mode in Lambda when it is known that AWS Lambda does not support parallel processsing.

datajoely

03/24/2022, 9:53 AM

AWS Step functions

user

03/27/2022, 7:29 AM

how do I run a local script using github actions https://stackoverflow.com/questions/71634369/how-do-i-run-a-local-script-using-github-actions

PhillyCheeseCake

03/27/2022, 7:52 PM

Hey all, how might I go about dynamically combining the results of multiple modular pipeline instances into a single dataset file when the number of pipeline results is variable? e.g. I have 4 pipeline instances, A, B, C and D. I need to combine results of A&B into a dataset based on shared properties and keep C and D separate. Any clean way of doing this natively? In other words, how can I get a node to take variable number of inputs?

datajoely

03/27/2022, 7:57 PM

We typically err on the side of keeping things explicit. There are a couple of techniques you can use - you can make use of kwargs/args in your function definition so you can provide variable number of inputs. I sort of an example here: https://github.com/datajoely/modular-spaceflights/blob/main/src/modular_spaceflights/pipelines/feature_engineering/pipeline.py#L41 you also may want to get creative with hooks.

PhillyCheeseCake

03/29/2022, 10:11 AM

Thanks a ton for all the help, my pipeline is now functional on a basic level, modular, and already quite beautiful.

williamc

03/29/2022, 7:12 PM

Regarding

TensorFlowModelDataset

has anyone noticed some strange behaviour when trying to save (overwrite) a model that already exists? Instead of overwriting the old version, it copies the entire temporal folder into the old model folder. I have replicated the issue in two different Kedro projects (0.17.5)

datajoely

03/29/2022, 9:03 PM

So looking at the implementation it does look a little funny - what would be the expected result in this case?

Galileo-Galilei

03/29/2022, 9:04 PM

May it be related? : https://github.com/kedro-org/kedro/issues/696 I had a couple of issues with tensorflowdataset in the past, and ended up writing my own one but I can't remember the details.

datajoely

03/29/2022, 9:05 PM

It looks related!

Galileo-Galilei

03/29/2022, 9:08 PM

(This is one of my team member, and I think we had a couple of issues back then) We use something custom since this issue was raised, but I did not had time to investigate further . We also had some issues with tensorflow itself and it's been a while, sorry I can't help more

williamc

03/29/2022, 9:11 PM

Yep it's related. The old model still exists after saving, and the new one is in a subdirectory. It's like the

fsspec.put

operation copies the root tmp directory that

TensorFlowModelDataset

creates into the dataset's filepath

datajoely

03/29/2022, 9:19 PM

Thank you - would you mind copying that to the issue I just reopened I'm on my phone so won't do it justice!

williamc

03/29/2022, 9:21 PM

Will do

user

03/29/2022, 9:24 PM

Protobuf compatibility error when running Kedro pipeline https://stackoverflow.com/questions/71668874/protobuf-compatibility-error-when-running-kedro-pipeline

user

04/01/2022, 9:37 PM

Pickling a file https://stackoverflow.com/questions/71712917/pickling-a-file