778216384475693066 #advanced-need-help

Channels

advanced-need-help

job-posting

welcome

Deep

03/07/2022, 1:44 PM

Hey @datajoely

datajoely

03/07/2022, 1:45 PM

Hi @User spark will do that because by definition different partitions run on different nodes

datajoely

03/07/2022, 1:45 PM

IIRC you might have some luck adding

df.coalesce(1)

to the last part of your node before it gets returned, that might generate only one file

datajoely

03/07/2022, 1:46 PM

behind the scenes we're just doing

df.write.options(**kwargs).save(filename)

Deep

03/07/2022, 1:46 PM

so while returning Instead of return df I should return df.coalesce(1)?

datajoely

03/07/2022, 1:47 PM

it's worth giving it a try

datajoely

03/07/2022, 1:47 PM

looking at these docs

datajoely

03/07/2022, 1:47 PM

https://sparkbyexamples.com/spark/spark-write-dataframe-single-csv-file/

Deep

03/07/2022, 1:47 PM

Sure I'll give it a try

datajoely

03/07/2022, 1:47 PM

they say after that you may have to run a couple of

dbutils

file movement commands

datajoely

03/07/2022, 1:47 PM

I think coalesce will write a single file within a folder, but still within a folder

Deep

03/07/2022, 1:48 PM

Right

Deep

03/07/2022, 1:48 PM

Would you guys be releasing any future updates to maybe tackle this?

datajoely

03/07/2022, 1:57 PM

I don't think this is a common enough issue for us to change the

SparkDataSet

implementation itself. We try to mirror the underlying API as much as possible. What I would recommend is the two simple ways to add this to yourself. I think the easiest thing you can do is subclass the

SparkDataSet

and then override the

save()

method - you can copy the implementation from us and simply add those two lines from the screenshot below to the operation. You can see how to create a custom dataset here https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html There is also a route to doing this with a hook (https://kedro.readthedocs.io/en/latest/07_extend_kedro/02_hooks.html) but I think the dataset is easier

datajoely

03/07/2022, 2:14 PM

does that make sense @User ? happy to coach you through it

Deep

03/07/2022, 2:16 PM

Thanks @datajoely. This makes sense. I'll try and implement this method.

Deep

03/08/2022, 5:47 AM

$ kedro viz 2022-03-08 11:15:12,792 - kedro.framework.cli.hooks.manager - INFO - Registered CLI hooks from 1 installed plugin(s): kedro-telemetry-0.1.3 2022-03-08 11:15:14,928 - kedro_telemetry.plugin - INFO - You have opted into product usage analytics. 2022-03-08 11:15:16,887 - kedro_viz.integrations.pypi - INFO - Checking for update... 2022-03-08 11:15:17,227 - kedro.framework.session.store - INFO -

read()

not implemented for

BaseSessionStore

. Assuming empty store. The system cannot find the path specified.

datajoely

03/08/2022, 10:16 AM

this is unusual - is this all that gets exported?

datajoely

03/08/2022, 10:16 AM

does the pipeline run without viz?

Deep

03/08/2022, 10:17 AM

Yes

Deep

03/08/2022, 10:18 AM

Found the error, it was related to Java environment variable.

datajoely

03/08/2022, 10:20 AM

Okay good - yeah on reflection that's not coming from Kedro's logging

Schoolmeister

03/08/2022, 1:11 PM

How do you guys handle outputting a variable number of outputs? For example, I want to write a leave-one-out cross validation split node that accepts as input a data set and a cross validation configuration (containing indexes where to split) and outputs, for each fold, a train/validation split. Each of these inputs should then be fed into a subsequent node that does inference. The "tricky" thing here I think is that the amount of folds varies depending on the dataset.

Schoolmeister

03/08/2022, 1:26 PM

Inspired by the spaceflights modular pipelines code, I'd like to do something like this. But how do I get those

train_data

and

validation_data

outputs to create the subsequent pipelines with?

Copy code

python
cv_split_pipe = Pipeline(
    [
        node(
            func=nodes.cv_split,
            inputs=["data", "params:fold_config"],
            outputs=["train_data", "validation_data"], # train_data and validation_data are lists, one index per fold
        )
    ]
)
    
# get the train_data and validation_data outputs somehow
train_data = []
validation_data = []
# build modular pipeline
pipelines = []
for i, train_set, validation_set in enumerate(zip(train_data, validation_data)):
    pipelines.append(
        pipeline(
            pipe=new_inference_pipeline(),
            inputs=[train_set, validation_set],
            outputs={"y_pred": f"y_pred_{i}"}
        )
    )
final_pipeline = sum(pipelines)

datajoely

03/08/2022, 1:42 PM

@User honestly the easiest solution is to return the same number of outputs consistently, but make them return

None

if not necessary

datajoely

03/08/2022, 1:42 PM

we can get into all sorts of creative workarounds but I think it's hard to get away from that view of things

williamc

03/08/2022, 3:10 PM

FWIW this is what I ended up writing. I side-stepped the issue I was having by delegating the spark dataset read to PySpark: https://gist.github.com/williamcaicedo/f5379a668fe0f59f5dcd02f57bffa369

datajoely

03/08/2022, 3:55 PM

Really neat - thats for the update

user

03/11/2022, 7:38 AM

How to fetch complex MongoDB Data from Kedro? https://stackoverflow.com/questions/71435031/how-to-fetch-complex-mongodb-data-from-kedro

Walber Moreira

03/11/2022, 2:30 PM

Be careful with this workaround because it may run in perfomance issues when there is a lot of data.