https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • d

    Deep

    03/07/2022, 1:42 PM
    Is there any fix for the same? Thanks.
  • d

    Deep

    03/07/2022, 1:44 PM
    Hey @datajoely
  • d

    datajoely

    03/07/2022, 1:45 PM
    Hi @User spark will do that because by definition different partitions run on different nodes
  • d

    datajoely

    03/07/2022, 1:45 PM
    IIRC you might have some luck adding
    df.coalesce(1)
    to the last part of your node before it gets returned, that might generate only one file
  • d

    datajoely

    03/07/2022, 1:46 PM
    behind the scenes we're just doing
    df.write.options(**kwargs).save(filename)
  • d

    Deep

    03/07/2022, 1:46 PM
    so while returning Instead of return df I should return df.coalesce(1)?
  • d

    datajoely

    03/07/2022, 1:47 PM
    it's worth giving it a try
  • d

    datajoely

    03/07/2022, 1:47 PM
    looking at these docs
  • d

    datajoely

    03/07/2022, 1:47 PM
    https://sparkbyexamples.com/spark/spark-write-dataframe-single-csv-file/
  • d

    Deep

    03/07/2022, 1:47 PM
    Sure I'll give it a try
  • d

    datajoely

    03/07/2022, 1:47 PM
    message has been deleted
  • d

    datajoely

    03/07/2022, 1:47 PM
    they say after that you may have to run a couple of
    dbutils
    file movement commands
  • d

    datajoely

    03/07/2022, 1:47 PM
    I think coalesce will write a single file within a folder, but still within a folder
  • d

    Deep

    03/07/2022, 1:48 PM
    Right
  • d

    Deep

    03/07/2022, 1:48 PM
    Would you guys be releasing any future updates to maybe tackle this?
  • d

    datajoely

    03/07/2022, 1:57 PM
    I don't think this is a common enough issue for us to change the
    SparkDataSet
    implementation itself. We try to mirror the underlying API as much as possible. What I would recommend is the two simple ways to add this to yourself. I think the easiest thing you can do is subclass the
    SparkDataSet
    and then override the
    save()
    method - you can copy the implementation from us and simply add those two lines from the screenshot below to the operation. You can see how to create a custom dataset here https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html There is also a route to doing this with a hook (https://kedro.readthedocs.io/en/latest/07_extend_kedro/02_hooks.html) but I think the dataset is easier
  • d

    datajoely

    03/07/2022, 2:14 PM
    does that make sense @User ? happy to coach you through it
  • d

    Deep

    03/07/2022, 2:16 PM
    Thanks @datajoely. This makes sense. I'll try and implement this method.
  • d

    Deep

    03/08/2022, 5:47 AM
    $ kedro viz 2022-03-08 11:15:12,792 - kedro.framework.cli.hooks.manager - INFO - Registered CLI hooks from 1 installed plugin(s): kedro-telemetry-0.1.3 2022-03-08 11:15:14,928 - kedro_telemetry.plugin - INFO - You have opted into product usage analytics. 2022-03-08 11:15:16,887 - kedro_viz.integrations.pypi - INFO - Checking for update... 2022-03-08 11:15:17,227 - kedro.framework.session.store - INFO -
    read()
    not implemented for
    BaseSessionStore
    . Assuming empty store. The system cannot find the path specified.
  • d

    datajoely

    03/08/2022, 10:16 AM
    this is unusual - is this all that gets exported?
  • d

    datajoely

    03/08/2022, 10:16 AM
    does the pipeline run without viz?
  • d

    Deep

    03/08/2022, 10:17 AM
    Yes
  • d

    Deep

    03/08/2022, 10:18 AM
    Found the error, it was related to Java environment variable.
  • d

    datajoely

    03/08/2022, 10:20 AM
    Okay good - yeah on reflection that's not coming from Kedro's logging
  • s

    Schoolmeister

    03/08/2022, 1:11 PM
    How do you guys handle outputting a variable number of outputs? For example, I want to write a leave-one-out cross validation split node that accepts as input a data set and a cross validation configuration (containing indexes where to split) and outputs, for each fold, a train/validation split. Each of these inputs should then be fed into a subsequent node that does inference. The "tricky" thing here I think is that the amount of folds varies depending on the dataset.
  • s

    Schoolmeister

    03/08/2022, 1:26 PM
    Inspired by the spaceflights modular pipelines code, I'd like to do something like this. But how do I get those
    train_data
    and
    validation_data
    outputs to create the subsequent pipelines with?
    python
    cv_split_pipe = Pipeline(
        [
            node(
                func=nodes.cv_split,
                inputs=["data", "params:fold_config"],
                outputs=["train_data", "validation_data"], # train_data and validation_data are lists, one index per fold
            )
        ]
    )
        
    # get the train_data and validation_data outputs somehow
    train_data = []
    validation_data = []
    # build modular pipeline
    pipelines = []
    for i, train_set, validation_set in enumerate(zip(train_data, validation_data)):
        pipelines.append(
            pipeline(
                pipe=new_inference_pipeline(),
                inputs=[train_set, validation_set],
                outputs={"y_pred": f"y_pred_{i}"}
            )
        )
    final_pipeline = sum(pipelines)
  • d

    datajoely

    03/08/2022, 1:42 PM
    @User honestly the easiest solution is to return the same number of outputs consistently, but make them return
    None
    if not necessary
  • d

    datajoely

    03/08/2022, 1:42 PM
    we can get into all sorts of creative workarounds but I think it's hard to get away from that view of things
  • w

    williamc

    03/08/2022, 3:10 PM
    FWIW this is what I ended up writing. I side-stepped the issue I was having by delegating the spark dataset read to PySpark: https://gist.github.com/williamcaicedo/f5379a668fe0f59f5dcd02f57bffa369
  • d

    datajoely

    03/08/2022, 3:55 PM
    Really neat - thats for the update
Powered by Linen
Title
d

datajoely

03/08/2022, 3:55 PM
Really neat - thats for the update
View count: 1