https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • n

    noklam

    05/24/2022, 11:00 AM
    I am not sure I understand the problem here?
  • n

    noklam

    05/24/2022, 11:02 AM
    You can always use a more functional approach like
    map
    , but for loop is fine too.
  • n

    noklam

    05/24/2022, 11:04 AM
    I think you mentioned your problem is with a big dataset, the problem with
    pandas
    is that it is memory hungry, especially during I/O and certain operations. Using the
    chunk
    args helps to mitigate this problem by only loading & processing small batch of data and stitch them by at the end. If the new dataset already iterate through the entire dataset before you start applying any transformation logic, then it doesn't help your memory problem.
  • m

    Mackson

    05/24/2022, 11:08 AM
    message has been deleted
  • n

    noklam

    05/24/2022, 11:18 AM
    I can understand the logic, could you repeat what's the question?
  • d

    datajoely

    05/24/2022, 11:22 AM
    Oh you shouldnt do the writing yourself in your node, this is what PartitionedDataSet is for
  • m

    Mackson

    05/24/2022, 11:24 AM
    What type of DataSet can deal with huge dataset that does not fit in memory
  • d

    datajoely

    05/24/2022, 11:25 AM
    So if you return callables to PartitionedDataset it will do in a lazy way
  • d

    datajoely

    05/24/2022, 11:25 AM
    There are examples in the docs
  • n

    noklam

    05/24/2022, 11:29 AM
    I would probably add that choose format other than csv is better, unless you have to stick with csv.
  • m

    Mackson

    05/24/2022, 11:30 AM
    Which one do you recommend?
  • n

    noklam

    05/24/2022, 11:31 AM
    Parquet/feather? @datajoely will be the right person to answer that
  • m

    Mackson

    05/24/2022, 11:32 AM
    @noklam @datajoely thanks a LOT!
  • d

    datajoely

    05/24/2022, 2:32 PM
    Team Parquet here!
  • n

    noklam

    05/24/2022, 2:44 PM
    Just curious if you have any experience with arrow. It's something on my radar but I have never used itπŸ˜…
  • d

    datajoely

    05/24/2022, 2:46 PM
    Arrow is the modern engine that Parquet can leverage so they're complementary rather than substitutes - this article by the creator of Pandas on the topic is one my favourites https://wesmckinney.com/blog/apache-arrow-pandas-internals/
  • n

    noklam

    05/24/2022, 2:54 PM
    Yes I have read this article a couple of times. I think you are referring Arrow as a in memory columanr data structure in this context. Feather is kind of like the mapping of this data structuring on disk as a storage, but I haven't seen it used widely.
  • l

    Lazy2PickName

    05/25/2022, 2:01 PM
    Hi all, if I want to pass a parameter to my node I can do so by writing:
    node(
        func=foo,
        inputs=['input', 'params:parameter'],
        outputs='output'
    ),
    Is there something similar I can do to pass a credential from the
    credentials.yml
    to my node?
  • d

    datajoely

    05/25/2022, 2:03 PM
    We don't support this by design, you could do a
    parameters.yml
    in your
    local
    folder, but typically your nodes should focus on data flow not IO
  • l

    Lazy2PickName

    05/25/2022, 2:04 PM
    thanks
  • w

    waylonwalker

    05/25/2022, 4:06 PM
    Hey kedroids, looking for a code style opinion here, is there any way to pass inputs directly into pd.concat without an extra function?
    python
    node(lambda *frames: pd.concat(frames), ["cars", "cars"], "two_cars")
    How would you concatenate pandas dataframes?
    a
    d
    • 3
    • 26
  • a

    antony.milne

    05/25/2022, 4:17 PM
    Cunning pd.concat
  • h

    hello_world

    05/26/2022, 3:39 PM
    Hello, how can I catch a DataSetError? I tried "except kedro.io.data_catalog.DataSetError" but then I get NameError: name 'kedro' is not defined...
  • d

    datajoely

    05/26/2022, 3:41 PM
    I think you just need to do
    from kedro.io.data_catalog import DataSetError
    before you do this
  • d

    datajoely

    05/26/2022, 3:42 PM
    + do
    except DataSetError
    rather than the full classpath
  • h

    hello_world

    05/26/2022, 3:43 PM
    it worked, thank you!
  • j

    JA_next

    05/26/2022, 10:49 PM
    I have a quick question on experiment logging: for example if I want to generate 10 models, I understand I can log all 10 hyperparameters in the JSON. how can I save models as 10 pickle files and appended into the config.yaml file?
  • d

    datajoely

    05/27/2022, 9:30 AM
    Hey so there are a few things here: - To reuse the same pipeline multiple times I would recommend reading our modular pipeline docs https://kedro.readthedocs.io/en/stable/nodes_and_pipelines/modular_pipelines.html - To log parameters you should use the experiment tracking features https://kedro.readthedocs.io/en/stable/tutorial/set_up_experiment_tracking.html - To save pickles you can find example on the catalog docs just search form
    pickle.PickleDataSet
    https://kedro.readthedocs.io/en/stable/data/data_catalog.html
  • n

    noklam

    05/27/2022, 9:37 AM
    This reminds me of a question that I used to have. When doing a hyperparameter search, I need to train K (dynamic number, could be 10 or 20) models, then later in the pipeline should pick the best model to continue. What's the best way to handle the K entry in catalog?
    j
    • 2
    • 1
  • d

    datajoely

    05/27/2022, 10:23 AM
    I think a hook, but it's not ideal.
Powered by Linen
Title
d

datajoely

05/27/2022, 10:23 AM
I think a hook, but it's not ideal.
View count: 1