https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • a

    adrian

    06/03/2022, 3:14 PM
    Hello :) I was wondering whether someone could help me with my current kedro question: I have a function which takes two args: a, b And I would like to wrap it as a node. The pb I am facing is that a is meant to be a dataset from my catalog but b is meant to be a list of literal strings. In the pipeline.py file, when defining the node, I don't manage to define the inputs kwarg
  • a

    adrian

    06/03/2022, 3:41 PM
    I found a fix for now: I add a node before the node in question that outputs the hard-coded strings I need as a MemoryDataset... I'll see if I can use the params: syntax instead. Not sure whether this allows be to pass sequences of str literals
  • j

    JA_next

    06/03/2022, 4:26 PM
    what about b is defined in the parameter yaml ?
  • a

    adrian

    06/03/2022, 4:28 PM
    So, I want b to be a tuple of strings, or a list of strings. How can I define this in the parameter yaml? The examples I find online only look like nested dicts
  • b

    bgereke

    06/03/2022, 4:31 PM
    You're going to want something like: yam_str: - yam_1 - yam_2 - yam_n You can then pass params:yam_str to your pipeline node and the arg will pass a list like ["yam_1", "yam_2", "yam_n"]
  • a

    adrian

    06/03/2022, 4:37 PM
    Thank you so much! It's exactly what I needed. It worked
  • b

    bgereke

    06/03/2022, 4:40 PM
    awesome!
  • d

    datajoely

    06/06/2022, 9:37 AM
    @bgereke thanks for helping out! You have been upgraded to status 🙂
  • v

    vivecalindahl

    06/08/2022, 10:16 AM
    Hi! Is there a fundamental reason why versioning is not supported for PartitionedDataSet or is it more a matter of some added complexity in implementing it? We're considering using kedro versioning feature, and it's a slight annoyance that one of the datasets we use regularly need to be managed differently.
  • d

    datajoely

    06/08/2022, 10:32 AM
    I think we've always been wary of the combinatorial complexity. One approach people have opted for is to use S3 or Delta table versioning provided by the filesystem rather than Kedro.
  • d

    datajoely

    06/08/2022, 10:33 AM
    We could also add it, but it's also not something users have been demanding very loudly.
  • n

    noklam

    06/08/2022, 10:53 AM
    I used to have the same problem, and I love the delta versioning (but it may be quite difficult for kedro to handle this) since having the entire directory versioned is quite inefficient and I gave up.
  • v

    vivecalindahl

    06/08/2022, 2:32 PM
    Just to be clear, by "use S3", do you mean using S3 bucket with versioning enabled? I'd love to hear what people use and like in practice. I know of DVC of course, but weren't 100% convinced we needed to go there. @noklam You "used to have the same problem", meaning you use Delta table versioning?
  • n

    noklam

    06/08/2022, 11:35 PM
    No, for my case it's wasn't an input but some intermediate file so I end up didn't version it at all
  • i

    inigohrey

    06/09/2022, 10:24 AM
    Hi, is there any "dry-run"-like functionality within Kedro? Sometimes we want to test run a pipeline without overwriting node outputs. Maybe with a hook we could modify the dataset types. The functionality I'm talking about would also be doable by renaming the inputs and outputs as it would force all the datasets to be MemoryDataSets, but it isn't very clean
  • i

    inigohrey

    06/09/2022, 10:30 AM
    Potentially also we could define a debug environment and redefine each dataset definition that we don't want to overwrite as a memorydataset, as we would still need the original input datasets defined. But I'm wondering if anybody has had a similar need and has found an efficient way to work this way.
  • i

    inigohrey

    06/09/2022, 10:41 AM
    https://github.com/kedro-org/kedro/issues/1160#issuecomment-1023966910 This is similar to what we want just the inverse, as OP wanted to save additional datasets when debugging and not when running normally. The difference for what I am looking for is that the base catalog.yml is always loaded, so we would have to explicitly, dataset by dataset, redefine them as MemoryDataSets right?
  • d

    datajoely

    06/09/2022, 11:57 AM
    There is a dry runner example here https://kedro.readthedocs.io/en/stable/nodes_and_pipelines/run_a_pipeline.html#custom-runners
  • i

    inigohrey

    06/09/2022, 1:18 PM
    Thanks @datajoely ! I think that lists all the nodes without actually running them, but maybe we could adapt it to replace certain tagged nodes' input/output. Though I don't know what the 1. most "kedro-friendly" way 2. easiest way would be, since the method in the issue I linked would require us to perform changes in two separate catalogs if anything changes, but the custom runner requires more development, and additional testing
  • d

    datajoely

    06/09/2022, 2:17 PM
    @antony.milne is this correct, I'm pretty sure DryRunner executes, just doesn't save
  • i

    inigohrey

    06/09/2022, 2:53 PM
    I think a combination of this DataCatalog method and after_catalog_created hook might be what we're looking for
  • i

    inigohrey

    06/09/2022, 2:54 PM
    I might be missing something but the implementation in that example and its description both point to it only listing the nodes without actually running them
  • n

    noklam

    06/09/2022, 3:46 PM
    No, it doesn't execute. From the docstring itself. > """``DryRunner`` is an ``AbstractRunner`` implementation. It can be used to list which > nodes would be run without actually executing anything. > """
  • d

    datajoely

    06/09/2022, 3:47 PM
    ah my mistake
  • w

    wulfcrona

    06/13/2022, 9:04 AM
    Hello, I've run into a wierd issue with partitioned datasets on aws s3, for some reson kedro adds a none existing file to the load dict (with key ''), this causes a load error. A simple key check solves it but I rather solve it properly. Any ideas on what might be causing this? {'': <bound method AbstractVersionedDataSet.load of >, '20220609112232.csv': <bound method AbstractVersionedDataSet.load of >}
  • j

    jaweiss2305

    06/14/2022, 5:13 PM
    Hey does anyone know how to add credentials for pandas big query? https://github.com/kedro-org/kedro/blob/develop/kedro/extras/datasets/pandas/gbq_dataset.py I am trying to add the service account file to credentials.yml (not having much luck)
  • n

    noklam

    06/15/2022, 10:05 AM
    It should have the same API as pandas, could you connect to the data with just
    pd.read_gbq
    ?
  • n

    noklam

    06/15/2022, 10:05 AM
    Does the doc helps? https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.pandas.GBQTableDataSet.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html
  • j

    JA_next

    06/15/2022, 5:17 PM
    question: catalog.add() seems not change local config.file Description: 1. jupyter notebook import the existing catalog.yaml to load all data (works) 2. use catalog.add and save the a new dataframe (works, data in catalog.list() and DF can be found on disk.) 3. check the local config.yaml file, cannot find the item
    a
    • 2
    • 3
  • a

    avan-sh

    06/15/2022, 6:41 PM
    That is definitely the expected outcome. Catalog object available from jupyter notebook is only for development and any additions from notebook are not reflected automatically in local file. You'll have to manually add the config to catalog.yaml or parameters.yaml file.
Powered by Linen
Title
a

avan-sh

06/15/2022, 6:41 PM
That is definitely the expected outcome. Catalog object available from jupyter notebook is only for development and any additions from notebook are not reflected automatically in local file. You'll have to manually add the config to catalog.yaml or parameters.yaml file.
View count: 1