https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • w

    WolVez

    03/15/2022, 4:02 PM
    Would love that.
  • l

    lbonini

    03/15/2022, 8:44 PM
    Hello guys! Is there a easy way to export a file from catalog to a dynamic url like:
    yaml
    example_csv: # Input
      type: pandas.ParquetDataSet
      filepath: "s3://my_bucket/yyyy-mm-dd/*.parquet
    I need to save it inside a
    today()
    folder... I was wondering about doing this with templateconfigloader, using environment variables but I need to know if it is simpler than I expect...
  • d

    datajoely

    03/15/2022, 8:46 PM
    Hello you may be able to use
    PartitionedDataSet
    to do this with pandas. Spark will do this automatically.
  • l

    lbonini

    03/15/2022, 8:51 PM
    Is there another way to do this?
  • d

    datajoely

    03/15/2022, 8:54 PM
    message has been deleted
  • d

    datajoely

    03/15/2022, 8:54 PM
    I think you just wrap your dataset one level higher
  • d

    DIVINE

    03/15/2022, 8:54 PM
    Hello! What's the best practice to update micro-packages between projects? For example, let's say I develop a pipeline P-1 in a project P-A then micro-package it into a project P-B then I find out that I need to update the pipeline (for example for a bugfix). Should I update the micro-package manually then publish it? Should I do it in project P-A or project P-B?
  • d

    datajoely

    03/15/2022, 8:57 PM
    Hello! This is an advanced question but I have a couple of thoughts here. (1) you could designate one of these the "lead" project where consuming projects just take things as is (2) you can set up CI rules to let you know when your versions get stale when compared to the artefactory / repo where the 'latest' lived
  • d

    datajoely

    03/15/2022, 8:58 PM
    Either way it's somewhat of a team workflow question rather than one I think we have a specific view on
  • a

    antony.milne

    03/16/2022, 9:32 AM
    FYI @User we are currently thinking about improvements to our interactive Jupyter notebook workflow, how users should debug pipelines, etc. Sounds very relevant to what you're trying to do so please do leave suggestions, comments, feature requests etc. here 🙂 https://github.com/kedro-org/kedro/issues/1075
  • v

    vivecalindahl

    03/16/2022, 10:05 AM
    Hi all! I'd like to parameterize my data catalog using an environment variable. For instance, say I have an environment variable
    FILEPATH=/path/to/file/
    and in my data catalog I'd want something like
    example_iris_data:
     type: pandas.CSVDataSet
     filepath: "${params:filepath}"
    Then I could run kedro as
    kedro run --params filepath=$FILEPATH
    . I'm aware of the
    TemplatedConfigLoader
    https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#template-configuration . But is there a way of skipping the extra loop through the config file
    globals.yaml
    ? I haven't been able to figure out how to do what I outlined above, i.e. using the parameters dict directly. Basically it would be nice to directly provide the env variable, without first creating a
    globals.yaml
    containing a single variable.
  • a

    avan-sh

    03/16/2022, 12:54 PM
    The
    register_config_loader
    hook (https://kedro.readthedocs.io/en/latest/kedro.framework.hooks.specs.RegistrationSpecs.html#kedro.framework.hooks.specs.RegistrationSpecs.register_config_loader) also gets all the extra_params you'll be passing from command line. You can use that to add file_path to globals_dict. And also your filepath placeholder will be
    ${filepath}
    . LMK if I'm not clear, I can help get a quick snippet later if you want.
  • d

    datajoely

    03/16/2022, 12:57 PM
    @User yeah this is possible by tweaking the config loader like @User suggests - you also make env variables available as $vars this way: https://discord.com/channels/778216384475693066/846330075535769601/950324995752620052
  • v

    vivecalindahl

    03/16/2022, 1:15 PM
    A snippet would help to understand that approach. I'm not sure how to modify the hook. I just tried @User suggestion, and that works nicely so I got my question answered, thanks!
  • a

    avan-sh

    03/16/2022, 1:25 PM
    An approach like this should work as well. One extra note, your runs might fail if you don't have any default value even when you're not using those items.
    class ProjectHooks:
        @hook_impl
        def register_config_loader(conf_paths, env, extra_params) -> ConfigLoader:
            globals_dict = {}
            globals_dict["filepath"] = extra_params.get("filepath", "default_value")
            return TemplatedConfigLoader(
                conf_paths,
                globals_dict=globals_dict,
            )
  • v

    vivecalindahl

    03/16/2022, 1:25 PM
    I'll give that alt a go too.
  • l

    lbonini

    03/16/2022, 1:53 PM
    Why i'm getting access denied?
    yaml
    # catalog.yml
    example_data:
      type: PartitionedDataSet
      dataset: pandas.ParquetDataSet
      credentials: dev_s3
      path: s3://bucket/path/to/folder
      filename_suffix: "_part.parquet"
    yaml
    # credentials.yml
    dev_s3:
      client_kwargs:
        aws_access_key_id: xxxx
        aws_secret_access_key: xxx
  • d

    datajoely

    03/16/2022, 1:53 PM
    If it's coming from boto3 underneath the error then it's credentials not kedro
  • l

    lbonini

    03/16/2022, 2:07 PM
    I was using the same credential to read, but notice that it doesnt have permissions to write 😅
  • g

    gui42

    03/17/2022, 2:23 AM
    Guys, how are you all? I've been using kedro for the past months and I've really been enjoying it 😄
  • g

    gui42

    03/17/2022, 2:23 AM
    Quick question. Is there a way of logging dataset sizes elegantly? Rather than a per node writting a logging.info statement?
  • d

    datajoely

    03/17/2022, 9:17 AM
    This is the right time to use hooks! It would be very similar to this Memory profiling examples: https://kedro.readthedocs.io/en/latest/07_extend_kedro/02_hooks.html#hooks-examples
  • s

    Schoolmeister

    03/17/2022, 9:37 AM
    Can you reference parameters in a list by their index? For example if I had a file
    parameters.yml
    with the following contents:
    yaml
    folds:
      timeseries1:
        - start: 2021-08-24 15:00:00+00:00
          end: 2021-10-22 03:05:00+00:00
        - start: 2021-10-22 03:10:00+00:00
          end: 2021-12-28 05:00:00+00:00     
        - start: 2021-12-28 05:05:00+00:00
          end: 2022-01-28 12:00:00+00:00
    Can I reference the second [start, end]? I've tried using something like
    params:folds.timeseries1.1
    or
    params:folds.timeseries1[1]
    , but that doesn't work.
  • d

    datajoely

    03/17/2022, 9:38 AM
    Unfortunately not - you would have to either do the indexing in Python or if you really wanted to make the keys the index yourself
  • s

    Schoolmeister

    03/17/2022, 9:40 AM
    Ok, thanks for the extremely quick reply. I'll simply use the second method.
  • l

    lbonini

    03/17/2022, 3:52 PM
    Hello, I was just wondering if it's normal to kedro executes hooks.py twice during
    kedro run
  • d

    datajoely

    03/17/2022, 3:57 PM
    It is not! IIRC there is an old version where this happened, what version are you running?
  • l

    lbonini

    03/17/2022, 3:57 PM
    '0.17.7'
  • d

    datajoely

    03/17/2022, 4:04 PM
    so that's unexpected - are you using custom hooks? What logging message to you see?
  • l

    lbonini

    03/17/2022, 4:05 PM
    I just print the conf paths and logs show it twice... I'll check the plugins and reinstall everything
    d
    • 2
    • 21
Powered by Linen
Title
l

lbonini

03/17/2022, 4:05 PM
I just print the conf paths and logs show it twice... I'll check the plugins and reinstall everything
d

datajoely

03/17/2022, 4:08 PM
Wait - How are you printing the conf paths?
that should be done for you
l

lbonini

03/17/2022, 4:09 PM
python
class ProjectHooks:
    
    @hook_impl
    def register_config_loader(
        self, conf_paths: Iterable[str], env: str, extra_params: Dict[str, Any],
    ) -> TemplatedConfigLoader:
        print(conf_paths)
d

datajoely

03/17/2022, 4:10 PM
ah gotcha
so a normal kedro run should log like this
message has been deleted
I think that the config loader is loaded directly after that once
Okay I wasn't aware of this
we actually register the config loader a few times
it's used for a couple of things behind the scenes and each plug-in gets access to hit
so if you have viz and telemetry installed thats twice already
taught me something I didn't know today
l

lbonini

03/17/2022, 4:18 PM
look:
three times now 😅
d

datajoely

03/17/2022, 4:19 PM
yeah its valid
there is also very little performance penalty
as it doesn't do anything on its own
it needs to be called by other things
l

lbonini

03/17/2022, 4:20 PM
downgraded to 0.17.6
d

datajoely

03/17/2022, 4:20 PM
yeah it's not a bug
l

lbonini

03/17/2022, 4:20 PM
ok, just to be sure if it was normal
View count: 1