https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • d

    datajoely

    12/02/2021, 3:48 PM
    I think if you copy this it will fix it
  • d

    datajoely

    12/02/2021, 3:48 PM
    https://github.com/quantumblacklabs/kedro/pull/1058/files
  • p

    Piesky

    12/02/2021, 4:07 PM
    Unfortunately, after implementing these lines in
    conf/base/logging.yml
    the issue still persists
  • d

    datajoely

    12/02/2021, 4:09 PM
    Weird! I'm still waiting to see if @User has any idea.
  • d

    datajoely

    12/02/2021, 4:09 PM
    The only other thing I can suggest - put a breakpoint and inspect which loggers are in scope
  • d

    datajoely

    12/02/2021, 4:09 PM
    and then see if any aren't covered by logging.yml
  • p

    Piesky

    12/02/2021, 4:31 PM
    The only Handler is a StreamHandler. also
    info.log
    is not created until program termination, either successful or not. It looks like streamhandler is flushed at th end to this file maybe?
  • d

    datajoely

    12/02/2021, 4:33 PM
    I'm really not sure - someone else will pick this up when they're next online
  • p

    Piesky

    12/02/2021, 4:34 PM
    No problem, thanks for the help and suggestions anyway!
  • a

    antony.milne

    12/02/2021, 5:02 PM
    Hi @User, this happens because as soon as you execute a
    kedro
    command **before conf/base/logging.yml is read **kedro sets some default logging according to this config: https://github.com/quantumblacklabs/kedro/blob/master/kedro/config/logging.yml. As you can see this includes
    info_file_handler
    which is what writes to info.log If you're surprised by that, you're not the only one πŸ˜€ I only just realised this a couple of weeks ago and I don't think anyone else on the team was aware of it either! See https://github.com/quantumblacklabs/kedro/pull/1024
  • d

    datajoely

    12/02/2021, 5:02 PM
    Ah I got the wrong PR - I remembered you fixed that!
  • a

    antony.milne

    12/02/2021, 5:07 PM
    Now if you want to make it so that the default kedro logging handlers aren't loaded at all that isn't immediately straightforward without you editing your site-packages kedro 😬 I think we are unlikely to fix this in kedro itself for a while, because (as I recently discovered) the way logging works is surprisingly intricate and tricky to fix properly. This is annoying I know - I'd definitely like to improve this, but since no one has really noticed or complained about this before it will be quite a low priority thing I'm afraid, sorry! Let me think of some ways you might be able to hack something for now though that will prevent the creation of info.log without needing to modify kedro itself
  • a

    antony.milne

    12/02/2021, 6:19 PM
    ok, so it is possible to do this but very awkward... In order to intercept the logging early on enough in the process you'll need to use the (very unknown and unused)
    kedro.init
    entrypoint, which means you'll need to make a pip-installable plugin. Here's a minimal example: https://github.com/AntonyMilneQB/kedro-disable-logging You can install by
    pip install git+https://github.com/AntonyMilneQB/kedro-disable-logging.git
    This example is very brute force in that it calls
    logging.disable
    . You can definitely make it less aggressive and just remove the handlers you don't want instead in
    plugin.disable_logging
  • i

    Ian Whalen

    12/03/2021, 12:12 AM
    Hey all! I'm working with
    APIDataSet
    and am having some issues getting the
    auth
    keyword argument working. My definition looks like this:
    yaml
    my_api:
      type: api.APIDataSet
      url: ${API_URL}
      auth:
        - "${USERNAME}"
        - "${PASSWORD}"
    but requests expects auth to be a
    tuple
    or
    HTTPBasicAuth
    . Sending in a list like this gives back
    'list' object not callable
    . I also tried giving auth
    !!python/tuple ["${USERNAME}, ${PASSWORD}"]
    but no dice there either since pyyaml is using the safe loader and tuples aren't allowed. Any ideas?
  • d

    datajoely

    12/03/2021, 8:58 AM
    Hi @User that tuple syntax will be available in the next version but we don't support it today. The quickest solution is to subclass and customise.
  • a

    antony.milne

    12/03/2021, 10:32 AM
    For context: https://github.com/quantumblacklabs/kedro/issues/1011 πŸ™‚ But yeah, if it's just one case where you want to use a tuple then I'd recommend making your own
    APIDataSet
    that handles that. If you don't want to subclass it then you could hack together a hook together that converts
    dataset._request_args["auth"]
    to a tuple also
  • a

    antony.milne

    12/03/2021, 10:39 AM
    This is a great question. I actually helped develop exactly something like this a while ago. I'd like to write it up properly in the docs as it's a great example of how you can use modular pipelines. We wanted a forecasting model that predicts something for 2020, and then the output of that is used an input for the same pipeline running for 2021, which in turn feeds the pipeline for 2022, and so on for 10 years… It can be done with something like this in your pipeline registry:
    base_pipeline = Pipeline([node(func, "input_data", "output_data")])
    # in reality base_pipeline would have many nodes
    
    all_pipelines = {}
    for year in range(2020, 2030):
        all_pipelines[f"year_{year}"] = pipeline(
            base_pipeline,
            outputs={"output_data": f"year_{year+1}.input_data"},
            namespace=f"year_{year}"
        )
    
    all_pipelines["all_years"] = sum(all_pipelines.values())
  • d

    dmb23

    12/03/2021, 12:58 PM
    Quick follow-up on such a use of non-linear pipelines: I tried recently to find a way to make e.g. the years in such a construct (2020, 2030) to be accessible as config parameters, but did not find a good solution. I also understand the many discussions on dynamic pipelines which are spread over github and this discord that defining the pipeline only at runtime like this is something you would rather discourage from your experience, but on the latter I am not very sure. Does anyone have any experience with defining pipeline structures in response to config like that?
  • a

    antony.milne

    12/03/2021, 2:22 PM
    See https://github.com/quantumblacklabs/kedro/issues/750#issuecomment-823068296 (I know you aware there have already been discussions on it... but this is really exactly what you're talking about :D). In short: * my controversial personal opinion: I sympathise with the need to generate pipelines on the fly and don't object to it * don't mix these "meta-parameters" with conf/base/parameters; instead put them in a new file like meta_parameters.yml (or .py or whatever suits you - if you're not using
    ConfigLoader
    there's no reason to be restricted to yaml if you don't want to) * environment variables are also quite good for this sort of thing, e.g. you'd do
    YEAR_START = 2020 YEAR_END = 2030 kedro run
    and then in the pipeline_registry you'd do
    range(os.getenv("YEAR_START"), os.getenv("YEAR_END"))
  • d

    dmb23

    12/03/2021, 2:28 PM
    Thanks a lot, that's quite interesting to read! (and Ididn't know this gem hid in this discussion, so thanks for the link πŸ˜„ )
  • d

    dmb23

    12/03/2021, 2:29 PM
    And as always: Thanks a lot for everything you are doing with Kedro in general and this community in particular! I really appreciate all the great work and effort! ❀️
  • a

    antony.milne

    12/03/2021, 2:41 PM
    Thank you, I really appreciate that. Thanks for asking such good questions and being part of the community!
  • i

    Ian Whalen

    12/03/2021, 6:33 PM
    I'm back! Another question on using APIDataSet (which is now my child class that just casts auth to a tuple if its iterable). I want to do this in my `catalog.yml`:
    yaml
    api:
      type: MyAPIDataSet
      url: ...
      auth: 
        - ${username}
        - ${password}
    where my credentials has:
    yaml
    username: me
    password: my_password
    I also tried:
    catalog.yml
    yaml
    api:
      type: MyAPIDataSet
      url: ...
      auth: my_auth
    credentials.yml
    yaml
    my_auth:
      - me
      - my_password
    Am I not understanding how
    credentials.yml
    is supposed to work? Or is it just wonky when working with APIDataSet?
  • i

    Ian Whalen

    12/03/2021, 7:07 PM
    So it was a little of both! I wasn't aware that credentials explicitly only filled in
    credentials
    keys. I fixed this by adding a
    credentials
    kwarg to my child class. Here's my class if anyone is interested:
    python
    from typing import Any, Dict, Iterable, List, Union
    from requests.auth import AuthBase
    from kedro.extras.datasets.api import APIDataSet
    
    class AuthorizableAPIDataSet(APIDataSet):
        def __init__(
            self,
            url: str,
            method: str = "GET",
            data: Any = None,
            params: Dict[str, Any] = None,
            headers: Dict[str, Any] = None,
            auth: Union[Iterable[str], AuthBase] = None,
            json: Union[List, Dict[str, Any]] = None,
            timeout: int = 60,
            credentials: Union[Iterable[str], AuthBase] = None,
        ) -> None:
            if credentials is not None and auth is not None:
                raise ValueError("Cannot specify both auth and credentials.")
    
            auth = credentials or auth
    
            if isinstance(auth, Iterable):
                auth = tuple(auth)
    
            super().__init__(
                url=url,
                method=method,
                data=data,
                params=params,
                headers=headers,
                auth=auth,
                json=json,
                timeout=timeout,
            )
  • t

    T.Komikado

    12/06/2021, 6:57 AM
    Hello! I'm using a partitioned dataset as a node output. Is there any way to automatically clean up the output directory of the partitioned dataset before Kedro saves new data into it? It's because the old files that aren't overwritten cause errors if I forget to remove them before a new run.
  • a

    antony.milne

    12/06/2021, 9:42 AM
    hello! Currently the answer is no, you need to manually delete the files... But in the soon to be released 0.17.6 the answer is yes! There will be a new
    overwrite
    option for `PartitionedDataSet`which does exactly that. By default this will be set to
    false
    (current behaviour) but if you set it to
    true
    in your catalog.yml file it will delete all the old files before saving new data
  • a

    antony.milne

    12/06/2021, 9:43 AM
    If you need this right now it's actually pretty easy to add - just make your own custom
    MyPartitionedDataSet
    class which has this code in it: https://github.com/quantumblacklabs/kedro/blob/ae80b129cb4f1973a554af964d48d8af0c355bb9/kedro/io/partitioned_data_set.py
  • a

    antony.milne

    12/06/2021, 9:43 AM
    If you do give it a go then it would be a useful test actually to see if it's working before we release it!!
  • t

    T.Komikado

    12/06/2021, 11:25 AM
    Thank you very much for your help. I have tried the above code, and it works well! I'm looking forward to the 0.17.6 release!
  • i

    Isaac89

    12/09/2021, 2:49 PM
    Hi! I was reading in the documentation that with fsspec also SSH can be used. Does anyone have an example of data catalog entry using ssh ? how is the path to the file defined in case of ssh ?
Powered by Linen
Title
i

Isaac89

12/09/2021, 2:49 PM
Hi! I was reading in the documentation that with fsspec also SSH can be used. Does anyone have an example of data catalog entry using ssh ? how is the path to the file defined in case of ssh ?
View count: 1