https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • r

    rafael.gildin

    09/24/2022, 9:50 PM
    Thank you @antheas ! I’ll try this out
  • f

    farisology

    09/26/2022, 9:20 AM
    I am trying to orchestrate my pipelines via airflow (deployed in an ec2) and I can’t find helpful info apart from astronomer. Is there any resource for my case, would appreciate any assistance
  • d

    datajoely

    09/26/2022, 9:22 AM
    We have the native plugin if you want to convert kedro nodes to tasks https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow
  • d

    datajoely

    09/26/2022, 9:23 AM
    But in many cases the granularity of a kedro node is smaller than a airflow task and you may just want to put a single pipeline in a task
  • f

    farisology

    09/26/2022, 9:23 AM
    Thank you will explore this further
  • r

    rafael.gildin

    09/27/2022, 2:54 PM
    kedrosequential_runner.py at main · ked...
  • v

    Vici

    09/27/2022, 3:12 PM
    Custom DataSet for larger than memory data -- dask, SQL, other?
    d
    • 2
    • 7
  • u

    user

    10/03/2022, 1:34 PM
    How to run a kedro pipeline interactively like a fuction https://stackoverflow.com/questions/73936203/how-to-run-a-kedro-pipeline-interactively-like-a-fuction
  • b

    Barros

    10/03/2022, 3:13 PM
    I have achieved this using sklearn - I just pass the object name as a string in the parameters.yml and then I parse it with getattr() to get the specific object
  • b

    Barros

    10/03/2022, 3:14 PM
    I just have to import the module first
  • s

    Seth

    10/04/2022, 12:11 PM
    Lets say for simplicity, I have a pipeline with a single node, which inputs dataset A and output dataset B. How to handle a situation where I want to run this pipeline each day (with airflow for example) and I want dataset B of the previous day as input (as dataset A) in the new day? In the ideal scenario I'd like to have a single versioned dataset, where I overwrite the input dataset with a new version. However, kedro doesnt allow the same dataset as input and output in a single node.
  • n

    noklam

    10/04/2022, 12:21 PM
    This is the default of kedro version dataset, If you don't specified the version it will fetch the latest. Overwriting inputs is not a good idea because you risk getting a corrupted file, the whole point of orchestrator is handling the dependencies and retry logic, and you can't have that if you overwrite the data.
  • s

    Seth

    10/04/2022, 12:29 PM
    Makes sense and I agree, however how do you setup the data catalog if the input file should be the same as the output file, since it is dependent on the output of the same node (but executed a day before) and the data catalog is time-independent.
  • n

    noklam

    10/04/2022, 12:32 PM
    You can have two catalog entries pointing to the same location. I strongly advise not to do this as there are likely better solution.
  • s

    Seth

    10/04/2022, 12:33 PM
    That is indeed what I'm currently doing, however indeed doesnt seem the correct solution.
  • a

    antheas

    10/05/2022, 3:36 PM
    sounds like what you're doing is taking yesterday's dataset, adding new data, and saving it again You are mutating your raw data which is not a good idea, you could corrupt it if your code crashes. If the overhead is not big, I would save each day's data into a separate timestamped file, then use kedro with an aggregation node that merges all the files into one dataset (by using a PartitionedDataset for example). The timestamped file doesn't have to be made with kedro and you can consider it as immutable and back it up with ex. S3 versioning. Otherwise, you can use environment variables with TemplatedConfigLoader, so that A's filename can use yesterday's timestamp, and B today's. So that you also keep a history of your datasets, in case something goes wrong. If something goes wrong however and you don't notice for a few days, you would have to revert and lose all those days' data. You could also combine both approaches if the overhead is too big, and start your aggregation node with say last months dataset as a base and only add this month's days... In case your dataset is not append only and would scale writes with days. In this case, even if something catastrophic happens you still have all the day data backed up, so that you can reconstruct your dataset given enough time. This parallels WAL, which you might find insightful https://en.wikipedia.org/wiki/Write-ahead_logging
  • u

    user

    10/07/2022, 8:07 AM
    How to change the kedro configuration environment in jupyter notebook? https://stackoverflow.com/questions/73984069/how-to-change-the-kedro-configuration-environment-in-jupyter-notebook
  • u

    user

    10/07/2022, 2:04 PM
    Is there a way to include an Azure Databricks Lakehouse query as a DataCatalog dataset in kedro? https://stackoverflow.com/questions/73988370/is-there-a-way-to-include-an-azure-databricks-lakehouse-query-as-a-datacatalog-d
  • u

    user

    10/07/2022, 4:46 PM
    import fsspec throws error (AttributeError: 'EntryPoints' object has no attribute 'get') https://stackoverflow.com/questions/73990243/import-fsspec-throws-error-attributeerror-entrypoints-object-has-no-attribut
  • u

    user

    10/08/2022, 6:36 PM
    Kedro on Databricks: Cannot import SparkDataset https://stackoverflow.com/questions/73999551/kedro-on-databricks-cannot-import-sparkdataset
  • w

    williamc

    10/10/2022, 2:31 AM
    Let's say I have a Kedro project named
    kpro
    and I've abstracted a bunch of stuff into
    kpro.coollib
    . If I package it up with
    kedro package
    and distribute it, do I just need to do
    import kpro.coollib
    on my destination machine to use it? In this case I'm not interested in running my pipelines, just the library code. Thanks!
  • n

    noklam

    10/10/2022, 8:04 AM
    Yes, that would work out of the box. But if you are not interested to ship the pipeline code, you may want to exclude that by modify the setup.py, if you need more details check out the standard python doc about packaging library.
  • n

    nickolas da rocha machado

    10/10/2022, 2:47 PM
    Is anyone having problems with adlfs + partitioned dataset + parallel runner? Apparently, the dataset can't retrieve partitions from blob storage when using this combination. In my tests, it might be something related to asyncio calls inside adlfs glob function.
    python
    # pipeline_registry.py
    pipelines["partitioned"] = Pipeline([node(print, 'partitioned', None)])
    yml
    # catalog.yml
    partitioned:
      type: PartitionedDataSet
      dataset: pandas.CSVDataSet
      path: abfs://...dfs.core.windows/...
      credentials: lab
      filename_suffix: .csv
    log
    [10/10/22 14:39:54] INFO     Kedro project
    [10/10/22 14:39:55] INFO     Loading data from 'partitioned' (PartitionedDataSet)...
  • n

    noklam

    10/10/2022, 2:49 PM
    I don't have any experience with this, but can you describe what problems did u encounter? What's the error?
  • n

    nickolas da rocha machado

    10/10/2022, 2:50 PM
    The dataset gets stuck when reading data from blob storage
  • n

    nickolas da rocha machado

    10/10/2022, 2:51 PM
    it doesn't happen when saving, but when loading, it gets stuck
  • n

    nickolas da rocha machado

    10/10/2022, 2:53 PM
    I tried to debug kedro in order to find what method it stops, and it stops on
    PartitionedDataSet._list_partitions
    . it doesn't look like a Kedro problem, but an adlfs problem, since it only stops execution when reaching its glob method.
  • n

    nickolas da rocha machado

    10/10/2022, 3:05 PM
    I tried it in my Windows machine twice, and it worked. For some reason, this is only occuring in Linux
  • o

    Onéira

    10/11/2022, 8:55 AM
    Hello, I have seen in 0.18.3 that
    kedro build-docs
    etc. will be deprecated in 0.19. Are there other commands which are planned to replace them? Or a new procedure to build the doc or requirements?
  • m

    Merel

    10/11/2022, 10:21 AM
    Hi @Onéira , the commands will not be replaced, but we will give guidance on alternative workflows. For
    kedro build-docs
    specifically we have a follow up task to provide a good alternative: https://github.com/kedro-org/kedro/issues/1618 Another thought we have for the commands is to move them to a community supported plugin. Please leave your thoughts on https://github.com/kedro-org/kedro/issues/1622 if this sounds useful to you 🙂
Powered by Linen
Title
m

Merel

10/11/2022, 10:21 AM
Hi @Onéira , the commands will not be replaced, but we will give guidance on alternative workflows. For
kedro build-docs
specifically we have a follow up task to provide a good alternative: https://github.com/kedro-org/kedro/issues/1618 Another thought we have for the commands is to move them to a community supported plugin. Please leave your thoughts on https://github.com/kedro-org/kedro/issues/1622 if this sounds useful to you 🙂
View count: 1