https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • o

    Onéira

    10/11/2022, 1:41 PM
    Thanks Merely for your feedback I will check the document you have mentioned ☺️
  • f

    farisology

    10/12/2022, 7:10 AM
    I have been introducing my team to kedro and multiple questions were raised. I hope you can help me answer these: 1- When
    kedro run
    in a local machine and it starts executing it says
    running kedro session start
    does this mean it runs a local server? 2- If we are running Kedro DAG will that mean the kedro will start a local server in our airflow? what ports will it use? 3- The kedro-airflow has been archived, what does this mean for the support of the project? 4- Since there is limited documentation on running kedro in stand-alone airflow, does this mean kedro airflow will only be a sustainable approach with astronomer going forward? 5- why there is kedro operator instead of python or base operator?
  • m

    Merel

    10/12/2022, 9:36 AM
    Hi @farisology I'll try my best to answer your questions: 1. No Kedro does not use a server to run, unless you set it up that way yourself. Out of curiosity: what version of Kedro are you using? That log line is not in our latest version, and I'm not sure it ever was.. perhaps your team added this for debugging? 2. Are you using the
    kedro-airflow
    plugin for this? 3. You can definitely run Kedro on stand-alone airflow as well. The Kedro maintainer team tries to keep documentation and deployment guides up to date, but we don't always manage to write about all possible ways of deploying Kedro projects. Let us know where we can help here! 4. I don't really understand your question, is this related to airflow/`kedro-airflow` again?
  • l

    lancechua

    10/12/2022, 11:42 AM
    Pattern wise, does a
    GenericDataSet
    class factory make sense? I'm looking to write a class factory with the following signature:
    python
    def create_generic_dataset(
        name: str,
        load: Callable[[Path, ...], pandas.DataFrame],
        save: Callable[[pandas.DataFrame, Path, ...], None]
    ):
        ...
    or is subclassing the
    AbstractaDataSet
    still the preferred method?
  • d

    datajoely

    10/12/2022, 11:43 AM
    Is this just for pandas? We already have a pandas.GenericDataSet
  • l

    lancechua

    10/12/2022, 11:47 AM
    It’s inspired by the pandas generic dataset actually, but the idea is to be able to pass arbitrary load / save callables that follow the required signature.
  • d

    datajoely

    10/12/2022, 11:48 AM
    What other targets would you imagine here? The rationale for the pandas one was more about us having a best effort way of not having to create a new implementation whenever they release something new like XML.
  • d

    datajoely

    10/12/2022, 11:49 AM
    The generic-ish datasets I've been planning on pitching/developing have been around Snowpark and Ibis
  • d

    datajoely

    10/12/2022, 11:49 AM
    In summary I'm intrigued by your pitch, but would love to learn more about what pain points it would solve
  • d

    datajoely

    10/12/2022, 11:50 AM
    Perhaps raising a feature request on GitHub is the best way to pitch this, kick the tyres and to give you an idea of what the steering committee would accept as a contribution
  • l

    lancechua

    10/12/2022, 11:59 AM
    One that comes to mind is supporting xarray NetCDF files or arviz InferenceData files, which have fairly standard load / save APIs.
  • d

    datajoely

    10/12/2022, 11:59 AM
    Super interesting and something we would likely want to support
  • d

    datajoely

    10/12/2022, 12:01 PM
    I wonder if there is slightly less generic middleground - what about a common superclass that we can just do trivial implementations of? I think there is an argument that having explicit names is good for newbies and we should do generics if they have a common parent i.e. pandas or are actually generic natively e.g. spark
  • l

    lancechua

    10/12/2022, 12:28 PM
    A less generic middle ground could be to have generic dataset for each library then just specify the reader / writer methods I guess. The other route is to make implementation of dataset as simple as providing load / save functions with the right signature. Will try to create a feature request in the next few days.
  • n

    noklam

    10/12/2022, 12:31 PM
    There is some interest in Xarray NetCDF before, I am not particularly familiar with it, but as I understand this is a common data format in the scientific community. https://github.com/kedro-org/kedro/issues/1346
  • n

    noestl

    10/12/2022, 1:25 PM
    Hello, I have a rather nebulous issue with Kedro which I'm curious if you've ever encountered. I'm trying to modify the runner so that everything runs on Batch by following the documentation on this page: https://kedro.readthedocs.io/en/latest/deployment/aws_batch.html Everything seems quite straightforward, but one of the steps is to add a custom cli in the same location as the settings.py file using the following template: https://kedro.readthedocs.io/en/latest/development/commands_reference.html#customise-or-override-project-specific-kedro-commands Again, not very complicated according to the doc and I found that the file is resolved by the main on the other hand when I execute the kedro run command the two files are ignored so my question is, did you do a custom cli and encountered this problem?
  • d

    DIVINE

    10/12/2022, 4:12 PM
    hello, i have some questions about datasets that are dependant on parameters... For example, if I have a sqlquerydataset where the sql query has a where statement, is there a way to change the value of the where depending on a parameter in the conf/base/parameters.yml ?
  • d

    datajoely

    10/12/2022, 4:12 PM
    You can do this with globals.yml
  • d

    DIVINE

    10/12/2022, 4:18 PM
    great, i just looked at the documentation, thank you
  • f

    farisology

    10/13/2022, 3:26 AM
    1. sorry for my mistake this got confused by my team with the session.py that appears on the side. (none of my team put his hands on kedro so some questions are wrong assumptions perhaps) 2. yes I use kerdo-airflow plugin 3. I am struggling on this part, I have tried following most of what I find online but nothing comes closer to making kedro DAG runs (generic directional tips that don't translate into an operational tutorial perhaps) I start thinking we shouldn't run kedro pipelines in airflow 4. Yes this is related to kedro-airflow auto-generated dag. Here is the summary of what I am trying to do: 1- I want to use kedro to make the data science team make production-ready code from the get go 2- As an MLOps I want to automate the process so that the kedro pipeline can be made into a DAG without so much friction and can be orchestrated 3- This kedro DAG should be able to run stand-alone (conf files & data should be read from a bucket -or from DWH or redis- and not from local storage) -cannot be pushing data and files cluttering the repo so leveraging bucket storage- What's your take on my approach?
    m
    • 2
    • 3
  • u

    user

    10/13/2022, 8:15 AM
    Kedro template configuration does not load globals.yml configuration into catalog.yml FOR Jupyter Lab https://stackoverflow.com/questions/74052421/kedro-template-configuration-does-not-load-globals-yml-configuration-into-catalo
  • g

    gonecaving

    10/13/2022, 9:17 AM
    Hi all, I've been exploring the kedro framework over the past week or two, and been building out a proof of concept pipeline for a data project in the company I work for. What I'd like to do next, and where I'm looking for some guidance/pointers, is to see if it's possible to build a pipeline where the structure of the DAG is defined in data. I'm hoping that it's possible to build a set of templates, and then enable a less technical user to join them together via some sort of data input (JSON/YAML/...) from the data catalog? I've done this sort of thing in Airflow before, but really like the concept of connected pipelines and the use of a data catalog in Kedro. If feels like the register_pipelines might be the place to do this, but it doesn't seem to take any args? So, any tips?
  • u

    user

    10/13/2022, 2:43 PM
    TypeError: __init__() got an unexpected keyword argument 'config_loader' https://stackoverflow.com/questions/74057556/typeerror-init-got-an-unexpected-keyword-argument-config-loader
  • p

    PetitLepton

    10/13/2022, 7:54 PM
    Hi fellows, I created a custom dataset to handle templated (Jinja) SQL queries. This kind of dataset would be useful for several projects. How would you deal with sharing the dataset across different kedro projects? The documentation for custom dataset talks about extending kedro extras but I don't think that this little implementation is general enough.
  • u

    user

    10/14/2022, 8:58 AM
    kedro PartitionedDataSet lazy writting to spare memory? https://stackoverflow.com/questions/74066621/kedro-partitioneddataset-lazy-writting-to-spare-memory
  • n

    noklam

    10/14/2022, 9:10 AM
    I think what u need is distribute this shared datasets as a python package. You can use a kedro project or just a plain python package since it won't have any pipeline in it.
  • p

    PetitLepton

    10/14/2022, 9:12 AM
    Thanks @noklam , that's what I had in mind. 👍
  • n

    noklam

    10/14/2022, 9:24 AM
    Do you mean constructing the pipeline dynamically on the fly? You can achieve this via the hooks but it isn't something we encourage in general. Part of the reason is that we think static pipeline is easier to understand and reproduce, when you hide too many switch in the pipeline, it's difficult to reason about it.
  • p

    PetitLepton

    10/14/2022, 6:43 PM
    Hi, I bumped into an unexpected behavior and would like your feedback if I did something wrong. I built a custom dataset which inherits from SQLQueryDataSet. The latter has a parameter filepath. In the former, I, first, added a parameter templated_filepath which I then pass into SQLQueryDataSet as filepath=templated_filepath. Here is the catch. If I use the custom dataset in a notebook with something like data/01_raw/blabla as templated_filepath, the catalog breaks because the full path includes notebooks/, i.e. it's using the directory of the notebook to build the complete path. But, if instead of calling the parameter templated_filepath, I use the genuine naming filepath, then everything is fine. Is there some magic around the name filepath when the catalog is parsed to create the datasets?
    • 1
    • 2
  • p

    PetitLepton

    10/15/2022, 9:18 AM
    About paths in the catalog
Powered by Linen
Title
p

PetitLepton

10/15/2022, 9:18 AM
About paths in the catalog
View count: 1