https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • k

    khern

    11/24/2021, 12:27 PM
    Hello, Arnaldo! thanks for your response, oh I didn't! Okay I'll try. Thank you! 🙂
  • r

    RRoger

    11/25/2021, 10:35 AM
    Newbie here. Can SageMaker be used in Kedro for generic processing (not fitting models)?
  • d

    datajoely

    11/25/2021, 10:35 AM
    Yes - but you need to package the project or use Kedro in IPython mode
  • d

    datajoely

    11/25/2021, 10:40 AM
    we also have docs here! https://kedro.readthedocs.io/en/latest/10_deployment/09_aws_sagemaker.html
  • d

    datajoely

    11/25/2021, 10:40 AM
    also docs on the AWS side https://aws.amazon.com/blogs/opensource/using-kedro-pipelines-to-train-amazon-sagemaker-models/
  • b

    Bastian

    11/25/2021, 4:05 PM
    Hi everyone! We have a workflow that seems to be a little tricky to achieve: Every night, we combine raw data from today and intermediate data from yesterday to produce (the same) intermediate data for today. For this, we use two PartionedDataSet data catalog entries that point to the same path - one so we can access the old data and one so we can then write out the updated data. While it feels hacky it seems to work, however we do have an issue when we run this for the very first time: Since there is no old data present, the PartionedDataSet crashes while loading. We could workaround this by using an IncrementalDataSet, however then we always load all the partitions. This would lead to us loading a years worth of data when we only need a day. We found this issue https://github.com/quantumblacklabs/kedro/issues/394 that seems to be related to what we want to do.
  • d

    datajoely

    11/25/2021, 4:07 PM
    Hello - I think as it stands this isn't possible out of the box. I need to check with the team why we made this a hard design requirement, but I would recommend you subclass PartitionedDataSet and override it for your own purposes.
  • r

    RRoger

    11/25/2021, 11:55 PM
    I want to load data from a database and then save to multiple formats locally (csv, parquet). Is there a way to do this nicely? I tried putting a list of outputs in a
    node
    but threw an error. I know that transcoding allows the same dataset to be loaded in multiple ways, I want to go the other way around.
  • d

    datajoely

    11/26/2021, 10:05 AM
    so out of the box you need to create two catalog entry 'targets', transcoding is really designed for outputs that share the same filepath i.e. pandas and spark pointing to the same parquet file. There is a 3rd party plugin called Kedro Wings. It's not officially supported but it includes a TeePlugin that I think allows you to do what you want https://github.com/deepyaman/kedro-accelerator
  • o

    Onéira

    11/26/2021, 8:47 PM
    Hello, I am starting to use kedro, and I was wondering if it could help me with some Kaggle challenge. Is there a "Kaggle DataCatalog" where I could get the data to feed a pipeline? I am actually trying to find a way to develop a pipeline which I could run both in my IDE and as Kaggle notebook...
  • d

    datajoely

    11/26/2021, 8:48 PM
    Hello - I’m not aware of any Kathleen integrations with Kedro but it’s a cool idea!
  • d

    datajoely

    11/26/2021, 8:50 PM
    In theory you can run Kedro in ipython mode in most notebooks https://kedro.readthedocs.io/en/stable/11_tools_integration/02_ipython.html#ipython-extension
  • o

    Onéira

    11/26/2021, 8:54 PM
    Thanks for the very quick answer 😄 Was worth asking. I am always struggling to code in the notebooks since I am so used to my IDE. How ever if I copy my code on a notebook, all the data are not located in the same place so I need to update everything, would be awesome to have a way to abstract the Kaggle data to a catalog or sth llike that. However, I understand that it is not a feature every one would use on the job 😆
  • o

    Onéira

    11/26/2021, 8:54 PM
    I will check if it can help!
  • d

    datajoely

    11/26/2021, 8:56 PM
    It’s something for us to think about I think - that being said the Kedro team strongly believe notebooks are for prototyping and communication and the IDE is for pipelines
  • o

    Onéira

    11/26/2021, 8:59 PM
    I don´t like notebook myself, I find them too messy, my colleagues are desperated to see me do everything in script/functions 🤣 . I am still trying to understand how to use Kaggle GPUs, at the moment I have only managed through notebooks :sigh going back trying:
  • r

    Rroger

    11/27/2021, 9:40 PM
    If there are a lot of columns that could be processed separately, e.g. imputation, mapping values, etc, then it would be better to have separate nodes for these right? The upside is modularity (as you mentioned), which also means the nodes could be run in parallel. The downside is that there would be a lot of nodes;
    process_col_A
    ,
    process_col_B
    , ...,
    process_col_ZZ
    .
  • d

    datajoely

    11/27/2021, 9:47 PM
    Ah good question - in this case If it’s the same content over and over again I’d look to keep into node for readability or use a loop to create many nodes
  • b

    brewski

    11/27/2021, 10:09 PM
    Hello- I'm looking to migrate an existing dask data science project into Kedro to help with structuring the code and to help with transparency for non-technical folks- are there any known best practices for this use case?
    d
    • 2
    • 16
  • r

    Rroger

    11/28/2021, 2:46 AM
    What are the practical differences between the
    ThreadRunner
    and
    ParallelRunner
    . 1. I know that the
    ParallelRunner
    won't take lambda functions. In this case how to deal with the nodes with identify lambda functions (
    lambda x: x
    )? 2. I tried running using
    ThreadRunner
    and it did run the nodes at the same time and shortened the end-to-end runtime. Are there situations in which I shouldn't use
    ThreadRunner
    ?
    d
    • 2
    • 1
  • s

    sri

    11/28/2021, 12:16 PM
    is it possible to use kedro viz to write to png file without opening the browser interface. I have issues opening the browser url localhost:port in my dev environment. I just would like it to dump to a png file from command or programmatically
    d
    • 2
    • 2
  • s

    sri

    11/28/2021, 3:32 PM
    How to give a json as parameters instead of parameters.yml
    d
    • 2
    • 4
  • d

    datajoely

    11/28/2021, 6:55 PM
    Parallel vs Thread runner
  • d

    datajoely

    11/28/2021, 6:58 PM
    Viz write PNG as CLI command
  • d

    datajoely

    11/28/2021, 7:00 PM
    JSON params instead of YAML
  • d

    datajoely

    11/28/2021, 7:04 PM
    Best practice for Kedrofying an existing project
  • s

    sri

    11/29/2021, 9:48 AM
    what is the best approach where you read same tables from two different kind of databases, oracledb , hive etc. All the data prep steps are same except the sql that initially reads into dataframe. how to make this configurable with kedro?
  • d

    datajoely

    11/29/2021, 9:57 AM
    You would likely have to define multiple
    pandas.SQLTableDataSet
    catalog references with different config
  • d

    datajoely

    11/29/2021, 9:58 AM
    Two things that can help you stop repeating yourself: - https://support.atlassian.com/bitbucket-cloud/docs/yaml-anchors/ - Jinja https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#jinja2-support
  • s

    sri

    11/29/2021, 10:38 AM
    Is it possible to have a sql code with a date range picked from config with these approaches?
Powered by Linen
Title
s

sri

11/29/2021, 10:38 AM
Is it possible to have a sql code with a date range picked from config with these approaches?
View count: 1