https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • d

    datajoely

    11/03/2021, 1:15 PM
    So in truth Kedro isn't great at this - what we have seen people do is to define a custom SQL dataset applys a date range query and then these results are then saved in a partitioned/incremental dataset
  • z

    Zemeio

    11/04/2021, 11:42 AM
    Do you happen to know an example of this?
  • d

    datajoely

    11/04/2021, 11:43 AM
    This is a similar question I think https://discord.com/channels/778216384475693066/778998585454755870/897056922690256896
  • d

    datajoely

    11/04/2021, 11:43 AM
    If you do settle on a neat implementation we'd really appreciate a PR into the main project
  • z

    Zemeio

    11/04/2021, 11:48 AM
    Unfortunately due to priority issues this is not going to happen in a near future, since I am mostly looking into this as a preparation for an issue I know my team will face, but when I do get to make it I can try to tidy it up.
  • d

    datajoely

    11/04/2021, 11:49 AM
    Absolutely - part of the reason we haven't got a 1st party way of doing this is that there are so many varied requirements with this sort of thing
  • d

    doc4evah

    11/04/2021, 6:42 PM
    Hey guys! I'm getting a weird error when running a pipeline. For context, I'm running a translation model (that might/or not leverage multithreading), when running it, this is the output..
    bash
    [1]    99263 segmentation fault  kedro run --pipeline translate
    /Users/durc1211/.asdf/installs/python/3.8.9/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown        
      warnings.warn('resource_tracker: There appear to be %d '
    And the process just stops.. Any ideas? Could not find much online or in the kedro repo..
  • d

    datajoely

    11/04/2021, 6:42 PM
    If the underlying model uses parallelism you can’t use parallel runner with it
  • d

    doc4evah

    11/04/2021, 6:44 PM
    Is that something I can configure?
  • d

    doc4evah

    11/04/2021, 6:50 PM
    Because I've tried to run it with all of the
    --runner <runner-name>
    configurations, and it still fails..
  • d

    datajoely

    11/04/2021, 6:57 PM
    Oh in which case it’s likely not a Kedro issue but part of the underlying library - It’s end of day here in London so i can’t help much more tonight. My only advice is to set up breakpoints and step up the debugger to try and work out where / why it fails
  • d

    doc4evah

    11/05/2021, 6:50 PM
    @User I'm still a bit convinced that it might be the two libraries playing together... Submitted an issue anywho 🙂 https://github.com/argosopentech/argos-translate/issues/213
  • d

    datajoely

    11/06/2021, 11:47 AM
    I've commented on that issue - it's really bizarre, I can ask one of the devs to look into it
  • d

    doc4evah

    11/06/2021, 1:57 PM
    I have a feeling it's something in 3.8. Can you give me some simple instructions to run kedro with 3.9 support? (I know you're gonna release it soon)
  • a

    antony.milne

    11/08/2021, 10:24 AM
    kedro 0.17.5 with Python 3.9 unofficially probably works fine actually. The main gotchas are: * if you're using
    pandas.ExcelDataSet
    then you will need to add
    engine: openpyxl
    to the
    load_args
    * you might get some library version conflicts which you'll need to fiddle with a bit Probably the easiest way is to install kedro + its dependencies + your project dependencies and then install Python 3.9 (rather than doing starting out with Python 3.9 first). I just tried this out and managed to get the spaceflights starter running on kedro 0.17.5, Python 3.9 just by fiddling with requirements a bit so it should definitely be possible!
  • a

    antony.milne

    11/08/2021, 10:25 AM
    The other thing you can try here is running the nodes/pipeline outside kedro just as pure Python functions. This should help to see whether this is a problem involving the kedro runner at all
  • s

    simon_myway

    11/08/2021, 1:11 PM
    Hi Joel, I've been working on a custom dataset to suit my need. To that end, I explored the kedro datasets architecture quite deep and learned about the realease function. I noticed that release will not be called on datasets which are the pipeline's first inputs or last outputs, see https://github.com/quantumblacklabs/kedro/blob/master/kedro/runner/sequential_runner.py#L72 - could you please explain what is preventing the release of those datasets?
  • d

    datajoely

    11/08/2021, 1:25 PM
    I'm not 100% sure - I'm going to ask the team. Initial guess it is a memory optimisation if you need to reuse the data multiple times. I would say that you could define an
    after_node_run
    hook if you wanted to do this explicitly
  • d

    datajoely

    11/08/2021, 1:56 PM
    What issues are you experience as a result of this?
  • s

    simon_myway

    11/08/2021, 2:07 PM
    Reading about the doc on IncrementalDataset here: https://kedro.readthedocs.io/en/stable/05_data/02_kedro_io.html#incremental-dataset-confirm "Partitions that are created externally during the run will also not affect the dataset loads and won’t appear in the list of loaded partitions until the next run or until the release() method is called on the dataset object." So I was curious about this release function and when it was called. The issue is that if I override the release function, it may or may not be called depending on wether the ds is a pipeline's input/output
  • d

    datajoely

    11/08/2021, 2:09 PM
    So I have two reactions: - I'd love to understand your implementation here to see if this is the best way of solving the problem - You can actually define your own runner class
  • d

    datajoely

    11/08/2021, 2:14 PM
    I should also say the runner will be overhauled in a future version it's something we want to rewrite when we get a chance
  • s

    simon_myway

    11/08/2021, 5:11 PM
    Thanks! I eventually moved the logic to an after_node_run hook, happy to share the implementation I came up with once I have an MVP
  • d

    datajoely

    11/08/2021, 5:11 PM
    Amazing! Great the hook approach worked, that is our preferred approach for custom logic
  • u

    user

    11/08/2021, 8:01 PM
    Adding parameters in Kedro Pipeline https://stackoverflow.com/questions/69889151/adding-parameters-in-kedro-pipeline
  • i

    Isaac89

    11/09/2021, 8:39 AM
    Hi! I'm running a pipeline on an HPC, scheduling jobs with slurm and I would like to run the same pipeline with different inputs. Until now I always used the TemplateConfigLoader to achieve this, changing the variables in the globals.yml, but now I would like to run the pipelines for all the inputs in parallel taking advantage of slurm. The problem I observed is that using the same catalog and changing the globals for a different pipeline run, if the jobs starts at the same time the catalog used during all runs may be the wrong one. Is there a way to use a catalog stored in a different folder passing the path from the cli to be sure that each pipeline is using a different globals.yml? Or how would be the best practice to avoid using the wrongly rendered catalog or the wrong globals.yml? Thanks!
  • d

    datajoely

    11/09/2021, 8:40 AM
    You’re looking for modular pipelines! I’m currently working on revamped docs as we speak but the tutorial is up there
  • d

    datajoely

    11/09/2021, 8:41 AM
    Modular pipelines allow you to reuse the same pipelines as different instances and namespaces with the ability to override certain inputs, outputs and parameters
  • i

    Isaac89

    11/09/2021, 10:07 AM
    Hi! Concerning the answer to my last question: I sow your answer suggesting the use of modular pipelines , but in my case I'm using always the same pipeline, just changing the input files. To give more context to the problem: I have a pipeline which I will call PIP. The catalog is rendered using the variable cohort in the globals.yml. I want to run the pipeline using different values for the variables cohort. This works fine as long as I'm running it sequentially like : update globals.yml -> kedro run for each cohort. If I want to run this in parallel the steps update globals -> kedro run run in parallel for all the cohorts. So globals can be overwritten by chance. Therefore I would like to be able to specify the globals.yml in another place or provide the variable from the command line to avoid this problem. Is there any way to achieve it ?
    d
    a
    • 3
    • 49
  • u

    user

    11/12/2021, 9:39 AM
    Azure Data Lake Storage Gen2 (ADLS Gen2) as a data source for Kedro pipeline https://stackoverflow.com/questions/69940562/azure-data-lake-storage-gen2-adls-gen2-as-a-data-source-for-kedro-pipeline
Powered by Linen
Title
u

user

11/12/2021, 9:39 AM
Azure Data Lake Storage Gen2 (ADLS Gen2) as a data source for Kedro pipeline https://stackoverflow.com/questions/69940562/azure-data-lake-storage-gen2-adls-gen2-as-a-data-source-for-kedro-pipeline
View count: 5