https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • r

    Raakesh S

    08/06/2022, 9:54 PM
    However, not sure if any conflicts could arise with kedro[pandas, pysaprk] being 0.17.4 and the project being created in 0.17.7. There is not any major upgrades in the version 0.17.7 from 0.17.4
  • r

    Raakesh S

    08/06/2022, 9:56 PM
    message has been deleted
  • f

    Flow

    08/06/2022, 10:35 PM
    Not sure but I remember that it validates the version in the pyproject.toml or equivalent file. So it’s hard coded. Therefore if it runs with the 0.17.4 version with some reason that might cause the failure
  • m

    Matthias Roels

    08/08/2022, 12:43 PM
    Question: is it possible to modify a pipeline definition using a hook? I would like to add a couple of nodes to a pipeline based on run time information.
  • d

    datajoely

    08/08/2022, 1:18 PM
    Anything is possible! But this feels a little fragile
  • m

    Matthias Roels

    08/08/2022, 1:37 PM
    Can you explain why it is fragile?
  • d

    datajoely

    08/08/2022, 2:00 PM
    We don't typically like dynamic pipelines since they make reproducibility difficult.
  • d

    datajoely

    08/08/2022, 2:00 PM
    And by that extension non-determism makes things difficult to debug.
  • m

    Matthias Roels

    08/08/2022, 7:28 PM
    Got it! Our biggest struggle is that there is no clean way to add conditional logic in the pipeline (that is dependent on the
    KEDRO_ENV
    ). And on top of that, we want to be able to more dynamically construct pipelines using logic that generates additional nodes during registry.
  • a

    antheas

    08/09/2022, 12:27 PM
    @Matthias Roels I had a similar problem. Kedro is written in such a way that both the catalog and the pipelines must be static
  • a

    antheas

    08/09/2022, 12:29 PM
    I use multiple datasets, multiple algorithms and want to run the same pipeline on them, but they have a different number of tables with different names and the algorithm has a different io
  • a

    antheas

    08/09/2022, 12:30 PM
    the way I solved it was by carefully namespacing my catalog and creating it dynamically based on the combination of datasets, algs. So, when i load up kedro ipython I can access all of my datasets
  • a

    antheas

    08/09/2022, 12:32 PM
    and then enumerating between all combinations of datasets, algs in my registry. My pipelines have a name of . or if I want to run part of the pipeline ... Example: cars.yolov5.measure
  • a

    antheas

    08/09/2022, 12:58 PM
    I have 32 pipelines and kedro launches in around ~2s if that's useful to you
  • d

    datajoely

    08/09/2022, 1:12 PM
    I'd love to showcase a demo of this - this sounds super neat
  • b

    Barros

    08/10/2022, 8:36 PM
    Hi guys, I'd like to know a way to build a dataset class kind of a PartitionedDataSet but the filenames are fixed. Thanks in advance!
  • d

    datajoely

    08/10/2022, 9:13 PM
    Could you elaborate more on what you mean?
  • b

    Barros

    08/10/2022, 9:39 PM
    In the sense that instead of PartitionedDataSet searching every file in the directory of a certain suffix, it searches a list of pre-determined filenames. If the load method finds the specified list of filenames, it loads it, just like PartitionedDataSet. If there are files missing, it can return empty or raise an error. I tried overloading the
    __init__
    method of PartitionedDataSet but I think there should be a better way
  • d

    datajoely

    08/10/2022, 9:40 PM
    Subclassing to do this sort of custom functionality is very much encouraged
  • b

    Barros

    08/10/2022, 9:41 PM
    It makes sense to use a PartitionedDataSet instance but also maybe it should be easier doing a specific dataset that does this
  • b

    Barros

    08/10/2022, 9:41 PM
    I want to write something that is easier for someone who reads the code to maintain
  • b

    Barros

    08/10/2022, 9:41 PM
    The things that I imagine are quite complicated
  • b

    Barros

    08/10/2022, 9:42 PM
    I think the best way is subclassing too but maybe there was something already done
  • b

    Barros

    08/10/2022, 9:46 PM
    I will do something and if I have problems I'll ask here again. Thank you @datajoely
  • m

    Matthias Roels

    08/11/2022, 12:50 PM
    Quick question: in mid June, Spark 3.3 was released. I noticed that kedro 0.18.x is not yet compatible with this release because of the strict delta lake 1.x dependency. Will that be resolved soon?
  • d

    datajoely

    08/11/2022, 12:57 PM
    Thanks for raising this - I've just raised a PR to resolve this so it will be in the next version.
  • j

    javier.16

    08/11/2022, 2:50 PM
    We use Kedro for all of use cases, and we are assessing if adding a Feature Store like Feast to manage all the features that we use. Anyone has had any experience integrating the kedro data catalog for getting historical features and online features? using the same dataset in the catalog? or would it take a big development in the Kedro Datasets side?
  • a

    antheas

    08/12/2022, 11:08 PM
    It's a research prototype for data synthesis. I will keep it in mind when I get close to publishing the source code
  • a

    antheas

    08/12/2022, 11:19 PM
    You can use the extension syntax for yaml files to define your datasets
    _dataset_csv: &dataset_csv
      type: pandas.CSVDataSet
      layer: raw
      filepath: ""
    
    dataset_2021_2:
      <<: *dataset_csv
      filepath: ${base_location}/raw/x/t/z.csv.gz
    Then you can instantiate your pipeline with a node that has as an input a dictionary {n: n for n in your_dataset_names}. You can even use jinja2 to template it. After that you can just dump everything in a parquet file, assuming it fits in ram. If you don't like the fact that this causes duplication (you need to define your dataset names both in your catalog and in your code), then you can instantiate your datasets using the after_catalog_created hook using python. Ofc this assumes that you don't need to use the partitioned features like lazy loading. I did check out the PartitionedDataset class. It's just 300 lines. What you want should be manageable by overwriting
    __init__
    , adding a list param for your datasets and then calling the
    super().__init__()
    method. Then you can use that list in
    _list_partitions()
    to return your files instead of a directory listing.
  • a

    antheas

    08/12/2022, 11:22 PM
    or you could just stick your files in their own directory and rename them so
    PartitionedDataset
    works without changes 🤷‍♂️
Powered by Linen
Title
a

antheas

08/12/2022, 11:22 PM
or you could just stick your files in their own directory and rename them so
PartitionedDataset
works without changes 🤷‍♂️
View count: 1