https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • d

    datajoely

    02/18/2022, 3:53 PM
    or use Plotly because it's way easier
  • i

    Isaac89

    02/18/2022, 3:56 PM
    ok thanks !
  • i

    Isaac89

    02/21/2022, 10:19 AM
    Hi! Is there any way to save the catalog to config. I sow the from_config method but no to_config one. Thanks !
  • d

    datajoely

    02/21/2022, 10:21 AM
    I don't think this exists today - it's something we've thought about before
  • i

    Isaac89

    02/21/2022, 10:25 AM
    I think it would be nice to be able to save the state of the catalog being used, especially if something is modified by the hooks.
  • d

    datajoely

    02/21/2022, 10:26 AM
    Yeah not many people have asked for it so we haven't built it before, it feels quite doable
  • a

    antony.milne

    02/21/2022, 12:54 PM
    You could mimic something like this with
    datasets = {dataset: dataset._describe() for dataset in catalog.datasets}
    yaml.dump(datasets)
    This won't give exactly the reverse of
    from_config
    but will go some of the way. Possibly
    str(dataset)
    might be useful also.
  • i

    Isaac89

    02/21/2022, 1:12 PM
    thanks for your suggestion! I' will definitely try it today.
  • r

    RRoger

    02/22/2022, 4:54 AM
    Happy 2sday everyone! I have a SQL YAML question. One of the items in my
    parameters.yml
    is a long complicated SQL string:
    sql_param: >
      SELECT
      ...
    The weird thing is that running it in through the node produces errors. If I copy and paste the SQL string to a DB tool, it runs fine. Is yaml or Kedro treating some characters in the SQL script differently? Is there a better way to make a node run SQL? Can I use pandas.SQLQueryDataSet even though the query is not supposed to return anything?
  • g

    grizzlyfederica

    02/22/2022, 8:53 AM
    Hi folks. I have some 50GB of data in a data warehouse (redshift). I'd like to have the initial processing (e.g. raw->primary) be done IN the warehouse to avoid heavy I/O out of the warehouse to do simple SQL queries in python. How would one handle such a scenario best? I could see two approaches: 1. run the SQL query in a kedro node with a dummy input & output to put it in the right place in the DAG 2. run the SQL query outside of kedro, e.g. in an orchestrator like airflow and do
    SQL -> kedro_pipeline()
  • d

    datajoely

    02/22/2022, 9:33 AM
    So this isn't an area that Kedro excels at. In general we only have 1 decent way of doing remote execution on a SQL database and that's via Spark and it's predicate pushdown features. This isn't ideal in all cases because it adds overhead, but it's the most pythonic way of doing things. 1. Unfortunately happens via our pandas datasets which is sub-optimal for big datasets 2. Feels like a better solution - perhaps it's even a chance to use dbt for the munging processes and Kedro for the parts that need to live in python
  • w

    williamc

    02/22/2022, 8:54 PM
    Hi, if I try to specify a custom separator (tabs, so
    '\t'
    ) for a
    Pandas.CSVDataSet
    I get a
    "delimiter" must be a 1-character string
    error. Is there another way to accomplish this?
  • d

    datajoely

    02/22/2022, 9:00 PM
    Can you post your yaml?
  • w

    williamc

    02/22/2022, 9:04 PM
    movie_titles:
      type: pandas.CSVDataSet
      filepath: s3:${s3_bucket}data/07_model_output/tars_movie_titles.tsv
      save_args:
        index: False
        header: True
        sep: '\t'
      versioned: True
  • d

    datajoely

    02/22/2022, 9:06 PM
    Can you try removing the quotes from the sep argument?
  • w

    williamc

    02/22/2022, 9:44 PM
    It still fails with the same error
  • d

    datajoely

    02/22/2022, 10:00 PM
    Okay worked it out - it's a YAML escaping thing:
    yaml
    sep: "\t"
  • w

    williamc

    02/22/2022, 10:03 PM
    Awesome, thanks!
  • e

    Edak

    02/23/2022, 4:38 AM
    This may be a silly question, but when writing a test suite how do you go about generating input data for said tests? My current engagement is dealing with a large dataset that takes a couple minutes to load into memory so using the actual data isn't a good option. Are there any good examples of projects that do this well when the source data is too large?
  • d

    deepyaman

    02/23/2022, 5:57 AM
    @User For what sort of tests? If we're talking unit tests, it makes sense to hand-craft the input data. On the other hand, for end-to-end pipeline tests, consider using a data mocker to generate fake data (if you have stuff like data security restrictions), or simply subsampling your source data. Note that creating test data can be a pretty big challenge in and of itself, depending on how many tables you need to mock and how realistic it needs to be.
  • d

    datajoely

    02/23/2022, 8:17 AM
    I would also note there are few times that enormous datasets in tests tell you anything about the correctness of your code.
  • p

    potterhead

    02/23/2022, 12:00 PM
    A pipeline returns a dictionary of dataframes where the key is the identifier/name of the value(pandas dataframe). I wish to write the dataframes to CSV. How should I go about it?
  • d

    datajoely

    02/23/2022, 12:01 PM
    Hello are you using the kedro code api? not the project template that exposes the YAML api and allows you to save catalog entries
  • p

    potterhead

    02/23/2022, 12:04 PM
    I was looking for a way where I can specify the output in a catalog.yml
  • d

    datajoely

    02/23/2022, 12:05 PM
    Ah okay - so we have lots of examples in the docs but essentially you need to put a catalog entry with the key you want to save
  • d

    datajoely

    02/23/2022, 12:05 PM
    I’ll post the catalog docs, but I’d highly recommend following the tutorial project as this is covered
  • d

    datajoely

    02/23/2022, 12:06 PM
    https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html
  • p

    potterhead

    02/23/2022, 12:06 PM
    I'm in the process of going through the tutorials. Thanks!
  • d

    datajoely

    02/23/2022, 12:06 PM
    Good luck!
  • p

    pypeaday

    02/23/2022, 7:10 PM
    To start a conversation about credentials management - global file or cloud native support - would it be best to do it here or keep it localized to the github issues I just opened?
Powered by Linen
Title
p

pypeaday

02/23/2022, 7:10 PM
To start a conversation about credentials management - global file or cloud native support - would it be best to do it here or keep it localized to the github issues I just opened?
View count: 1