https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • j

    jaweiss2305

    02/26/2022, 1:06 PM
    Anyone use Kedro and Airflow, successfully?
  • d

    datajoely

    02/26/2022, 2:38 PM
    Have you tried using the Kedro-Airflow plug-in? It was codeveloped with astronomer
  • j

    jaweiss2305

    02/26/2022, 3:43 PM
    I did. I even purchased astrocloud. I was curious if anyone else has used it and what success they had..
  • a

    Arnaldo

    02/26/2022, 4:51 PM
    I used, @User. It worked nicely!
  • j

    jaweiss2305

    02/26/2022, 4:56 PM
    Are you using it with Astrocloud? By chance you don't have sample code?
  • a

    Arnaldo

    02/26/2022, 4:57 PM
    No, I'm not using with Astrocloud
  • d

    datajoely

    02/26/2022, 7:30 PM
    Any feedback on the experience positive or negative is always useful
  • f

    FelicioV

    03/02/2022, 5:42 PM
    Hello, I'm trying to implement a few pipelines in Kedro 17.7 that have a lot of inputs of moderate complexity. It can be summarised roughly in reading a few dozen sheets in a few hundred excel spreadsheets. To do so, I'm using
    PartitionedDataSets
    with
    pandas.ExcelDataSet
    and specifying
    load_args
    as
    sheet_name
    ,
    names
    and
    dtype
    . It works like a charm but I'm worrying about the size of the
    catalog/ingest.yml
    . I've been searching for a way to split that catalog yml into a few files, maybe on business oriented segments, but I have had no luck with it. Is there a intended way to do such a thing? If not intended way implemented, I've been thinking (not really tried though) to mess up with the
    register_catalog
    on the
    ProjectHooks
    class. Am I making any sense? Thanks!
    d
    • 2
    • 27
  • f

    FelicioV

    03/02/2022, 6:07 PM
    Just occur to me, maybe this question belongs in the begginers-need-help?
  • d

    datajoely

    03/02/2022, 6:08 PM
    managing catalog complexity
  • w

    waylonwalker

    03/03/2022, 4:32 PM
    Does anyone have a good Kedro DataSet for chunked sql queries? I'm imagining something that will save all the chunks as individual outputs.
  • d

    datajoely

    03/03/2022, 4:57 PM
    so do you want to save as chunks or read as chunks? The
    chunk_size
    argument of
    pd.read_sql_table
    should work in
    load_args
  • d

    datajoely

    03/03/2022, 4:58 PM
    in terms of saving I can't think of an easy way of doing that to SQL, but it's possible via
    ParititionedDataSet
    and a file system
  • w

    waylonwalker

    03/03/2022, 5:08 PM
    save as chunks
  • d

    datajoely

    03/03/2022, 5:08 PM
    to the same table?
  • w

    waylonwalker

    03/03/2022, 5:08 PM
    oh wait, read as table, and save parquet chunks would be nice, but whole think might be ok
  • d

    datajoely

    03/03/2022, 5:09 PM
    But also it looks like
    chunksize
    is an argument of
    pd.to_sql_table
    so you can use it in
    pandas.SQLTableDataSet
  • j

    jaweiss2305

    03/03/2022, 6:25 PM
    I tried to use the chunksize and couldn't figure it out (using pandas.SQLTableDataSet). I would be interested in seeing a hello world code snippet once you get it to work.
  • w

    waylonwalker

    03/03/2022, 6:26 PM
    I was asking for a friend, I am not sure if they are going to go with this solution or not.
  • w

    williamc

    03/04/2022, 11:16 PM
    I've been debugging this issue for a couple days now, and I'm about to lose my sanity 😁 So, I'm dealing with Spark dataframes and Tensorflow. To have them talk, I usually save my dataframes as csv and then read them into a teonsorflow dataset with a call to
    tf.data.experimental.make_csv_dataset
    . In this particular case I have a node at the end of one of my pipelines saving the dataframe to s3 (
    spark.SparkDataFrame
    ), and I have written a custom dataset (essentially copied most of the code from
    TensorFlowModelDataset
    ) that does the reading at the beginning of the next pipeline. The maddening issue I haven't been able to solve is that, if I run both pipelines with the --from-node option, the run fails as my call to
    self._fs.get()
    returns an empty result. I have verified that the dataframe is being correctly written to my s3 bucket, but a call to
    self._fs.ls(load_path)
    comes back empty as well. If after my failed run, I run just the second pipeline, everything works as expected,
    self._fs.get()
    returns my csv files and I'm able to load my data into a TF dataset and train my model without issue. Does anybody have any idea about what I'm doing wrong?
  • w

    williamc

    03/04/2022, 11:16 PM
    For the sake of completeness, here's my implementation of the` _load` method: `def _load(self) -> tf.data.Dataset: logger = logging.getLogger('TensorFlowCSVDataSet') load_path = get_filepath_str(self._get_load_path(), self._protocol) logger.info(f'remote path: {load_path}') logger.info(self._fs.ls(load_path)) logger.info(f'remote path contents: {self._fs.ls(load_path)}') self._tmp_data_dir = tempfile.TemporaryDirectory(prefix=self._tmp_prefix) logger.info(f'local path: {self._tmp_data_dir.name}') self._fs.get(load_path + '/*.parquet', self._tmp_data_dir.name + '/', recursive=True) ds_dir = list(Path(self._tmp_data_dir.name).iterdir()) logger.info(f'local path contents: {ds_dir}') ds = (tf.data.experimental .make_csv_dataset(file_pattern=f'{self._tmp_data_dir.name}/*.csv', **self._load_args)).unbatch() return d`s
  • w

    williamc

    03/04/2022, 11:16 PM
    Thanks in advance
  • a

    avan-sh

    03/05/2022, 1:26 AM
    Just to debug the issue can you add a if empty sleep a sec or two and try again block? I don't see any reason this should happen as they added strong consistency for S3 more than a year ago
  • w

    williamc

    03/05/2022, 3:27 AM
    I went one step further and added a while loop to wait for
    self._fs.ls(load_path)
    to find something in the S3 bucket where my csv dataframe is, to no luck.
  • d

    datajoely

    03/05/2022, 11:31 AM
    Hi @User this looks really confusing - I have two ideas coming to mind: (1) Use a breakpoint to inspect the objects at write time and verify if things look correct (2) A quick google suggests this library may be worth exploring https://petastorm.readthedocs.io/en/latest/index.html#
  • w

    williamc

    03/05/2022, 12:14 PM
    Hi, I've held on Petastorm cause in their examples you've got to work within a context manager (https://petastorm.readthedocs.io/en/latest/readme_include.html#spark-dataset-converter-api), while I'd like to encapsulate the loading logic inside a custom dataset and cleanly return the resulting
    tf.data.DataSet
    object. According to their docs "when exiting the context, the reader of the dataset will be closed". RE breakpoints: unfortunately I'm working with an old version of Jupyter Lab and can't readily update it nor install plugins. I'd rather use vscode but I've had some trouble setting up the ssh + Docker integration (my dev env is a Docker container running on an EC2 instance). I'll keep trying things to isolate the error further. Thanks for the pointers
  • d

    datajoely

    03/05/2022, 12:15 PM
    Can you use the old fashioned pdb debugger or the
    breakpoint()
    syntax
  • w

    williamc

    03/05/2022, 12:18 PM
    I'm gonna try it out, thanks
  • d

    Deep

    03/07/2022, 1:41 PM
    Hey guys. I'm new here. Need some help.
  • d

    Deep

    03/07/2022, 1:42 PM
    So I'm saving this spark.dataset via kedro which is getting saved in dbfs. But the problem is that it is getting saved as a FOLDER and not a file.
Powered by Linen
Title
d

Deep

03/07/2022, 1:42 PM
So I'm saving this spark.dataset via kedro which is getting saved in dbfs. But the problem is that it is getting saved as a FOLDER and not a file.
View count: 1