I get it Thanks for your support < datajoely> I have another Kedro #beginners-need-help

I get it. Thanks for your support <@!8427879849020...

Matheus Serpa

11/10/2021, 1:34 PM

I get it. Thanks for your support @User. I have another question (if you don't mind 🙂 Which would be the good practice for loading the data from the processed_semente_table into the semente_table (which is a SQL table)? In summary, the steps are 1) read a CSV file with semente_data; 2) read semente_table from SQL DB; 3) remove duplicates (comparing the CSV with SQL DB); 4) insert the new data on semente_table We also tried the following: instead of reading semente_table in step 2, we read a semente_query with the columns used to detect duplicates and then eliminate the semente_table cycle.

datajoely

11/10/2021, 2:12 PM

Creating a thread since the convo continued in the main channel

datajoely

11/10/2021, 2:12 PM

Are duplicates defined by an ID column? I think we may want to do some sort of UPSERT operation here?

datajoely

11/10/2021, 2:21 PM

And also is this there a large amount of data in the table?

Matheus Serpa

11/10/2021, 2:25 PM

They are defined by a combination of one varchar column and two FK ID columns. For example CSV with semente_data after some merge/join and transformations name; value1; value2; fk1_id; fk2_id 95R51; 30; 5; 1; 1 95Y72; 20; 5; 1; 1 96Y90; 10; 3; 2; 1 SQL with semente_table data needed for check duplication name; fk1_id; fk2_id 95R51; 1; 1 CD 202; 3; 2 ... So, the conclusion is that we should insert/load 95Y72; 20; 5; 1; 1 96Y90; 10; 3; 2; 1 and we could either update or do nothing with: 95R51; 30; 5; 1; 1

Matheus Serpa

11/10/2021, 2:26 PM

less than a thousand records in the near future less than five thousand

datajoely

11/10/2021, 2:26 PM

so going back to this image

datajoely

11/10/2021, 2:27 PM

semente_table

pandas.SQLDataSet

Matheus Serpa

11/10/2021, 2:27 PM

yes

datajoely

11/10/2021, 2:27 PM

Okay so I think I'm going to suggest something a bit funny, but it will make sense from a Kedro point of view

datajoely

11/10/2021, 2:30 PM

Copy code

yaml
original_semente_dataset:
  type: pandas.SQLDataSet
  ...

target_semente_dataset:
  type: pandas.SQLDataSET
  ...

datajoely

11/10/2021, 2:31 PM

I would duplicate the definition of the dataset, but Kedro will think it's two different source/targets

datajoely

11/10/2021, 2:31 PM

you then use

target_semente_table_dataset

downstream so that the rest of Kedro can sort the execution correctly

datajoely

11/10/2021, 2:31 PM

does that make sense?

Matheus Serpa

11/10/2021, 2:32 PM

yesss it does

Matheus Serpa

11/10/2021, 2:32 PM

we implement some similar

Matheus Serpa

11/10/2021, 2:32 PM

that's great

datajoely

11/10/2021, 2:32 PM

💪

datajoely

11/10/2021, 2:32 PM

nice!

Matheus Serpa

11/10/2021, 2:36 PM

Thank you for your support @User I'll probably have more questions in the future 😆 Arnaldo from Wildlife shows kedro to us during a google for startups accelerator mentoring session. We've been in love with kedro; it is helping us a lot to migrating our data pipelines to Airflow@GCP

datajoely

11/10/2021, 2:37 PM

Oh amazing @User 🙏 thanks for spreading the word!

datajoely

11/10/2021, 2:37 PM

and no problem @User - happy to help!

Matheus Serpa

11/10/2021, 2:44 PM

thanks! Best,

Previous Next