https://kedro.org/ logo
#beginners-need-help
Title
# beginners-need-help
m

Matheus Serpa

11/10/2021, 1:34 PM
I get it. Thanks for your support @User. I have another question (if you don't mind 🙂 Which would be the good practice for loading the data from the processed_semente_table into the semente_table (which is a SQL table)? In summary, the steps are 1) read a CSV file with semente_data; 2) read semente_table from SQL DB; 3) remove duplicates (comparing the CSV with SQL DB); 4) insert the new data on semente_table We also tried the following: instead of reading semente_table in step 2, we read a semente_query with the columns used to detect duplicates and then eliminate the semente_table cycle.
d

datajoely

11/10/2021, 2:12 PM
Creating a thread since the convo continued in the main channel
Are duplicates defined by an ID column? I think we may want to do some sort of UPSERT operation here?
And also is this there a large amount of data in the table?
m

Matheus Serpa

11/10/2021, 2:25 PM
They are defined by a combination of one varchar column and two FK ID columns. For example CSV with semente_data after some merge/join and transformations name; value1; value2; fk1_id; fk2_id 95R51; 30; 5; 1; 1 95Y72; 20; 5; 1; 1 96Y90; 10; 3; 2; 1 SQL with semente_table data needed for check duplication name; fk1_id; fk2_id 95R51; 1; 1 CD 202; 3; 2 ... So, the conclusion is that we should insert/load 95Y72; 20; 5; 1; 1 96Y90; 10; 3; 2; 1 and we could either update or do nothing with: 95R51; 30; 5; 1; 1
less than a thousand records in the near future less than five thousand
d

datajoely

11/10/2021, 2:26 PM
so going back to this image
is
semente_table
as
pandas.SQLDataSet
m

Matheus Serpa

11/10/2021, 2:27 PM
yes
d

datajoely

11/10/2021, 2:27 PM
Okay so I think I'm going to suggest something a bit funny, but it will make sense from a Kedro point of view
Copy code
yaml
original_semente_dataset:
  type: pandas.SQLDataSet
  ...

target_semente_dataset:
  type: pandas.SQLDataSET
  ...
I would duplicate the definition of the dataset, but Kedro will think it's two different source/targets
you then use
target_semente_table_dataset
downstream so that the rest of Kedro can sort the execution correctly
does that make sense?
m

Matheus Serpa

11/10/2021, 2:32 PM
yesss it does
we implement some similar
that's great
d

datajoely

11/10/2021, 2:32 PM
💪
nice!
m

Matheus Serpa

11/10/2021, 2:36 PM
Thank you for your support @User I'll probably have more questions in the future 😆 Arnaldo from Wildlife shows kedro to us during a google for startups accelerator mentoring session. We've been in love with kedro; it is helping us a lot to migrating our data pipelines to Airflow@GCP
d

datajoely

11/10/2021, 2:37 PM
Oh amazing @User 🙏 thanks for spreading the word!
and no problem @User - happy to help!
m

Matheus Serpa

11/10/2021, 2:44 PM
thanks! Best,