Hi folks in a previous post here there was a question about Kedro #beginners-need-help

Hi folks, in a previous post here, there was a que...

PetitLepton

08/21/2022, 1:53 PM

Hi folks, in a previous post here, there was a question about passing parameters to SQL queries. While randomly reading about transcoding in dataset yesterday, it gave me an idea on how to use it for the problem at stake. The idea is to use a Jinja template for the SQL query — nothing new here — and use transcoding to impose the sequential run of the nodes, namely filling the template with the parameters and only then run the query by using a common file. The catalog would look like

Copy code

aggregates@query_template:
  type: text.TextDataSet
  filepath: data/01_raw/aggregates_query.sql

aggregates@query_string:
  type: text.TextDataSet
  filepath: data/02_intermediate/filled_aggregates_query.sql

aggregates@query:
  type: pandas.SQLQueryDataSet
  filepath: data/02_intermediate/filled_aggregates_query.sql
  credentials: aggregates_uri

and the pipeline

Copy code

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        [
            node(
                parse_parameters,
                inputs=[
                    "params:start_date",
                    "params:end_date",
                    "params:metric",
                ],
                outputs="query_parameters",
            ),
            node(
                fill_template,
                inputs=["aggregates@query_template", "query_parameters"],
                outputs="aggregates@query_string",
            ),
            node(
                perform_query,
                inputs=["aggregates@query"],
                outputs="results",
            ),
        ]
    )

Transcoding ensures that the second node runs before the third node. I like using transcoding in this situation because it makes the link between nodes more transparent than using an extra output/input. Please let me know what you think about it.

Previous Next