Mirko
04/26/2022, 8:14 PMpython
def do_something(db_name: str, table_name: str):
spark.sql("SELECT <some complicated expression> FROM {:s}.{:s}".\
format(db_name, table_name))
# write results to another table.
I am wondering what the best way to convert this to kedro is. I can load the table as a spark DataFrame using kedro.extras.datasets.spark.SparkDataSet
, but I would like to avoid rewriting all of the SQL queries in the DataFrame API. Does it make sense to do something like this:
python
def my_node(my_table: DataFrame) -> DataFrame:
my_table.createOrReplaceTempView("tmp_table")
# The SQL query is just copied from the function above
result = spark.sql("SELECT <some complicated expression> FROM tmp_table"
spark.catalog.dropTempView("tmp_table")
return result
Creating the temporary view seems like a bit of a hack to me, but I can't think of a better way that allows me to aovoid rewriting the SQL queries in the DataFrame API. I'm also not sure if this has any performance implications.datajoely
04/27/2022, 9:57 AMdbt
is the right tool for the job and then use Kedro for the parts that need to live in PythonMirko
04/27/2022, 6:19 PMdatajoely
04/28/2022, 9:41 AMcreate
and save
parts with DataSet class and not in your nodespark.HiveDataSet
but you may want to define your own custom dataset to do the SQL view thingyMirko
04/28/2022, 3:48 PMafter_dataset_loaded
and create the temporary view. The second hook would be of the type before_dataset_saved
and delete the temporary view.datajoely
04/28/2022, 3:50 PMMirko
04/28/2022, 3:55 PMspark.HiveDataSet
instead of the spark.SparkDataSet
as a starting point?datajoely
04/28/2022, 3:56 PMMirko
04/28/2022, 6:06 PM