Hi all, I am porting an existing project to kedro ...
# advanced-need-help
m
Hi all, I am porting an existing project to kedro which reads a lot of data from tables that are hosted on Databricks. A lot of the functions look like this:
Copy code
python
def do_something(db_name: str, table_name: str):
    spark.sql("SELECT <some complicated expression> FROM {:s}.{:s}".\
        format(db_name, table_name))
    
    # write results to another table.
I am wondering what the best way to convert this to kedro is. I can load the table as a spark DataFrame using
kedro.extras.datasets.spark.SparkDataSet
, but I would like to avoid rewriting all of the SQL queries in the DataFrame API. Does it make sense to do something like this:
Copy code
python
def my_node(my_table: DataFrame) -> DataFrame:
    my_table.createOrReplaceTempView("tmp_table")

    # The SQL query is just copied from the function above
    result = spark.sql("SELECT <some complicated expression> FROM tmp_table"
    spark.catalog.dropTempView("tmp_table")

    return result
Creating the temporary view seems like a bit of a hack to me, but I can't think of a better way that allows me to aovoid rewriting the SQL queries in the DataFrame API. I'm also not sure if this has any performance implications.
d
Hi @Mirko
so there SQL and DataFrame APIs are equivalent in PySpark 2.0 in terms of performance
so there isn't a penalty in that dimension, only in the effort migrating
we do have several spark datasets that may be useful
That being said we don't encourage Kedro nodes to be aware of IO and this sort of breaks the paradigm
in truth the best way to use Spark in Kedro is via the DataFrame API for this reason
What I would say is that if you want to stay in SQL perhaps
dbt
is the right tool for the job and then use Kedro for the parts that need to live in Python
m
Thanks for the quick reply. I think that from a technical perspective using the DataFrame API would be the best solution. However, there are always time and budget considerations as well and it is not unlikely that we will use the hack with the temporary views that I suggested as a temporary solution. Could you perhaps comment on what exactly is undesirable about creating the temporary views? As far as I can tell these views would only exist while the node is being run.
d
I think you're welcome to create a temporary view, just to be kedro style you should do the
create
and
save
parts with DataSet class and not in your node
so I think you can get pretty close with the
spark.HiveDataSet
but you may want to define your own custom dataset to do the SQL view thingy
m
Instead of defining a custom dataset, would I be able to achieve the same thing with two hooks? The first hook would be of the type
after_dataset_loaded
and create the temporary view. The second hook would be of the type
before_dataset_saved
and delete the temporary view.
d
You can but it feels like more work
m
That's good to know. I have never created a custom dataset before. That is why I am shying away from it, but if in this case it is likely to be less work then I should look at the docs and figure out how to do it. Is there a reason why you are recommending the
spark.HiveDataSet
instead of the
spark.SparkDataSet
as a starting point?
d
Hive crested a table, the spark one created a file
m
Thanks for your help with this. I think I have a pretty good overview of all the possibilities now.
6 Views