Hi all I am porting an existing project to kedro which reads Kedro #advanced-need-help

Hi all, I am porting an existing project to kedro ...

Mirko

04/26/2022, 8:14 PM

Hi all, I am porting an existing project to kedro which reads a lot of data from tables that are hosted on Databricks. A lot of the functions look like this:

Copy code

python
def do_something(db_name: str, table_name: str):
    spark.sql("SELECT <some complicated expression> FROM {:s}.{:s}".\
        format(db_name, table_name))
    
    # write results to another table.

I am wondering what the best way to convert this to kedro is. I can load the table as a spark DataFrame using

kedro.extras.datasets.spark.SparkDataSet

, but I would like to avoid rewriting all of the SQL queries in the DataFrame API. Does it make sense to do something like this:

Copy code

python
def my_node(my_table: DataFrame) -> DataFrame:
    my_table.createOrReplaceTempView("tmp_table")

    # The SQL query is just copied from the function above
    result = spark.sql("SELECT <some complicated expression> FROM tmp_table"
    spark.catalog.dropTempView("tmp_table")

    return result

Creating the temporary view seems like a bit of a hack to me, but I can't think of a better way that allows me to aovoid rewriting the SQL queries in the DataFrame API. I'm also not sure if this has any performance implications.

datajoely

04/27/2022, 9:57 AM

Hi @Mirko

datajoely

04/27/2022, 9:57 AM

so there SQL and DataFrame APIs are equivalent in PySpark 2.0 in terms of performance

datajoely

04/27/2022, 9:58 AM

so there isn't a penalty in that dimension, only in the effort migrating

datajoely

04/27/2022, 9:59 AM

we do have several spark datasets that may be useful

datajoely

04/27/2022, 10:00 AM

That being said we don't encourage Kedro nodes to be aware of IO and this sort of breaks the paradigm

datajoely

04/27/2022, 10:00 AM

in truth the best way to use Spark in Kedro is via the DataFrame API for this reason

datajoely

04/27/2022, 10:01 AM

What I would say is that if you want to stay in SQL perhaps

dbt

is the right tool for the job and then use Kedro for the parts that need to live in Python

Mirko

04/27/2022, 6:19 PM

Thanks for the quick reply. I think that from a technical perspective using the DataFrame API would be the best solution. However, there are always time and budget considerations as well and it is not unlikely that we will use the hack with the temporary views that I suggested as a temporary solution. Could you perhaps comment on what exactly is undesirable about creating the temporary views? As far as I can tell these views would only exist while the node is being run.

datajoely

04/28/2022, 9:41 AM

I think you're welcome to create a temporary view, just to be kedro style you should do the

create

and

save

parts with DataSet class and not in your node

datajoely

04/28/2022, 9:41 AM

so I think you can get pretty close with the

spark.HiveDataSet

but you may want to define your own custom dataset to do the SQL view thingy

Mirko

04/28/2022, 3:48 PM

Instead of defining a custom dataset, would I be able to achieve the same thing with two hooks? The first hook would be of the type

after_dataset_loaded

and create the temporary view. The second hook would be of the type

before_dataset_saved

and delete the temporary view.

datajoely

04/28/2022, 3:50 PM

You can but it feels like more work

Mirko

04/28/2022, 3:55 PM

That's good to know. I have never created a custom dataset before. That is why I am shying away from it, but if in this case it is likely to be less work then I should look at the docs and figure out how to do it. Is there a reason why you are recommending the

spark.HiveDataSet

instead of the

spark.SparkDataSet

as a starting point?

datajoely

04/28/2022, 3:56 PM

Hive crested a table, the spark one created a file

Mirko

04/28/2022, 6:06 PM

Thanks for your help with this. I think I have a pretty good overview of all the possibilities now.

12 Views

Previous Next