Hi, I am using Kedro for my ML project. I am happy that I got to know Kedro, because it makes it much easier to implement a ML model as a data scientist.
Currently I am working to make Kedro ML project run on Spark cluster that uses Hive database. For this I make use of SparkHiveDataset of Kedro. The problem I faced was that there was a mismatch between variable name used in the codes of Kedro and the variable name used in Hive database that I was using. I use Kedro version 0.17.7 and below is the code of function _exists() in spark_hive_dataset.py.
def _exists(self) -> bool:
if (
self._get_spark()
.sql("show databases")
.filter(col("namespace") == lit(self._database))
.take(1)
):
self._get_spark().sql(f"use {self._database}")
if (
self._get_spark()
.sql("show tables")
.filter(col("tableName") == lit(self._table))
.take(1)
):
return True
return False
I have changed "namespace" to "databaseName" and after that I was able to load the data from table of Hive database. What I found in the newest version 0.18.2 is that the code of function _exists() is replaced with new code that make use of Spark function to do the same.
def _exists(self) -> bool:
# noqa # pylint:disable=protected-access
return (
self._get_spark()
._jsparkSession.catalog()
.tableExists(self._database, self._table)
)
I had trouble (getting various errors) to use version 0.18.2, so I kept the version that I am using (0.17.7) and only changed the codes in function _exists() in spark_hive_dataset.py with the updated codes of version 0.18.2. I would like to ask whether it is ok to use it like this. Will there be no problem/issue in using it like this?