09/01/2022, 12:20 PM
Hi, I am using Kedro for my ML project. I am happy that I got to know Kedro, because it makes it much easier to implement a ML model as a data scientist. Currently I am working to make Kedro ML project run on Spark cluster that uses Hive database. For this I make use of SparkHiveDataset of Kedro. The problem I faced was that there was a mismatch between variable name used in the codes of Kedro and the variable name used in Hive database that I was using. I use Kedro version 0.17.7 and below is the code of function _exists() in def _exists(self) -> bool: if ( self._get_spark() .sql("show databases") .filter(col("namespace") == lit(self._database)) .take(1) ): self._get_spark().sql(f"use {self._database}") if ( self._get_spark() .sql("show tables") .filter(col("tableName") == lit(self._table)) .take(1) ): return True return False I have changed "namespace" to "databaseName" and after that I was able to load the data from table of Hive database. What I found in the newest version 0.18.2 is that the code of function _exists() is replaced with new code that make use of Spark function to do the same. def _exists(self) -> bool: # noqa # pylint:disable=protected-access return ( self._get_spark() ._jsparkSession.catalog() .tableExists(self._database, self._table) ) I had trouble (getting various errors) to use version 0.18.2, so I kept the version that I am using (0.17.7) and only changed the codes in function _exists() in with the updated codes of version 0.18.2. I would like to ask whether it is ok to use it like this. Will there be no problem/issue in using it like this?