I plucked it out of my custom pyspark save mothod I wanted t Kedro #advanced-need-help

Join Discord

I plucked it out of my custom pyspark _save mothod...

# advanced-need-help

SandyShocks™

10/18/2021, 11:31 PM

I plucked it out of my custom pyspark _save mothod. I wanted to test if my save args were being respected.

datajoely

10/19/2021, 7:31 AM

Hi @SandyShocks™ so does it still not work?

SandyShocks™

10/19/2021, 1:23 PM

So this is the type of catalog in the scenario where it failed

SandyShocks™

10/19/2021, 1:24 PM

data.write.save(save_path, self._file_format, **self._save_args)

datajoely

10/19/2021, 1:24 PM

what type of object is

data

SandyShocks™

10/19/2021, 1:24 PM

this is how we were doing the save in my _save method

SandyShocks™

10/19/2021, 1:24 PM

spark dataset

SandyShocks™

10/19/2021, 1:27 PM

data.write.save( path=save_path, format=self._file_format, mode=self._save_args.get("mode"), **options)

SandyShocks™

10/19/2021, 1:27 PM

this is how I'm changing it now

SandyShocks™

10/19/2021, 1:28 PM

partition_col = self._save_args.get("partitionBy") options = self._save_args.get("option")

SandyShocks™

10/19/2021, 1:28 PM

data.write \ .option("replaceWhere", f"{partition_col} >= '{start_date}'") \ .save( path=save_path, format=self._file_format, mode=self._save_args.get("mode"), partitionBy=partition_col, **options)

datajoely

10/19/2021, 1:44 PM

what error do you get?

datajoely

10/19/2021, 1:44 PM

Is it still the column merge error as above

SandyShocks™

10/19/2021, 1:45 PM

Tough to say

SandyShocks™

10/19/2021, 1:46 PM

I ran the save explicitly on my jupyter notebook as I showed in the pic

SandyShocks™

10/19/2021, 1:46 PM

So now the target schema is overwritten, so my new implementation doesn't complain as well

datajoely

10/19/2021, 1:47 PM

from our side it looks like it should work

SandyShocks™

10/19/2021, 1:47 PM

Although this exact issue happened once before, and running the command explicitly fixed target and then it ran okay

datajoely

10/19/2021, 1:47 PM

in general we try not to change too much about the load/save methods and delegate to the underlying implementation

SandyShocks™

10/19/2021, 1:48 PM

Correct

datajoely

10/19/2021, 1:48 PM

> command explicitly fixed target what do you mean by this?

SandyShocks™

10/19/2021, 1:48 PM

Cell #9

datajoely

10/19/2021, 1:51 PM

so this isn't a Kedro issue - it's a Spark issue with conflicting types of both tables

datajoely

10/19/2021, 1:51 PM

can you do a cast before?

SandyShocks™

10/19/2021, 1:51 PM

One sec, I have a question regarding that

SandyShocks™

10/19/2021, 1:52 PM

my work laptop and work VDI don't allow Discord, that makes chatting here very complicated if I wanna pull some examples

datajoely

10/19/2021, 1:55 PM

That's okay - but I do want to make clear that the AnalysisException is purely a Spark one so has to do with the data frame content rather than how Kedro is writing it

SandyShocks™

10/19/2021, 1:55 PM

def save(self, path=None, format=None, mode=None, partitionBy=None, **options): """Saves the contents of the :class:`DataFrame` to a data source. The data source is specified by the ``format`` and a set of ``options``. If ``format`` is not specified, the default data source configured by ``spark.sql.sources.default`` will be used. .. versionadded:: 1.4.0 Parameters ---------- path : str, optional the path in a Hadoop supported file system format : str, optional the format used to save mode : str, optional specifies the behavior of the save operation when data already exists. * ``append``: Append contents of this :class:`DataFrame` to existing data. * ``overwrite``: Overwrite existing data. * ``ignore``: Silently ignore this operation if data already exists. * ``error`` or ``errorifexists`` (default case): Throw an exception if data already \ exists. partitionBy : list, optional names of partitioning columns **options : dict all other string options Examples -------- > df.write.mode("append").save(os.path.join(tempfile.mkdtemp(), 'data')) > """ > self.mode(mode).options(**options) > if partitionBy is not None: > self.partitionBy(partitionBy) > if format is not None: > self.format(format) > if path is None: > self._jwrite.save() > else: > self._jwrite.save(path)

datajoely

10/19/2021, 1:55 PM

so in theory

withColumn('column_name', F.cast('string'))

datajoely

10/19/2021, 1:55 PM

on both sides will allow the merge

SandyShocks™

10/19/2021, 1:55 PM

that's pyspark.sql readwriter.py

SandyShocks™

10/19/2021, 1:57 PM

that method is what is being called behind save. it takes mode and applies .mode(mode), it takes partitionBy and applies .partitionBy(partitionBy) and same with options

SandyShocks™

10/19/2021, 1:58 PM

save_args: mode: "overwrite" option: overwriteSchema: true

datajoely

10/19/2021, 1:58 PM

right, but I think this is a within node business logic issue

SandyShocks™

10/19/2021, 1:59 PM

So doing a data.write.save(save_path, self._file_format, **self._save_args) would put mode and partitionBy from catalog into options

datajoely

10/19/2021, 1:59 PM

I'm not sure spark can do a merge with columns of diff types even with

overwriteSchema: true

datajoely

10/19/2021, 1:59 PM

I think you need to do an explicit cast in the node

SandyShocks™

10/19/2021, 1:59 PM

I wish I could do a cast easily

SandyShocks™

10/19/2021, 2:00 PM

But that node copies data 1-1 and overwrites target tables. They are small tables so that's how we defined the logic.

SandyShocks™

10/19/2021, 2:01 PM

Maybe source schema could be applied to target

datajoely

10/19/2021, 2:06 PM

yes I think that's the right call

SandyShocks™

10/19/2021, 2:20 PM

sorry, read target schema and apply to the data. But that might be complicated with how all catalogs are designed to be abstracted away from node

datajoely

10/19/2021, 2:55 PM

as a rule I'm hesitant to put too much transformation logic in the catalog, I feel that should live in node

SandyShocks™

10/19/2021, 4:50 PM

great point you brought up there. Question: would I be able to access input and output catalog info line save args, etc from inside node? Also is it possible to update save_args from inside the node? I've only seen catalog.add() functionality

datajoely

10/19/2021, 4:50 PM

No we specifically don't provide that to users

datajoely

10/19/2021, 4:51 PM

declarative pointers to data live in the catalog. Business logic such as transformations live in the node in python

3 Views

Previous Next