778216384475693066 #beginners-need-help

Channels

advanced-need-help

job-posting

welcome

datajoely

05/21/2022, 3:04 AM

Yeah this is a deliberate choice not to support that because we believe it harms reproducibility and makes things harder to debug

RRoger

05/21/2022, 6:16 AM

What's the pattern for numerous files as raw data? I want to download about 2000 files of the same type with different dates, e.g. "senate_2006-03-30.xml". 1. Do I create a catalog entry for each file? 2. Does the download-node

output

to a list of length 2000, i.e.

["senate_2006-03-30", "senate_2006-03-31", ...]

, i.e. a 2000-line

pipeline.py

? Or is there some sort of clever templating?

datajoely

05/21/2022, 6:35 AM

PartitionedDataset?

RRoger

05/21/2022, 11:46 AM

Yes this solved the problem thank you.

Mackson

05/24/2022, 12:37 AM

Hello people, how should I work with chunking a huge dataset when applying a function? I found an issue but it was not clear how should I deak with the context manager inside the function. Thanks!!!!

datajoely

05/24/2022, 1:10 AM

Are you using pandas? Can you post a snippet?

Mackson

05/24/2022, 8:37 AM

It's from my work, but you cant just consider a really huge pandas dataset that does not fit memory in a function where you just, let's say, add a collumn.

Mackson

05/24/2022, 8:38 AM

The only lazy evaluation I know is though Spark, which I dont have access at the time.

datajoely

05/24/2022, 8:50 AM

Sure - but is this a general pandas question or a Kedro one? This tutorial shows simple chunking + full dask https://pythonspeed.com/articles/faster-pandas-dask/

Mackson

05/24/2022, 8:51 AM

It's a Kedro question, I know how to chunk outside the node context

Mackson

05/24/2022, 8:56 AM

My question is: I know how the _load (let's say I return the chunk iterator object from pandas) can return the iterator, but how the node itself will know how to apply all the steps inside the function for EACH chunk (not just for one) without a outside loop (let's say a wrapper in the node).

noklam

05/24/2022, 10:02 AM

I think what you get in a node will be a generator instead of a dataframe if you are using the chunk iterator. That looping would be logic that you write inside the node.

Mackson

05/24/2022, 10:54 AM

Yeah, maybe returning a "map" will do trick?

Mackson

05/24/2022, 10:55 AM

Or doing a whole new AbstractDataSet that will write the iterator

noklam

05/24/2022, 11:00 AM

I am not sure I understand the problem here?

noklam

05/24/2022, 11:02 AM

You can always use a more functional approach like

map

, but for loop is fine too.

noklam

05/24/2022, 11:04 AM

I think you mentioned your problem is with a big dataset, the problem with

pandas

is that it is memory hungry, especially during I/O and certain operations. Using the

chunk

args helps to mitigate this problem by only loading & processing small batch of data and stitch them by at the end. If the new dataset already iterate through the entire dataset before you start applying any transformation logic, then it doesn't help your memory problem.

noklam

05/24/2022, 11:18 AM

I can understand the logic, could you repeat what's the question?

datajoely

05/24/2022, 11:22 AM

Oh you shouldnt do the writing yourself in your node, this is what PartitionedDataSet is for

Mackson

05/24/2022, 11:24 AM

What type of DataSet can deal with huge dataset that does not fit in memory

datajoely

05/24/2022, 11:25 AM

So if you return callables to PartitionedDataset it will do in a lazy way

datajoely

05/24/2022, 11:25 AM

There are examples in the docs

noklam

05/24/2022, 11:29 AM

I would probably add that choose format other than csv is better, unless you have to stick with csv.

Mackson

05/24/2022, 11:30 AM

Which one do you recommend?

noklam

05/24/2022, 11:31 AM

Parquet/feather? @datajoely will be the right person to answer that

Mackson

05/24/2022, 11:32 AM

@noklam @datajoely thanks a LOT!

datajoely

05/24/2022, 2:32 PM

Team Parquet here!

noklam

05/24/2022, 2:44 PM

Just curious if you have any experience with arrow. It's something on my radar but I have never used it😅

datajoely

05/24/2022, 2:46 PM

Arrow is the modern engine that Parquet can leverage so they're complementary rather than substitutes - this article by the creator of Pandas on the topic is one my favourites https://wesmckinney.com/blog/apache-arrow-pandas-internals/

noklam

05/24/2022, 2:54 PM

Yes I have read this article a couple of times. I think you are referring Arrow as a in memory columanr data structure in this context. Feather is kind of like the mapping of this data structuring on disk as a storage, but I haven't seen it used widely.