Custom DataSet for larger than memory data dask SQL other Kedro #advanced-need-help

Join Discord

Custom DataSet for larger than memory data -- dask...

# advanced-need-help

Vici

09/27/2022, 3:12 PM

Custom DataSet for larger than memory data -- dask, SQL, other?

Vici

09/27/2022, 3:12 PM

Hi, I'm currently confronted with a file directory of some container format files, where each container contains one audio signal (a.k.a. array) of a few megabytes size and variable length, as well as a handful of miscellaneous scalar parameters. So I've written a custom DataSet that takes the root directory of my data as filepath. The loading method returns a Tuple[pd.DataFrame, List[np.ndarray]]. The dataframe holds all scalar parameters and allows for easy exploration and analysis, while the list holds the actual data payload. At saving, I want to mirror the original directory structure, but for each container file create a yaml file holding the parameters, and for the audio signal create a .flac audio file. Now I'm confronted with a directory that holds some 70GB (larger than memory) of audio data. What could I use instead of a list of numpy arrays? To my knowledge, e.g. dask bags don't preserve order, while SQL might have a large implementation overhead (especially because I've never worked with it)... Does anybody have experience what one could do with such large data, while being suitable for an ML workflow (split the data, do mini bathes later on, etc.)?

Vici

09/27/2022, 3:29 PM

Or is it recommended to learn Spark for such an issue?

Vici

09/28/2022, 6:54 AM

PS: I'm okay with other saving formats than the original directory structure. And TLDR: I guess I'm looking for: * Data format that works well with kedro * For larger than memory data * Indexability would be a plus

datajoely

09/30/2022, 1:50 AM

Can you try this? https://github.com/mzjp2/kedro-dataframe-dropin

Vici

10/05/2022, 7:13 AM

Thank you for hinting me towards this! Is it considered wise, though, to store numpy array objects (or any unspecified large objects, really) in dataframe cells to begin with (be it dask/dropin based or not)? Do you happen to have any experience with this? And thanks again 😊

datajoely

10/05/2022, 1:26 PM

My preference is to limit the abstractions

datajoely

10/05/2022, 1:26 PM

So numpy in dicts or other numpy constructs

2 Views

Previous Next