Custom DataSet for larger than memory data -- dask...
# advanced-need-help
v
Custom DataSet for larger than memory data -- dask, SQL, other?
Hi, I'm currently confronted with a file directory of some container format files, where each container contains one audio signal (a.k.a. array) of a few megabytes size and variable length, as well as a handful of miscellaneous scalar parameters. So I've written a custom DataSet that takes the root directory of my data as filepath. The loading method returns a Tuple[pd.DataFrame, List[np.ndarray]]. The dataframe holds all scalar parameters and allows for easy exploration and analysis, while the list holds the actual data payload. At saving, I want to mirror the original directory structure, but for each container file create a yaml file holding the parameters, and for the audio signal create a .flac audio file. Now I'm confronted with a directory that holds some 70GB (larger than memory) of audio data. What could I use instead of a list of numpy arrays? To my knowledge, e.g. dask bags don't preserve order, while SQL might have a large implementation overhead (especially because I've never worked with it)... Does anybody have experience what one could do with such large data, while being suitable for an ML workflow (split the data, do mini bathes later on, etc.)?
Or is it recommended to learn Spark for such an issue?
PS: I'm okay with other saving formats than the original directory structure. And TLDR: I guess I'm looking for: * Data format that works well with kedro * For larger than memory data * Indexability would be a plus
d
v
Thank you for hinting me towards this! Is it considered wise, though, to store numpy array objects (or any unspecified large objects, really) in dataframe cells to begin with (be it dask/dropin based or not)? Do you happen to have any experience with this? And thanks again 😊
d
My preference is to limit the abstractions
So numpy in dicts or other numpy constructs