09/27/2022, 3:12 PM
Custom DataSet for larger than memory data -- dask, SQL, other?
3:12 PM
Hi, I'm currently confronted with a file directory of some container format files, where each container contains one audio signal (a.k.a. array) of a few megabytes size and variable length, as well as a handful of miscellaneous scalar parameters. So I've written a custom DataSet that takes the root directory of my data as filepath. The loading method returns a Tuple[pd.DataFrame, List[np.ndarray]]. The dataframe holds all scalar parameters and allows for easy exploration and analysis, while the list holds the actual data payload. At saving, I want to mirror the original directory structure, but for each container file create a yaml file holding the parameters, and for the audio signal create a .flac audio file. Now I'm confronted with a directory that holds some 70GB (larger than memory) of audio data. What could I use instead of a list of numpy arrays? To my knowledge, e.g. dask bags don't preserve order, while SQL might have a large implementation overhead (especially because I've never worked with it)... Does anybody have experience what one could do with such large data, while being suitable for an ML workflow (split the data, do mini bathes later on, etc.)?