datajoely
03/11/2022, 2:33 PMWalber Moreira
03/11/2022, 2:34 PMdatajoely
03/11/2022, 2:35 PMThreadRunner
is for spark and other remote execution worklaodsWalber Moreira
03/11/2022, 2:40 PMWalber Moreira
03/11/2022, 2:40 PMdatajoely
03/11/2022, 2:44 PMDeep
03/11/2022, 2:46 PMdatajoely
03/11/2022, 2:47 PMWalber Moreira
03/11/2022, 3:44 PMboazmohar
03/11/2022, 4:07 PMdatajoely
03/11/2022, 4:15 PMboazmohar
03/11/2022, 4:21 PMdatajoely
03/11/2022, 4:23 PMyaml
my_data:
type: MyCustomImageClass
data_location: path/to/files
metadata_location: path/to/sepc.xml
datajoely
03/11/2022, 4:23 PMmetadata_location
need to be dynamic?datajoely
03/11/2022, 4:25 PMdatajoely
03/11/2022, 4:26 PMboazmohar
03/11/2022, 4:28 PMdata_location
so these is not even a need to a xml path... Here is part of the class
python
class DaskAlpha3TifsDataset(AbstractDataSet):
def __init__(self, filepath: str, params: Dict[str, Any] = None):
# parse the path and protocol (e.g. file, http, s3, etc.)
protocol, path = get_protocol_and_path(filepath)
self._protocol = protocol
self._filepath = PurePosixPath(path)
self._fs = fsspec.filesystem(self._protocol)
load_path = get_filepath_str(self._filepath, self._protocol)
self.xml = get_meta_alpha3(load_path)
self.xml['filters'] = params
self.xml['ch_names'] = [params[i] for i in self.xml['filter_order']]
def _load(self) -> da.Array:
"""Loads data from the image file.
Returns:
Data from the image file as a numpy array
"""
# using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
file_shapes = self.xml['file_shapes']
base_dir = self.xml['base_dir']
files = glob.glob(os.path.join(base_dir, 'Raw', '*.tiff'))
logger.info(f'Found {len(files)} files in {base_dir} Raw folder')
sizes = [(file_shapes[3], file_shapes[4])] * len(files)
delay = [dask.delayed(load_tiff_stack)(fn) for fn in files]
both = list(zip(delay, sizes))
slices = [slice(i, i+file_shapes[2]) for i in range(0, len(both), file_shapes[2])]
lazy_arrays = [da.from_delayed(x, shape=y, dtype=np.uint16) for x, y in both]
lazy_arrays_conZ = [ da.stack(lazy_arrays[s], axis=0) for s in slices]
lazy_arrays_conTileCh = da.stack(lazy_arrays_conZ, axis=0).reshape(file_shapes[[5,0,1,2,3,4]])
return lazy_arrays_conTileCh
boazmohar
03/11/2022, 4:29 PMget_meta_alpha3
knows how to find the xml based on the pathdatajoely
03/11/2022, 4:29 PMboazmohar
03/11/2022, 4:29 PMnode
datajoely
03/11/2022, 4:30 PMparams
however would be in the catalog definition NOT the nodeboazmohar
03/11/2022, 4:30 PMyml
gel3_round1:
type: alpha3_expand.extra.datasets.dask_alpha3_tifs_dataset.DaskAlpha3TifsDataset
filepath: /nrs/svoboda/moharb/ExM/Alpha3/20220310_YFP_ANM1_Gel3_R1_v2/Basal/
params:
2: YFP
3: PSD95
4: GluA1
datajoely
03/11/2022, 4:30 PMyaml
my_data:
type: DaskAlpha3TifsDataset
filepath: path/to/files
params:
xml: path/to/sepc.xml
boazmohar
03/11/2022, 4:31 PMDaskArray
but how do I access self.xml
boazmohar
03/11/2022, 4:35 PMDaskAlpha3TifsDataset._load
could be a dict with the raw data and metadata?boazmohar
03/11/2022, 4:39 PMpython
def _load(self) -> Dict[str, Any]:
"""Loads data from the image file.
Returns:
Data from the image file as a numpy array
"""
# using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
file_shapes = self.xml['file_shapes']
base_dir = self.xml['base_dir']
files = glob.glob(os.path.join(base_dir, 'Raw', '*.tiff'))
logger.info(f'Found {len(files)} files in {base_dir} Raw folder')
sizes = [(file_shapes[3], file_shapes[4])] * len(files)
delay = [dask.delayed(load_tiff_stack)(fn) for fn in files]
both = list(zip(delay, sizes))
slices = [slice(i, i+file_shapes[2]) for i in range(0, len(both), file_shapes[2])]
lazy_arrays = [da.from_delayed(x, shape=y, dtype=np.uint16) for x, y in both]
lazy_arrays_conZ = [ da.stack(lazy_arrays[s], axis=0) for s in slices]
lazy_arrays_conTileCh = da.stack(lazy_arrays_conZ, axis=0).reshape(file_shapes[[5,0,1,2,3,4]])
return {'data': lazy_arrays_conTileCh, 'meta':self.xml}
datajoely
03/11/2022, 6:39 PMboazmohar
03/11/2022, 8:36 PMuser
03/12/2022, 6:30 PMregister_flow.py
script, the datasets in the catalog object are named as in the file. However, in the Nodes the input and output datasets are namespaced. Therefore, when running that flow it will create only memory datasets because it will assume all the datasets don't exist in the catalog. Now if I change the register_flow.py
so that it does not create MemoryDatasets for everything, the run_node
function does not work as the input and catalog name don't match up and the save/load functions don't work anymore (it tries loading a namespaced dataset that it can't find in the catalog). Is there a way to obtain either a namespaced catalog or a pipeline object where the inputs/outputs of the nodes are not namespaced so that the run_node
function works properly? 🙂user
03/13/2022, 11:21 AM0.17.7
does not work with the starters anymore? In 0.17.6
the spaceflight starter still works fine but in 0.17.7
the new namespacing seems to break the starter? (pipelines and their nodes now include namespacing but catalog does not)