WolVez
03/15/2022, 4:02 PMlbonini
03/15/2022, 8:44 PMyaml
example_csv: # Input
type: pandas.ParquetDataSet
filepath: "s3://my_bucket/yyyy-mm-dd/*.parquet
I need to save it inside a today()
folder...
I was wondering about doing this with templateconfigloader, using environment variables but I need to know if it is simpler than I expect...datajoely
03/15/2022, 8:46 PMPartitionedDataSet
to do this with pandas. Spark will do this automatically.lbonini
03/15/2022, 8:51 PMdatajoely
03/15/2022, 8:54 PMdatajoely
03/15/2022, 8:54 PMDIVINE
03/15/2022, 8:54 PMdatajoely
03/15/2022, 8:57 PMdatajoely
03/15/2022, 8:58 PMantony.milne
03/16/2022, 9:32 AMvivecalindahl
03/16/2022, 10:05 AMFILEPATH=/path/to/file/
and in my data catalog I'd want something like
example_iris_data:
type: pandas.CSVDataSet
filepath: "${params:filepath}"
Then I could run kedro as kedro run --params filepath=$FILEPATH
.
I'm aware of the TemplatedConfigLoader
https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#template-configuration . But is there a way of skipping the extra loop through the config file globals.yaml
? I haven't been able to figure out how to do what I outlined above, i.e. using the parameters dict directly.
Basically it would be nice to directly provide the env variable, without first creating a globals.yaml
containing a single variable.avan-sh
03/16/2022, 12:54 PMregister_config_loader
hook (https://kedro.readthedocs.io/en/latest/kedro.framework.hooks.specs.RegistrationSpecs.html#kedro.framework.hooks.specs.RegistrationSpecs.register_config_loader) also gets all the extra_params you'll be passing from command line. You can use that to add file_path to globals_dict. And also your filepath placeholder will be ${filepath}
. LMK if I'm not clear, I can help get a quick snippet later if you want.datajoely
03/16/2022, 12:57 PMvivecalindahl
03/16/2022, 1:15 PMavan-sh
03/16/2022, 1:25 PMclass ProjectHooks:
@hook_impl
def register_config_loader(conf_paths, env, extra_params) -> ConfigLoader:
globals_dict = {}
globals_dict["filepath"] = extra_params.get("filepath", "default_value")
return TemplatedConfigLoader(
conf_paths,
globals_dict=globals_dict,
)
vivecalindahl
03/16/2022, 1:25 PMlbonini
03/16/2022, 1:53 PMyaml
# catalog.yml
example_data:
type: PartitionedDataSet
dataset: pandas.ParquetDataSet
credentials: dev_s3
path: s3://bucket/path/to/folder
filename_suffix: "_part.parquet"
yaml
# credentials.yml
dev_s3:
client_kwargs:
aws_access_key_id: xxxx
aws_secret_access_key: xxx
datajoely
03/16/2022, 1:53 PMlbonini
03/16/2022, 2:07 PMgui42
03/17/2022, 2:23 AMgui42
03/17/2022, 2:23 AMdatajoely
03/17/2022, 9:17 AMSchoolmeister
03/17/2022, 9:37 AMparameters.yml
with the following contents:
yaml
folds:
timeseries1:
- start: 2021-08-24 15:00:00+00:00
end: 2021-10-22 03:05:00+00:00
- start: 2021-10-22 03:10:00+00:00
end: 2021-12-28 05:00:00+00:00
- start: 2021-12-28 05:05:00+00:00
end: 2022-01-28 12:00:00+00:00
Can I reference the second [start, end]? I've tried using something like params:folds.timeseries1.1
or params:folds.timeseries1[1]
, but that doesn't work.datajoely
03/17/2022, 9:38 AMSchoolmeister
03/17/2022, 9:40 AMlbonini
03/17/2022, 3:52 PMkedro run
datajoely
03/17/2022, 3:57 PMlbonini
03/17/2022, 3:57 PMdatajoely
03/17/2022, 4:04 PMlbonini
03/17/2022, 4:05 PMlbonini
03/17/2022, 4:05 PMdatajoely
03/17/2022, 4:08 PMlbonini
03/17/2022, 4:09 PMpython
class ProjectHooks:
@hook_impl
def register_config_loader(
self, conf_paths: Iterable[str], env: str, extra_params: Dict[str, Any],
) -> TemplatedConfigLoader:
print(conf_paths)
datajoely
03/17/2022, 4:10 PMlbonini
03/17/2022, 4:18 PMdatajoely
03/17/2022, 4:19 PMlbonini
03/17/2022, 4:20 PMdatajoely
03/17/2022, 4:20 PMlbonini
03/17/2022, 4:20 PM