noklam
04/21/2022, 3:44 PMRafał
04/21/2022, 9:37 PMhook_manager
Unfortunately all the documentation says nothing about hook_manager
and how to initialize it.
Moreover, the documentation code gives error since calling run(pipeline, catalog=catalog)
yields
ErrorMessage "TypeError: run() missing 1 required positional argument: 'hook_manager'
Rafał
04/21/2022, 9:41 PMdatajoely
04/21/2022, 9:44 PMRafał
04/21/2022, 10:21 PMdatajoely
04/21/2022, 10:22 PMnoklam
04/21/2022, 10:27 PMsession.run()
instead of calling runner directly.
If you absolutely need the runner for some reason, this is a hack work for 0.18.0 but there is no guarantee this will continue to work for coming version.
from kedro.framework.session.session import _create_hook_manager
print(runner.run(greeting_pipeline, data_catalog, _create_hook_manager())
Rafał
04/22/2022, 5:34 AMsession.run()
. It is just that I have found in documentation that "Running pipeline" should create the runner first.noklam
04/22/2022, 8:07 AMeleonora.picca
04/22/2022, 10:49 AMfrom kedro.framework.hooks import _create_hook_manager
> print(runner.run(greeting_pipeline, data_catalog, _create_hook_manager())
but I would like to use session. Any help on how I could get or create the session to use for the session.run()
command? Is there a way to pass to the session a specific datacatalog? (I am using this to perform some tests and I created a fake data catalog to do this)
Thank you in advance!noklam
04/22/2022, 2:43 PMsession
in interactive mode (Jupyter/Ipython), the most common way to interact with kedro is via the CLI, kedro run
which execute the pipeline. Under the hood, it will create all necessary components like session
, context
, catalog
for you.
To start a new kedro project, you would do kedro new
.
You can do kedro new --starter=spaceflights
which will create a template project for you with more advance kedro features. Then you can run a pipeline via kedro run
. The tutorial above will guide you step by step how to creating these project in practice.eleonora.picca
04/22/2022, 3:18 PMsession.run
? The runner.run
has the DataCatalog argument, while session.run
doesn't, but in this case would be useful to be able to pass a specific DataCatalog.
Thank you again for your timenoklam
04/22/2022, 3:46 PMsession.run()
doesn't have the data catalog argument because it is managed by the session itself.eleonora.picca
04/22/2022, 8:33 PMnoklam
04/22/2022, 8:56 PMkedro run --env=test
? or optionally kedro run --env=test --pipeline=TARGET_PIPELINE
?Barros
04/23/2022, 1:58 PMDataSetError: An exception occurred when parsing config for DataSet `val_csv_glebas`:
Class `local_pipeline.extras.io.vector_datasets.ShpVectorDataset` not found or one of its dependencies has not been installed.
How can I make the module local_pipeline.extras.io.vector_datasets.ShpVectorDataset
be known to DataCatalog class?datajoely
04/23/2022, 2:00 PMkedro jupyter notebook
it will be registered for youBarros
04/23/2022, 2:03 PMuser
04/25/2022, 9:12 AMFlow
04/26/2022, 2:13 PMconf
as part of the src/
file. Is there any good example on how that actually looks like? I guess what I am trying to figure out is once it's there do people add it to package_data
of setup.py and then somehow change the CONF_SOURCE
variable in settings.py or are there better approaches.
Use case is an airflow deploymentFlow
04/26/2022, 2:33 PMRafał
04/26/2022, 3:12 PMdatajoely
04/26/2022, 3:15 PMRafał
04/26/2022, 3:40 PMavan-sh
04/26/2022, 4:12 PMwilliamc
04/26/2022, 7:30 PMdatajoely
04/26/2022, 7:30 PMwilliamc
04/26/2022, 7:47 PMMirko
04/26/2022, 8:14 PMpython
def do_something(db_name: str, table_name: str):
spark.sql("SELECT <some complicated expression> FROM {:s}.{:s}".\
format(db_name, table_name))
# write results to another table.
I am wondering what the best way to convert this to kedro is. I can load the table as a spark DataFrame using kedro.extras.datasets.spark.SparkDataSet
, but I would like to avoid rewriting all of the SQL queries in the DataFrame API. Does it make sense to do something like this:
python
def my_node(my_table: DataFrame) -> DataFrame:
my_table.createOrReplaceTempView("tmp_table")
# The SQL query is just copied from the function above
result = spark.sql("SELECT <some complicated expression> FROM tmp_table"
spark.catalog.dropTempView("tmp_table")
return result
Creating the temporary view seems like a bit of a hack to me, but I can't think of a better way that allows me to aovoid rewriting the SQL queries in the DataFrame API. I'm also not sure if this has any performance implications.datajoely
04/27/2022, 9:57 AM