https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • j

    Jose Alejandro M

    08/31/2022, 10:21 PM
    so threadrunner would handle this case?. If i use Sequential runner, the way the nodes are executed are like in the second image?
  • n

    noklam

    08/31/2022, 10:30 PM
    It depends on what you are doing, threading isn't straight forward in Python due to GIl
  • n

    noklam

    08/31/2022, 10:31 PM
    It will be executed one by one, but not guaranteed a, b, c.
  • n

    noklam

    08/31/2022, 10:31 PM
    In a DAG they are independent, so it shouldn't matter who executed first.
  • j

    Jose Alejandro M

    09/01/2022, 1:10 AM
    Ok thanks i will take a look at it 😄
  • f

    fantasprite

    09/01/2022, 12:20 PM
    Hi, I am using Kedro for my ML project. I am happy that I got to know Kedro, because it makes it much easier to implement a ML model as a data scientist. Currently I am working to make Kedro ML project run on Spark cluster that uses Hive database. For this I make use of SparkHiveDataset of Kedro. The problem I faced was that there was a mismatch between variable name used in the codes of Kedro and the variable name used in Hive database that I was using. I use Kedro version 0.17.7 and below is the code of function _exists() in spark_hive_dataset.py. def _exists(self) -> bool: if ( self._get_spark() .sql("show databases") .filter(col("namespace") == lit(self._database)) .take(1) ): self._get_spark().sql(f"use {self._database}") if ( self._get_spark() .sql("show tables") .filter(col("tableName") == lit(self._table)) .take(1) ): return True return False I have changed "namespace" to "databaseName" and after that I was able to load the data from table of Hive database. What I found in the newest version 0.18.2 is that the code of function _exists() is replaced with new code that make use of Spark function to do the same. def _exists(self) -> bool: # noqa # pylint:disable=protected-access return ( self._get_spark() ._jsparkSession.catalog() .tableExists(self._database, self._table) ) I had trouble (getting various errors) to use version 0.18.2, so I kept the version that I am using (0.17.7) and only changed the codes in function _exists() in spark_hive_dataset.py with the updated codes of version 0.18.2. I would like to ask whether it is ok to use it like this. Will there be no problem/issue in using it like this?
  • n

    noklam

    09/01/2022, 12:50 PM
    Did you just overwrite the source code? It's fine if you are just doing it yourself. 1. Ideally if you can just upgrade to
    0.18.2
    , you can share the errors that you get and we can try work together and help you with the migration. 2. If upgrade is not possible, a slightly better option is make your own
    CustomHiveDataSet
    , that just inherit from the original
    SparkHiveDataSet
    and you implement the
    _exists
    method to override its behavior, you can package that in your project. One addition thing you need to is update the
    DataSet
    type in your
    catalog,yml
    to point to your custom implementation instead.
  • f

    fantasprite

    09/01/2022, 1:24 PM
    I have overwritten the source code with updated source code. It is good to know that this is in principal not a problem. I plan to do the migration to the newer version in the future, for which I have to reserve some time for it. I will make CustomHiveDataset to make it work better if I have enough time to do it 🙂 . Thank you for your help, Kedro team!
  • n

    noklam

    09/01/2022, 1:38 PM
    Just be aware that it will only work on your own laptop, once you need to share/distribute the code it won't work anymore.
  • f

    fantasprite

    09/01/2022, 1:47 PM
    This Kedro project is only placed in a platform. Other users will use this Kedro project on the platform by providing their data and get the predictions back. So I think sharing of code will not take place.
  • n

    noklam

    09/01/2022, 2:08 PM
    In that sense, it's kind of a deployed project, personally, I would be very cautious to do this in a production environment. It's dangerous since once you re-deploy or any CI/CD reinstall the dependencies things will break and no one will know why.
  • f

    fantasprite

    09/01/2022, 2:13 PM
    Thank you for the information! Then I will create CustomHiveDataset to do the work. Will this be sufficient enough for deployed project?
  • f

    fantasprite

    09/01/2022, 2:52 PM
    I see that it is relatively simple to create a custom dataset. I just copied all the contents of SparkHiveDataSet and placed them in CustomHiveDataSet. I only changed the codes in _exists() function. By doing this I am now able to run the project the same way as I did before. Is this sufficient for deployment project or has Kedro to be migrated to the newest version?
  • r

    rohan_ahire

    09/01/2022, 9:49 PM
    Hi All. I wanted to test experiment tracking using kedro. I did not want to use the mlflow plugin and just wanted to try out what Kedro can do. I was using the space flights pipeline and defined a tracking.JSONDataSet in the data catalog and modified the pipeline to output the metrics. However, I do not see that metrics dataset being written to disk. It gives below message```INFO Model has a coefficient R^2 of 0.449 on test data. nodes.py:56 INFO Saving data to 'data_science.candidate_modelling_pipeline.metrics' (MemoryDataSet)... ```
  • r

    rohan_ahire

    09/01/2022, 9:51 PM
    I also tried to hard-code the metrics dataset with a sample json file just to see how kedro viz will show the metric value, however, kedro viz does not show the value. It just shows the metrics dataset as an output on the viz
  • n

    noklam

    09/01/2022, 9:52 PM
    it's likely because you are using namespace piepline, your dataset name need to be "data_science.candidate_modelling_pipeline.metrics" base on your log
  • r

    rohan_ahire

    09/01/2022, 9:52 PM
    Yes, I am using a namespace pipeline, let me change that and see what happens.
  • r

    rohan_ahire

    09/01/2022, 10:00 PM
    that solved the problem and I am getting the metrics and my kedro viz can see the metric values. How do I get the experiments on my kedro viz? It still says "You don't have any experiments"
  • n

    noklam

    09/01/2022, 10:01 PM
    Did you update your settings.py to use the SQLiteStote?
  • r

    rohan_ahire

    09/01/2022, 10:05 PM
    Do you mean instructions on this page? https://kedro.readthedocs.io/en/latest/tutorial/set_up_experiment_tracking.html I just found it, have to read it. I was reading the stable docs and hence did not find this page.
  • n

    noklam

    09/01/2022, 10:06 PM
    https://kedro.readthedocs.io/en/stable/tutorial/set_up_experiment_tracking.html#set-up-the-session-store This section
  • n

    noklam

    09/01/2022, 10:06 PM
    It's in stable? I think
  • r

    rohan_ahire

    09/01/2022, 10:07 PM
    Got it. Will read and implement. I was only able to access this link and did not know where to go from there - https://kedro.readthedocs.io/en/stable/logging/experiment_tracking.html
  • n

    noklam

    09/01/2022, 10:12 PM
    I agree the docs doesn't make it very clear.
  • n

    noklam

    09/01/2022, 10:12 PM
    Let us know if you have any problem with experiment tracking
  • n

    noklam

    09/01/2022, 10:14 PM
    I will raise this to the team, I think we can organise the docs slightly better here.
  • r

    rohan_ahire

    09/01/2022, 10:21 PM
    That would really, help as I spent too much time on it and was actually going to not continue with kedro. Luckily I found this discord channel and everything started working so fast.
  • r

    rohan_ahire

    09/01/2022, 10:21 PM
    Experiment tracking is working for me now 🙂
  • n

    noklam

    09/01/2022, 10:24 PM
    Thanks for sticking with us! Experiment tracking is a relatively new feature, as a result the doc have been update frequently and we may forget to link things properly sometimes.
  • n

    noklam

    09/01/2022, 10:25 PM
    Don't hesitate to ask here if you have been stucking for a while.
Powered by Linen
Title
n

noklam

09/01/2022, 10:25 PM
Don't hesitate to ask here if you have been stucking for a while.
View count: 1