778216384475693066 #resources

I finished writing something today, my drafts are starting to pile up again. It covers a use case that came up, "how can we compare our datasets over time". The answer was to simply turn on versioned datasets now, a one line change. Then when we have enough data to compare adding in a partitioned or incremental dataset to compare the dataset over time. https://waylonwalker.com/kedro-incremental-versioned-datasets/

waylonwalker

07/08/2021, 3:38 PM

@User you must share with us how you made those kedro-viz glitch embeds.

datajoely

07/08/2021, 3:42 PM

So it was a bit tricky (so now we have a ticket on our backlog to make embedding easier 😆 ) 1. There are some example testing projects in the viz repo - and I adapted them 2. I've included the javascript below, I needed to hide the sidebar and minimap as they were annoying when embedding in the medium window 3. Medium doesn't let you use any old

iframe

so glitch lets you host any SPA for free, so

npm build

and deploy via github!

Copy code

javascript
import KedroViz from "@quantumblack/kedro-viz";
import * as sourceDomainModel from "./data/source_domain_model.json";
import * as representativePipeline from "./data/complete_demo_pipeline.json";

export const dataSources = {
  sourceDomainModel: () => sourceDomainModel.default,
  representativePipeline: () => representativePipeline.default,
};

const App = ({ initialData }) => {
  const visibleSetting = { sidebar: false, miniMap: false };
  return (
    <div style={{ height: "100vh" }}>
      <KedroViz
        data={dataSources.representativePipeline()}
        visible={visibleSetting}
      />
    </div>
  );
};

App.defaultProps = {
  initialData: "layers",
};

export default App;

waylonwalker

07/08/2021, 3:48 PM

Great article @User oel. I need to take some time to think about it for sure. I have been using some of the layers a bit differently, I would be curious to hear your thoughts on it. The largest difference I see is between intermediate and primary. At the intermediate layer I only really do automated (off-the-shelf) functions, plus anything that is needed to just get it to parquet. Sometime datetimes dont want to store properly. I generally think of this intermediate layer as applying assumptions that my project has adopted, such as all strings are pre-stripped, all column names are lowercase and free of special characters. My primary layer looks a bit more like your intermediate layer. It most often starts as an identity function but gives us a place to do any manual cohersion needed.

datajoely

07/08/2021, 3:50 PM

For me the big difference is the source versus domain level thinking - i.e.

intermediate

has retains the structure the data arrives with. With

primary

it is restructured for the problem at hand.

waylonwalker

07/08/2021, 3:51 PM

Would kedro viz run on preact? Could this make creating static pages with kedro any simpler? Could we have a cli become part of kedro-viz to output a static page or glitch ready site?

datajoely

07/08/2021, 3:53 PM

You've reached the limit of my JS knowledge :p @User any thoughts?

waylonwalker

07/08/2021, 3:54 PM

I think I am in line with that statement, but I think we might still end up with some tasks on different layers if we were to do the same project. Naming things is really hard, I'm sure there are days that if I did the same project 3 times they would all have pieces of it on different layers. Do you feel like you have achieved better consistency?

datajoely

07/08/2021, 3:56 PM

These are guidelines not non-negotiables. Some people are stricter than others, I've seen primary 1 and primary 2 (not my style, but I understood it). The benefit for us is that by picking on vocab and sticking to it, you can move people around projects way easier than ever.

waylonwalker

07/08/2021, 4:00 PM

got it. I can also see where some of the fuzziness can be standardized for a particular team who deals with similar boarder line things often.

noklam

07/10/2021, 9:13 AM

Once again, nice post! I have not thought about using partitional dataset on a versioned dataset directly. I have tried partition/increment dataset but find that they do not support the "versioned" flag. When using partition dataset, i found that the folder base add some complexity to reproducible results. Since it is easy to not notice that the underlying folder has changed. I had one time partition the dataset by month then run a rolling ml train/test pipeline for backtesting. at one point i find the result is really weird, and then i find that because when I was developing the pipeline, some debug set is left behind in the folder, and it is hard to clean it up with the timestamp named folder

datajoely

07/12/2021, 8:25 AM

I think this is a good time to use run environments and then use TemplatedConfigLoader to write to different locations between

debug

and

production

runs https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#additional-configuration-environments https://kedro.readthedocs.io/en/stable/kedro.config.TemplatedConfigLoader.html

noklam

07/12/2021, 9:25 AM

nice tips!

datajoely

07/16/2021, 4:43 PM

There was a talk about Kedro + Airflow + Great Expectation at today's Airflow Summit https://www.crowdcast.io/e/airflowsummit2021/43

Arnaldo

07/16/2021, 5:55 PM

@User

Arnaldo

07/16/2021, 11:52 PM

https://github.com/Mar1cX/kedro-toolkit

datajoely

07/17/2021, 6:32 AM

This is awesome!

waylonwalker

07/17/2021, 1:46 PM

use find-kedro and you dont even need the create_pipeline snippet :). It works exactly like pytest does for finding tests, but finds nodes/pipelines for kedro.

waylonwalker

07/17/2021, 1:49 PM

Cool package though. I believe there are ways to make snippet plugins cross editor compatible. That would make it super cool. Maybe it belongs in kedro-lsp as that is naturally is cross platform.

waylonwalker

07/17/2021, 1:52 PM

@User Is it considered complete or are there plans to add things like datasets?

Arnaldo

07/18/2021, 12:44 AM

Hi, @User First of all, I really liked your

find-kedro

package. That's awesome too. I will use in my next Kedro projects for sure. Regarding the package, IDK actually. I knew this package yesterday and I don't know the author neither. I just liked the toolkit and thought it could be interesting to share with the community here

user

07/21/2021, 9:55 AM

What is Kedro

Copy code

Kedro is an open source data pipeline framework. It provides guardrails to set

your project up right from the start wit

https://waylonwalker.com/what-is-kedro

datajoely

07/21/2021, 9:55 AM

@User this is now set up

datajoely

07/21/2021, 9:56 AM

If anyone would like their Kedro specific RSS feeds to appear here - please shout 🙂