Dynamic Entity Definition #643

pmbrull · 2021-10-02T13:12:01Z

pmbrull
Oct 2, 2021
Maintainer

Hi team 🤗

First of all, let me drop a brief line on thanking you and congratulating you on your work. This looks like an amazing project and I can see it become the de-facto metadata management tool. Having this level of transparency on how to ingest the data and the automation possibilities with the REST API, ticks a lot of boxes.

One idea/requirement I'd love to share would be having more flexibility on dynamically generating and updating new Entities. Although we currently have available a lot of them (and in my opinion, treating a Pipeline as a proper Data Asset is something similar tools are missing...), I feel that it is hard to get definitions right the first time.

I would assume that not everyone needs to model the same assets, and not all projects will have the same requirements inside each asset. One example here would be being able to extend the Pipeline Entity definition to include a source and a sink, or to create completely new Entities such as Model, to track Machine Learning related aspects.

I believe that the key point here is that necessities evolve and requirements are different for different teams. Therefore, a great addition would be having a clear approach on how to perform these kinds of customisations.

Based on this Slack discussion, it seems that we can directly go to the source code and update the JsonSchema. This is a great starting point and my only question here would be how this approach would manage the addition of new Entities, such as the Model.

What might feel a bit cleaner and allow for better interaction with an already ongoing Metadata project, would be being able to create and update entities directly through the REST API. But again, if we could just be able to customise the Entities even in the source code, that would already be great!

More than happy to discuss 🌻

Thanks again,
Pere

pmbrull · 2021-10-03T15:46:05Z

pmbrull
Oct 3, 2021
Maintainer Author

After sleeping on it and taking into consideration your feedback in similar discussions, I believe that I need to switch my point a bit.

Based on my comments, the goal would be to allow flexibility at the user level in terms of establishing definitions of Entities and their properties. However, this goes completely against defining a standardization generic enough for most of the use cases, which I believe is the point of OpenMetadata and a lacking pillar in the ecosystem.

Therefore, I'd like to rephrase the idea of not asking for any specific endpoint or feature that allows for individual customisation, but rather to provide proper documentation on the implementation of OpenMetadata so that we can test the changes and more actively contribute to the Metadata standard being sought.

Many thanks,
Pere

0 replies

pmbrull · 2021-10-04T17:23:43Z

pmbrull
Oct 4, 2021
Maintainer Author

I believe the first version of a Model entity could be as follows:

{
  "$id": "https://open-metadata.org/schema/entity/data/model.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Model",
  "description": "This schema defines the Model entity. Models are algorithms trained on data to find patterns or make predictions.",
  "type": "object",

  "properties" : {
    "id": {
      "description": "Unique identifier of a model instance.",
      "$ref": "../../type/basic.json#/definitions/uuid"
    },
    "name": {
      "description": "Name that identifies this model.",
      "type": "string",
      "minLength": 1,
      "maxLength": 64
    },
    "fullyQualifiedName": {
      "description": "A unique name that identifies a model.",
      "type": "string",
      "minLength": 1,
      "maxLength": 64
    },
    "displayName": {
      "description": "Display Name that identifies this Model.",
      "type": "string"
    },
    "description": {
      "description": "Description of the model, what it is, and how to use it.",
      "type": "string"
    },
    "algorithm": {
      "description": "Algorithm used to train the model",
      "type": "string"
    },
    "dashboard" : {
      "description": "Performance Dashboard URL to track metric evolution",
      "$ref" : "../../type/entityReference.json"
    },
    "href": {
      "description": "Link to the resource corresponding to this entity.",
      "$ref": "../../type/basic.json#/definitions/href"
    },
    "owner": {
      "description": "Owner of this model.",
      "$ref": "../../type/entityReference.json"
    },
    "followers": {
      "description": "Followers of this model.",
      "$ref": "../../type/entityReference.json#/definitions/entityReferenceList"
    },
    "tags": {
      "description": "Tags for this model.",
      "type": "array",
      "items": {
        "$ref": "../../type/tagLabel.json"
      },
      "default": null
    }
  },
  "required": ["id", "name"]
}

As a second step, I believe it would be interesting to gather the requirements to define a Feature Entity. Models will then be based on an Array of Features, the same way a table is a collection of columns.

In each Feature, we could define its origin (e.g., table_A.column_X) and if it has undergone any specific transformation, such as normalisation.

@harshach, what would be the steps to move this forward? More than happy to discuss any changes and looking forward to your input.

Thanks

0 replies

harshach · 2021-10-05T02:02:43Z

harshach
Oct 5, 2021
Maintainer

@pmbrull This looks great. Can you open a PR against main. We can provide comments if necessary
cc @sureshms

4 replies

sureshms Oct 5, 2021
Maintainer

@pmbrull this looks like a good starting point. We can tweak it as it is developed. +1.

pmbrull Oct 5, 2021
Maintainer Author

Great thank you both. Working on a PR atm, might open it as a draft and ask for some input as I have no real Dropwizard knowledge 😅 It is always fun to jump headfirst.

sureshms Oct 5, 2021
Maintainer

@pmbrull this should be easy.

Look at TopicResource, TopicRepository, TopicResourceTest as examples. In fact, you can start by copying them and renaming them to ModelResource, ModelRepository, and ModelResourceTest and making appropriate changes in those files.
Add a table model_entity in bootstrap/sql/mysql/v001__create_db_connection_info.sql for storing model entities
Tests will automatically start dropwizard server and embedded MySQL and you will be able to test it end-to-end without having to start a real server.

Ask questions and we are here to help you. We are motivated to get this as a community contribution from someone outside :). If you need a zoom call, we can do that too to make it happen :).

pmbrull Oct 5, 2021
Maintainer Author

Thanks, @sureshms! Yeah, I was following exactly the same approach but with Dashboard instead of Topic.

So far got the API running with the Model Entity locally, which is nice enough! I am copying some tests now, will open a PR once they all work and will pop some questions there :)

Many thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Entity Definition #643

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Dynamic Entity Definition #643

pmbrull Oct 2, 2021 Maintainer

Replies: 3 comments · 4 replies

pmbrull Oct 3, 2021 Maintainer Author

pmbrull Oct 4, 2021 Maintainer Author

harshach Oct 5, 2021 Maintainer

sureshms Oct 5, 2021 Maintainer

pmbrull Oct 5, 2021 Maintainer Author

sureshms Oct 5, 2021 Maintainer

pmbrull Oct 5, 2021 Maintainer Author

pmbrull
Oct 2, 2021
Maintainer

Replies: 3 comments 4 replies

pmbrull
Oct 3, 2021
Maintainer Author

pmbrull
Oct 4, 2021
Maintainer Author

harshach
Oct 5, 2021
Maintainer

sureshms Oct 5, 2021
Maintainer

pmbrull Oct 5, 2021
Maintainer Author

sureshms Oct 5, 2021
Maintainer

pmbrull Oct 5, 2021
Maintainer Author