Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: validate local and external references #26

Closed
wants to merge 32 commits into from
Closed

feat: validate local and external references #26

wants to merge 32 commits into from

Conversation

probablyArth
Copy link

@probablyArth probablyArth commented Mar 14, 2024

validate external references

  • external url is invalid
    image
  • external url leads to invalid json-schema file
    image
  • relative external url but invalid $id
    image

validate internal references

image

TODO:

  • validate anchor fragments
  • look for optimizations

@probablyArth
Copy link
Author

solves #7

@jdesrosiers
Copy link
Collaborator

Sorry, but that's not what is meant by "external reference". It means external to the schema document, not external to the workspace. I wasn't thinking about fetching schemas over HTTP and at this point and I don't want to go there. There are complications involved in doing that right that we don't want to deal with right now. For this task, you should only be considering schemas defined in the workspace or pre-registered in @hyperjump/json-schema like the meta-schemas.

my-project/schemas/address.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/address",

  ...
}

my-project/schemas/customer.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/customer",

  "type": "object",
  "properties": {
    ...,
    "address": { "$ref": "/schemas/address" } <-- External reference to a schema in the workspace
  }
}

@probablyArth
Copy link
Author

Sorry, but that's not what is meant by "external reference". It means external to the schema document, not external to the workspace. I wasn't thinking about fetching schemas over HTTP and at this point and I don't want to go there. There are complications involved in doing that right that we don't want to deal with right now. For this task, you should only be considering schemas defined in the workspace or pre-registered in @hyperjump/json-schema like the meta-schemas.

my-project/schemas/address.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/address",

  ...
}

my-project/schemas/customer.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/customer",

  "type": "object",
  "properties": {
    ...,
    "address": { "$ref": "/schemas/address" } <-- External reference to a schema in the workspace
  }
}

Ahh! Got it!
will make the necessary changes. I was also a little skeptical when fetching schemas over http, should've asked beforehand my bad 😅

Copy link
Collaborator

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick review to point out a couple things. I didn't review in too much detail yet, because I know you're going to have to change a lot of it to get back on the right track.

language-server/src/references.js Outdated Show resolved Hide resolved
language-server/src/json-instance-from-json.js Outdated Show resolved Hide resolved
language-server/src/references.js Outdated Show resolved Hide resolved
@benodiwal
Copy link

Sorry, but that's not what is meant by "external reference". It means external to the schema document, not external to the workspace. I wasn't thinking about fetching schemas over HTTP and at this point and I don't want to go there. There are complications involved in doing that right that we don't want to deal with right now. For this task, you should only be considering schemas defined in the workspace or pre-registered in @hyperjump/json-schema like the meta-schemas.

my-project/schemas/address.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/address",

  ...
}

my-project/schemas/customer.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/customer",

  "type": "object",
  "properties": {
    ...,
    "address": { "$ref": "/schemas/address" } <-- External reference to a schema in the workspace
  }
}

What exactly is workspace, is it just a folder containing many schema.json files ? If we open a folder in an IDE and the folder has many schema.json files and with them it has another folder which also has schema,json files then how will the workspaces be defined here?

@jdesrosiers
Copy link
Collaborator

What exactly is workspace

"Workspace" is an LSP concept. Open a directory in vscode and look at the file tree for that directory. That's your workspace. Everything that's visible to the IDE/editor is your workspace.

@probablyArth
Copy link
Author

probablyArth commented Mar 27, 2024

What exactly is workspace

"Workspace" is an LSP concept. Open a directory in vscode and look at the file tree for that directory. That's your workspace. Everything that's visible to the IDE/editor is your workspace.

ok! Just a little confused.

let's consider this example.

my-project/schemas/b.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/c",

  ...
}

my-project/schemas/c.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/b",

  ...
}

my-project/schemas/a.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/a",
  "type": "object",
  "properties": {
    ...,
    "address": { "$ref": "/schemas/b" }
  }
  ...
}

is the ref in my-project/schemas/a.schema.json pointing to the file at /schemas/b.schema.json or /schemas/c.schema.json since c.schema.json has the id of /schemas/b

another example just to kind of stress more on what I'm confused, is the ref pointing to $id or the location

my-project/another-folder/b.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/c",

  ...
}

my-project/another-folder/c.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/b",

  ...
}

my-project/schemas/a.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/a",
  "type": "object",
  "properties": {
    ...,
    "address": { "$ref": "/schemas/b" }
  }
  ...
}

@jdesrosiers
Copy link
Collaborator

jdesrosiers commented Mar 27, 2024

If you haven't read Structuring a Complex Schema on the website, definitely do that. It should help with the basic concepts.

There are two schema identification concepts. One is the retrieval URI and the other is self-identification.

Because these are files that exist on the file system, they have a file: URI that identifies them by their location on the file system. LSP uses this URI to identify files in the workspace. An example would be file:///path/to/my-project/schemas/a.schema.json.

However, JSON Schema also defines a way for schemas to declare their own identifier in addition to the natural identifier they get based on their file system location. This is done this using the $id keyword (id in draft-04). When a schema self-identifies, it effectively has two identifiers: the one it declares and it's file system location. LSP understands the natural identifier, but the self assigned identifier, so you need to implement that somehow.

So, going to your example,

LSP-ID: file:///path/to/my-project/another-folder/b.schema.json
Self-ID: https://example.com/schemas/c

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/c",

  ...
}

LSP-ID: file:///path/to/my-project/another-folder/c.schema.json
Self-ID: https://example.com/schemas/b

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/b",

  ...
}

LSP-ID: file:///path/to/my-project/schemas/a.schema.json
Self-ID: https://example.com/schemas/a

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/a",
  "type": "object",
  "properties": {
    ...,
    "address": { "$ref": "/schemas/b" }
  }
  ...
}

The question is, what does the reference at /properties/address resolve to. $id doesn't just assign an identifier to the schema, it also sets the base URI. So, the URI-reference /schemas/b is resolved against the https://example.com/schemas/a to get https://example.com/schemas/b. There is only one schema identified by that exact URI and that's the one located at /path/to/my-project/another-folder/c.schema.json.

Now consider if the schema at /path/to/my-project/schemas/a.schema.json didn't self-identify (didn't have a $id). How would that change how the reference resolves? Let's also change the reference to ../another-folder/b.schema.json to make the result interesting. Since there's no $id, the base URI is file:///path/to/my-project/schemas/a/schema.json and it resolves to file://path/to/my-project/another-folder/b.schema.json. This resolves the schema that self-identifies as https://example.com/schemas/c.

@probablyArth
Copy link
Author

If you haven't read Structuring a Complex Schema on the website, definitely do that. It should help with the basic concepts.

There are two schema identification concepts. One is the retrieval URI and the other is self-identification.

Because these are files that exist on the file system, they have a file: URI that identifies them by their location on the file system. LSP uses this URI to identify files in the workspace. An example would be file:///path/to/my-project/schemas/a.schema.json.

However, JSON Schema also defines a way for schemas to declare their own identifier in addition to the natural identifier they get based on their file system location. This is done this using the $id keyword (id in draft-04). When a schema self-identifies, it effectively has two identifiers: the one it declares and it's file system location. LSP understands the natural identifier, but the self assigned identifier, so you need to implement that somehow.

So, going to your example,

LSP-ID: file:///path/to/my-project/another-folder/b.schema.json Self-ID: https://example.com/schemas/c

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/c",

  ...
}

LSP-ID: file:///path/to/my-project/another-folder/c.schema.json Self-ID: https://example.com/schemas/b

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/b",

  ...
}

LSP-ID: file:///path/to/my-project/schemas/a.schema.json Self-ID: https://example.com/schemas/a

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schemas/a",
  "type": "object",
  "properties": {
    ...,
    "address": { "$ref": "/schemas/b" }
  }
  ...
}

The question is, what does the reference at /properties/address resolve to. $id doesn't just assign an identifier to the schema, it also sets the base URI. So, the URI-reference /schemas/b is resolved against the https://example.com/schemas/a to get https://example.com/schemas/b. There is only one schema identified by that exact URI and that's the one located at /path/to/my-project/another-folder/c.schema.json.

Now consider if the schema at /path/to/my-project/schemas/a.schema.json didn't self-identify (didn't have a $id). How would that change how the reference resolves? Let's also change the reference to ../another-folder/b.schema.json to make the result interesting. Since there's no $id, the base URI is file:///path/to/my-project/schemas/a/schema.json and it resolves to file://path/to/my-project/another-folder/b.schema.json. This resolves the schema that self-identifies as https://example.com/schemas/c.

got it! I almost wrote another huge example, but I read the comment once again and now I completely get it!
Thank you so much, I will finish this ASAP!

@probablyArth probablyArth marked this pull request as ready for review March 28, 2024 21:25
@probablyArth
Copy link
Author

I think I'm done with all the tasks and my PR is ready for review!!

@jdesrosiers
Copy link
Collaborator

Please rebase and resolve conflicts. Also, make sure you run tests locally to make sure they pass before pushing. Don't rely on automation.

@probablyArth
Copy link
Author

probablyArth commented Mar 29, 2024

Please rebase and resolve conflicts. Also, make sure you run tests locally to make sure they pass before pushing. Don't rely on automation.

The failed checks don't seem to be related to the changes I made. Could you please take a look?
image

@jdesrosiers
Copy link
Collaborator

It's definitely related to your changes. The tests pass on "main".

@probablyArth
Copy link
Author

Thanks for pointing that out @jdesrosiers
I made the necessary changes, tests should pass now.

Copy link
Collaborator

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A found a couple bugs in some edge cases you probably hadn't considered.

Reference to an embedded Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/main",

  "allOf": [
    { "$ref": "a" },
    { "$ref": "b" },
  ]

  "$defs": {
    "embedded-a": {
      "$id": "a"
    },
    "embedded-b": {
      "$id": "https://example.com/b"
    }
  }
}

These are valid references that get flagged as invalid.

$ref in a non-schema location

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/main",

  "properties": {
    "$ref": "foo"
  },
  "not-a-keyword": { "$ref": "foo" }
}

These get flagged as invalid references, but neither are references. They just look like references.


The design isn't great in that finding references is coupled to validating references. There are other features where we'll need to use or know about references. Ideally, we'd be able to reuse the reference identification functionality you've developed for those other features, but we can't because it's too coupled to validating references.

Similarly, I'd like to see the validation of the references decoupled from language server concerns. Eventually, I want to use the functionality developed here to make a CLI that validates all the schemas in a workspace. That means I need to use this validation functionality outside of the context of a language server. So, things like calling buildDiagnostic is a problem. It should just return the validation information. Then we can call it from the language server and run buildDiagnostic on the result or call it from the CLI and run whatever is need to present those errors on the console.

@probablyArth
Copy link
Author

probablyArth commented Apr 2, 2024

I will get on those asap!

Copy link
Collaborator

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone through this in detail yet, but this is a lot more code than I was expecting. I feel like this should be simpler. I think you're duplicating things that are already handled. I would expect you to make use of the managed instances. Every time the workspace changes, every schema resource in the workspace is revalidated and is available in the schemaResourceCache. The schemas are already decomposed so there are no embedded schemas to worry about. You just need to collect the identifiers for every schema resource to get the list of valid identifiers. Then you just need to iterate over those schemaResources looking for and validating references, which shouldn't be too hard since you don't have to worry about embedded schemas.

language-server/src/references.js Outdated Show resolved Hide resolved
@probablyArth
Copy link
Author

Hey @jdesrosiers , I have refactored my solution more.

I have removed the identifiers store but I have kept the inactiveDocumentStore , since that caches the documents which are not open in the editor and saves a lot of io operations. and the instances are removed from the inactiveDocumentStore once those are open in the editor.

Copy link
Collaborator

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept the inactiveDocumentStore, since that caches the documents which are not open in the editor

I think you've misunderstood what I was telling you in my last review. The schemaResourceCache should contain all schemas in the workspace, not just the ones that are open. All of them. You shouldn't need to invalidDocumentStore. It should all be there already.

language-server/src/references.js Outdated Show resolved Hide resolved
language-server/src/server.js Show resolved Hide resolved
language-server/src/server.js Show resolved Hide resolved
language-server/src/server.js Outdated Show resolved Hide resolved
@probablyArth
Copy link
Author

I have made some changes, please take a look @jdesrosiers !

@probablyArth
Copy link
Author

probablyArth commented May 11, 2024

Seems like there were a lot of changes made for schema resources 😅
I'll update my PR 😅

@jdesrosiers
Copy link
Collaborator

Yes, sorry. There was a big change introducing a more powerful and performant AST for working with schemas that replaced JsoncInstance. The change is preparing for supporting more advanced features like linting.

@jdesrosiers
Copy link
Collaborator

Closing in favor of #44

@jdesrosiers jdesrosiers closed this Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants