Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide DocEvent Webhook #1002

Open
hackerwins opened this issue Sep 9, 2024 · 7 comments · May be fixed by #1113
Open

Provide DocEvent Webhook #1002

hackerwins opened this issue Sep 9, 2024 · 7 comments · May be fixed by #1113
Assignees
Labels
enhancement 🌟 New feature or request

Comments

@hackerwins
Copy link
Member

What would you like to be added:

We are currently implementing an LLM-based document search functionality in CodePair. As part of this, we need to maintain a vector of document content in Vector Store. It's crucial that any updates to the document are reflected in the Vector Store by continually editing the content.

To achieve this, we require a mechanism that notifies external services like CodePair when documents are modified in Yorkie. We propose the introduction of a Webhook system that triggers when a document event occurs.

Specifically, we suggest that when handling the PushPullChanges requests, the server should check if a Webhook for the DocEvent is registered for the project. If it is, the server would call that Webhook during the background routine of the PushPullChanges API execution, right before publishing the DocEvent.

I think it will have a similar structure to the Auth Webhook, and if changes occur frequently, an event control device such as debouncing will also be needed.

Why is this needed:

This enhancement would enable seamless integration with external services, allowing for real-time updates to Search Engine or Vector Store based on document changes in Yorkie, thereby enhancing the overall document management and search capabilities of our application.

@window9u
Copy link
Contributor

Hello! Could I try this issue?

@window9u
Copy link
Contributor

window9u commented Nov 23, 2024

What Events Should We Send?

To keep external services informed about the state of documents, we should send events corresponding to the CRUD (Create, Read, Update, Delete) operations.

Common Webhook Specifications
  • Webhook Request Type: HTTP POST
  • Content Type: application/json
  • Expected Response:
    • Status Code: 200 OK

Event Types and Payloads

a. Document Created

  • Event Type: documentCreated

  • Payload Schema:

    {
      "type": "documentCreated",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      }
    }
  • Example:

    {
      "type": "documentCreated",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T05:43:52.318Z"
      }
    }

b. Document Watched

  • Event Type: documentWatched

  • Payload Schema:

    {
      "type": "documentWatched",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      },
      "data": {
        "watchingNum": "integer"
      }
    }
  • Example:

    {
      "type": "documentWatched",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T05:45:00.000Z"
      },
      "data": {
        "watchingNum": 3
      }
    }

c. Document Unwatched

  • Event Type: documentUnwatched

  • Payload Schema:

    {
      "type": "documentUnwatched",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      },
      "data": {
        "watchingNum": "integer"
      }
    }
  • Example:

    {
      "type": "documentUnwatched",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T05:50:00.000Z"
      },
      "data": {
        "watchingNum": 2
      }
    }

d. Document Changed

1. Change Event
  • Event Type: documentChanged

  • Payload Schema:

    {
      "type": "documentChanged",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      },
    }
  • Example:

    {
      "type": "documentChanged",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T06:00:00.000Z"
      },
    }
2. Snapshot Event
  • Event Type: snapshotStored

  • Payload Schema:

    {
      "type": "snapshotStored",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      },
      "data": {
        "snapshot": "string" // marshaled snapshot data
      }
    }
  • Example:

    {
      "type": "snapshotStored",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T06:05:00.000Z"
      },
      "data": {
        "snapshot": "",
      }
    }

e. Document Deleted

  • Event Type: documentDeleted

  • Payload Schema:

    {
      "type": "documentDeleted",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      }
    }
  • Example:

    {
      "type": "documentDeleted",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T06:10:00.000Z"
      }
    }

@window9u
Copy link
Contributor

Explanation of Events

Common Parts

  • Event Types:
  • Attributes:
    • To maintain consistency, all events share the same set of attributes:
      • documentKey: Unique identifier for the document.
      • clientKey: Unique identifier for the client. If our user (CodePair) sets the clientKey to their user_id, they can identify who made the event.
      • issuedAt: Timestamp when the event was issued (ISO 8601 format).

Document Watched / Unwatched

  • Purpose: Indicates that a client has started or stopped watching (reading) a document.
  • Watching Number (watchingNum):
    • Yorkie maintains the status and controls the watch/unwatch events of documents, and it sends the watchingNum to indicate the number of clients currently watching the document.
    • Potential Uses:
      • Real-time ranking of documents to identify which ones are popular or actively being edited.
      • Analytics on user engagement and collaboration intensity.
    • Considerations:
      • The usefulness of watchingNum is currently uncertain, and its implementation might not be immediately necessary.
      • It has a lower priority and could be omitted or deferred if it complicates the implementation.
      • If there are difficulties in implementing this feature, we can consider adding it later.

Document Changed

Change Event
  • Purpose: Notifies that changes have occurred in a document. Due to the high frequency of changes, we need to implement rate limiting mechanisms.
  • Rate Limiting Strategies:
    • Debouncing:
      • Collect events over a short period (e.g., 5 seconds).
      • Send a single aggregated event after the period ends.
      • Reduces the number of webhook calls and prevents overwhelming external services.
    • Throttling:
      • Limit the maximum number of webhook calls within a specific time frame.
      • Ensures that webhook calls are spaced out over time.
  • Implementation Choice:
    • I have chosen throttling for this event type.
      • During a defined time period, we acknowledge that changes occur but do not consider how many changes are made within that period.
      • This simplifies the implementation and reduces the load on external services.
  • Reasons:
    • The number of changes (changeNum) does not accurately represent the amount of data changed.
      • A single change could involve a large amount of data (e.g., copy and paste operations).
      • Minor edits like adding a word or a space also count as a change.
    • Therefore, the emphasis is on the occurrence of changes rather than the quantity.
Snapshot Stored
  • Purpose: In certain situations, sending a snapshot of the entire document can be useful.
  • Use Case Examples:
    • Direct Snapshot Transmission:
      • If we need to periodically receive the entire data, it might be more efficient to send the snapshot directly rather than receiving a change event and then pulling the document to get the snapshot. This is especially beneficial when automatically creating up-to-date thumbnail documents.
    • CodePair Integration:
      • In CodePair, when changes occur, it retrieves a snapshot from Yorkie to store in a vector database (e.g., for search indexing or machine learning models).
      • Instead of updating the vector database incrementally with each change, it might be more efficient to send the complete data when a snapshot occurs.
    • Thumbnail Generation:
      • We could use snapshots as thumbnails for documents. Saving one snapshot per document for thumbnail purposes can be efficient.
  • Considerations:
    1. Size of Snapshots:
      • If the snapshot is large, sending it can be burdensome on network resources and processing time.
    2. Frequency of Snapshots:
      • We might consider sending snapshots after a certain number of changes instead of after every single change.
      • Similar to debouncing, we can aggregate changes and send snapshots periodically.
  • Possible Approach:
    • For example, we could implement a mechanism to send a snapshot every third time it is generated.
    • This balances the need for up-to-date data with the overhead of transmitting large snapshots.

@window9u
Copy link
Contributor

If the above data types are finalized, we should consider the following:

  1. Where to Send the Data
    • Adding an Endpoint Attribute to the Project: We need to include an endpoint property in the project configuration to specify where the data should be sent.
    • Defining Endpoint Properties: We should define various properties of the endpoint, such as the debouncing period, snapshot period, or how frequently to send data (e.g., after a certain number of changes).
    • Batching Events: It might be possible to send events in batches (if CodePair processes them in batches). Therefore, we need to discuss batching strategies.
    • Security Considerations: Determine how to handle security, such as how users can verify that the Yorkie server is the one sending the data.
  2. How to Handle Exceptions
    • Timeout Settings for Requests: Decide on a timeout setting for individual requests.
    • Handling Unresponsive Endpoints: Determine what to do if the endpoint continuously fails to receive requests.
    • Storing Unsent Events: Should we store events that the endpoint failed to receive? Or should we allow users to choose whether or not to store them? If we decide to store them, where should we store them?

@krapie
Copy link
Member

krapie commented Nov 27, 2024

@hackerwins @devleejb @sejongk Any thoughts on this proposed schema?

@devleejb
Copy link
Member

@window9u Sorry for late check.

Overall looks good. I have a few questions.

  1. If documentChanged is throttled, I believe clientKey cannot be included in the payload. What are your thoughts on this?
  2. Could we include clientKey in the snapshotStored event? It seems this event is not directly related to clientKey.
  3. As you mentioned, how about evaluating the priority of these events? While the events cover many cases, implementing and discussing them all at once might be burdensome. I think the documentChanged event should have a higher priority than the others.

@window9u
Copy link
Contributor

window9u commented Nov 28, 2024

@devleejb Thank you for reviewing my comments!

  1. If documentChanged is throttled, I believe clientKey cannot be included in the payload. What are your thoughts on this?
  1. I missed that point, and I believe there are two possible options to address it. Both have their pros and cons:
    • Option 1: Remove clientKey from the request type
      • Pros: Simpler implementation.
      • Key Question: Is there any case where we truly need clientKey in documentChanged, especially since we already have the documentWatched option?
    • Option 2: Include clientKey as an array (like a set)
      • Pros: This ensures we can track all clients who edited the document during the throttled time.
      • Implementation Idea: We could use a map in Go to aggregate clientKeys and convert it to a slice[] before sending.
      • Benefit: This approach provides more precise information about which clients are actively editing the document, as documentWatched only indicates that a client is viewing the document, not editing it.

I lean toward removing clientKey for the following reasons:

  • While Option 2 is better for auditing, we could achieve the same outcome by auditing clientKey through the Change data.
  • If clientKey is needed in the future, we can implement it when the requirement arises.

  1. Could we include clientKey in the snapshotStored event? It seems this event is not directly related to clientKey.

Thank you for catching that mistake—I missed that point as well. As you mentioned, it’s better to remove clientKey from snapshotStored. I’ll update my previous comment accordingly.

Additionally, as discussed in our previous weekly sync, I agree with your perspective on the following:

Should users of Yorkie need to understand the concept of a snapshot?

I believe that customers could potentially use snapshot data as a preview for the document and subscribe to it. However, as per your question in point 3, we should set the priority of the snapshot webhook to a lower level and implement it only when it becomes an actual requirement.


  1. As you mentioned, how about evaluating the priority of these events? While the events cover many cases, implementing and discussing them all at once might be burdensome. I think the documentChanged event should have a higher priority than the others.

I agree with your perspective. The implementation of the documentChanged event is more complex compared to other events due to the associated challenges. While the requirement for documentChanged is significant, I think it's better to implement the events incrementally, starting with simpler ones and gradually moving toward more complex ones.

Here’s my suggested implementation order:

  1. documentCreated, documentDeleted

    • These events simply require sending notifications.
    • All we have to do is just publish these events.
    • I believe we can implement the foundational components for webhooks using these events.
    • For example, this would include tasks like:
      • Setting a secret key.
      • Registering endpoints.
      • Configuring the docevent properties API.
      • Handling unreached webhooks.
      • Defining default timeouts.
    • I will summarize these tasks, organize them into a to-do list, and work on implementing step 1 with a proof of concept (POC) next time.
  2. documentChanged

    • This requires additional complexity, such as caching for throttling.
    • While it has more complexity than documentWatched events, I prioritize it second due to its higher requirement.
  3. documentWatched, documentUnwatched

    • Similar to the first group, these events involve basic notification handling.
    • However, at some point, we may need to modify our code to calculate the number of watchers. Although we currently publish metrics for the watch count, it reflects the total number of watchers across the server, not the count for specific documents.
  4. snapshotStored

    • This can be implemented when it becomes an actual requirement, as its priority is lower.

Here’s the webhook code I envision for this implementation. For example, we could define the webhook in the Backend component like this:

type Backend struct {
    Config           *Config  
    serverInfo       *sync.ServerInfo  
    AuthWebhookCache *cache.LRUExpireCache[string, *types.AuthWebhookResponse]  
  
    Metrics      *prometheus.Metrics  
    Webhook      docevent.Webhook // Add webhook component here  
    DB           database.Database  
    Coordinator  sync.Coordinator  
    Background   *background.Background  
    Housekeeping *housekeeping.Housekeeping  
}

Then, publishing a webhook could follow the pattern used for metrics:

s.backend.Metrics.AddWatchDocumentEventPayloadBytes(
    s.backend.Config.Hostname,  
    project,  
    event.Type,  
    event.Body.PayloadLen(),  
)

// This is example code for POC. And will be modified
s.backend.Webhook.PublishWatchedEvent(
    docInfo.Key,
    clientKey,
    issuedAt,
    watchingNum,
)

PS: My semester will be finished on 12/19. I think I’ll be able to work more intensively on this issue after that date.

@window9u window9u linked a pull request Dec 26, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🌟 New feature or request
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

4 participants