Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly Playground High Level Design #130

Closed
Tracked by #129
gaiksaya opened this issue Oct 28, 2023 · 1 comment
Closed
Tracked by #129

Nightly Playground High Level Design #130

gaiksaya opened this issue Oct 28, 2023 · 1 comment
Assignees

Comments

@gaiksaya
Copy link
Member

gaiksaya commented Oct 28, 2023

Terminology

  • Playground: A publicly available OpenSearch and OpenSearch-Dashboards instance of recently released version. Accessible here: https://playground.opensearch.org/app/home
  • Nightly artifacts: Artifacts that are built daily (usually but not necessarily overnight) and may experience instability due to active development.

Overview

Background

The OpenSearch and OpenSearch Dashboards distributions are built daily that includes actively developed features for upcoming 1.x, 2.x and 3.0 version. Community users as well as developers would like to be able to play with these new features faster as they're getting built.
Therefore, having a nightly version of playground that consists of these actively built features would be a useful for the community users as well as developers to play around with them as well as provide feedback.

Overview of solution

The solution re-uses opensearch-cluster-cdk to deploy the nightly playground cluster with below enhancements:

  • Ability to provide customized security permissions for the cluster.
  • Ability to pass custom configuration for OpenSearch Dashboards.
  • Alarms to monitor cluster state.
  • Ability to choose from network load balancer or application load balancer.

New components that would be added:

  • Add few more stacks for handling sample data and monitoring/notification.
  • OIDC set up between AWS and GitHub to deploy these clusters using GitHub Actions
  • One time manual set up of route53 with the endpoint

Stakeholders

  • Developers contributing actively to any component
  • Community Users interested in upcoming features / bug fixes

Use cases

  • User: Component developer
    When: Developing an upcoming feature/bug-fix
    Then: I should be able to test my contribution using the nightly playground
  • User: Community member interested in specific upcoming feature/bug-fix/functionality
    When: Interested to test the functionality
    Then: I want to able to play around with the functionality in the nightly playground
  • User: Playground enthusiast
    When: I want to have my own
    Then: I should be able to reproduce this set up easily on my infrastructure
  • User: Playground Maintainer
    When: Receive request to set up multiple playgrounds for different versions
    Then: I should be able to set-up multiple playgrounds supporting multiple versions of OpenSearch(-Dashboards) easily and with minimal manual intervention.

Requirements

  • The nightly playground should be publicly accessible. [P0]
  • The nightly playground should be always have security plugin installed. [P0]
  • The nightly playground should support upcoming version for 2.x [P0]
  • The nightly playground should deploy both OpenSearch and OpenSearch-Dashboards on a regular basis using the latest builds. [P0]
  • The nightly playground should have generic anonymous (read-only) access for OpenSearch Dashboards. [P0]
  • Index some useful data apart from sample data set [P0]
  • End to end documentation [P0]
  • The nightly playground should display below information in the form on index or a different page: [P0]
    • Components included in this deployment
    • Version of OpenSearch and OpenSearch Dashboard
    • Commit-ids of the all the components
    • Time the distribution was built at
    • Java/NodeJS version used to build the distribution
    • Build Number of the distribution artifact
  • Able to index (customized) data in order to meet the feature testing requirements.[P1] [Might additional security review]

Out of scope

  • Distribution support for any distribution other than tarball
  • Customized distribution artifact (other than coming from distribution workflow) to be deployed on the playground
  • Gate keeping for whether a feature / functionality is security reviewed.

Proposed Solution

Architecture Diagram

image

Component Details

  • Tools Stack: Consist of tools such as AWS step functions, Lambda function and S3 bucket. This stack is responsible for managing the sample data in the cluster. If we receive ad-hoc requests to index a specific type of data, this model can be extended to review, approve or reject the data, copy the data in the s3 bucket and index it into the right cluster. This stack will be shared by all playgrounds.

    • AWS step functions: Used to invoke lambda function.
    • AWS Lambda: Might use opensearch clients to manage sample data
    • S3 bucket: Used to store sample data or snaphots
  • Notification Stack: This stack consist of resources such as cloudwatch events, Lambda function and SNS. Mainly responsible for monitoring and notifying the clusters. Example, notifying in case cluster health, runs out of disk space, etc. This stack will be responsible for creating GitHub issue as one of the notifying mechanism as well as SNS notification to alert any slack channels or enable notifications. This stack will also be shared by multiple playground instances.

    • Eventsbridge: Cloudwatch event acting as a trigger to invoke lambda function.
    • AWS Lambda: Responsible for creating GH issues in case of errors and failures.
    • SNS Notification: Responsible for notifying via email/slack, etc.
  • opensearch-cluster-cdk: Originating in opensearch-cluster-cdk repository, this set up consist of 3 dependency based stacks. These stacks will not be shared by multiple playground set ups. Each of

    • Network Stack: Consist of VPC and security groups. Everything related to network and access management.
    • Infrastructure Stack: Consist of elements such as autoscaling groups that deploys opensearch and opensearch-dashboards on EC2, IAM role(s), Load balancer and cloudwatch logs
    • Monitoring and alarm Stack: Would consist of resources such as cloudwatch alarms set up based on cloudwatch log groups. These alarms can be extended later to integrate with cloudwatch events, other actions, etc
  • 2PR maintainer approval: Before anything can be propagated to Prod, human approval is required. This will be in the form of GitHub issues as well.

Activity diagram:

1. Updating tools and notifications stack

image

2. Updating opensearch-cluster-cdk stacks

image

Failure modes

  • Deployment failures: Cloudformation deployment failures are possible. If the rollback is complete, the stack should revert back to last known stable state else in order to fix the stack human intervention is required. GtiHub Action workflow status will be the metric of this problem.
  • Cluster instability:
    • Red cluster
    • Disk space usage
    • Master unreachable
    • JVM pressure

Each of the above metrics will create a GH issue tagging all the maintainers as a notification and possible remediation.

  • Endpoint monitoring: If the nightly playground is unavailable for 10min threshold, it should notify us.

Infrastructure and deployment

We will be using GitHub Actions as the deployment pipeline. This includes deployment to beta and prod environment.

Credentials Management will be handled strictly using OIDC between GitHub repository and AWS cloud. The permissions should be minimal and needs to be reviewed each time someone makes any change.

@github-actions github-actions bot added the untriaged Issues that have not yet been triaged label Oct 28, 2023
@gaiksaya gaiksaya removed the untriaged Issues that have not yet been triaged label Oct 30, 2023
@gaiksaya gaiksaya self-assigned this Oct 30, 2023
@gaiksaya gaiksaya moved this from Backlog to In review in OpenSearch Engineering Effectiveness Oct 30, 2023
@gaiksaya gaiksaya changed the title Nightly Playground High Level Design and Execution Plan Nightly Playground High Level Design Oct 30, 2023
@gaiksaya
Copy link
Member Author

Below is the execution plan for implementing the above design. Adding the same in the main issue.

Feature Priority Status Efforts in points
Ability to pass custom configuration to OpenSearch Dashboards config (yml) file P0 Not started 2
Ability to provide customized security configurations to the cluster P0 Not started 3
Add optional monitoring and alarms stack P0 Not started 3
Research and integrate opensearch-cluster-cdk into nightly playgrounds. May involve publishing the cdk code base to npm or consume it by cloning the repo. P0 Not started 2
Code refactoring to make it more generic (I do not have a list yet) P0 Not started 2
Tools stack along with lambda function P0 Not started 3
Notifications stack along with lambda function P0 Not started 3
Set up accounts and required permissions (OIDC) P0 Not started 1
Add support to enable validation with-security strictly P0 Not started 2
Add GHA deployment workflow for beta P0 Not started 1
Add Prod stage to deployment workflow P0 Not started 1
Test the notification and alarms P1 Not started 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant