This section is for those interested in contributing to the development of fabric8-analytics. Please read through our glossary in case you are not sure about terms used in the docs.
Please find a documentation on how to setup CI/CD for your repository here.
Git, and possibly other packages, depending on how you want to run the system (see below).
First of all, clone the fabric8-analytics-deployment
repo (this one). This includes all
the configuration for running the whole system as well as some helper
scripts and docs.
In order to have a good local development experience, the code repositories are mounted inside containers, so that changes can be observed live or after container restart (without image rebuilds).
In order to achieve that, all the individual fabric8-analytics repos have to be
checked out. The helper script setup.sh
is here to do that. Run setup.sh -h
and follow the instructions (most of the time, you'll be fine with running
setup.sh
with no arguments).
This is not exactly a local development experience, but if you'd like to test your changes in OpenShift, there is some documentation on how to do it.
Requirements:
- docker
- docker-compose
Then run:
$ sudo docker-compose up
To get the system up.
Please note that some error messages might be displayed during startup for the
data-model-importer
module. Such errors are caused by the Gremlin-https that
takes some time to start before serving any requests. After some time, the
data-model-importer
will be started properly.
If you want a good development setup (source code mounted inside the containers, ability to rebuild images using docker-compose), use:
$ sudo ./docker-compose.sh up
docker-compose.sh
will effectively mount source code from checked out
fabric8-analytics sub-projects into the containers, so any changes made to the local
checkout will be reflected in the running container. Note, that some
containers (e.g. server) will pick this up interactively, others (e.g. worker)
will need a restart to pick the new code up.
$ sudo ./docker-compose.sh build --pull
Some parts (GithubTask, LibrariesIoTask) need credentials
for proper operation. You can provide environment variables in worker_environment
in docker-compose.yml
.
When running locally via docker-compose, you will likely not need to scale most of the system components. You may, however, want to run more workers, if you're running more analyses and want them finished faster. By default, only a single worker is run, but you can scale it to pretty much any number. Just run the whole system as described above and then in another terminal window execute:
$ sudo docker-compose scale worker-api=2 worker-ingestion=2
This will run additional 2 workers, giving you a total of 4 workers running. You can use this command repeatedly with different numbers to scale up and down as necessary.
When the whole application is started, there are several services you can
access. When running through docker-compose, all of these services will be
bound to localhost
.
- fabric8-analytics Server itself - port
32000
- fabric8-analytics Jobs API - port
34000
- PGWeb (web UI for database) - port
31003
PGWeb is only run if you run with-f docker-compose.debug.yml
- Minio S3 - port
33000
All services log to their stdout/stderr, which makes their logs viewable through Docker:
- When using docker-compose, all logs are streamed to docker-compose output
and you can view them there in real-time. If you want to see output of a single
container only, use
docker logs -f <container>
, e.g.docker logs -f coreapi-server
(the-f
is optional and it switches on the "follow" mode).
Refer to the integration testing README
Worker, by its design, is a monolith that takes care of all tasks available in the system. However there is a possibility to let a worker serve only some particular tasks. This can be accomplished by supplying the following environment variables (disjoint):
WORKER_EXCLUDE_QUEUES
- a comma separated list of regexps describing name of queues that should be excluded from worker serving, all others are includedWORKER_INCLUDE_QUEUES
- a comma separated list of regexps describing name of queues that should be included from worker serving, all others are excluded
This can be especially useful when performing prioritization or throttling of some particular tasks.
Even though the whole fabric8-analytics is run in Amazon AWS, there are basically no strict limitations in local deployment.
The following AWS resources are used in the cloud deployment.
- Amazon SQS
- Amazon DynamoDB
- Amazon S3
- Amazon RDS
However, there are used alternatives in the local setup:
- RabbitMQ - an alternative to Amazon SQS
- Local DynamoDB
- Minio S3 - an alternative to Amazon S3
- PostgreSQL - an alternative to Amazon RDS
You can use directly RabbitMQ instead of Amazon SQS. To do so, do not provide
AWS_SQS_ACCESS_KEY_ID
and AWS_SQS_SECRET_ACCESS_KEY
environment variables.
The RabbitMQ broker should be then running on a host described by
RABBITMQ_SERVICE_SERVICE_HOST
environment variable (defaults to coreapi-broker
).
As RabbitMQ scales pretty flawlessly, there shouldn't be any strict downside of using RabbitMQ instead of Amazon's SQS.
The local setup is already prepared to work with local DynamoDB.
To substitute Amazon S3 in local deployment and development setup, there was chosen Minio S3 as an alternative. However Minio S3 is not fully compatible alternative (e.g. it does not support versioning, nor encryption), but it does not restrict basic functionality whatsoever.
You can find credentials to the Minio S3 web console in docker-compose.yml
file
(search for MINIO_ACCESS_KEY
and MINIO_SECRET_KEY
).
The only difference in using PostgreSQL instead of Amazon's RDS is in supplying connection string to PostgreSQL/Amazon RDS instance. This connection string is constructed from the following environment variables:
POSTGRESQL_USER
- defaults tocoreapi
in the local setupPOSTGRESQL_PASSWORD
- defaults tocoreapi
in the local setupPOSTGRESQL_DATABASE
- defaults tocoreapi
in the localPGBOUNCER_SERVICE_HOST
- defaults tocoreapi-pgbouncer
(note that PostgreSQL/Amazon RDS is accessed through PGBouncer, thus naming)
There can be requirements where a developer will have to import the production data on her/his development graph database. Following process has to be followed for the data import:
- SRE team dumps production data on the S3 bucket
dynamodb-migration-backups-osio
inosio-dev
namespace. - Use the AWS Data pipeline service to import data from graph tables.
- Create a pipeline with the source template
Import DynamoDB backup data from S3
. - Select one of the backup folders in
s3://dynamodb-migration-backups-osio/backups/prod_edgestore/
asInput S3 folder
. - Provide the corresponding developer dynamodb table name as
Target DynamoDB table name
. For example,mykerberosid_edgestore
. - Set the
DynamoDB write throughput ratio
as 1. For a faster import, the developer will have to set the write capacity of the respective dynamodb table to a higher number (For example, 100). This value has to be reset to the default as soon as the import completes of that table. - Schedule the data pipeline to run
on pipeline activation
. - Enable logging optionally and then click on
Activate
. - The process involving steps from 2 to 8 has to be repeated for two other tables
_graphindex
and_titan_ids
. All three data pipelines can be run in parallel. - The import may take a day to two to complete depending on the dynamodb write capacity set of the developers' tables.