Run a Grafana instance to provide a monitoring dashboard to a ceph cluster.
- docker and docker-compose (for simplicity)
- grafana image (official latest 4.3 release from docker hub)
- graphite image (docker.io/abezhenar/graphite-centos7)
- clone the cephmetrics repo (docker configuration, dashboards)
- host that will run the monitor should have passwordless ssh to all the ceph nodes
- the storage for the graphite database should be on SSD/flash if possible
- needs PyYAML, tested with python 2.7.13
- collectd rpm (5.7 or above)
Install the monitoring endpoint first, and then apply the collectd configuration to each of the ceph nodes.
On the monitoring host, perform the following steps;
- Pull the required docker images (listed above)
- we need to persist the grafana configuration db and settings, as well as the graphite data.
mkdir -p /opt/docker/grafana/etc
mkdir -p /opt/docker/grafana/data/plugins
mkdir -p /opt/docker/graphite
- Download the additional status panel plugin
cd /opt/docker/grafana/data/plugins
wget https://grafana.com/api/plugins/vonage-status-panel/versions/1.0.4/download
unzip download
rm -f download
- Copy the seed .ini file for grafana to the containers etc directory, and reset the permissions to be compatible with the containers
cp etc/grafana/grafana.ini /opt/docker/grafana/etc
chown -R 104:107 /opt/docker/grafana
chown -R 997 /opt/docker/graphite
chmod g+w /opt/docker/graphite
- Edit the docker-compose.yml example (if necessary)
- From the directory with the compose file, issue
docker-compose up -d
- check that the containers are running and the endpoints are listening
7.1 Usedocker ps
7.2 usenetstat
and look for the following ports: 3000,80,2003,2004,7002
7.3 open a browser and connect to graphite - it should be running on port 80 of the local machine - Add the graphite instance as a datasource to grafana
8.1 update setup/add_datasource.json with the IP of the host machine
8.2 register the graphite instance to grafana as the default data source
curl -u admin:admin -H "Content-Type: application/json" -X POST http://localhost:3000/api/datasources \
--data-binary @setup/add_datasource.json
- Install the grafana labs pie-chart plugin
9.1 open a shell session to the grafana instance, and install the plugin
docker exec -it grafana bash
grafana-cli plugins install grafana-piechart-panel
- the sample dashboards need to be added/edited to reflect the ceph cluster to
monitor
10.1 seed dashboards are provided in the dashboards/current directory
10.2 editdashboard.yml
with the shortnames of the OSD's and RGW's, plus the dns domain name of the environment.
10.3 run the following command
python dashUpdater.py
After adding ceph nodes to the configuration, update the dashboard.yml
file, and then rerun the dashUpdater.py
script.
You may need to update your SELINUX policy to allow the write_graphite plugin to access outbound on port 2003. To test, simply disable SELINUX
- install collectd (this will also require libcollectdclient)
- create the required directories for the cephmetrics collectors (see known issues [2])
mkdir -p /usr/lib64/collectd/python-plugins/collectors
- copy the collectors to the directory created in [2], and cephmetrics.py to /usr/lib64/collectd/python-plugins
- Setup the collectd plugins
4.1 Update the write_graphite.conf file to specify the hostname where the grafana/graphite environment is (use a hostname not IP - anecdotally I found that with an IP the plugin fails to connect to the graphite container port?)
4.2 copy the example plugin files to the /etc/collectd.d directory (i.e. cpu.conf, memory.conf etc) - update the "ClusterName" parameter in the cephmetrics plugin file to match the name of your ceph cluster (default is 'ceph')
- copy the example collectd.conf file to the ceph node (or update the existing
configuration to ensure there is a
Include "/etc/collectd.d/*.conf"
entry) - enable collectd
- start collectd
- check collectd is running without errors
- Following a reboot of an OSD node, the cephmetrics collectd plugin doesn't send disk stats. Workaround: Following the reboot of an OSD, restart the collectd service.
- the cephmetrics.py and collectors should be installed through python-setuptools to cut down on the installation steps.
- SELINUX may block the write_graphite plugin writing outbound on port 2003