EMC Datadomain Availability and Performance Monitoring
You should at least check if the DD is available (just the normal ping host check). Furthermore you can ping it's interfaces with the ip you're backup servers are using to check if it's reliable. You also should check the services you need to manage the device - SSH
and the Webfrontend
via check_ssh and check_http.
Other important checks are the operation status of the attached disks, the interface states, the state of the nvram batteries, the state of the power supplies as well as the space usage on the file system.
the example.cfg
should give you an idea what to do.
An important part of the data domain is the replication. Currently check_datadomain only supports checking replication pairs state.
- check age of last replication
- check size of data to sync
The key measurements you should do are diskReadKBytesPerSecond
/ diskWriteKBytesPerSecond
and for just a global overview of disk performance the diskBusyPercentage
if you graph them you will get a quite nice performance profile for the disk view. So if you're noticing performance problems in some time period this probably will help you out.
Now you're able to see you peak usage. Just try to schedule the filesystem clean as far away as possible from your peak usage. filesys clean show config
will show you your configuration
With filesys clean schedule <day> <time>
you can try to shift it. But don't forget to delete you old schedule
De-duplication is a quite load intensive job. Also the compression jobs the DD is doing after it has broken up the data can be hard for the system. Take a look on the cpuAvgPercantageBusy
and the cpuMaxPercentageBusy
measurements to get an idea if you should change your compression method.
Note: The default compression is lzo, so you shouldn't notice any problems. But if you changed it to gzfast or gz and rise the global-compression-type to a high value please take a look at the cpuBusytimes.
You're noticing performance issues and the graph from above are looking "great"? Take a look into your nfs and cifs ops. If you're writing or reading from many hosts at the same time you have to share the available performance. nfsOpsPerSecond
and cifsOpsPerSecond
can give you an idea of how many operations the system is requested to do.
Spread your writes: please spread your writes over the time. If possible don't schedule your backup jobs at the same time.
You can watch for example the PreCompBytesRemaining
for each replication context of your devices. This will show you how many not de-duplicated data is left on the sending device. If this value never comes to zero you definitely have a replication problem.
You also should measure the space usage on the DD file systems. To get an overview how many data is written to your disks and when it's time to tidy up or buying a new shelf. dd_space-Data
is the really interesting value. You'll be able to see issues with replications here too. If a snapshot doesn't expire or isn't deleted you can see a disk usage difference (if the devices are cross replicated). In normal cases the snapshot will be expired and cleaned up when the initial replication is done.