-
Notifications
You must be signed in to change notification settings - Fork 23
Performance Tests
Architecture and implementation needs to be performance test driven so that the final product satisfies performance requirements. Performance is measured in terms of scalability along three axes:number of application instances, number of tasks, and number of agents.
We define three test categories:
- Large number of cores owned by one agent, represented by one task queue (single HPC scenarios)
- Large number of cores owned by small number of agents (multiple HPC scenarios)
- through one aggregate task queue (saga-pilot only scenario)
- through one task queue per agent (saga-pilot + TROY)
- Small number of cores owned by large number of agents (OSG / Cloud scenarios)
- through one aggregate task queue (saga-pilot only scenario)
- through one task queue per agent (saga-pilot + TROY)
Those tests are to be fed by either one, or by many, application instances.
Tests shall be defined as soon as the user-facing REST API has been defined and periodically run during all stages of the implementation period to ensure performance QoS and get an early handle on overall scalability and performance numbers / limitations.
Performance metrics are:
- time to bootstrap the saga-pilot service layer
- time to bootstrap the saga-pilot agent layer
- early binding: time to schedule 100k CUs
- late binding: time to schedule 100k CUs
- time to stage input files for 100k CUs
- time to execute (NEW->DONE) all 100k CUs
- time to stage output files for 100k CUs
For these metrics:
- measure averages and variation
- understand minimum / maximum / variation
- determine overhead (time saga-pilot spends doing things other than CU execution etc)
Scale up the number of cores owned by a single agent:
- The largest HPC cluster we have access to is STAMPEDE:
- normal queue: 256 nodes ( 4K cores)
- large queue: 1024 nodes (10k cores) (on request)
Similar to Scenario 1, but distribute the number of total cores over 4 distinct HPC resources (XSEDE + Futuregrid).
Similar to Scenario 2, but distribute the number of total cores over many OSG/Cloud resources -- this basically inverses the pilot-size to #CU ratio.