WHEN Release Decision Metrics
Niels Bruun Svendsen & John A. Fodeh
B-K Medical A/S Denmark
For setting up a metrics program, the process described in the CMU/SEI handbook "Goal-Driven Software Measurement" was applied. Apart from leading to a well-defined set of metrics, the impact on the organisation was remarkable. The metrics program resulted in a „Release Form", i.e. a data sheet containing a set of metrics collected during the system test phase together with other relevant information needed for assessing the product's readiness for release. A number of metrics included in the developed Release Form have been applied in multiple releases and the results evaluated after each release. This paper will highlight the benefits found as well as the problems encountered. Furthermore, it will put emphasis on the experienced effect of introducing and working with these metrics, that has been seen in the organisation.
The work has been supported by the European Commission and this paper is part of the final dissemination of the ESSI – Process Improvement Experiment (PIE) project 27498 – WHEN. The PIE has three objectives, of which two are related to release decision support, the topic of this paper.
In order to measure the impact of the improvement, a questionnaire was designed based on the output of the management brainstorm. The questionnaire was applied after each of three releases, with one release being before the improvement work was started, one being half way through and one being at the end of the improvement work.
The questionnaire consists of 7 questions to be rated in 1 of 5 levels. The questions are:
1. How would you characterise the basis for release decision in general? 2. How was the remaining known errors and their consequences presented? 3. How was the presentation of how much that had been tested? 4. How was the presentation of how thorough the user evaluation was? 5. How was the estimate on remaining unknown errors? 6. How was the estimate on remaining unknown safety errors? 7. How was the post-release plan presented?
The five rating levels are: 1. ___ Non existent 2. ___ Weak (Very subjective) 3. ___ Fair (Subjective, but well argumented) 4. ___ Very Good (Mainly objective) 5. ___ That’s how to do it (Objective, based on solid data)
The initial average score was 2.4, indicating that the basis for release decision was all together subjective. The results obtained during the WHEN project can be seen in The Results section.
The method builds on the GQM (Goal Question Metric) method by Basili and Rombach, and extends the GQM with a phase that guides the user from Business Goals through Sub-Goals to Measurement Goals. In overview, the method can be illustrated as shown in Figure NBS-JAF 1.
Figure NBS-JAF 1: Goal-Driven Software Measurement method
The work on Goal-Driven Software Measurement was conducted as a series of workshops involving the newly formed system test group and an external mentor. The system test group consisted of a senior test manager, a senior SW engineer and two newly employed test managers. The first step on the way from Business goals to Sub-Goal was to ask questions concerning the process involved. As shown in Figure NBS-JAF 2, this raised another question, i.e. which process are we talking about? It was found that despite the fact that work is performed according to an ISO9000 certified quality system, the process definitions were either lacking or not detailed enough and the terms used were not defined. The „Goal-Driven Software Measurement" handbook uses a concept called „Mental Models". Mental Models are the perception of procedures, processes and practices in the mind of the user. Models like that can work when only one person is using the model, although the model has a tendency to change according to the current situation. The problem occurs when more people are involved and only Mental Models exist, because there is at least one Mental Model per person involved.
Figure NBS-JAF 2: Which process?
Getting the individual Mental Models aligned and written down in process definitions took quite some time. However, the discussions afterwards could be aimed at continuing the Goal-Driven Software Measurement process, instead of discussing the proper use of terms and which sub-processes existed.
Having reached the point where a number of Sub-Goals were defined we were ready to apply the GQ(i)M part of the process. The (i) part stands for indicator and is an addition to the GQM that we found valuable. The idea is to make sketches of the desired presentation of the measurement results. It makes the measurements more real and „alive" and generates a number of additional discussions and ideas. Examples of indicators can be seen in Figure NBS-JAF 3.
Several measurements were defined using the Goal-Driven Software Measurement method. A number of these were selected as our release decision metrics. It was noticeable that some Sub-Goals did not directly result in measurements, but rather pointed out the need for templates, checklists etc.
Figure NBS-JAF 3: Indicators
The final step was to prepare a plan that addressed the identified actions needed for both implementing the measures and completing the templates and checklists. This plan set the framework for the WHEN PIE activities and established a reference for further improvements of the processes.
The test coverage data include the information concerning the progress and completeness of the testing. A low value reveals insufficient testing effort and the risk of potential latent defects. It is planned to extend this section with code coverage data for quantifying the portion of the code that is exercised by testing, thereby showing the thoroughness of the applied testing techniques.
The system stability section delivers vital information about the reliability of the software. The data shows the mean number of operations between failures, equivalent to the widely spread Mean Time Between Failures (MTBF) metric. This section refers to the stability chart giving a graphical presentation of the mean number of random operations between failures, as a function of the build number. The chart contains two limits; the lower limit is the entry criterion for system testing, while the higher limit is release stability criterion. In this way, it is straightforward to confirm that the system's stability is adequate for release.
Test system status section contains statistics regarding the problems reported during the system-testing phase. The total number of problems reported is shown and categorised in closed (fixed and verified) and open problems. The open problems are sorted according to their severity. These data deliver a snap shot of the system state at release time, making it possible to take into account the risk and consequence of releasing the system. E.g. if the data reveals a large number of open high-severity or non-verified problems, then it clearly shows that releasing the system at this moment is high-risk decision.
User feedback during the development is undoubtedly of major importance. The user evaluation section presents relevant data collected during the user evaluation activities. At this time, this section only contains a summary of the raised problem reports and their classification. It is planned to extend this section with information about covered applications, user types, countries, etc.
The post release plan section contains an overview of the activities to be performed after the release of the product together with the responsibilities, schedules and the date for the subsequent release. The post release plan sends a clear signal that the project is not ended with the release of the product. This helps preventing management from allocating all resources to new projects just after release. Instead, efficient planning in the transition phase between projects can be made.
Figure NBS-JAF 4: Release Template
Figure NBS-JAF 5: Stability Trend
Figure NBS-JAF 6: Error Trend
The dots in the graph represent the reported errors, while the line going through the dots is the best-fit line (mathematical least square) based on the Weibull function. This line is extrapolated, providing a predictive evolution of the error finding rate.
As noticeable, the graph is S-shaped and can be divided into three sections; the first is the section with the slight slope at the beginning, the second is the mid-section with the linear-like slope, the third is the section where the graph flattens out. This S-shape is found to correlate with empirical data from software projects. At system test initiation, the error finding rate is low (as the functionality of the software is often restricted to few areas). The error finding rate increases with the addition of new functionality and the introduction of new errors during the correction of already found errors. Entering the third section, the error finding rate begins to decrease, as it becomes harder to find new errors. Ultimately, the graph flattens. Finding further errors at this stage require huge test effort and shows that the software is possibly ready for release (or that the limitation of the applied testing technique has been reached).
More details on the error trending can be found in ref. [1] and the results obtained by using it are discussed in The Results section.
Figure NBS-JAF 7: Result of Release Decision Questionnaire
One of the major improvements is the estimate on remaining unknown errors. This estimate is based on the error trend. The results of the error trend based estimates compared with actual number of errors found can be seen in the table below. What we conclude from this, is that the error trend based estimate is an optimistic estimate. It is not high precision, but it is fairly consistent and far more realistic than a subjective estimate. The experience is that the error trend based estimate is nearly always received as being high, i.e. „Do we really have that many errors left". In that case, it is important to notice that so far the estimate always has been too low.
1All reports counts, including change requests. 2Only 3 months data available. The other results are based on 6 months.
In this respect, the metrics used for supporting the release decision have shown their value. By giving the management group a more objective release decision basis, a higher degree of freedom in their decision has been obtained. A visible effect has been that management has decided not to delay releases in order to reduce the number of unknown defects at time of release, but to focus on a post-release plan to bring down the impact of post-release errors.
In the planning phase, the metrics have also shown their strength. The ability to give a qualified guess on the effort size of a system test project 9 months ahead, by use of the SW development time to system test time ratio, is convincing. During the system test, the error trend has given input to the planning of the remaining amount of test and needed resources for both testing and error correction.
Furthermore, metrics have also taken the role of a common reference. Especially the stability and error trends gave the common reference for discussion of system state, i.e. a simple graph gives the common basis for discussing system state, which is understood and accepted by top-management, project management, developers and QA staff.
Metrics demand maturity or the will to mature Working with defining relevant metrics we soon discovered that there was a need for clear definitions of the processes to base the measures on. In other words, for the metrics to be relevant a certain level of maturity is required. We did not initially have that level of maturity but we used the work on metrics to trigger and drive the improvements of process definitions. We experienced major benefits from that work especially in terms of job motivation, as there is no longer any need for spending time on the general way of performing regular routines, instead more effort can be put into solving the specific task at hand. Moreover, when spending time on the process it is to improve it, instead on figuring what the process is.
As much of the work done was focused on the system test phase, the major impact has been seen in the system test group. The results obtained as well as the discussion generated during the PIE has helped greatly in forming a dynamic and committed group that considers metric-supported process improvement a vital part of the process.
The conclusion on the use of Goal-Driven Software Measurement to drive the definition of a metrics program is that it can be highly recommended. Although it involved far more work than initially anticipated, it was undoubtedly worthwhile. Looking back, it was a necessary step for bringing up the level of maturity to where measurements start to make sense. Starting out without the awareness of missing process definition etc., the Goal-Driven Software Measurement was a perfect trigger of the needed improvement actions. Especially in the system test group, the work completed with Goal-Driven Software Measurement had helped establishing a solid infrastructure consisting of well-defined, functional and efficient processes.
In relation to the release decision support, a clear and positive effect has been seen. The greatest positive effect has been seen for the error trend based estimation of number of remaining errors and for the post release plan. These improvements have also triggered an interest in other metrics based on available data. An example of this is the calculation of the general cost of delaying release and comparing that with the cost of field update of the SW. It showed that the cost of updating the SW on all scanners in the field, 6 months after release, equals the loss of delaying the release by only 10 days.
The substantial cost of delaying the release shows the enormous pressure to release early and emphasises the importance of choosing the right release time, as the consequences of a "premature" release may be unrecoverable.
The developed release template will without doubt be used on future releases. It will be enhanced with code coverage and an improved user feedback section. It will also evolve towards defining release criteria by defining more target values.
In a broader sense, this work has helped establishing process improvement as a natural part of daily life in the development department.
[2] Robert E. Park, Wolfhart B- Goethert, William A. Florac, „Goal-Driven Software Measurement – A Guidebook", Handbook CMU/SEI-96-HB-002
[3] Linda Rosenberg and Lawrence Hyatt, „Developing a Successful Metrics Program", Software Assurance Technology Center (SATC), USA, 1997
[4] Stephen H. Kan, „Metrics and Models in Software Quality Engineering", Addison-Wesley, USA