EuroSPI99 
Learn from the Past - Experience the Future
European Software Process Improvement
SPI and Measurement
Category Index
Rated Newspaper Supported by EU Project 

Error Trending -Why and How

Niels Bruun Svendsen

B-K Medical A/S, Denmark
nbs@bkmed.dk
 

Introduction

How do you waste your money? Do you make the perfect error free product and loose the market while doing so or do you get your product out "first thing" and drown in error corrections, patches and possibly field updates?
When developing systems and software an inevitable management question is: "When is the system ready for release?". On the bottom line the answer on when to release a new product for production and sales is a matter of being able to estimate the cost of releasing, as well as the cost of postponing the release.

In calculation of the cost of releasing a product the number of remaining unknown errors is a major factor. Therefore error detection trends during the system-testing phase have been introduced as means of estimating the number of remaining unknown errors.
This paper will share the experiences gained and the lessons learned from introducing error trending as an estimation tool and highlight the benefits found as well as the problems encountered.
The results includes not only experiences with the precision of the estimates but also, and not less interesting, the impact of error trending on the organisation. It was found that the error trend had a great value during all of the system-testing phase, and for all groups involved:

In short this means that one simple curve gives input and insight for top-management, project management, QA function and developers, i.e. becomes the common reference on system state.
 

Company Context


B-K Medical develops, produces and markets ultrasound systems for medical diagnostic imaging. The systems are sold throughout the world with the major markets being Europe, USA and Asia. B-K Medical has 250 employees with 166 located in Denmark. The development department consists of 60 employees where 20 are involved in software development. B-K Medical is ISO 9001 certified and most of the products have FDA market clearance and are CE-Medical Device certified. Therefore external audits are performed accordingly. No formal assessment against a model has been performed, but an informal self-assessment using the BootCheck tool from ESI has been performed. This assessment gave maturity ratings between 2.5 and 3.25, indicating some areas in need of improvement to get to the Defined (3) level, and a general lack of metrics as required in the Managed (4) level.
 

Project Initiation

The introduction of error trending at B-K was initiated by a management request for an improved basis for making the release decision, i.e. to decide whether or not to release a new product for production and sales. As part of the initiatives taken in order to pursue this goal, Error Trending was introduced. By using Error Trending to estimate the number of remaining unknown errors rather than using pure intuition, the objectivity of the basis for the release decision is increased.

Although aiming primarily on an estimate of remaining errors at the time of the release decision, error trending was introduced in the system-testing group as a tool to be used from the beginning of system test execution until the product is released. Beginning error trending early in the system-testing phase gave a lot of good experiences as described later on.

The initial steps with error trending were done on error data from a scanner that had been on the market for a year and therefore the number of reported error after release was known. Error reports from the last part of the system-testing phase were used and plotted as seen in fig.1. The y-axis shows the accumulated number of errors reported, and the x-axis shows the number of test days. A test day is equal to a calendar day except that only calendar days where test were performed are included.

Fig.NBS.1 : Accumulated no. of errors for released product




Despite the fact that the test effort pr. test day was not known in any detail, the plotted error data gave a quite clear trend with a distinct convergence in the last part of the trend. To start off with, very simple functions were tried out, using the trending functions in the MS Excel spreadsheet. None of the experiments using all data gave any trustworthy results. Our criteria for a result to be trustworthy, were that the estimated trend had a good correlation with the last converging part of the data, and that it gave an estimated total number of errors higher than the number of errors already found.

Finally it was decided to focus only on the latter part of the error data, and using the exponential function on those data as shown on fig. 2. It gave a perfect match with the number of errors actually found after release.

Although this was very well affected by the fact that we knew the result we should get, it did give some confidence in that here was something useful. Fig. 2 was used for raising internal interest in error trending, with the argument that:

Based on data with a great deal of uncertainty you can apparently still draw and extrapolate a trend using the data from the final stage of system test and get a very good estimate on remaining errors. The conclusions on the work with the data from the released product was that although limited in amount and precision it gave a good initial interest in error trending and was a kick-off for going further into the subject.

Fig.NBS.2 : Exponential Error Trend for released product




The Model

When searching for experiences on error trending the name of SATC (Software Assurance Technology Center) at NASA is very likely to pop up. SATC has published articles that mentioning their work on an Error Trending Model, ref.[1] & ref.[2]. The Error Trending model was also mentioned by Linda Rosenberg, SATC at a QWE’98 tutorial. As we did not find any further description of this model, Linda Rosenberg was contacted. We got a very quick response saying that work was still in progress on the model and they were working on a tool to support the model. We were also invited to send our data to SATC to have them analysed.

We decided to send data from the first part of system testing on a new scanner. Unfortunately our data did not give any valid results when analysed by the SATC Software Error Trending Tool. Instead they returned a spreadsheet with our data analysed by a Weibull variate. It differed from the Weibull function in relation to the manpower utilisation, but as this did not influence the estimate on the number of remaining errors, we decided to proceed with the Weibull function itself. The Weibull function has the form:




With p = 1, we have the exponential function and with p = 2, we have the Rayleigh curve. When used for trending, the parameters k, tmax and p are optimised to get the minimum sum-of-difference-squared. The spreadsheet included set-up for using the MS Excel solver to analyse additional data, and has formed the basis of our further work with error trending. We are therefore very thankful for this valuable input from SATC.

The Weibull function and the alternative models that could be used are described further in reference [3], [4] and [5].

In fig. 3 the use of the Weibull function on the data from the released scanner is shown. The estimated number of errors remaining is 5. A total of 15 error reports have been made since release, including also change requests.


Fig.NBS.3 : Weibull Error Trend for released product

Multiple reasons for the difference can be and has been discussed, e.g. was the system test as thorough as the "test" performed by having customer using the system, did we have sufficient data to make a reliable estimate etc. In this case the found errors were not corrected, otherwise errors introduced while making corrections could have been a reason.

The conclusion drawn on the estimate was that although a bit low, it is still a good estimate, especially when taking into account the uncertainty and limited amount of data. The estimate indicates a system ready for release and the errors found after release was also within acceptable level.

Error Trending during System Test

As mentioned earlier, data from the first part of system test on a new scanner were analysed by SATC, NASA. The results, based on the Weibull function, gave a very high estimate on the number of remaining errors, as well as a high number of days to find the remaining errors.

When presented for the project manager we had the first direct impact on the project:

With that many test days left, we need more test objects

The presented error trend and estimates were the direct cause for additional test objects to be arranged for. The fact that the calculations on our data were made by NASA was used to increase confidence in the estimate.

A good reference gives confidence

From this point in the system test phase, daily updates of the trend and estimates were made, i.e. yesterdays reported errors were entered and new parameters for the Weibull function were calculated. The test days are here counted as test man-days, e.g. 3 testers working one day, results in 3 test days. This way we account for the changes in test effort.

Fig.NBS.4 : Weibull Error Trend for new product

The new trend and estimates were presented on the "project wall", and on the Intranet, see fig. 4. A lot of internal interest were gained and although not all understood that the error trend curve were optimised every day, it gave opportunities to discuss the state of system under test as well as error trending in general.

During the last part of the system test phase the project manager had a demonstration of the system for the top-management. A full functioning scanner was demonstrated and as often in these situations the comment that the project manager receives is: "This scanner looks complete. Why don’t we release tomorrow or at least at the end of the week?". The standard answer to this question is that "we still need a little optimisation on the quality of the image" and "we haven’t got all parts in production quantities". But this time the project manager had another argument, i.e. the error trend and the estimate of remaining errors and test days. So he showed the error trend saying: "See we estimate the need for another 100 test days. With the number of scanners and testers we have, that means we’re finished in 30 days, and that is exactly the planned release date. That was very convincing and made the end of that discussion. Of cause the input from SATC at NASA again played a role in the creation of confidence in the estimate.

The product is finished. Why not release "tomorrow"?

The Error Trend holds the answer

This time it was the top-management, but next time it will be the sales staff asking for a release "tomorrow". The visualisation of the Error Trend makes it easy to communicate the probability for further errors to all types of staff in the company.

Not only the project manager, but also the developers can make use of the Error Trend in this stage of the project. Typically another project is crying out for development resources as soon as they have finished their work on the current project. And there is a strong tendency for developers to be almost finished, i.e. "I have only a few more (known) errors to correct, then I’m finished. A few errors might pop up but we’ll fix them in-between the other work". Here the Error Trend is a great help too as it is easy to take the number of estimated remaining errors and divide by the number of developers and you have an estimate on how many more errors there are to correct for each developer. In our case and probably for many others, this will mean a considerable amount of time to be planned for before the resources are ready for the next project.

You have implemented it all. Why can’t you start on a new project "tomorrow"?

The Error Trend holds the answer

The value of the Error Trend and the estimates in the mentioned situations naturally depends on the precision of the estimates. However we find that the normal expectations are that far from any reality, that almost any estimate is better than none. The benefit is there if just you can show that there is "a lot" of errors left and not just "a few".
 
 

Fluctuations of the Error Trend

Fluctuations in the trend was expected, as new builds, new test techniques and the start of test in previously untested areas are very likely to initially increase the number of errors found. And when you test on the same build with the same test technique fewer and fewer errors will be found. This phenomenon is seen in fig. 4, where the first 24 test days constitutes its own "S" curve and a large increase in error detection rate is seen as we enter what is referred to as functional test.

So fluctuations are seen:

In our case the largest fluctuations were seen when entering test of new feature and the smallest fluctuation seen when introducing new builds.
 


Fig.NBS.5 : Evolvement of estimated total no. of errors




As the estimated total number of errors were calculated every day these fluctuations had an impact on the estimated total number of errors. Therefore there was a need for visualisation of the evolvement of this estimate. A trend for the estimated total number of errors was added as seen in fig. 5. The first estimate of 530 errors in total was the estimate received from SATC’s analysis of our data and the figure used to get additional test objects. As seen the estimate was reduced somewhat during the first period where Error Trending was used and we saw the estimate stabilise around approx. 350 errors. But then around the 65th test day suddenly the estimates of the total number of errors increased drastically. This was caused by the fact that we had entered test of 2 previously untested areas that were found to have a much higher error density than what had been tested so far.

This increase in the estimated number of errors in the system naturally imposed a problem on the project, both in getting development resources to correct the errors and the extra time needed for both the correction and the verification of the corrections. When discussing the situation we could see that this was not a new problem, but a problem we have had "always". It is a result of the way we plan the system test, where we execute the test sequentially, function by function. The problem is visualised in fig.6. The illustration shows a set of functionalities, where

the "F" functionality is significantly more error prone than the others. The first case


Fig.NBS.6 : Test Sequences

is how we traditionally have covered the test of such a system with test suites for

each functionality and executing the test suites sequentially. This means that we will not have any knowledge of the, in this case, high error density of "F" until late in the system test execution phase.

Therefore we have changed the strategy for test planning slightly, making a test suite that covers all functionalities. This test suite will not cover any functionality in depth, but just enough to get an impression of the error density of the functionality. Use Cases will be used for designing this test suite. By executing the Use Case based test suite as the first test suite, we will get valuable data for planning the execution of the remaining test suites. We will also get the possibility to reject functionalities early in the test process, limiting the time spend on system testing features that are not ready for system test. This way of planning the execution of the system test will be applied in two projects during autumn 1999.

Common Sense has to be triggered

This change in the system test execution is not directly connected to the Error Trending. But the visualisation of the problem that the Error Trend caused was the trigger needed to realise it and to have a broader group of people discussing the problem and possible solutions.

In the final stage of the system test it was found that the estimated total number of errors remained at a very high level, even with many test days having no or very few errors found. When looking at the trend curve it was apparent that it was not following the actual error findings in the final stage of the system test very well. Therefore the initial part of the system test, where new features were still added to the system, was omitted from the trend calculations in the final stage of the system test. So just as there were valuable impact of starting Error Trending early in the system test, even though not all of the system was ready, it was found that the estimates to be used in the latter part of system test had to be based solely on data starting at the time where the total system is available.
 
 

Summary

The experiences with the work performed with Error Trending can be summarised in the following Why’s and How’s:

Why:

How:

Conclusions

We started off aiming at a technique to estimate the number of remaining errors at the time of possible release. What we found was a technique that apart from doing that were able to trigger common sense in several processes related to the system test phase. Improvement was triggered in relation to: So far we have limited data on the precision we can obtain, but we have found that even with a limited precision there’s a lot of benefit in collecting and presenting data which in many cases are fairly easy to get hold of.

Apart from the mentioned models we also tried using 3rd order polynomial approximation as suggested by Grove, but had some problems getting estimates we believe in. And the trust in the model is a key issue when the idea is to be "sold" internally. Also a good reference play a key role in that respect. But whatever model you choose, don’t trust it blindly. Keep your common sense and professional knowledge, but let Error Trending help you stay objective and use it as a mean of communicating between personnel groups.

Get your Error Trending started – You won’t regret it
 

References

[1] Robert E. Waterman, Lawrence E. Hyatt : "Testing - When Do I Stop?", International Testing and Evaluation Conference, Washington, DC - October, 1994

[2] Dr. Linda Rosenberg, Ted Hammer, Jack Shaw: "Software Metrics and Reliability", 9th International Symposium on Software Reliability Engineering Germany - Nov 1998

[3] Stephen H. Kan: "Metrics and Models in Software Quality Engineering", Addison Wesley, ISBN 0-201-63339-6

[4] J.D.Musa, A.Iannino, K.Okumoto: "Software Reliability: Measurement, Prediction, Application, ISBN 0-07-044093-X

[5] J.D.Musa: "Software Reliability Engineering, More Reliable Software, Faster Development and Testing, ISBN 0-07-913271-5
 


Partners in EuroSPI

Editors
ISCN LTD, ISCN GesmbH, Schieszstattgasse 4/24, 8010 Graz, and Coordination Office, Florence House, 1 Florence Villas, Bray, Ireland, office@iscn.at, office@iscn.com, office@iscn.ie, Editing Done: 19.7.2002