Error Trending -Why and How
Niels Bruun Svendsen
B-K Medical A/S, Denmark nbs@bkmed.dk
In calculation of the cost of releasing a product the number of remaining unknown errors is a major factor. Therefore error detection trends during the system-testing phase have been introduced as means of estimating the number of remaining unknown errors. This paper will share the experiences gained and the lessons learned from introducing error trending as an estimation tool and highlight the benefits found as well as the problems encountered. The results includes not only experiences with the precision of the estimates but also, and not less interesting, the impact of error trending on the organisation. It was found that the error trend had a great value during all of the system-testing phase, and for all groups involved:
B-K Medical develops, produces and markets ultrasound systems for medical diagnostic imaging. The systems are sold throughout the world with the major markets being Europe, USA and Asia. B-K Medical has 250 employees with 166 located in Denmark. The development department consists of 60 employees where 20 are involved in software development. B-K Medical is ISO 9001 certified and most of the products have FDA market clearance and are CE-Medical Device certified. Therefore external audits are performed accordingly. No formal assessment against a model has been performed, but an informal self-assessment using the BootCheck tool from ESI has been performed. This assessment gave maturity ratings between 2.5 and 3.25, indicating some areas in need of improvement to get to the Defined (3) level, and a general lack of metrics as required in the Managed (4) level.
Although aiming primarily on an estimate of remaining errors at the time of the release decision, error trending was introduced in the system-testing group as a tool to be used from the beginning of system test execution until the product is released. Beginning error trending early in the system-testing phase gave a lot of good experiences as described later on.
The initial steps with error trending were done on error data from a scanner that had been on the market for a year and therefore the number of reported error after release was known. Error reports from the last part of the system-testing phase were used and plotted as seen in fig.1. The y-axis shows the accumulated number of errors reported, and the x-axis shows the number of test days. A test day is equal to a calendar day except that only calendar days where test were performed are included.
Fig.NBS.1 : Accumulated no. of errors for released product
Despite the fact that the test effort pr. test day was not known in any detail, the plotted error data gave a quite clear trend with a distinct convergence in the last part of the trend. To start off with, very simple functions were tried out, using the trending functions in the MS Excel spreadsheet. None of the experiments using all data gave any trustworthy results. Our criteria for a result to be trustworthy, were that the estimated trend had a good correlation with the last converging part of the data, and that it gave an estimated total number of errors higher than the number of errors already found.
Finally it was decided to focus only on the latter part of the error data, and using the exponential function on those data as shown on fig. 2. It gave a perfect match with the number of errors actually found after release.
Although this was very well affected by the fact that we knew the result we should get, it did give some confidence in that here was something useful. Fig. 2 was used for raising internal interest in error trending, with the argument that:
Fig.NBS.2 : Exponential Error Trend for released product
The Model
When searching for experiences on error trending the name of SATC (Software Assurance Technology Center) at NASA is very likely to pop up. SATC has published articles that mentioning their work on an Error Trending Model, ref.[1] & ref.[2]. The Error Trending model was also mentioned by Linda Rosenberg, SATC at a QWE’98 tutorial. As we did not find any further description of this model, Linda Rosenberg was contacted. We got a very quick response saying that work was still in progress on the model and they were working on a tool to support the model. We were also invited to send our data to SATC to have them analysed.
We decided to send data from the first part of system testing on a new scanner. Unfortunately our data did not give any valid results when analysed by the SATC Software Error Trending Tool. Instead they returned a spreadsheet with our data analysed by a Weibull variate. It differed from the Weibull function in relation to the manpower utilisation, but as this did not influence the estimate on the number of remaining errors, we decided to proceed with the Weibull function itself. The Weibull function has the form:
With p = 1, we have the exponential function and with p = 2, we have the Rayleigh curve. When used for trending, the parameters k, tmax and p are optimised to get the minimum sum-of-difference-squared. The spreadsheet included set-up for using the MS Excel solver to analyse additional data, and has formed the basis of our further work with error trending. We are therefore very thankful for this valuable input from SATC.
The Weibull function and the alternative models that could be used are described further in reference [3], [4] and [5].
In fig. 3 the use of the Weibull function on the data from the released scanner is shown. The estimated number of errors remaining is 5. A total of 15 error reports have been made since release, including also change requests.
Fig.NBS.3 : Weibull Error Trend for released product
Multiple reasons for the difference can be and has been discussed, e.g. was the system test as thorough as the "test" performed by having customer using the system, did we have sufficient data to make a reliable estimate etc. In this case the found errors were not corrected, otherwise errors introduced while making corrections could have been a reason.
The conclusion drawn on the estimate was that although a bit low, it is still a good estimate, especially when taking into account the uncertainty and limited amount of data. The estimate indicates a system ready for release and the errors found after release was also within acceptable level.
Error Trending during System Test
As mentioned earlier, data from the first part of system test on a new scanner were analysed by SATC, NASA. The results, based on the Weibull function, gave a very high estimate on the number of remaining errors, as well as a high number of days to find the remaining errors.
When presented for the project manager we had the first direct impact on the project:
With that many test days left, we need more test objects
The presented error trend and estimates were the direct cause for additional test objects to be arranged for. The fact that the calculations on our data were made by NASA was used to increase confidence in the estimate.
A good reference gives confidence
From this point in the system test phase, daily updates of the trend and estimates were made, i.e. yesterdays reported errors were entered and new parameters for the Weibull function were calculated. The test days are here counted as test man-days, e.g. 3 testers working one day, results in 3 test days. This way we account for the changes in test effort.
Fig.NBS.4 : Weibull Error Trend for new product
The new trend and estimates were presented on the "project wall", and on the Intranet, see fig. 4. A lot of internal interest were gained and although not all understood that the error trend curve were optimised every day, it gave opportunities to discuss the state of system under test as well as error trending in general.
During the last part of the system test phase the project manager had a demonstration of the system for the top-management. A full functioning scanner was demonstrated and as often in these situations the comment that the project manager receives is: "This scanner looks complete. Why don’t we release tomorrow or at least at the end of the week?". The standard answer to this question is that "we still need a little optimisation on the quality of the image" and "we haven’t got all parts in production quantities". But this time the project manager had another argument, i.e. the error trend and the estimate of remaining errors and test days. So he showed the error trend saying: "See we estimate the need for another 100 test days. With the number of scanners and testers we have, that means we’re finished in 30 days, and that is exactly the planned release date. That was very convincing and made the end of that discussion. Of cause the input from SATC at NASA again played a role in the creation of confidence in the estimate.
The product is finished. Why not release "tomorrow"?
The Error Trend holds the answer
This time it was the top-management, but next time it will be the sales staff asking for a release "tomorrow". The visualisation of the Error Trend makes it easy to communicate the probability for further errors to all types of staff in the company.
Not only the project manager, but also the developers can make use of the Error Trend in this stage of the project. Typically another project is crying out for development resources as soon as they have finished their work on the current project. And there is a strong tendency for developers to be almost finished, i.e. "I have only a few more (known) errors to correct, then I’m finished. A few errors might pop up but we’ll fix them in-between the other work". Here the Error Trend is a great help too as it is easy to take the number of estimated remaining errors and divide by the number of developers and you have an estimate on how many more errors there are to correct for each developer. In our case and probably for many others, this will mean a considerable amount of time to be planned for before the resources are ready for the next project.
You have implemented it all. Why can’t you start on a new project "tomorrow"?
The value of the Error Trend and the estimates in the mentioned situations naturally depends on the precision of the estimates. However we find that the normal expectations are that far from any reality, that almost any estimate is better than none. The benefit is there if just you can show that there is "a lot" of errors left and not just "a few".
So fluctuations are seen:
Fig.NBS.5 : Evolvement of estimated total no. of errors
As the estimated total number of errors were calculated every day these fluctuations had an impact on the estimated total number of errors. Therefore there was a need for visualisation of the evolvement of this estimate. A trend for the estimated total number of errors was added as seen in fig. 5. The first estimate of 530 errors in total was the estimate received from SATC’s analysis of our data and the figure used to get additional test objects. As seen the estimate was reduced somewhat during the first period where Error Trending was used and we saw the estimate stabilise around approx. 350 errors. But then around the 65th test day suddenly the estimates of the total number of errors increased drastically. This was caused by the fact that we had entered test of 2 previously untested areas that were found to have a much higher error density than what had been tested so far.
This increase in the estimated number of errors in the system naturally imposed a problem on the project, both in getting development resources to correct the errors and the extra time needed for both the correction and the verification of the corrections. When discussing the situation we could see that this was not a new problem, but a problem we have had "always". It is a result of the way we plan the system test, where we execute the test sequentially, function by function. The problem is visualised in fig.6. The illustration shows a set of functionalities, where
the "F" functionality is significantly more error prone than the others. The first case
Fig.NBS.6 : Test Sequences
is how we traditionally have covered the test of such a system with test suites for
each functionality and executing the test suites sequentially. This means that we will not have any knowledge of the, in this case, high error density of "F" until late in the system test execution phase.
Therefore we have changed the strategy for test planning slightly, making a test suite that covers all functionalities. This test suite will not cover any functionality in depth, but just enough to get an impression of the error density of the functionality. Use Cases will be used for designing this test suite. By executing the Use Case based test suite as the first test suite, we will get valuable data for planning the execution of the remaining test suites. We will also get the possibility to reject functionalities early in the test process, limiting the time spend on system testing features that are not ready for system test. This way of planning the execution of the system test will be applied in two projects during autumn 1999.
Common Sense has to be triggered
This change in the system test execution is not directly connected to the Error Trending. But the visualisation of the problem that the Error Trend caused was the trigger needed to realise it and to have a broader group of people discussing the problem and possible solutions.
In the final stage of the system test it was found that the estimated total number of errors remained at a very high level, even with many test days having no or very few errors found. When looking at the trend curve it was apparent that it was not following the actual error findings in the final stage of the system test very well. Therefore the initial part of the system test, where new features were still added to the system, was omitted from the trend calculations in the final stage of the system test. So just as there were valuable impact of starting Error Trending early in the system test, even though not all of the system was ready, it was found that the estimates to be used in the latter part of system test had to be based solely on data starting at the time where the total system is available.
Why:
Apart from the mentioned models we also tried using 3rd order polynomial approximation as suggested by Grove, but had some problems getting estimates we believe in. And the trust in the model is a key issue when the idea is to be "sold" internally. Also a good reference play a key role in that respect. But whatever model you choose, don’t trust it blindly. Keep your common sense and professional knowledge, but let Error Trending help you stay objective and use it as a mean of communicating between personnel groups.
Get your Error Trending started – You won’t regret it
[2] Dr. Linda Rosenberg, Ted Hammer, Jack Shaw: "Software Metrics and Reliability", 9th International Symposium on Software Reliability Engineering Germany - Nov 1998
[3] Stephen H. Kan: "Metrics and Models in Software Quality Engineering", Addison Wesley, ISBN 0-201-63339-6
[4] J.D.Musa, A.Iannino, K.Okumoto: "Software Reliability: Measurement, Prediction, Application, ISBN 0-07-044093-X
[5] J.D.Musa: "Software Reliability Engineering, More Reliable Software, Faster Development and Testing, ISBN 0-07-913271-5