Promote: Metrics & Models to improve the test process
Traversa Mauro Elsag spa, Genoa - Italy
Working, as case study, on a critical software component (the recognition address task on a postal object), we:
The SRS Department, where the PIE has been performed, has the following organization:
The test team (about 12 members) uses Test Director by Mercury as database of test cases and Winrunner by Mercury for the validation of the Graphical User Interfaces.
The test team (for the testing phase) and the business units (after the delivering of the products) use PVCS Tracker to report the detected anomalies while PVCS Version Manager is used by the Software Configuration Management Team (3 members) for change control and release management activities.
The key factors of a postal machine are the level of reliability of hardware and software components (supposed to work 24 hours a day), the amount of postal objects sorted for hour and the percentage of automatic recognition of the addresses located on the postal object. This last factor is very relevant and lets the customer tune the number of human resources necessary to complete manually the address recognition task. The case study considered in the PIE regards this strategic software component.
A simple estimation of the size and complexity of this product can be derived from the following table:
The OCREngine system software has a strategic rule of the product in Elsag spa, and we can affirm that many customers really evaluate the opportunity to buy our postal system according to the quality of our address recognition task for a postal object. The percentage of recognized addresses is typically a basic requirement in the customer specifications and it represents always a basic acceptance test for our customers.
Elsag selection for the PIE is focused to a software module whose task is the identification of any single hand/type written character of the address on a postal object. This software is composed of 30 routines (1180 Lines of Code) written in C/C++ language
This probability is defined as:
p(faults)=p (FP, CONTROL, V(G), LOC, Lines, Comments) where
log(p /1-p ) = - 4.76740505 + 0.4789582 FP -0.68208893 CONTROL + 0.6643859 V(G) - 0.50226561 LOC + 0.4514182 Lines - 0.42495702 Comments
where FP, CONTROL, V(G), LOC, Lines, Comments are classic static metrics [1][2]. See Appendix 3 for a description of these metrics and criteria to evaluate the goodness of the model.
param = 1.12013763 eLOC + 0.31751306 Comments – 0.4334234 Lines – 1.204101 FP + 6.06374051 CO – 6.27636 LFC + 8.29360992 eV(G) + 0.32320117 OC – 3.8617338 NEST – 0.9116793 n1 – 6.629129 N +0.44606073 n – 0.0298177 V + 0.00011847 E – 5.4163831 BRANCH – 3.239876 ANION –3.3306005 EXEC + 2.23758035 QCP_MAINT
See Appendix 3 for a description of the metrics mentioned in the above formulas and criteria to evaluate the goodness of the model.
We estimate that a module belongs to the class with highest probability in the above formulas and consequently we evaluate its level of potential residual faultiness.
Further it’s possible to estimate n as the minimal number of test cases that have to cover each routine to find a fault, if present in the module itself. According to the definition of the reveal/pass-through rate this number is defined as 1/xmin where xmin is the lower bound of the reveal/pass-through rate of the modules belonging to Class i ( i.e.: if a routine belong to Class 2, xmin is 0.054 and consequently n=19). The definition of n is rather intuitive but to understand better this, the limit situation where any test case passing through a routine detects a fault may be considered. In this case the rate is equal to 1 and 1 is again the minimum number of test cases: really it’s logical to think that, if any test case is able to detect the fault, just one test case is necessary to be executed.
We estimate to terminate testing when, for any routine, the number of passing through test cases is greater than n.
After the definition of metrics, models and tools, the project entered in the most two significant technical phases.
The first one may be defined as the first experience phase (the offline phase). We used the tools to captures the selected measures on a delivered version of the case study with the target to „take a photograph" of the initial state of art in terms of violations, complexity and level of faultiness of the delivered software.
At the end of this task we defined a first version of the new process applicable to the current industrial process and developers had useful information to improve their software in term of violation and complexity.
These improvements have been introduced in a new release of the case study previously planned with our customers for bug fixing and minor technical enhancements. We allocated an extra-budget of time (30% more than the originally planned one) to let our developers introduce the improvements required by the new process. It’s important to underline that we applied the new process working in a „realistic industrial scenario" (the online phase) where time for delivering was one of the typical constraints of any product roadmap and the introduction of the improvements deduced in the offline phase was only one of the objectives of the new release. We preferred to do so with the goal to simulate the real scenario of any possible future extension of the new process to other software components where industrial constraints of time and cost must be considered.
After that the improvements were introduced and the new release of the baseline project was delivered to the test team, the last technical phase could officially start. In this last step we captured again the selected metrics and we measured the real level of improvements we reached in terms of violation and complexity. We applied the static model to the delivered software with the goal to give to the test team accurate information about the routines whose probability to have fault was higher to let the test activities to be focused on these. At last, while test activities were running and measures of coverage could be captured, we applied the dynamic model calculating the probability to still have faults after delivering and calculating the minimum number of useful test case to assure a significant level of coverage. Both informations gave a great help to define a criterion to stop test activities on the new version of the case study.
20% of test cases do not increase the level of coverage.
Time for test: reduction of 10 %
These models seem to work well in our environment with our software. This can be confirmed for the static model comparing the probability, for any routine, to have faults with the number of anomalies detected in the testing phase. We found a good correspondence between the routines identified by the static model as critical and the reports of the test team: the largest number of bugs was really detected in these routines; on the other side a very short and not-critical number of bugs was detected in the routines identified by the model as faultiness or with a very low probability of faults.
Considering the dynamic model we can declare our confidence in it analysing the first reports coming from external customers about the quality of our software delivered after the introduction of the new process. Again the number of bugs they are reporting decreases and the anomalies tend to be concentrated in the routines classified in Class 1. This is the most significant way to demonstrate the goodness of the model and of the new process we introduced.
More interesting from technical point of view is the introduction of the statistic models elaborated by CEFRIEL. The experiments performed and the obtained results clearly indicate a high correlation between static metrics (used in both models) and the presence of faults. The detailed statistical analysis performed with different statistical methods indicates that for a homogeneous class of software and a limited set of severe faults it is possible to build a statistical model referring to a reasonable small set of metrics.
Business impacts from the test point of view are more immediate and relevant. The result table reports a reduction of time for test (respect to the originally planned one) of 10% that is a first significant result considering that the case study represents only a small percentage of the software to be tested This has been possible because test activities have been more focused on the functionalities implemented by the routines whose probability of fault was higher. In general this information is very useful also for a better estimation of times and costs for test and the test manager will use it for planning future activities.
Another important aid to reduce time for test is the estimation about the minimum number of test cases useful to be executed. In the experiment it has been calculated that 20% of the test cases do not increase the level of coverage.
At last, we cannot forget the obvious and most relevant impact: the quality of the product delivered to customers. The dynamic model has the goal to calculate, on statistic base, the level of potential residual faultiness of code and consequently gives to the test and product managers a significant help in the critical decisions about the opportunity to deliver the new release. Our products are delivered worldwide (Chile, South Korea, Japan, Taiwan, Indonesia, Kuwait…) and, consequently, costs for bug detecting and fixing in field are very expensive (we estimate about 100 times the cost of bug fixing in house).
Before this experiment we had no rules for a better programming and no sensibility about the significant benefit of a well structured and easy to be maintained software. Different was the scenario in the test area; before the experiment any testers needed criteria to understand if they were using with the best profit the time dedicated to test. After the experiment, any developer knows that it’s not enough to deliver to testers a piece of software compiled and debugged. The software must have a good level of quality without violations and with a low level of complexity.
More significant is the cultural impact for testers. Now they know that a software component cannot be released until a routine has a too high probability to have a fault. It’s a complete different style of working, not more based on „someone feelings" about the critical components.
Product and test managers have now the opportunity to require and to examine a quantitative report of the level of faultiness of the software under/after testing. It’s a different approach to evaluate in a more accurate way the quality of the products before the delivering.
Working on software routines to reduce the number of violations may be a long and boring work but it does not present significant risks to create dangerous instabilities. Different is the scenario about complexity. Reducing the level of complexity of a routine often is not a very simple work and the risks to generate instabilities are concrete. The lesson we learnt about this point is that we have to perform this task gradually and in different steps.
Initially we planned to estimate the quality of the delivered software using measures of coverage (i.e. statement coverage or branch coverage) and to use these measures also as a criteria to determine when test activities may be considered concluded. While the experiment was running we understood that a measure of coverage represents only an indication about test effort (useful but not specifically requested in this experiment) but this measure cannot give us any significant information about the level of faultiness of the delivered software. Statistical models, following a different approach, try to estimate the potential residual faultiness of the software using, as baseline, static metrics. Consequently, if well tuned, they can give a more exhaustive information about the quality of a product.
Another strong point is that any step of the new process improvement is fully based on the use of tools supporting automatically the capture of metrics, the detection of violations, and giving significant helps to reduce the complexity of software and to measure the level of coverage.
A significant weak point is the already mentioned lack of initial measures on the typologies of violations performed by the development team in the implementing phase.
The second weak point regards the „range of validity" of the static and dynamic models. We cannot affirm that the models after the tuning phase performed in Elsag, can work fine on other software components and we cannot affirm that their goodness can be automatically confirmed forever. A continuous monitoring and, eventually, tuning may be necessary in the baseline project any time „external" conditions (i.e. modification of the members of the development team) change. Identical consideration must be done when we’ll extend the models to different software components written by other developers with different style of programming.
[2] IB-A83098 – A. J. Perlis and F. Sayward and M. Shaw – „Software Metrics: Ananlysis and evaluation" – MIT Press – Cambridge, Mass. – 1981
[3] V. R. Basili. - Tutorial on Models and Metrics for Software Management and Engineering. - IEEE Computer Society, New York, 1980.
[4] V. R. Basili and E. E. Katz. - A formalization and categorization of software metrics. Technical report, Dept. Com. Sci., Univ. Maryland, College Park, 1986.
[5] S. Mohanty. - Models and measurements for quality assessment of software. ACM Computing Surveys, 11(3):251--275, September 1979.
[6] C. E. Walston and C. P. Felix. - A method of programming measurement and estimation. 16(1):54--73, 1977.
[7] J. P. Cavano and J. A. McCall. - A framework for the measurement of software quality. - Proc. Software Quality and Assurance Workshop, pages 133--139, San Diego, CA, Nov. 1978.
Traversa Mauro has 15 years of experience in Elsag in the development, testing and software configuration management areas. After the courses at the Electronic Department at the University of Genoa, Traversa was integrated in the Development and Research Department of Elsag in a team responsible for the implementation of a multi-processor and real-time proprietary operating system for internal applications. In this team he worked initially in the test area validating the programming interface of the kernel core of the operating system. After two years, he entered in the development team contributing to the implementation of software components for a new major release of the operating system. It’s in this context that he started to be involved in software configuration management thematic and in 1994 he became the configuration manager in a critical and strategic join venture with a californian company for the implementation of a distributed system of character recognition. When the pilot project was concluded, Elsag formed a stable and independent team for software configuration management and Mauro Traversa assumed the responsibility of this team, actually working to manage any software component of a postal system. This team, initially dedicated to software configuration management activities, extended its competencies to offer to the development area a centralised service for installation procedures of software components.
Mauro Traversa has the responsibility of the best practice described in this article and he is actually involved in an Esprit project whose goal is the implementation of a tool for the automatic generation of test cases and for threats detection.
Elsag S.p.A established on November 1st 1998 was formerly a Division of Elsag Bailey, a Finmeccanica Company, one of the largest Italian industrial groups.
With revenues of about euro 360m in 1998, euro 410m in 1999, and more than 2,450 employees, Elsag S.p.A. group is now one of Italy’s more important suppliers of IT solutions and services.
Elsag operates in the following areas.
LOC: lines of code: any line in the source that is not a comment or a blank line; therefore it includes standalone braces and parentheses on a single line; eLOC: essential lines of code. Lines of code not including blank lines, comment lines and isolated braces or parenthesis lines; Lines: total number of code lines, no matter what they contain; Comments: lines containing comments; FP: number of parameters of the function; V(G): Cyclomatic complexity; eV(G): essential Cyclomatic complexity; CO: Comparison operators (<, >, = =, etc.); LFC: Logic Flow Complexity; OC: Operational complexity, weighted sum of present operations; NEST: Maximum number of nesting levels of if-then-else control structures; n1: Halstead number of unique operators N: Halstead program length, calculated as the sum between N1 (Halstead number of total operators) and N2 (Halstead number of total operands); n: Halstead program vocabulary, calculated as the sum between n1 (Halstead number of unique operators) and n2 (Halstead number of program operands); V: Halstead program volume, calculated as V=N * (log2 n); D: Halstead difficulty, calculated as D=(n1/n2)*(N2/n2); E: Halstead effort, calculated as E=D*V; BRANCH: number of forced exits from control structures by means of goto, exit, and break (also inside a switch) statements; OAC: Operation argument complexity, weighted sum of arguments of each operation of a module; ANION: Adjusted number of input/output nodes, calculated as the sum between number of entry nodes in a module and number of exit nodes from it, adjusted to behave intelligently where redundant "return" statements exist. CONTROL: number of control statements; EXEC: number of executable statements; NSTAT: Number of statements (equals to CONTROL + EXEC); QCP_MAINT: Quality criteria profile – Maintainability. Linear combination of other static metrics calculated as MAINTAINABILITY = 3 * N + NSTAT + NEST + 2 * V(G) + Number of Branching Nodes;
This statistic is not to be confused with least-square regression R2 — they are built upon different formulae, even though they both range between 0 and 1 and are similar from an intuitive perspective. This statistic may be interpreted as the proportion of uncertainty in the dependent variable explained by the model. The higher R2, the higher the effect of the model's explanatory variables, the more accurate the model. However, as opposed to the R2 of least-square regression, high R2s are rare for logistic regression. For this reason, the reader should not interpret logistic regression R2s using the usual heuristics for least-square regression R2s.
For the static model we obtained R2 = 0.5176 which is quite good in the context of the Logistic Regression.
For the dynamic model R2 = 0.4251.