EuroSPI 2000 
Practical and innovation based software process improvement to prepare for the new millenium.
European Software Process Improvement
SPI and Testing
Category Index
Rated Newspaper Supported by EU Project 

 
 
 
 
 

Promote: Metrics & Models to improve the test process
 
 

Traversa Mauro
Elsag spa, Genoa - Italy


 
 
 

Abstract

This article reports the Process Improvement Experiment (PIE) performed in Elsag spa, a large italian company whose main business is in postal and services automation, with the main goals to improve the development and test processes.
Lack of standardisation and rules in the programming step (C/C++ languages), bad-structured and too complex routines, and high costs and times for the test activities performed without criteria and measures of coverage, have been the main problems to be fixed in the PIE.
The experiment (June 1998-March 2000) has been performed in collaboration with CEFRIEL, an italian consortium with a large experience in the field of testing metrics whose role was to select metrics and to calculate and tune the statistical models used in the PIE

Working, as case study, on a critical software component (the recognition address task on a postal object), we:

The results of the experiment have been quite positive and the technological, the economical and cultural impacts have been measured. After the conclusion of the best practise, Elsag is extending the new process to other significant software components.
 
 

The Background Scenario

Elsag spa is a large engineering company mainly operating in the industrial sectors of postal automation, image processing, document management, and automation of services. The software components mainly embedded into more complex systems contribute to the total business of Elsag for about 180 MECU/year and represent about 50% of the company business. This percentage constantly increased in the last 10 years and it is expected to rapidly increase in the next years.
 

Company Organization

Elsag organization is based on a Central Development and Research Department (SRS), where activities of development and test are concentrated, interfacing with several Business Units whose responsibility is to manage the projects directly with the external customers.

The SRS Department, where the PIE has been performed, has the following organization:

The development team (about 40 members) works mainly on Window NT platform and uses Microsoft Development Studio (Visual C and Visual Basic) to develop the software components.

The test team (about 12 members) uses Test Director by Mercury as database of test cases and Winrunner by Mercury for the validation of the Graphical User Interfaces.

The test team (for the testing phase) and the business units (after the delivering of the products) use PVCS Tracker to report the detected anomalies while PVCS Version Manager is used by the Software Configuration Management Team (3 members) for change control and release management activities.
 

Products

Elsag operates in different technological markets with a large catalogue of several products. One of the most strategic areas where the company is actually investing is the market of the postal automation. Elsag proposes to its customers (that are typically national postal organizations) complete hardware and software products for automatic sorting of postal object where for postal object we consider letters, magazines, boxes,…. Actually we have many systems working in Chile, Indonesia, South Korea, Japan, Kuwait and Taiwan. In Europe, Elsag is the main contractor of the Italian Postal Organization and is now selling and installing 15 systems to France.

The key factors of a postal machine are the level of reliability of hardware and software components (supposed to work 24 hours a day), the amount of postal objects sorted for hour and the percentage of automatic recognition of the addresses located on the postal object. This last factor is very relevant and lets the customer tune the number of human resources necessary to complete manually the address recognition task. The case study considered in the PIE regards this strategic software component.
 
 

The Starting Scenario

This section describes the problems and needs we identified in our organization, the goals we planned to achieve in the PIE and the expected benefits we estimated to reach after the conclusion of the experiment. The last subsection describes our selection about the case study.
 

Identified Problems

We were not able to diagnose our organization against well-known models (CMM, SPICE,..) and we had no a great amount of quantitative measures really demonstrating the weak points of our process. Essentially, to identify our problems, we organized many interviews with the middle management and with the development and test teams. The result of our analysis was the following one:

Technical Objectives and Benefits

Considering the problems listed in the previous subsection, we identified the following objectives: The expected benefits were:

The case study

The selected case study is one of the core products of Elsag spa: automate character recognition system, named OCREngine.
OCREngine is a powerful, scalable and flexible family of products for automatic postal object reading. It is based on a common base technology covering a wide area of possible applications, wiping from form processing (for document processing applications) to postal object automatic encoding based on address interpretation (for postal automation applications).

A simple estimation of the size and complexity of this product can be derived from the following table:
 

Number of source modules ~ 600
Number of code lines ~ 115.000
Size of source code ~ 4.6 Mbytes

The OCREngine system software has a strategic rule of the product in Elsag spa, and we can affirm that many customers really evaluate the opportunity to buy our postal system according to the quality of our address recognition task for a postal object. The percentage of recognized addresses is typically a basic requirement in the customer specifications and it represents always a basic acceptance test for our customers.

Elsag selection for the PIE is focused to a software module whose task is the identification of any single hand/type written character of the address on a postal object. This software is composed of 30 routines (1180 Lines of Code) written in C/C++ language
 
 

Metrics & Models

From the technical point of view we can say that we worked on four different areas with the objective to achieve all the planned goals. The first two areas were investigated and managed with the collaboration of the development team, while the third and the fourth ones were investigated with the collaboration of the test team.
 

Violations

With the target to reduce the number of C/C++ languages constructs that are potentially cause of faults we selected and classified, three different classes (A, B and C) of violations. These violations have been selected by our developers in the set (more than 100) of suggested ones by the commercial tool that we used in the experiment. The tool is able to detect and to report any critical language construct when the source code analysing functionality is running. Our development team analyzed any suggested violation and selected the most significant ones according to the frequency they were occurring in our software and according to their level of gravity.

Complexity

With the target to reduce the complexity of the structure of code we focused on two well-know metrics: In this case the target has been to concentrate the attention of the developers on critical (from the complexity point of view) components and give to the test team useful indications about the complexity of the software to be tested.
 

The Static Model

Using techniques of Logistic regression, it has been built a statistical model which is essentially a function of static metrics captured with the selected tools and it represents, for any routine, the probability to have faults. This information, captured before the testing phase, is used by the test team in costs evaluation and let the test activity to be focused on modules with a high probability to have faults.

This probability is defined as:

p(faults)=p (FP, CONTROL, V(G), LOC, Lines, Comments) where

log(p /1-p ) = - 4.76740505 + 0.4789582 FP -0.68208893 CONTROL + 0.6643859 V(G) - 0.50226561 LOC + 0.4514182 Lines - 0.42495702 Comments

where FP, CONTROL, V(G), LOC, Lines, Comments are classic static metrics [1][2].
See Appendix 3 for a description of these metrics and criteria to evaluate the goodness of the model.
 

The Dynamic Model

Always working with techniques of logistic regression, it was built a second model defining the probability that a fault remains uncaught despite the achievement of a given coverage. This model is to be applied to any routine after the conclusion of the testing phase and give useful information about the potential residual faultiness of the delivered software and indications about the minimum significant number of test cases. Using coverage measures captured with Testbed, it’s possible, for any routine, to define, x as the ratio between the number of test cases that detect the failure (revealing test cases) and the number of test case exercising the routine (passing through test cases). The range of values of the reveal/pass-through rate has been divided in four distinct sub-ranges identifying five distinct classes (from Class 0 to Class 4) representing different level of faultiness: We estimate that the number of faults still present in a module after testing decreases moving from Class 1 to Class 4. We define again the probability p(i) of a module to belong to Class i as: Where param is a combination of well-know static metrics [1] [2]:

param = 1.12013763 eLOC + 0.31751306 Comments – 0.4334234 Lines – 1.204101 FP + 6.06374051 CO – 6.27636 LFC + 8.29360992 eV(G) + 0.32320117 OC – 3.8617338 NEST – 0.9116793 n1 – 6.629129 N +0.44606073 n – 0.0298177 V + 0.00011847 E – 5.4163831 BRANCH – 3.239876 ANION –3.3306005 EXEC + 2.23758035 QCP_MAINT

See Appendix 3 for a description of the metrics mentioned in the above formulas and criteria to evaluate the goodness of the model.

We estimate that a module belongs to the class with highest probability in the above formulas and consequently we evaluate its level of potential residual faultiness.

Further it’s possible to estimate n as the minimal number of test cases that have to cover each routine to find a fault, if present in the module itself. According to the definition of the reveal/pass-through rate this number is defined as 1/xmin where xmin is the lower bound of the reveal/pass-through rate of the modules belonging to Class i ( i.e.: if a routine belong to Class 2, xmin is 0.054 and consequently n=19). The definition of n is rather intuitive but to understand better this, the limit situation where any test case passing through a routine detects a fault may be considered. In this case the rate is equal to 1 and 1 is again the minimum number of test cases: really it’s logical to think that, if any test case is able to detect the fault, just one test case is necessary to be executed.

We estimate to terminate testing when, for any routine, the number of passing through test cases is greater than n.
 
 

The Experiment

This section describes the main phases of the experiment performed jointly with development and test teams whose collaboration and work were strategic to determine the success of the experiment itself.
 

The technical phases of the experiment

The first technical step has been dedicated to the selection of the commercial tool and to the definition of the most suitable metrics and models. Significant internal benchmarks were used to compare different commercial tools and LDRA TESTBED was selected for its high level of performance. Other two tools, Krakatau (www.powersoftware.com) and RSM(www.m2tech.net), were introduced in the project to capture metrics not supported by TESTBED.

After the definition of metrics, models and tools, the project entered in the most two significant technical phases.

The first one may be defined as the first experience phase (the offline phase). We used the tools to captures the selected measures on a delivered version of the case study with the target to „take a photograph" of the initial state of art in terms of violations, complexity and level of faultiness of the delivered software.

At the end of this task we defined a first version of the new process applicable to the current industrial process and developers had useful information to improve their software in term of violation and complexity.

These improvements have been introduced in a new release of the case study previously planned with our customers for bug fixing and minor technical enhancements. We allocated an extra-budget of time (30% more than the originally planned one) to let our developers introduce the improvements required by the new process. It’s important to underline that we applied the new process working in a „realistic industrial scenario" (the online phase) where time for delivering was one of the typical constraints of any product roadmap and the introduction of the improvements deduced in the offline phase was only one of the objectives of the new release. We preferred to do so with the goal to simulate the real scenario of any possible future extension of the new process to other software components where industrial constraints of time and cost must be considered.

After that the improvements were introduced and the new release of the baseline project was delivered to the test team, the last technical phase could officially start. In this last step we captured again the selected metrics and we measured the real level of improvements we reached in terms of violation and complexity. We applied the static model to the delivered software with the goal to give to the test team accurate information about the routines whose probability to have fault was higher to let the test activities to be focused on these. At last, while test activities were running and measures of coverage could be captured, we applied the dynamic model calculating the probability to still have faults after delivering and calculating the minimum number of useful test case to assure a significant level of coverage. Both informations gave a great help to define a criterion to stop test activities on the new version of the case study.
 

The Results

This table summarizes the state-of-art before the improvements and the results of the experiment. I remember that the case study was based on 30 routines written in C language.
 
Before the improvements
After the improvements
Any routine had violations included in Class A & B No violations in Class A. 6 routines with violations in Class B.
36% of the routines with an essential cyclomatic number greater than the threshold (3). 13% of the routines with an essential cyclomatic number greater than the threshold.
37% of the routines with an essential knots number greater than the threshold(0).  16% of the routines with an essential knots number greater than the threshold. 
87% of the routines with p(fault) greater than the threshold(0.5) and 23% of these with a value superior to 0.9  45% of the routines with p(fault) greater than the threshold. None with a value superior to 0.9. 
13% of the routines have still a high probability to have faults after testing (Class 1) while 23% may be estimated to have a very low risk of failures (Class 4). 

20% of test cases do not increase the level of coverage.

7% of the routines have still a high probability to have faults after testing (Class 1) while 61% may be estimated to have a very low risk of failures.
 

Time for test: reduction of 10 % 

These models seem to work well in our environment with our software. This can be confirmed for the static model comparing the probability, for any routine, to have faults with the number of anomalies detected in the testing phase. We found a good correspondence between the routines identified by the static model as critical and the reports of the test team: the largest number of bugs was really detected in these routines; on the other side a very short and not-critical number of bugs was detected in the routines identified by the model as faultiness or with a very low probability of faults.

Considering the dynamic model we can declare our confidence in it analysing the first reports coming from external customers about the quality of our software delivered after the introduction of the new process. Again the number of bugs they are reporting decreases and the anomalies tend to be concentrated in the routines classified in Class 1. This is the most significant way to demonstrate the goodness of the model and of the new process we introduced.
 
 

Experiment Evaluation

This section wants to give our evaluation of the experiment analysing the technical, the cultural and the economical impacts. Lesson learnt and weak and strong points of the experiment are also described.
 

The Technical Impact

Violations and complexity metrics do not represent nothing of new or exciting from technical point of view but the reached sensibility about the problems connected to the presence in software of potential bugs (violations) and the reached sensibility toward bad-structured software is something new in our development area. About these points, the most significant technical impact is the introduction of the tools of source code analysis adopted by any member of the development team to capture violations and to have useful information about the complexity of the routines and very useful indication to simplify the level of complexity.

More interesting from technical point of view is the introduction of the statistic models elaborated by CEFRIEL. The experiments performed and the obtained results clearly indicate a high correlation between static metrics (used in both models) and the presence of faults. The detailed statistical analysis performed with different statistical methods indicates that for a homogeneous class of software and a limited set of severe faults it is possible to build a statistical model referring to a reasonable small set of metrics.
 

The Business Impact

The business impact on software improvements in terms of reduction of violations and complexity is significant but its quantitative evaluation is not very simple in a short period of monitoring. The positive effects of a well written and structured software will be more relevant when, in a short future, this code will be again modified for bug fixing or new enhancements.

Business impacts from the test point of view are more immediate and relevant. The result table reports a reduction of time for test (respect to the originally planned one) of 10% that is a first significant result considering that the case study represents only a small percentage of the software to be tested This has been possible because test activities have been more focused on the functionalities implemented by the routines whose probability of fault was higher. In general this information is very useful also for a better estimation of times and costs for test and the test manager will use it for planning future activities.

Another important aid to reduce time for test is the estimation about the minimum number of test cases useful to be executed. In the experiment it has been calculated that 20% of the test cases do not increase the level of coverage.

At last, we cannot forget the obvious and most relevant impact: the quality of the product delivered to customers. The dynamic model has the goal to calculate, on statistic base, the level of potential residual faultiness of code and consequently gives to the test and product managers a significant help in the critical decisions about the opportunity to deliver the new release. Our products are delivered worldwide (Chile, South Korea, Japan, Taiwan, Indonesia, Kuwait…) and, consequently, costs for bug detecting and fixing in field are very expensive (we estimate about 100 times the cost of bug fixing in house).
 

The Cultural Impact

The most significant cultural impact is the introduction of the concept of metrics as important aid to measure the quality of our lifecycle process.

Before this experiment we had no rules for a better programming and no sensibility about the significant benefit of a well structured and easy to be maintained software. Different was the scenario in the test area; before the experiment any testers needed criteria to understand if they were using with the best profit the time dedicated to test. After the experiment, any developer knows that it’s not enough to deliver to testers a piece of software compiled and debugged. The software must have a good level of quality without violations and with a low level of complexity.

More significant is the cultural impact for testers. Now they know that a software component cannot be released until a routine has a too high probability to have a fault. It’s a complete different style of working, not more based on „someone feelings" about the critical components.

Product and test managers have now the opportunity to require and to examine a quantitative report of the level of faultiness of the software under/after testing. It’s a different approach to evaluate in a more accurate way the quality of the products before the delivering.
 

Lessons Learnt

The first, in chronological order, lesson learned is about the necessity to have a significant baseline of data regarding the process before starting any improvement action. This was not our scenario because, in the analysis phase, the problems we identified were not confirmed by quantitative measures but only by the feelings and the impressions of the working teams. This lack did not facilitate our initial work in the PIE and did not facilitate the work of CEFRIEL in the phase of definition of metrics and models.

Working on software routines to reduce the number of violations may be a long and boring work but it does not present significant risks to create dangerous instabilities. Different is the scenario about complexity. Reducing the level of complexity of a routine often is not a very simple work and the risks to generate instabilities are concrete. The lesson we learnt about this point is that we have to perform this task gradually and in different steps.

Initially we planned to estimate the quality of the delivered software using measures of coverage (i.e. statement coverage or branch coverage) and to use these measures also as a criteria to determine when test activities may be considered concluded. While the experiment was running we understood that a measure of coverage represents only an indication about test effort (useful but not specifically requested in this experiment) but this measure cannot give us any significant information about the level of faultiness of the delivered software. Statistical models, following a different approach, try to estimate the potential residual faultiness of the software using, as baseline, static metrics. Consequently, if well tuned, they can give a more exhaustive information about the quality of a product.
 

Weak and strong points

A strong point is, without any doubt, the introduction of the concepts of metric as an important way to measure the quality of our process; now we have the rules and the tools to monitor in any moment the quality of our process and to perform any identified corrective action.

Another strong point is that any step of the new process improvement is fully based on the use of tools supporting automatically the capture of metrics, the detection of violations, and giving significant helps to reduce the complexity of software and to measure the level of coverage.

A significant weak point is the already mentioned lack of initial measures on the typologies of violations performed by the development team in the implementing phase.

The second weak point regards the „range of validity" of the static and dynamic models. We cannot affirm that the models after the tuning phase performed in Elsag, can work fine on other software components and we cannot affirm that their goodness can be automatically confirmed forever. A continuous monitoring and, eventually, tuning may be necessary in the baseline project any time „external" conditions (i.e. modification of the members of the development team) change. Identical consideration must be done when we’ll extend the models to different software components written by other developers with different style of programming.
 
 

Future actions

From the point of view of the future actions we’ll proceed contemporary in two directions: This last point introduces the aspect of the reproducibility of the experiment. We assume that any technical result captured within this experiment is fully reproducible for any other software components developed in C\C++ language inside or outside our company. The points to be reviewed when the process defined in the current experiment has to migrate in a different environment are:
 

References

[1] conte86a – S.D. Conte and H. E. Dunsmore and V. Y. Shen – „ Software Engineering Metrics and Models – Benjamin/ Cummings Publishing Company – 1986 – Menlo Park CA

[2] IB-A83098 – A. J. Perlis and F. Sayward and M. Shaw – „Software Metrics: Ananlysis and evaluation" – MIT Press – Cambridge, Mass. – 1981

[3] V. R. Basili. - Tutorial on Models and Metrics for Software Management and Engineering. - IEEE Computer Society, New York, 1980.

[4] V. R. Basili and E. E. Katz. - A formalization and categorization of software metrics. Technical report, Dept. Com. Sci., Univ. Maryland, College Park, 1986.

[5] S. Mohanty. - Models and measurements for quality assessment of software. ACM Computing Surveys, 11(3):251--275, September 1979.

[6] C. E. Walston and C. P. Felix. - A method of programming measurement and estimation. 16(1):54--73, 1977.

[7] J. P. Cavano and J. A. McCall. - A framework for the measurement of software quality. - Proc. Software Quality and Assurance Workshop, pages 133--139, San Diego, CA, Nov. 1978.
 
 

Appendix 1 - Author Profile


Traversa Mauro has 15 years of experience in Elsag in the development, testing and software configuration management areas. After the courses at the Electronic Department at the University of Genoa, Traversa was integrated in the Development and Research Department of Elsag in a team responsible for the implementation of a multi-processor and real-time proprietary operating system for internal applications. In this team he worked initially in the test area validating the programming interface of the kernel core of the operating system. After two years, he entered in the development team contributing to the implementation of software components for a new major release of the operating system. It’s in this context that he started to be involved in software configuration management thematic and in 1994 he became the configuration manager in a critical and strategic join venture with a californian company for the implementation of a distributed system of character recognition. When the pilot project was concluded, Elsag formed a stable and independent team for software configuration management and Mauro Traversa assumed the responsibility of this team, actually working to manage any software component of a postal system. This team, initially dedicated to software configuration management activities, extended its competencies to offer to the development area a centralised service for installation procedures of software components.

Mauro Traversa has the responsibility of the best practice described in this article and he is actually involved in an Esprit project whose goal is the implementation of a tool for the automatic generation of test cases and for threats detection.
 

Appendix 2 - Company Profile


Elsag S.p.A established on November 1st 1998 was formerly a Division of Elsag Bailey, a Finmeccanica Company, one of the largest Italian industrial groups.

With revenues of about euro 360m in 1998, euro 410m in 1999, and more than 2,450 employees, Elsag S.p.A. group is now one of Italy’s more important suppliers of IT solutions and services.

Elsag operates in the following areas.

Elsag Spa has been holding an ISO9001 certification since 1993.
 

Appendix 3 - Applied metrics and evaluation of the models

Metrics used in the models

Listed below, there is a short description of any software metric applied in the static and in the dynamic models.

LOC: lines of code: any line in the source that is not a comment or a blank line; therefore it includes standalone braces and parentheses on a single line;
eLOC: essential lines of code. Lines of code not including blank lines, comment lines and isolated braces or parenthesis lines;
Lines: total number of code lines, no matter what they contain;
Comments: lines containing comments;
FP: number of parameters of the function;
V(G): Cyclomatic complexity;
eV(G): essential Cyclomatic complexity;
CO: Comparison operators (<, >, = =, etc.);
LFC: Logic Flow Complexity;
OC: Operational complexity, weighted sum of present operations;
NEST: Maximum number of nesting levels of if-then-else control structures;
n1: Halstead number of unique operators
N: Halstead program length, calculated as the sum between N1 (Halstead number of total operators) and N2 (Halstead number of total operands);
n: Halstead program vocabulary, calculated as the sum between n1 (Halstead number of unique operators) and n2 (Halstead number of program operands);
V: Halstead program volume, calculated as V=N * (log2 n);
D: Halstead difficulty, calculated as D=(n1/n2)*(N2/n2);
E: Halstead effort, calculated as E=D*V;
BRANCH: number of forced exits from control structures by means of goto, exit, and break (also inside a switch) statements;
OAC: Operation argument complexity, weighted sum of arguments of each operation of a module;
ANION: Adjusted number of input/output nodes, calculated as the sum between number of entry nodes in a module and number of exit nodes from it, adjusted to behave intelligently where redundant "return" statements exist.
CONTROL: number of control statements;
EXEC: number of executable statements;
NSTAT: Number of statements (equals to CONTROL + EXEC);
QCP_MAINT: Quality criteria profile – Maintainability. Linear combination of other static metrics calculated as
                                    MAINTAINABILITY = 3 * N + NSTAT + NEST + 2 * V(G) + Number of Branching Nodes;
 

Evaluation of the models

We used R2, the goodness of fit, as statistic coefficient to evaluate the experimental results obtained with the tecniques of Logistic Regression on static and dynamic models.

This statistic is not to be confused with least-square regression R2 — they are built upon different formulae, even though they both range between 0 and 1 and are similar from an intuitive perspective. This statistic may be interpreted as the proportion of uncertainty in the dependent variable explained by the model. The higher R2, the higher the effect of the model's explanatory variables, the more accurate the model. However, as opposed to the R2 of least-square regression, high R2s are rare for logistic regression. For this reason, the reader should not interpret logistic regression R2s using the usual heuristics for least-square regression R2s.

For the static model we obtained R2 = 0.5176 which is quite good in the context of the Logistic Regression.

For the dynamic model R2 = 0.4251.
 



 
Partners in EuroSPI

Editors
ISCN LTD, ISCN GesmbH, Schieszstattgasse 4/24, 8010 Graz, and Coordination Office, Florence House, 1 Florence Villas, Bray, Ireland, office@iscn.at, office@iscn.com, office@iscn.ie, Editing Done: 19.7.2002