The importance of QA in software development: Lessons from 4 historical IT project failures

Intro – what is article about and what not?

Are you working in a highly automated and technology-dependent industry? Or perhaps you’re looking to take the first steps towards digitizing your production facilities? In either scenario, you’ve likely considered the impact of software quality on the success of your proposed solutions.

The essential role of software quality and risk management in averting failures might seem self-evident, right? Yet, as our analysis of historical IT software failures reveals, even experienced organizations are not immune to substantial financial and reputational damages caused by —sometimes minor—lapses in software quality.

There are many crucial aspects that that contribute to the effectiveness and success of custom software development projects. The below list covers a wide range of them, but is by no means exhaustive:

Clear Project Goals and Objectives
Effective Project Planning
Active involvement of stakeholders
Skilled Development Team
Requirements Gathering and Analysis
Effective Design and Architecture

Iterative Development and Agile Methodologies
Change Management
Risk Management
Effective Communication
Scalability and Future-Proofing
Security and Compliance
Documentation
User Training and Support
Post-Implementation Evaluation
Budget and Resource Management
Client Satisfaction
Quality Assurance and Testing

In this article we will focus mostly on the last point from the list – Quality Assurance and Testing.

Over the years, awareness of the importance of software quality has increased significantly, which has contributed to standardized test procedures, the adoption of best practices across industries, and the integration of quality assurance into all stages of the software development cycle. However, despite these advances, companies from all industries continue to grapple with software quality issues.

Our analysis covers four historic IT disasters that show how much money and reputation is at stake from oversights in software quality assurance and testing:

Knight Capital Group (2012) – a $440 Million software flaw that resulted in sale of the company,
NatWest and RBS (2012) – an IT glitch leading to a record fine by the Financial Conduct Authority (FCA), amounting to tens of millions of pounds for the banks involved,
The Ariane 5 Rocket explosion (1996) – loss of $400 million in just 37 seconds,
Airport Heathrow in London, terminal 5 (2008) – more than 23,000 pieces of luggage were lost, 500 flights were canceled and £16 million lost.

1. Knight Capital Group – a $440 Million software flaw that resulted in the sale of the company (2012)

Imagine working tirelessly for 17 years to build a successful company on Wall Street. You’ve built an empire – it is 2012, and Knight Capital Group is a leading American financial services firm specializing in market making, electronic execution, and institutional sales and trading. The firm is a major player in U.S. equities trading with a market share of approximately 17 percent on both the New York Stock Exchange and the Nasdaq Stock Market.

In June 2012, the New York Stock Exchange got approval to start its Retail Liquidity Program (RLP). This program aimed to give individual investors the best prices, even if it meant using “dark markets.” It was set to launch on August 1, giving firms just six weeks to prepare.

But on August 1, chaos ensued within the first 30 minutes of trading in what is today remembered as the Knight Capital incident, which cost the company an estimated $440 Million. An overnight code error caused disastrous consequences and put the company’s very existence at risk. The software triggered a buying spree under the rule: buy high, sell low, purchasing 150 different stocks at a total cost of approximately $7 billion within the first hour of trading.

The incident shocked investors and led to a massive sale of Knight Capital’s stock. In just two business days, the stock lost 75 percent of its value. Because of their enormous losses, Knight Capital had to borrow an extra $400 million. According to the Wall Street Journal, the company was essentially taken over by its new creditors.

What could have been done to prevent this situation?

Implementing modern software development and DevOps practices, such as version control, automated testing, and automated deployment, could have prevented the disaster. Knight Capital Group’s failure stemmed from a combination of software flaws, lack of code review, inadequate risk management, and manual deployment processes.

2. NatWest and RBS: an IT glitch leading to a record fine by the FCA (2012)

In 2012, NatWest and The Royal Bank of Scotland (RBS) experienced a major IT glitch that left millions of customers unable to access their accounts, make transactions, or receive their salaries. The failure was attributed to a software update that went awry during routine maintenance. Inadequate testing and a lack of contingency planning exacerbated the situation.

The IT glitches at NatWest and RBS happened because of technical problems with a software update to RBS’s CA-7 software, which ran the payment system. It was later revealed that RBS staff unintentionally corrupted the update, causing a huge backlog of customer transactions. Customers’ wages, payments and other transactions were disrupted. Some customers were unable to withdraw cash using ATMs or to see bank account details. Others faced fines for late payment of bills because the system could not process direct debit, and some customers were even stranded abroad or faced other serious difficulties.

These serious issues persisted for more than a week, and ultimately it took several weeks before all normal operations could be completely restored.
The penalty for NatWest and RBS for the major glitch in 2012 was £56 million, with the Financial Conduct Authority (FCA) fining Royal Bank of Scotland £42 million and the Prudential Regulation Authority (PRA) fining the banks an additional £14 million for the IT failures that affected over 6.5 million customers in the United Kingdom.

Ultimately, a mix of several factors contributed to this situation. Integrating systems between NatWest and RBS, aiming to create and sell financial products under different brands, was a big challenge. The banks faced even more IT problems because they were using outdated software and hardware, some of which was from the 1970s.

After two examples from the financial market, which is highly regulated, let’s see two more examples that will also give you great excitement: from the aerospace industry and the construction of a modern airport terminal.

3. Ariane 5 Rocket Explosion: a loss of $400 million in just 37 seconds

In 1996, the European Space Agency (ESA) experienced a catastrophic failure when their carrier rocket Ariane 5 exploded shortly after take-off. This disaster was caused by a software bug.

Video on Ariana 5 explosion

The European Space Agency had been in development for nearly 10 years and the project’s budget was about $7 billion. Unlike originally planned, Ariane 5 was not an improved Ariane 4, but a completely new rocket. No two parts were the same. The hardware was completely new. Only the software was to be taken over from the old Ariane, because it had worked reliably up to that point and developing it from scratch would have been much more expensive.

Ariane 5 was supposed to carry a team of probes into orbit around the Earth to study the magnetosphere. Unfortunately, 37 seconds after liftoff, the rocket suddenly tilted 90° and the aerodynamic force caused the rocket drives to detach from the rest of the structure. This triggered the self-destruct procedure and the rocket exploded nearly 4 km above the ground.

There was an error in the software code responsible for determining the rocket’s position in the coordinate system. The variable that stored the speed correction in the horizontal axis was converted from a 64-bit float number to a 16-bit integer.

37 seconds after launch, this variable reached a value greater than the maximum integer stored in 16 bits. The position module gave a value indicating an error instead of the true value. This situation was wrongly interpreted by the flight computer and the navigation system decided to correct the course. Interestingly, when the exception was detected, the module responsible for position determination stopped working and backup module was automatically activated. Unfortunately, both of these modules were identical and contained the same error. The part of the programme in which all this happened was actually no longer needed in Ariane 5. It was only adopted in order to minimise the differences in the software versions of the rockets.

Undeniably, specification errors created by lack of proper risk analysis and testing are pointed out as the causes. In addition, part of the code came from an earlier version – Ariane 4. Thus, these are process errors, and the faulty line of code is only their effect.
The error would probably have been recognised with the “test as you fly, fly as you test” principle. However, instead of using data from the rocket’s real inertial navigation system, the control system was only tested using values from a simulation. And even these simulations did not follow the actual flight path of the new Ariane 5 missions. So this error slipped through and caused the disaster. The now-forgotten embarrassments of Ariane 5 ended after 14 flights, of which only ten were successful.

4. Airport Heathrow in London, terminal 5

2 weeks before opening the terminal Queen Elizabeth said: “It gives me great pleasure to open terminal 5 – this 21st century gateway to Britain and for us to wider world” .

Terminal 5, was built at a cost of nearly £4.3 billion, around £75m of these costs are for technology, while BAA invested at least another £175m in IT systems.

The work has involved 180 IT suppliers and seen 163 IT systems installed, 546 interfaces, more than 9,000 connected devices, 2,100 PCs.

The baggage handling system at T5 is the largest baggage handling system in Europe for a single terminal, a main baggage sorter and a fast track system are processing 70,000 bags a day.

Bags undergo several processes on the way through the system including automatic identification, explosives screening, fast tracking for urgent bags, sorting and automatic sorting and passenger reconciliation.

The opening of the designated terminal was ceremonial and was supposed to be a milestone for London’s airport, giving capacity for an extra 30 m. passengers annually. Instead of splendor, there was a huge mishap.

During the first 5 days, more than 23,000 pieces of luggage were lost, 500 flights were canceled and £16 million was lost. What was the reason for such a huge failure?

The glitches were attributed to various factors, including IT failures, inadequate training, parking-related issues, and broken lifts. The situation resulted in a backlog of lost bags, long wait times, and frustrated passengers. There were no main reasons for that, just a mix of few smaller ones, like:

Terminal 5 Crisis Points:

1. Parking Location Confusion: Lack of testing and familiarity with new car park locations caused confusion for both staff and passengers.
2. Security Access Issues: Staff encountered difficulties passing through security checkpoints, impacting operational efficiency.
3. Delayed Check-In Opening: Delayed check-in opening led to long queues and passenger inconvenience.
4. Luggage Delay: Early arriving passengers faced an hour-long wait for their luggage, causing frustration.
5. Baggage Staff System Access: Some baggage handlers experienced challenges logging into the system, affecting baggage processing.
6. Resource Allocation Errors: The Resource Management Systems (RMS) incorrectly assigned baggage handlers, leading to operational inefficiencies.
7. Overloading Baggage System: Check-in staff continued adding luggage beyond system capacity, causing bottlenecks.
8. Conveyor Congestion: Clogged conveyor belts resulted in extended wait times for luggage, leading to flight cancellations.
9. Baggage System Failure: Complete suspension of check-in at Terminal 5 due to baggage system failure.
10. Check-In Suspension: Long queues at the “fast bag drop” desk prompted British Airways to suspend check-in for all hold luggage by 5:00 PM.
These points highlight the series of operational failures and challenges that contributed to the crisis at Terminal 5 during its opening day at London Heathrow Airport.

On the first day of opening by 5 p.m., British Airways stopped accepting checked baggage, informing passengers they might have to travel without luggage. Those who hadn’t checked in could opt for a flight without baggage or reschedule. Unrecognized bags were manually sorted daily until March 31.

“With the benefit of hindsight, it is clear we had made some mistakes. In particular, we had compromised on the testing regime as a result of delays in completing the building programme for T5 and the fact that we compromised on the testing of the building did impact the operation at T5 on the first few days after its opening”

BA Chairman Willie Walsh

In a summary, British Airways attributed the report’s failure to detect IT problems to inadequate testing of the system, caused by delays in BAA’s construction work. Construction was scheduled to be completed on September 17, 2007. The delays meant that BA IT staff could not begin testing until October 31. Several tests had to be canceled, and BA had to limit the scope of system tests because testing personnel could not access the entire Terminal 5 site.
Despite the setbacks, Terminal 5 was a significant investment in infrastructure and technology, aiming to enhance the passenger experience at Heathrow Airport.

Summary

It’s important to note that software failures are typically complex and multifaceted, making it challenging to attribute them solely to one factor like bad quality testing. In many cases, failures result from a combination of issues that may include inadequate testing, poor project management, scope changes, and communication breakdowns.

In each of these cases, software testing and quality assurance played a crucial role in the failures. Inadequate testing, insufficient quality control measures, and a lack of attention to detail were common factors. These examples emphasize the importance of comprehensive testing, effective project management, and clear communication in software development to prevent such failures in the future.

Choosing the right partner for an IT project always carries risks. Minimize them by selecting MAcrix Technology Group – a company with over 23 years of experience and certified by ISTQB.

Do not hesitate to contact Macrix Technology Group with Your next digitalization project, we will be happy to code and test!

Sources:

Knight Capital Group:

https://www.sec.gov/files/litigation/admin/2013/34-70694.pdf

https://www.zerohedge.com/news/what-happens-when-hft-algo-goes-totally-berserk-and-serves-knight-capital-bill

Source for NatWest and RBS glitch:

https://www.fca.org.uk/news/press-releases/fca-fines-rbs-natwest-and-ulster-bank-ltd-%C2%A342-million-it-failures

https://www.wealthbriefing.com/html/article.php/UK-Regulators-Hit-RBS,-Other-Banks-With-%C2%A356-Million-Fine-For-Major-IT-Breakdown-In-2012?id=159707

https://www.fca.org.uk/news/press-releases/fca-fines-rbs-natwest-and-ulster-bank-ltd-%C2%A342-million-it-failures

Source for Ariane 5 rocket explosion

https://www.mathe-museum.uni-passau.de/digitale-exponate-zum-ausprobieren/ariane-5/

http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html