The importance of QA in software development: Lessons from historical IT project failures part 2

 

In the first part of article we share 4 case studies on how lack of quality assurance can lead to financial consequences. In this article we prepared even more interesting examples to learn from.

 

  1. DIA’s automated baggage system (1995) – monthly maintenance cost of $1 million is bigger than manual tug and trolley system
  2. Deutsche Bank (2019) – bank’s share price fell rapidly low after revealing the bug
  3. Starbucks Point of Sale Register Outage (2015) – cost the company around $3 to $4 million
  4. The Healthcare.gov Rollout (2013) – a tough lesson for Obama that costs $2 bn.
  5. The Mars Climate Orbiter (1999) – cost NASA a total of $327.6 million
  6. Y2K BUG – $100 billion dollars  only in US to fix the problem before it happened

  

1. DIA’s automated baggage system – monthly maintenance cost of $1 million outweighed the value it provided and in 2005 system was abandonded.

 

Denver International Airport (DIA) is the largest airport in the US, spanning 135.69km². Despite its size, it handled 61.4 million passengers, making it the fifth busiest in the continent.

The baggage handling system at the Denver International Airport was initially hailed as a cutting-edge marvel but turned into a notorious project failure. Originally designed to automate baggage handling throughout the airport, the system proved to be far more complex than anticipated. Delays in its construction led to the airport remaining unused for 16 months (Expenditure to maintain the empty airport and interest charges on construction loans cost the city of Denver $1.1M per day throughout the delay).

The project’s setbacks added around $560 million to the airport’s cost and garnered attention in a Scientific American article titled “The Software’s Chronic Crisis.” The system that was eventually implemented was a scaled-down version of the original plan, only supporting outbound flights on one concourse. Other baggage had to be manually handled due to the automated system’s inability to meet its goals. Reputation of the whole project was broken, as media spread widely the information of bags getting ejected from their carts when navigating the sharp turns.

Even the functional part of the system never worked correctly, and by August 2005, the entire system was abandoned. The monthly maintenance cost of $1 million outweighed the value it provided, and a manual system (invented by external expert, that additionally costs $51M) proved in the end to be more cost-effective.

Contributing factors to this failure included: underestimating complexity, insufficient time left for such a complex architecture, changing requirements, budget and schedule underestimations, disregarding advice from experts, lack of backup or recovery processes for system failures and – lack of QA.

A project of this size, complexity and risk should have had a number of such reviews along the way and independent expert assessment should have been a continual part of the project.

Click —> here <—- to see yt movie about DIA’s automated baggage system.

2. Deutsche Bank (2019) – bank’s share price fell rapidly low after revealing the bug

 

A software glitch at Deutsche Bank has for almost a decade prevented some potentially suspicious transactions from being flagged to law enforcement authorities, Germany’s biggest bank has discovered in 2019.

According to financial circles, parameters in this program were incorrectly programmed for years, so that the second verification of payments was not complete. Deutsche Bank has reported the problem to the German financial regulator Bafin and the US Federal Reserve. The bank has several IT applications to monitor payment transactions for various risks. In one of these applications, two of 121 parameters were not defined correctly.

While the exact monetary cost is not known, the repercussions of such a glitch could lead to substantial financial losses due to regulatory fines, legal penalties, reputational damage, and potential impacts on the bank’s operations and customer trust.

This news likely impacted Deutsche Bank’s stock performance and investor sentiment, potentially leading to fluctuations in the bank’s stock price on the stock exchange, which started to decrease significantly after announcing the bug.  Staying on the theme of banking, another case study comes to my mind.

3. Starbucks Point of Sale Register Outgage

 

 In 2015, Starbucks had to close almost all its stores in the USA and Canada for half a day due to an internal failure during a daily system update. This incident led to financial losses and reputational damage for Starbucks due to the lack of quality assurance in their systems. It is estimated that the error could have cost the company around $3 to $4 million + the expense of giving away thousands of free drinks that are unaccounted for. But of course those customers who were served – were certainly happy with the free drinks.

 

 

4. The Healthcare.gov Rollout (2013): a tough lesson for Obama that costs $2 bn.

The launch of the Healthcare.gov website, the online platform for the Affordable Care Act (Obamacare), faced multiple issues during its rollout.

The problems with The Healthcare.gov Rollout in 2013 lasted for several weeks. The website was launched on October 1, 2013, and immediately encountered numerous technical issues, including crashes, delays, errors, slow performance, and long wait times for users trying to enroll. These problems persisted for weeks, with only about 1% of interested individuals successfully enrolling in the first week of operations. Even after the initial launch, technical issues continued into the third week of operations, with maddeningly long wait times and broken features reported by users.

The cost of the failure of HealthCare.gov was substantial, with the estimated costs ballooning from $2 million to about $2 billion.

The primary reasons for its failure included poor project management, inadequate testing, and a lack of coordination among contractors.

 

 

5. The Mars Climate Orbiter (1999): loss of $327.6 million

 

The Mars Climate Orbiter, a NASA spacecraft, met its demise during entry into Mars’ atmosphere because of a confusion between metric and imperial units. The orbiter’s software was set to metric units, while the navigation team used imperial units.

This discrepancy led to a disastrous navigation error, resulting in the mission’s failure. This failure in 1999 cost NASA a total of $327.6 million, covering expenses such as spacecraft development, launch, and mission operations. The incident involved the loss of the Mars Climate Orbiter, launched in 1998 to explore Mars’ atmosphere and surface changes. The spacecraft was supposed to enter Mars’ orbit successfully but was destroyed due to a critical metric error during atmospheric entry. The error stemmed from a unit conversion mistake between metric and English units. The navigation team at JPL used metric units for calculations, while Lockheed Martin Astronautics provided data in English units, leading to trajectory miscalculations and the orbiter’s destruction upon nearing Mars.

This event emphasized the importance of accurate unit conversions, effective communication among teams, and precise calculations in space missions. NASA’s investigation revealed systemic project oversight issues, highlighting the need for stringent quality control measures in aerospace engineering to prevent such errors in the future. The financial impact of this failure on NASA’s mission to study Mars’ climate and surface changes underscores the significance of ensuring precise software and team communication in space exploration endeavors.

6. Y2K bug – $100 billion dollars only in US. To fix the problem before it happened

 

The Y2K bug is a perfect example of quality assurance (QA) because it highlights the critical role of QA in preventing catastrophic failures in software and systems. It demonstrated that even seemingly minor issues, like incorrect date calculations, can have widespread and severe consequences if left unaddressed. Businesses and organizations worldwide invested in QA processes to identify and fix Y2K-related problems, ultimately averting the potential chaos that could have ensued. The Y2K bug underscores the importance of thorough testing, risk management, and proactive QA measures to ensure the reliability and safety of technology systems.

Y2K bug served as a valuable lesson in risk management and quality assurance. It demonstrated the importance of proactive measures in the face of potential technological crises. The investments made in Y2K readiness contributed to a safer and more reliable digital landscape.

In summary, the investments made to address the Y2K bug were not wasted but rather a prudent and necessary response to a global challenge. They helped avert widespread disruptions, protect critical systems, and ensure the continued functioning of businesses and governments into the new millennium.

They aimed to identify and fix issues before they could lead to widespread failures. This approach was more cost-effective and less disruptive than dealing with the aftermath of unaddressed problems.

Few words from our expert

Dariusz Rudziński, Macrix Test Manager:

Analyzing the above examples, it is evident how complex the topic of quality assurance is, encompassing all stages of the project from gathering requirements to product deployment. For this reason, in our work, we strive to apply a shift-left testing approach, meaning testing is performed as early as possible in the software development lifecycle and at every stage. We initiate our testing activities already during the requirements gathering stage. Issues made at this stage can be the most expensive, as could have been the case with the Mars Climate Orbiter, where two cooperating systems operated on different units, essentially not “speaking the same language.” Surprisingly, this problem was not discovered earlier, for example, during integration or acceptance testing.

With our clients, we test software using simulations, models, and test benches, allowing for the safe testing of the entire process. An example of this could be steel production, where software at various levels, from machine control hardware to HMI, must cooperate according to protocol and with the highest precision in time and space. There is a threat that even 1mm error in the positioning of a rolling mill could cause damages worth hundreds of thousands of euros.

A properly functioning system is not everything; it must fulfill many non-functional aspects such as performance, usability, and security. The examples described show how these aspects were neglected, where users couldn’t use the system for weeks or the banking system operated incorrectly. Personally, I like to use the meaning of the abbreviation QA as Quality Awareness. Our responsibility is to inform the client and implement appropriate testing processes to prepare the final products to meet the defined functional and non-functional requirements.

Our testers become an integral part of project teams. They collaborate daily with developers, business analysts, and clients. Only a thorough understanding of requirements, the client needs, and the entire system allows for ensuring the appropriate level of quality. Of course, we are aware that testing everything is from time and cost perspective rather impossible, so in our work, we pay close attention to risk analysis, basing our actions on its results.

Finding the right partner for an IT project always has risks. Minimize them by choosing Macrix Technology Group. With over 23 years of experience and ISTQB certification, we’re the right choice for your project. Don’t hesitate to reach out for your next digitalization endeavor. We’re ready to code and test for you!

 

Source for DIA’s automated baggage system:

https://www.youtube.com/watch?v=xmas0-SthUQ

https://www5.in.tum.de/~huckle/DIABaggage.pdf

https://calleam.com/WTPF/wp-content/uploads/articles/DIABaggage.pdf

 

Source for Deutsche Bank glitch:

https://www.ft.com/content/d537f416-7c71-11e9-81d2-f785092ab560

https://www.reuters.com/business/finance/fed-fines-deutsche-bank-186-mln-insufficient-progress-addressing-anti-money-2023-07-19/

https://www.sueddeutsche.de/wirtschaft/deutsche-bank-it-panne-1.4456987

 

 

Source for Y2K bug:

https://www.gao.gov/assets/aimd-00-290.pdf

https://dl.acm.org/doi/pdf/10.1145/572199.572205

https://education.nationalgeographic.org/resource/Y2K-bug/

 

Source for Starbucks’ glitch

https://stories.starbucks.com/stories/2015/starbucks-point-of-sale-register-outage-resolved/

https://www.geekwire.com/2015/starbucks-lost-millions-in-sales-because-of-a-system-refresh-computer-problem/

 

Source for Healthcare.gov rollout

https://www.bloomberg.com/news/articles/2014-09-24/obamacare-website-costs-exceed-2-billion-study-finds

 https://oig.hhs.gov/oei/reports/oei-03-14-00231.asp

https://hackernoon.com/small-is-beautiful-the-launch-failure-of-healthcare-gov-5e60f20eb967

https://www.appdynamics.com/blog/product/technical-deep-dive-whats-impacting-healthcare-gov/

https://www.gao.gov/products/gao-14-694