There are a few moments many employees have experienced over the past two years without it being perceived as a major, problematic event: the language model refuses service with 'Service temporarily unavailable.' The task that was supposed to be completed in ten minutes suddenly becomes a problem. An hour later, everything runs again, and the incident is quickly forgotten.

These moments are not anecdotes. Rather, they lead us to a question that deserves greater attention, that demands it: 'What actually happens if the AI we have built our work processes around suddenly becomes unavailable, not for an hour, but for a day, a week, a month?'

This question presents itself in three very different forms, and all three are real. The first is 'technical': What if a provider fails, a data center burns down, a supply chain breaks? The second is 'political': What if a provider restricts access for political reasons, deliberately and selectively, as leverage? The third is the most invisible: 'What if the system continues to function, but its outputs are being manipulated without anyone noticing?'

All three forms lead to the same answer, and that is: resilience. But what resilience means in the age of AI is significantly more demanding than what we are used to from classical IT. That is precisely what we want to work through in the following.

Outages are no longer hypothetical

Before we get to the strategic question, it is worth looking at what has already happened. ChatGPT experienced several major outages in 2025 that are, individually, alarming, but collectively reveal a pattern.

On 10 June 2025, the service went down worldwide for twelve hours. Several European data centers were simultaneously in maintenance, which simply led to overload on the remaining node. Europe was particularly hard hit because the outage occurred during normal working hours. On 3 September 2025, the next major outage followed, also at the beginning of the European business day. On 18 November 2025, a Cloudflare outage took ChatGPT, Sora, and several other AI services with it. In December, another major outage followed.

The official availability of about 99 percent sounds reassuring. But it also means that approximately 87 hours of downtime per year are possible. If you have structured your business processes so that they become significantly slower without AI, you must factor in 87 hours of economic damage per year, often without anyone explicitly calculating it that way. Observers also report that the average downtime of major AI services has increased significantly in recent quarters, because usage is growing faster than the infrastructure behind it.

The political side-what seemed unthinkable until recently

Until early 2025, I would not have accepted the following observation as a realistic scenario.

When AI services become 'strategic infrastructure'-no longer merely useful tools, but a prerequisite for the productivity of entire industries-their availability becomes a political lever. Whoever controls access controls productivity.

Here lies a characteristic that has rarely been so clear in classical sanctions logic: 'A carefully targeted temporary outage can act like a tactical nuclear bomb.' A few hours, precisely placed, can cause considerable damage without requiring broad-based escalation. Maximum impact with minimal visibility. This very characteristic makes the lever politically attractive, because it is granular, deployable, and it offers excellent opportunities to dispute its intentional use. Was it a technical glitch? Maintenance? A rate limit? From the outside, it is hardly possible to determine reliably whether it was a mistake or intent.

Concretely, this means: it has become conceivable that a US provider would restrict services in specific European countries for political reasons, that certain industries would be deliberately underserved, that model quality would be throttled for European customers while US customers retain full access. Export-heavy European industrial sectors currently undergoing massive shifts toward AI-driven development are obvious targets. Whoever wants to apply pressure here to favor American competitors suddenly has a lever that did not exist not long ago.

This sounds like a conspiracy theory, but structurally it is not unusual. Semiconductor export controls work exactly that way. SWIFT sanctions work that way. The question is not 'whether' such levers are used, but 'when.' Whoever does not factor this into their resilience planning is planning incompletely.

Why old backup concepts no longer work

Anyone who grew up in classical IT understands the logic of disaster recovery. Redundant systems at a second location, regular tests, documented emergency plans. This works excellently for classical IT infrastructure. For AI tools, it works only in limited ways, and the reason lies in an aspect that is often underestimated.

AI models are not arbitrarily interchangeable. If you have built a workflow around GPT-4, you cannot simply deploy Mistral or Aleph Alpha and expect the same results. The models have different strengths, different weaknesses, different response patterns, different security mechanisms. A prompt optimized for one model often delivers worse results with another. Entire toolchains-retrieval pipelines, agent systems, model orchestrations-are adapted to specific API interfaces and model peculiarities.

This means: even if you set up a 'backup provider,' you do not simply have a functionally equivalent replacement in an emergency. You have a second variant that completes 'similar' tasks, but with different quality, different behavior, different effort. The switch in a crisis is not a button press, but an adaptation project-one that is precisely unavailable when it is needed.

Added to this is a phenomenon that deserves more attention in resilience discussions: the 'eroded baseline capability.' If someone has worked with AI assistance for three years, they have often forgotten their own routines. The marketing employee who designs every text with ChatGPT writes more slowly from memory than three years ago. The developer who generates code has never fully understood some frameworks himself. This is not laziness, but specialization-but it means that the manual fallback option that was once taken for granted is today no longer taken for granted. Anyone basing their backup plan on 'we'll do it by hand if necessary' is probably wrong in assuming how well that hand is still trained.

AI companies under political pressure

Another development of recent months deserves special attention because it puts the political aspects around availability in a new light: major AI providers are increasingly coming under direct pressure from their home governments. This is not speculative. It is documented, and it is happening in various constellations.

In July 2025, the US Department of Defense awarded contracts worth up to 200 million dollars per company to Anthropic, Google, OpenAI, and xAI to help the Pentagon rapidly upgrade to advanced AI capabilities. Reactions were mixed: OpenAI's deal triggered substantial public backlash with uninstall waves. Just a few months later, on 27 February 2026, US President Trump issued a directive ordering federal agencies to cease using Anthropic products, with a six-month transition period. The background: a dispute over acceptable use policies that restricted certain Pentagon applications. The Trump administration also revoked Biden's AI Safety Framework in January 2025 and announced in December 2025 that federal AI contracts would be tied to 'ideological neutrality.'

Parallel to this, there is a 'lobbying escalation' that is historically unprecedented. Anthropic's lobbying spending increased year-on-year by over 340 percent and exceeded OpenAI's spending for the first time. Anthropic allocated 20 million dollars to a political action committee defending state-level regulation. An OpenAI-aligned counter-movement, 'Leading the Future,' has raised over 125 million dollars since August 2025, aimed at advancing federal deregulation. The Trump administration is subsidizing the interest group 'Innovation Council Action' with 100 million dollars. The tech industry spent over 50 million dollars on AI lobbying alone in the first nine months of 2026.

What does this mean for a European company using one of these providers? 'The provider is no longer merely a service provider. It is a political actor in a conflict over which it has no control.' If a US government decides tomorrow that certain AI applications can only be exported under specific conditions, this affects the providers, but above all their international customers. If a provider is forced through government pressure to change its acceptable use policy, that change applies globally. If a model is retrained in the US for political reasons, this has consequences for every customer who relied on the old behavior.

This observation expands the resilience question into a new dimension. It is no longer just about what happens if the provider fails. It is also about what happens if the provider is forced by external political forces to change-acceptable use, model quality, regional availability, data practices. Whoever relies on a single provider is coupling themselves to that provider's political situation. This was not an issue a few years ago. Today it is one that should be explicitly named in an organization's risk architecture.

The knowledge shadow: attacks that do not destroy, and are dangerous precisely because of that

All previous considerations have revolved around 'availability': the system is there or it is not. But there is another attack class that is neither outage nor political restriction. It is called 'integrity attack.' The system continues to work. It delivers answers. It seems reliable. But its outputs are manipulated without anyone noticing.

This attack class moved from theoretical risk to documented practice in 2025. In June 2025, security researchers disclosed 'EchoLeak,' the first zero-click prompt injection in a production AI system-a vulnerability in Microsoft 365 Copilot (CVE-2025-32711) that allowed an attacker to exfiltrate confidential company data simply by sending an email. No user interaction needed, no click, no attachment. The AI itself executed the manipulation because it understood hidden instructions in an incoming email as part of its own directives. Microsoft had to issue emergency patches. NIST now calls indirect prompt injection 'the largest security vulnerability of generative AI,' OWASP lists it as number 1 of LLM threats for 2025, and in the 2026 list for agentic applications, 'Agent Goal Hijacking' ranks at the top.

EchoLeak is just one example. In March 2026, Oasis Security demonstrated a comparable attack against claude.ai. An analysis of the Moltbook AI agent network found that 2.6 percent of posts in that network contained hidden prompt injection payloads-the first large-scale observation of bot-to-bot manipulation in a production environment.

What these examples have in common is their characteristic of 'destroying nothing.' The service keeps running. The employee gets an answer. The answer seems plausible. It is just wrong-manipulated, slightly shifted, with a subtle error that has consequences weeks later. A sales answer that recommends an important competitor. A price calculation systematically shifted downward. A code snippet with a built-in vulnerability that will surface in the next audit-if there even is an audit. A market analysis with fabricated numbers serving as the basis for a strategic investment decision.

Here lies the very concept that makes these attacks so effective: they operate in a 'knowledge shadow.' By this I mean that zone of organizational reality where neither the company nor the provider routinely checks whether the system's outputs are correct-simply because both sides assume the AI is working. The employee does not scrutinize the answer in detail because they expect quality. The provider does not check each individual answer because they cannot see the situation and scope. It is precisely in this shadow that damage accumulates.

Three characteristics make integrity attacks significantly more dangerous than availability attacks. 'First, they are hard to detect.' An outage is immediately visible; subtle manipulation often never is. 'Second, they are hard to attribute.' Was that a hallucination? A model update? User error? A targeted attack? Root cause analysis is significantly more labor-intensive than a classical cyber incident. 'Third, the damage accumulates quietly.' An outage has a clear beginning and a clear end. Manipulated outputs accumulate over weeks or months-incorrect strategic decisions, faulty code, wrong customer classifications, creeping competitive disadvantages.

This opens an entirely new class of economic attack vectors that no classical cyber insurance covers and that does not appear in most risk analyses. A competitor manipulating an LLM-based sales assistant. A state actor subtly weakening code generation across an industry. An extortionist gradually building small errors into business processes and then demanding payment. These scenarios sound extreme, but I believe they are only a matter of time because the tools for them already exist. EchoLeak was not a theoretical study. It was a real, executed attack on a production platform with over 10,000 enterprise customers.

Whoever reduces resilience to availability loses this dimension from their field of vision.

What genuine redundancy actually costs in the AI era

Building genuine AI resilience has a price, and that price is higher than most organizations calculate.

A meaningful redundancy strategy requires at least three things. 'First,' multiple providers in parallel active operation-not as a theoretical option, but as tested practice. This doubles or triples the ongoing API costs because the same task must be prepared in multiple systems. 'Second,' an abstraction layer in your own software that makes the provider interchangeable. This is technically feasible, but it requires engineering effort and ongoing maintenance-someone must maintain this layer, review each model update, test compatibility. 'Third,' regular failover exercises where the switch is actually tested. Without these exercises, redundancy is merely theory.

For an SME with limited budget, this is a genuine challenge. Not impossible, but expensive enough that decision makers think twice. The temptation is too great to rely on 'they won't fail anyway,' and most of the time you are right.

But: the question is not whether you are right most of the time. The question is what happens in the hours when you are wrong. If those hours determine the quarterly result, cost a lost customer contract, or sabotage a market launch, then the resilience investment is not overkill, but simply insurance against a known risk.

An honest threshold: which SMEs are particularly vulnerable?

Not every organization is equally vulnerable. Three criteria help with self-assessment.

'First, integration depth.' Whoever uses AI only sporadically for individual tasks (occasional text writing or coding help) has little to fear from an AI blackout. Whoever has built workflows around AI-sales pipelines with AI pre-qualification, customer service with AI bots, development processes with AI-assisted coding-is much more vulnerable. The outage does not affect one task, but the entire process.

'Second, time criticality.' Some tasks tolerate a few hours of delay, others do not. Whoever operates customer support with guaranteed response times has no buffer time. Whoever runs applicant pre-qualification over weeks can weather a pause. Time criticality determines how painful every lost hour is.

'Third, task substitutability.' Some AI tasks can be done manually if necessary, others cannot. Whoever writes texts with AI can write them without AI-slower, but possible. Whoever uses AI to generate embeddings for vector search has no search function without the provider. Substitutability determines how far the damage reaches.

Whoever ranks high in all three categories-deep integration, high time criticality, low manual substitutability-should treat resilience not as optional luxury but as operational necessity. Whoever ranks low in all three can pragmatically live with the risk.

What can be learned from other resilience discussions

SME decision makers are familiar with the resilience question from other areas. Supply chain shocks during COVID, the energy question after Russia's attack on Ukraine, the semiconductor bottlenecks of recent years-all showed what happens when critical inputs are concentrated with one supplier and no alternatives exist.

The lessons from these crises are transferable. First: concentration with one supplier is an efficiency-optimized decision for normal times and existentially threatening in crises. Second: redundancy costs in normal operations but pays off in rare events whose probability we systematically underestimate. Third: resilience cannot be built during a crisis; it must be developed beforehand.

These lessons apply to AI tools just as much. Whoever relies on a single provider today because it is the best is optimizing for efficiency. Whoever operates multiple providers in parallel is optimizing for resilience. Both strategies are legitimate, but they have different consequences when the situation changes.

What organizations should do now, concretely

This analysis suggests five straightforward steps any organization can take regardless of size.

'Step one: a criticality assessment.' Which AI-powered processes are operationally critical, which are not? This list does not need to be perfect, but it must exist. It is the foundation for any resilience investment.

'Step two: an outage simulation.' What would actually happen if the main providers were unreachable for 24 hours? Which customer promises would be endangered, which internal processes would stall, how much time would we have before serious problems emerged? This simulation does not need to be elaborate-a workshop with relevant departments is sufficient.

'Step three: at least one second provider in active operation.' Not as a theoretical option, but as ongoing practice. A European provider alongside OpenAI or Anthropic. Even if the quality is slightly lower, whoever has never practiced the switch cannot execute it in a crisis.

'Step four: architecture decisions that enable interchangeability.' Whoever builds a new AI application should not couple it directly to one provider's API. An abstraction layer is not technical luxury but a strategic decision. It costs somewhat more initially but saves multiples in a crisis.

'Step five: integrity verification as an ongoing process.' This is the consequence of the knowledge shadow problem. Whoever feeds AI outputs into business-critical processes should conduct spot checks and verify, build in plausibility controls, monitor for anomalous patterns. In classical IT, no one checks the correctness of every individual computation; with AI you should do so in risk areas because the error characteristics are different. A pragmatic minimum measure: for every process built on AI output with business consequences, regular four-eyes review of a sample.

These five steps are the AI equivalent of what organizations already do with supply chains and energy. What is missing is the transfer of this resilience logic to AI infrastructure-often because the necessity has not yet been felt.

The necessity of continuous monitoring

There is one aspect that distinguishes the resilience discussion from the pure backup question, and it is closely related to the knowledge shadow. With classical IT outages, recovery is a technical problem-the system is there, the data is there, you just need to turn it back on. With AI outages, it can happen that the service is running again, but 'model quality' has changed. Providers tinker with their models without always communicating this transparently. What generated a specific answer yesterday can generate a different one tomorrow. Whoever has built processes on the assumption that the model behaves consistently can have problems even when it is technically available.

This circumstance is not in itself an attack. But it creates the same damage class as an integrity attack: changes in system behavior that are not immediately visible but accumulate. Whoever understands this draws two conclusions. First: AI resilience is an ongoing process, not a project with an endpoint. Second: investment in architecture decisions and integrity verification that make system behavior observable is cheaper in the long run than repairing cumulative damage discovered months later as a problem.

What remains is an honest question every business decision maker should answer for themselves: if AI went down this afternoon for 48 hours, who in the organization would know what to do? And additionally, perhaps more importantly: if AI has been delivering slightly manipulated answers for weeks, who in the organization would notice at all? If the answer to either question is 'no one,' then resilience is no longer a theoretical exercise but homework that is already overdue.