
SAN FRANCISCO (Source: Tech Crunch)— OpenAI faced one of its longest outages in history on Wednesday, affecting its popular AI tools, including ChatGPT, its video generator Sora, and its developer API. The disruption began around 3 p.m. Pacific Time, leaving users unable to access these services for several hours.
By 9 p.m. PT, OpenAI had mostly restored functionality, but users experienced intermittent outages throughout the afternoon. The company confirmed the problem and began working on a fix shortly after the issue was identified. Affected services slowly came back online around 7 p.m. PT.
The outage coincided with the launch of OpenAI’s integration with Apple’s iOS 18.2, causing some confusion among users who reported issues with ChatGPT on Apple devices. However, OpenAI clarified that the outage was unrelated to the integration or any holiday-related promotions, including its “12 Days of OpenAI” event.
In a detailed postmortem released on Thursday, OpenAI explained that the disruption was caused by a new telemetry service it had deployed to collect Kubernetes metrics. Kubernetes is a system used to manage containers that run applications in isolated environments. The configuration of the new telemetry service unintentionally led to resource-intensive operations that overwhelmed OpenAI’s Kubernetes API servers, taking down much of its infrastructure. This problem was compounded by OpenAI’s use of DNS caching, which delayed the full visibility of the issue.
The company said that while it detected the issue minutes before customers began experiencing problems, it was unable to implement a quick fix due to the overwhelmed Kubernetes servers. OpenAI acknowledged that the incident was a result of multiple systems and processes failing simultaneously, and that its tests did not catch the impact of the new service on its infrastructure.
In response, OpenAI promised to adopt several measures to prevent similar incidents in the future, including improving monitoring systems and ensuring better access to its Kubernetes servers in emergency situations.
“We apologize for the impact that this incident caused to all of our customers—from ChatGPT users to developers to businesses who rely on OpenAI products,” the company said in the postmortem. “We’ve fallen short of our own expectations.”
The outage also follows another disruption earlier in the day to Meta’s products, raising concerns about the reliability of major tech platforms during a time of increased online activity. Despite the swift recovery, OpenAI’s explanation of the technical failure provides insight into the complexities of maintaining large-scale AI infrastructure and the challenges posed by new service deployments.