On May 28, 2026, between 16:18 UTC and 17:53 UTC, Lucidworks AI hosted LLMs were unavailable in the us-southcarolina region. Customers using hosted LLM inference (llama-3-8b-instruct, llama-3v2-3b-instruct, and phi-4-multimodal-instruct) received 500 or 429 errors when attempting to query these models. Other SaaS Platform services, including search, embedding models, and the Lucidworks Platform UI, were not affected.
Lucidworks Engineering declared a Sev1 incident at 16:54 UTC and began remediation efforts. All hosted LLM models were fully restored and operational at 17:53 UTC.
The incident was caused by a routine Kubernetes patch upgrade on a cluster in the us-southcarolina region. LWAI-hosted models are served via Ray Serve, which uses both “head” and “worker” nodes as part of its deployment system for routing inference requests. The Kubernetes upgrade cycled node pools, causing all cluster head nodes and worker nodes to restart simultaneously. Under normal conditions, the Ray cluster can tolerate a head node restart because worker nodes continue serving requests. However, the node pool upgrades utilize a surge strategy where our platform waits for pods to leave the old node (be evicted) but not for them to be running on the new node. The platform considers the “drain” successful once the pod is gone from the old node and moves on to the next one, even if the pod is stuck in an “Initializing” state on the new node. This meant that all node pools were cycled in rapid succession, and both head and worker pods were evicted before any had finished initializing on their replacement nodes, resulting in a complete cluster outage.
Recovery was prolonged by multiple compounding factors. First, one of the replacement head nodes was in a degraded state and unable to pull container images, requiring manual intervention to delete the node. Second, the LLM container images (6-11 GB in size) experienced abnormally slow Docker image transfer, taking 30-52 minutes compared to the typical 2-4 minutes observed in normal operation. Additionally, in a separate operation, new models were being brought online to expand our LWAI offering, and this caused the Ray operator's blue-green deployment strategy to require the existing and replacement LLM deployments to be healthy before switching traffic, which extended the outage until the slower image pulls completed on both blue and green deployments.
Lucidworks Engineering deleted the degraded node, waited for image pulls to complete on replacement nodes, and verified that all hosted models were responding to queries. The incident was verified as resolved at 18:07 UTC.
Lucidworks will take the following actions as a result of this incident:
Lucidworks recommends that clients subscribe to Lucidworks status updates to receive real-time notifications about Lucidworks SaaS Platform incidents. To enable this feature, click Subscribe to Updates at status.lucidworks.com.