Lucidworks AI Hosted LLM Service Disruption

Incident Report for Lucidworks Platform

Postmortem

Summary

On May 28, 2026, between 16:18 UTC and 17:53 UTC, Lucidworks AI hosted LLMs were unavailable in the us-southcarolina region. Customers using hosted LLM inference (llama-3-8b-instruct, llama-3v2-3b-instruct, and phi-4-multimodal-instruct) received 500 or 429 errors when attempting to query these models. Other SaaS Platform services, including search, embedding models, and the Lucidworks Platform UI, were not affected.

Lucidworks Engineering declared a Sev1 incident at 16:54 UTC and began remediation efforts. All hosted LLM models were fully restored and operational at 17:53 UTC.

Root Cause

The incident was caused by a routine Kubernetes patch upgrade on a cluster in the us-southcarolina region.  LWAI-hosted models are served via Ray Serve, which uses both “head” and “worker” nodes as part of its deployment system for routing inference requests.  The Kubernetes upgrade cycled node pools, causing all cluster head nodes and worker nodes to restart simultaneously.  Under normal conditions, the Ray cluster can tolerate a head node restart because worker nodes continue serving requests.  However, the node pool upgrades utilize a surge strategy where our platform waits for pods to leave the old node (be evicted) but not for them to be running on the new node. The platform considers the “drain” successful once the pod is gone from the old node and moves on to the next one, even if the pod is stuck in an “Initializing” state on the new node. This meant that all node pools were cycled in rapid succession, and both head and worker pods were evicted before any had finished initializing on their replacement nodes, resulting in a complete cluster outage.

Recovery was prolonged by multiple compounding factors. First, one of the replacement head nodes was in a degraded state and unable to pull container images, requiring manual intervention to delete the node. Second, the LLM container images (6-11 GB in size) experienced abnormally slow Docker image transfer, taking 30-52 minutes compared to the typical 2-4 minutes observed in normal operation. Additionally, in a separate operation, new models were being brought online to expand our LWAI offering, and this caused the Ray operator's blue-green deployment strategy to require the existing and replacement LLM deployments to be healthy before switching traffic, which extended the outage until the slower image pulls completed on both blue and green deployments.

Lucidworks Engineering deleted the degraded node, waited for image pulls to complete on replacement nodes, and verified that all hosted models were responding to queries. The incident was verified as resolved at 18:07 UTC.

Lucidworks Actions

Lucidworks will take the following actions as a result of this incident:

  • Implement sequenced Kubernetes upgrade procedures for clusters hosting LLM workloads, ensuring each node pool is fully healthy before the next pool is upgraded.
  • Investigate Docker image pre-loading strategies (such as pre-baked disks or image streaming) to eliminate long container image pull times for large ML model images.
  • Open a support ticket with our cloud provider to investigate the abnormal Docker pull times.
  • Establish a notification protocol to coordinate Kubernetes maintenance windows with LLM service owners to avoid conflicts with ongoing deployments.
  • Improve tooling around our Ray clusters to allow Lucidworks Engineering to force a failover to a blue or green state instead of waiting for Ray to automatically resolve the new and old deployments.

Recommended Client Actions

Lucidworks recommends that clients subscribe to Lucidworks status updates to receive real-time notifications about Lucidworks SaaS Platform incidents. To enable this feature, click Subscribe to Updates at status.lucidworks.com.

Posted Jun 02, 2026 - 22:33 UTC

Resolved

The service disruption affecting certain Lucidworks AI hosted models has been fully resolved, and all previously affected services are operating normally. End-user functionality is fully restored, with all dependent services operating as expected.

The disruption originated during routine Kubernetes maintenance involving node upgrades, during which a cloud provider capacity shortfall prevented the instances hosting these models from re-launching. Stability was restored after successfully launching new nodes and pulling the required images. We will share a postmortem report containing the full root cause analysis within three business days.
Posted May 28, 2026 - 18:08 UTC

Monitoring

We have successfully launched new nodes with the required LLM images, and redirected Lucidworks AI traffic to them. All LWAI hosted models are fully functional again, and stability has been confirmed through successful 200 responses from the prediction endpoint. End-user functionality is fully restored, with queries no longer returning errors. We are continuing to observe monitoring metrics to ensure system stability remains consistent before providing a final resolution update.
Posted May 28, 2026 - 18:01 UTC

Identified

The service disruption affecting certain Lucidworks AI hosted models remains ongoing, and the models are currently unavailable. The disruption occurred during routine Kubernetes maintenance involving node upgrades, during which a cloud provider capacity shortfall prevented the instances hosting these models from re-launching. We have secured the needed capacity and have engaged our cloud provider to resolve the remaining delays in bringing the models back up. We will provide further updates as additional information becomes available.
Posted May 28, 2026 - 17:39 UTC

Investigating

Certain Lucidworks AI hosted models (llama-3-8b-instruct, llama-3v2-3b-instruct, and phi-4-multimodal-instruct) in the us-southcarolina region are experiencing a service disruption, causing them to be currently unavailable. Customers attempting to use these Lucidworks-hosted models are receiving 500 or 429 errors for queries that rely on them. End users utilizing services that depend on these specific models are encountering errors, though passthrough LLM calls remain fully functional. We are currently investigating the issue to determine the cause and are actively working to restore full availability. We will provide further updates as new information becomes available.
Posted May 28, 2026 - 17:14 UTC
This incident affected: Lucidworks AI.