SaaS Platform unavailable

Incident Report for Lucidworks Platform

Postmortem

Summary

On November 13, 2025, at 16:12 UTC, the Lucidworks SaaS Platform became unavailable. This issue affected Lucidworks AI functionality (including Neural Hybrid Search), Commerce Studio and Analytics Studio accessibility, and Connected Search. API requests received HTTP 403 errors, and the platform.lucidworks.com UI returned the error message RBAC: access denied. The access denied message was returned when attempting to access Commerce Studio instances, Analytics Studio dashboards, Lucidworks Search promotion requests, or any general Lucidworks Platform configuration controls.

Lucidworks Engineering resolved the issue by 17:08 UTC, restoring service for all affected products.

Root Cause

Background

To support our development and testing efforts, there are two comprehensive Lucidworks SaaS Platform environments: the Production environment and the Development environment. The Production environment provides services to all Lucidworks customers. The Development environment is used by Lucidworks Engineering to test changes before deploying them to Production.

Lucidworks employs a concept of capabilities in regional deployments to control product and feature availability on the Lucidworks Platform. For example, Lucidworks can have all of the Lucidworks SaaS Platform products available in the us-iowa region but deploy only Lucidworks AI to the us-texas region. Capabilities are configured by applying labels to various resources in Google Cloud Platform (GCP), which is Lucidworks’ public cloud provider. Our deployment tooling assesses the labels to determine where to deploy applicable microservices.

Platform Availability

On November 12, 2025, a change related to an upcoming product release was deployed in the Development environment, where it was tested and confirmed to be functional. However, after deploying this same change to Production, the product did not display in our testing workspaces. Lucidworks investigated and determined the requisite GCP resource labels had not been applied to the necessary Production regions. A Lucidworks engineer ran an internal tool to update the deployment labels in Production.

Unfortunately, due to human error, this tool was executed with an incorrect parameter, which had the unintended consequence of applying the labels from Development into Production. This action subsequently updated the routing layer configuration, which operates off of some of the same labels. This resulted in the Production routing layer being configured to send traffic to the Development backend services. However, because the Production and Development environments are not attached to each other in any way, all incoming traffic was dropped.

The routing layer is deployed across multiple instances in order to provide highly available services. Once the same erroneous configuration was applied to all such instances, a widespread outage ensued. 

Blast Radius

API calls to the following systems began to fail:

  • Connected Search indexing requests and query responses
  • Lucidworks AI vectorization, prediction and usecase requests used by Neural Hybrid Search, and Lucidworks AI enhanced pipelines for both index updates and query processing
  • Signals sent to Lucidworks Analytics by the Lucidworks Signals Beacon for ingestion

Additionally, because the Platform could not route traffic to the user interfaces in Production, HTTP 403 responses were returned with an error message of RBAC: access denied. These errors affected the following user experiences:

  • Commerce Studio instances, including the ability to change search rules
  • Analytics Studio dashboards and charts
  • Lucidworks Search configuration change promotion requests
  • Connected Search configuration management
  • The primary Lucidworks Platform login system
  • The configuration interface for configuring Lucidworks Platform constructs

Incident Detection

Lucidworks has identified monitoring gaps that increased the time to detect this issue, which in turn increased the time to resolution. We utilize multiple monitoring and alerting tools to ensure timely notification of any production issues 24x7. 

Lucidworks also identified gaps in the response and mitigation of this issue.

External synthetic checks were retrieving SSL certificates to ensure their integrity, which Lucidworks falsely believed would also alert if the service itself is down. However, the secure HTTP connection was successful, and the failure occurred beyond that point.

Lucidworks has active alerting in place for HTTP 500 errors but was not sufficiently alerting for 4xx errors. This was done under the false premise that these occur due to invalid requests and not invalid responses. In this incident, the system incorrectly responded with a 403 error code instead of the more accurate 500 error. 

Resolution

After the source of this issue was detected, at 17:07 UTC Lucidworks personnel changed the necessary values in Production to generate a repopulation of all relevant labels throughout the Lucidworks Platform. These labels propagated quickly. Within one minute, the routing layer had been updated with the correct information, and full service was restored for all affected products.

Lucidworks Actions

Lucidworks will take the following actions as a result of this incident:

  • Replace the tooling responsible for setting the labels with a tool that:

    • Constrains which parameters can be passed to it, which minimizes the potential for human error
    • Runs in an automated fashion that occurs from a centralized location and logs all activity in an auditable manner
    • Requires peer review prior to making any changes in the Production environment
  • Enhance external monitoring tools to:

    • More completely probe Production systems to ensure all products are functioning
    • Actively alert appropriate Lucidworks personnel to drastic increases in HTTP 4xx responses, even in cases where these may be indicative of invalid calls rather than an inability to properly respond to those calls
  • Update Lucidworks’ error handling processes in our product UIs so that:

    • Unroutable requests do not respond with a 403 error but instead respond with a 500 error that clearly specifies a Lucidworks infrastructure issue as the cause
    • Error response pages include a definitive message that indicates the issue is due to an inability to serve the requested traffic, instead of referring to unrelated role-based access control (RBAC) errors
  • Enhance Neural Hybrid Search to more gracefully fall back to a lexical-only query in the event that Lucidworks AI is unreachable; the recently released Fusion 5.9.15 included an additional failsafe fallback to the Neural Hybrid Query Stage, and we will additionally implement similar fallbacks in a future release of Fusion to increase the ability for the overall system to withstand service outages such as this one

Recommended Client Actions

Lucidworks recommends that clients using Neural Hybrid Search upgrade to Fusion 5.9.15 as soon as possible, in order to take advantage of the latest enhancements, including the automated lexical query fallback mentioned previously.

We also recommend that clients subscribe to Lucidworks status updates to receive notifications about Lucidworks SaaS Platform incidents. To enable this feature, click “Subscribe to Updates” on status.lucidworks.com.

Posted Nov 17, 2025 - 09:15 PST

Resolved

We have confirmed that rolling back the erroneous change has resolved this incident. A postmortem will be posted here as soon as possible.
Posted Nov 13, 2025 - 09:47 PST

Monitoring

We have identified a recent job that ran with an incorrect parameter, which we believe to be the cause of this incident. We have reverted that change and services are now becoming available again. We are continuing to monitor to ensure full recovery.
Posted Nov 13, 2025 - 09:15 PST

Investigating

The Lucidworks SaaS Platform, which powers Lucidworks AI, Commerce Studio, Analytics Studio, and Connected Search is currently unavailable. We are actively working to investigate and mitigate this issue and will post updates here.
Posted Nov 13, 2025 - 09:07 PST
This incident affected: Lucidworks Platform (User Logins & Configuration UI, Integrations), Connected Search (US Region Data Ingest, US Region Search APIs), Lucidworks AI (Shared Embeddings Models, Custom Model Training, Custom-trained Embeddings Models, Shared Generative Models), Lucidworks Analytics Studio (Beacon & Signals Ingestion, Usage Analytics Hub), and Lucidworks Experience Studios (Commerce Studio).