Elevated search errors

Incident Report for Lucidworks Platform

Postmortem

Summary

On October 29, 2025, beginning at 15:40 UTC, Lucidworks Connected Search query responses began to produce an increased amount of 504 gateway timeout errors which affected the ability to serve search traffic. Lucidworks mitigated the issue by rolling back the most recent code change, resolving the incident at 17:01 UTC. Full functionality was verified to be restored by 17:12 UTC.

The total impact duration was approximately 81 minutes, during which time users experienced intermittent increased search latency and failures in Connected Search environments.

Lucidworks proactively opened outbound Support cases for affected customers during this incident.

Root Cause

The incident was initially believed to be caused by a recent change to an unrelated service that introduced unstable connection handling. Following that update, we detected frequent connection resets to an upstream third-party service, which propagated to other services, resulting in timeouts and degraded search performance. A Severity 1 (S1) event was declared, and we began posting updates to our status page.

Despite our belief that the most recently-deployed change was isolated from Connected Search, we reverted it in an attempt to restore service. Around the time that rollback completed, we began to see recovery and declared the incident resolved.

However, in the process of debugging the change that was rolled back as part of our standard postmortem process, our testing suggests that the code in question may not have actually been the underlying source of this incident, and we are coordinating with one of our third-party providers to obtain more information. This incident report will be updated when that information is received.

Update from 6 November 2025:

After extensive investigation, consultation with third-party providers, and targeted testing of the suspected faulty code, Lucidworks identified the root cause as a compound issue. The recent change previously mentioned, designed to improve vector-lookup performance, successfully achieved that goal but inadvertently exposed a latent defect in HTTP connection reuse. In addition, the vector service was under-resourced, causing rapid scaling of replicas under query load. This scaling behavior increased the number of concurrent connections. This behavior, combined with lacking connection reuse, led to all available ports being exhausted. As a result, outgoing packets were dropped, and new connections could not be established to reach the index in order to serve search traffic.

To resolve the issue, Lucidworks increased service resource allocations, explicitly implemented HTTP connection reuse, and added enhanced logging and validation safeguards to detect similar issues in the future.

Lucidworks Actions

Lucidworks has taken the following actions as a result of this incident:

  • Rolled back the most recent change to the relevant service in production. This change will not be re-deployed until we’ve confirmed beyond a doubt that it will not cause a recurrence of this issue.
  • Restarted affected services to clear residual connection issues.
  • Added an internal process review to ensure change isolation assumptions are verified prior to all deployments.
  • Reached out to our third-party provider to get additional information and to determine if an unreported service issue on their end was the true underlying cause of our Connected Search traffic issues.

Additional Lucidworks Actions following further root-cause analysis:

  • Improved our vector-serving backend to scale more gradually.
  • Implemented HTTP connection reuse where it was lacking.
  • Added timeouts and more rigorous logging for better visibility.
  • Scale-tested these changes at 100x normal QPS levels to ensure the issue was fully resolved.

Recommended Client Actions

There are no recommended client actions as a result of this incident.

Posted Nov 01, 2025 - 12:21 PDT

Resolved

Rolling back the recent change has resolved this incident, and we have confirmed that we are no longer seeing query errors. Individual Connected Search customers who we know to have been affected have already had a proactive ticket raised by Lucidworks Support. Any additional inquiries or follow-ups can be submitted by visiting our Support Portal: https://support.lucidworks.com/hc/en-us

We will post a full postmortem here within 48 hours.
Posted Oct 30, 2025 - 10:12 PDT

Monitoring

The change has been rolled back and we are starting to see recovery. Search queries for Connected Search endpoints are once again responding successfully. Our team is continuing to monitor to ensure all errors resolve.
Posted Oct 30, 2025 - 10:01 PDT

Identified

We have identified a recent change that seems to be the source of this issue, and are preparing to roll that change back now.
Posted Oct 30, 2025 - 09:10 PDT

Investigating

We are experiencing an elevated rate of search errors for Connected Search customers. Many queries are currently returning 504 responses and the calls are unable to complete. Lucidworks personnel are actively investigating.
Posted Oct 30, 2025 - 08:40 PDT
This incident affected: Connected Search (US Region Search APIs).