Notice: Aug. 27th, 2024

Service Disruption Notice – Betterworks Platform

We experienced an unexpected outage on the Betterworks platform on [08/27], affecting multiple modules starting at approximately 12:30 PM EDT / 9:30 AM PDT. We are happy to report that access has now been fully restored.

If you continue to experience any issues, please contact our support team at support@betterworks.com. We are actively investigating the root cause of this disruption, and once confirmed, we will publish a detailed Root Cause Analysis (RCA) to keep you informed of our findings and the steps we are taking to prevent future occurrences.

We apologize for any inconvenience this may have caused and appreciate your understanding and patience as we work to ensure the stability of our platform.

Thank you for your continued trust in Betterworks.

 

Sept. 6th, 2024 RCA Update

Root Cause Analysis: Redis Cache System Outage

Issue Summary:
A critical issue in the Redis cache system caused a sharp increase in memory usage, leading to a service outage.

  • Date and Time of Occurrence:
    The issue began on August 26, 2024, with service failures reported on August 27, 2024.

  • Location:
    Redis cache system.

  • Impact:
    The outage disrupted multiple services, leading to downtime and customer dissatisfaction.


Contributing Factors:

  • A service installation request on August 26, 2024.
  • Lack of proper monitoring for abnormal cache growth.

Problem Analysis:

Key Observations:

  • A connector was storing unusually large values.
  • A retry mechanism in error handling caused a multiplicative increase in cache entries, as failed jobs spawned multiple retries.

Root Cause:

  • Incorrect retry logic led to duplicates in the Redis cache.
  • Each failure generated multiple duplicate jobs, exacerbating memory usage.
  • No Time to Live (TTL) was set for Redis keys, allowing the cache to grow without automatic clearance.

Immediate Resolution Steps:

  1. The cache memory was scaled up by two times to accommodate the increased usage.
  2. Unused data in the cache was cleared to reduce memory consumption.
  3. The faulty retry mechanism was removed to prevent further duplicate jobs from being generated.

Long-Term Fixes:

  1. A Time To Live (TTL) will be added to all Redis cache keys to ensure that data expires automatically after a set period.
  2. The retry mechanism and error handling will be fixed to prevent duplicate jobs from being created in the future.
  3. We will consider allocating separate caches for each service or implementing a namespacing strategy to isolate cache usage.
  4. Enhanced monitoring will be implemented to detect abnormal cache growth patterns and take corrective action promptly.

Monitoring and Evaluation:

  • Monitoring: Ongoing tracking of memory usage and cache performance.
  • Evaluation: Daily review of cache logs and performance metrics for the first week, then weekly for the next month.
  • Adjustments: Make necessary changes based on findings.

Timeline:

  • 26-Aug-2024: Redis memory usage spiked after a service installation request.
  • 27-Aug-2024: Failures were reported, and immediate actions were taken to mitigate the issue and restore full access.