Lead Timeouts

Incident Report for Decision Cloud

Resolved

After processing transactions all day, and re-running nightly files, I believe this issue has been resolved. We will continue to monitor the situation, but this ticket can be marked as resolved.

Posted Dec 10, 2021 - 11:08 PST

Update

Some nightly data files failed to parse because of database maintenance. We are currently rerunning all nightly data to fix any gaps that may exist.

Posted Dec 10, 2021 - 05:39 PST

Update

We have re-enabled filebuilder, and now all systems are operational again.

At approximately 3:17am PST we re-enabled the filebuilder processing queue. There were several thousand queued jobs (far higher than normal conditions) and our system attempted to process them all at once. This resulted in too many jobs fighting for access to the same (non database) resources and caused a lot of queued jobs to fail.

The queue appears to be processing normally now, so any failed jobs can be manually re-run.

During all of this, the new database server did not slow down any operations for lead processing. We will monitor for the next 2 hours while the daily traffic increase starts to trickle in.

Posted Dec 10, 2021 - 03:29 PST

Update

The primary server has been given 4x the capacity it had yesterday. Moving the device triggered regular maintenance tasks that required rebuilding ZFS kernel modules, which ultimately invalidated our SSD read-cache for our primary database. This means that the device will have to work a little harder over the next 1-2 hours to "warm" the cache. Regular traffic will process during this time, but we will see more jitter in our response time metrics until it has completed.

We will monitor the server for the next several hours to ensure things are functioning properly.

Posted Dec 10, 2021 - 01:30 PST

Update

We are performing the pending hardware upgrades at 1am PST. All services will be unavailable for several minutes while the updates are applied.

Posted Dec 10, 2021 - 00:51 PST

Update

Reports have been re-enabled and data has synchronized to the reporting cluster. We are testing new read replicas now and will be performing our remaining hardware upgrades this evening.

Posted Dec 09, 2021 - 14:06 PST

Update

We have temporarily disabled reports in an effort to identify the current bottleneck. This is a very short term, temporary measure. Reporting will be restored once this investigation completes.

Posted Dec 09, 2021 - 09:29 PST

Update

We are still seeing periods of elevated timeouts coinciding with hourly traffic increases. While the timeout rate has been reduced, it has not been eliminated. We are working on a temporary solution to reach the end of the business day , at which point we will finish the upgrades we began last night.

Posted Dec 09, 2021 - 09:22 PST

Update

All automated alarms have cleared and our average response time is currently approximately 400ms while serving typical traffic flows.

We had to reduce the IO priority of the replica sync to avoid excess load on the database server, but it is running slowly in the background.

At this point, our system is fully operational. We will enable filebuilder shortly and close this incident once we have confirmed that it is working.

Posted Dec 09, 2021 - 06:11 PST

Update

The main metrics that we observe for database performance have all returned to ranges in our normal operating parameters.

Our internal automated alarms won't clear until we have had at least 20 minutes of normal operation, but right now our health system is showing optimal performance out of our API once again.

We have paused updating the new read-replica until the current alarms clear, and then we will begin the process of restoring that once again.

Posted Dec 09, 2021 - 05:34 PST

Update

Optimization has completed.

We are currently seeing very slow response times, and now expect those numbers to recover over the next several minutes. We will post another update as soon as we see performance begin to return to normal.

Posted Dec 09, 2021 - 05:20 PST

Update

The final disk is reading 99% optimized. In several minutes, the storage migration will be complete and we should begin to see response times return to normal values. We will update this incident when the final optimization has completed.

Posted Dec 09, 2021 - 05:02 PST

Update

Optimization is nearly complete. There are two remaining disks still in an "optimizing" state, both above 95% complete.

We are still experiencing read/write latency spikes while this storage array is finishing this task, but they should subside as soon as optimization is complete. We have also begun bringing a new read-only database replica online from our fresh backup. When this device becomes available, we will re-instate filebuilder and should see a significant reduction in read requests from our primary endpoint. These factors combined should provide a large improvement to our uplatform scalability and resiliency.

We will post another update in 30 minutes.

Posted Dec 09, 2021 - 04:31 PST

Update

The optimization process continues to make progress. The remaining disks are approximately 91% optimized. We will continue to monitor progress and update this incident accordingly.

Posted Dec 09, 2021 - 01:58 PST

Update

While we continue to run our backup, we have observed inconsistent latency with some database queries. After consulting with AWS, we have been informed that while our storage was upgraded, there was a background optimization happening that would impact existing data access until it was completed.

At the current time, we can see that roughly 15% of our disks have fully optimized while the remaining disks are approximately 70% optimized. Based on the current progression of this process, we are confident that all storage will be optimized and online before 5am PST.

We cannot, however, upgrade our non-disk resources until this process has completed. We will continue to monitor optimization progress and evaluate the safety of system hardware upgrades when the process completes. If we do not have a window to perform the CPU/RAM upgrades, we will still have more than doubled our capacity on the resource that was our recent processing bottleneck and the system should still operate within normal levels.

Posted Dec 08, 2021 - 22:58 PST

Update

We have completed upgrading our storage array and are performing a full read/write test to our backup array. The test will take at least 6 hours to complete. When we successfully complete the test, we will briefly take the array offline while transfering to new hardware.

This issue will remain open for the night until we have completed that transfer.

Posted Dec 08, 2021 - 19:02 PST

Monitoring

A small percentage of our overall traffic has been subject to an increased error and time out rate. We have identified the reason as being storage bandwidth constrained on our new database cluster. We are upgrading the primary storage array one disk at a time, and performance should improve linearly as upgrades complete over the next few hours. This evening we are going to allocate another read replica and move primary processing to a server with roughly 2x the current resources. We will continue to monitor this issue until all upgrades have been completed.

Posted Dec 08, 2021 - 11:09 PST

This incident affected: Decision Cloud - Web Site, Decision Cloud - API (Lead Processing), Decision Cloud - Reports, and File Builder.