Background
On March 3, 2025, the IFTTT team conducted scheduled maintenance from 8 PM to 9 PM UTC. During this time, some Applet executions were temporarily paused. More details about this maintenance can be found on our status page.
Following the maintenance, unexpected technical challenges caused delays in restoring Applet schedules. As a result, some Applets may have run late or not at all between March 3 at 8 PM UTC and March 4 at 4:30 AM UTC.
Our team worked through the night to resolve the issue, and Applets are now running as expected. No action is needed on your part.
For more details on this incident, please see the More Details section below.
More Details
At IFTTT, we are dedicated to building a high-quality, reliable service that seamlessly connects apps and devices to make life easier for our users. Our engineering team is constantly improving our infrastructure to ensure that our platform remains scalable, resilient, and efficient. As part of this commitment, we recently undertook a planned migration to improve system performance and reliability.
While our goal is always to ensure a smooth transition, this migration presented unexpected challenges that temporarily impacted Applet execution.
What Happened?
On March 3, 2025, our engineering team began migrating critical services in an underlying caching layer to improve performance. This migration required temporarily scaling down services, taking snapshots of existing cached data, and launching new caching clusters. While these steps went as planned, an issue arose when we attempted to restore certain cache instances into the new caching service, causing delays in Applet execution and a backlog of scheduled jobs.
To provide transparency, here’s a high-level overview of the challenges we encountered:
- Execution Schedules Did Not Fully Restore: IFTTT Applets rely on a sophisticated scheduling system that determines when an automation should run. During our maintenance, the execution schedule was not preserved. Due to a limitation in our bulk schedule reset process, only about 30% of the expected schedules were restored quickly, leading to delays in Applet execution.
- Increased Load on the New Caching Clusters: The sudden surge in scheduling updates placed an unexpectedly high load on the new caching clusters, resulting in slow query times and timeouts.
- Temporary Service Degradation: Some Applets were delayed or did not execute during the maintenance window, leading to an increased support volume and user reports of missing or late Applet runs.
How We Responded
The IFTTT engineering team worked around the clock to diagnose and resolve the issue. We took immediate action to mitigate the impact, including:
- Scaling Up Resources: We temporarily increased the capacity of our new caching clusters and adjusted our infrastructure to better handle the workload.
- Manual Restorations: To expedite recovery, we manually cleared blocked jobs and re-ran schedule resets in a controlled manner.
- Fallback Mechanism: Recognizing the potential delay in full recovery, we leveraged a backup of the previous caching instance to restore missing schedules and prevent further delays.
- Real-time Monitoring & Adjustments: Throughout the incident, we closely monitored system performance, scaled services dynamically, and made adjustments to stabilize Applet execution rates.
What We Learned and Next Steps
At IFTTT, we believe that every challenge is an opportunity to learn and improve. Here are the key takeaways from this migration and how we are addressing them moving forward:
- Enhancing Bulk Schedule Reset Processes: We are refining the way we restore schedules to ensure more efficient processing and avoid execution delays in future migrations.
- Strengthening Load Testing Practices: We will conduct more extensive pre-migration stress tests to better predict system behavior under heavy loads.
- Improving Monitoring & Alerting: By expanding our internal observability tools, we will better detect and respond to performance degradation before it affects Applets.
- Building a More Resilient Fallback Strategy: We are streamlining our ability to revert to backups seamlessly, ensuring faster recovery times if needed in the future.
Our Commitment to You
We recognize that reliability is critical for our users. While migrations like this are essential for long-term improvements, we acknowledge the impact this incident had and are committed to making our platform even stronger.