8:57 pm Pacific: Here is a timeline of what occurred. At 10:05 am Pacific today, we began seeing an elevated error rate. The errors were intermittent and well within our service level objectives. At 8:28 pm the error rate was increasing and we made the decision to restart some affected services, expecting this to be an immediate operation as usual. Services did not come back. The error indicated an incorrect certificate, so we restarted the Istio service mesh. By 8:33 pm all services were restored and operating normally. Total downtime was under 5 minutes.
8:45 pm Pacific: The issue has been identified as a bug in Istio, the service mesh system we use to expose our cloud services to the internet. Istio generates encryption certificates to encrypt traffic, and a bug in one of those certificate generators prevents the certificate from being renewed. The certificate expires one year after creation, and then cannot be automatically renewed. See https://github.com/istio/istio/issues/14516 for details from Istio on the exact issue.
8:37 pm Pacific: Services have been restored. Engineering is monitoring and investigating the cause. More details will be provided about the impact of the issue and the cause as we learn them.
8:35 pm Pacific: DialSource monitoring identified an issue with softphone connections to the Denali cloud. Engineering is investigating and recovering services.