Which Aiven services are affected?

A limited number of Aiven (https://aiven.io) customer PostgreSQL, Apache Kafka, InfluxDB, Elasticsearch and Grafana services running on the AWS us-east-1 region

What happened?

According to Amazon, AWS region us-east-1 (N. Virginia) is suffering from a high rate of errors in the S3 service.

What is the impact to Aiven?

Aiven's 24/7 monitoring system alerted us to a number of problems in AWS N. Virginia region, including access to EBS volumes of EC2 instances, launching new EC2 instances and storing backups in S3. 

A number of of our customers have services running in the affected region and the services experienced several types of problems. Some of the services running in N. Virginia continued to operate normally but failed to store new backups in S3.  Other services experienced worse failures, causing some of the cluster nodes to become unavailable.

Aiven's automation and operations efforts have resolved the immediate issues in all Aiven services. 

Is the incident over?

Yes.

How was the impact mitigated to the affected services?

Aiven allows customers to easily migrate services between different clouds and regions using the console or the API.  The same feature allowed our operations team to migrate affected services to the nearest unaffected AWS region (Ohio, us-east-2).

The services experiencing EBS or EC2 troubles were migrated online from N. Virginia to Ohio, and all services that continued to operate normally in N. Virginia were updated to store their backups in the Ohio region to ensure that all Aiven services have proper backups during the outage.

It won't be possible to spin up new instances in N. Virginia or perform a PostgreSQL point-in-time-recovery to a backup location stored only in N. Virginia while the outage is ongoing, but once AWS has resolved the issue Aiven services will be automatically restored.

Timeline

17:45 UTC: Newly created Aiven services fail to enter running  state in AWS N. Virginia
18:01 UTC: A high volume of backup-upload failures from Aiven services in AWS N. Virginia noted by our monitoring system
18:17 UTC: Operations starts migrating affected services to AWS Ohio
18:56 UTC: All affected services migrated to AWS Ohio

What other services is this affecting besides Aiven?

"The issues appear to be affecting Adobe’s services, Amazon’s Twitch, Atlassian’s Bitbucket and HipChat, Buffer, Business Insider, Carto, Chef, Citrix, Codecademy, Coindesk, Convo, Coursera, Cracked, Docker, Elastic, Expedia, Expensify, FanDuel, FiftyThree, Flipboard, Flippa, Giphy, GitHub, GitLab, Google-owned Fabric, Greenhouse, Heroku, Home Chef, iFixit, IFTTT, Imgur, Ionic, isitdownrightnow.com, Jamf, JSTOR, Kickstarter, Lonely Planet, Mailchimp, Mapbox, Medium, Microsoft’s HockeyApp, the MIT Technology Review, MuckRock, New Relic, News Corp, PagerDuty, Pantheon, Quora, Razer, Signal, Slack, Sprout Social, StatusPage (which Atlassian recently acquired), Travis CI, Trello, Twilio, Unbounce, the U.S. Securities and Exchange Commission (SEC), Vermont Public Radio, VSCO, and Zendesk, among other things. Airbnb, Down Detector, Freshdesk, Pinterest, SendGrid, Snapchat’s Bitmoji, and Time Inc. are currently working slowly."

Source: http://venturebeat.com/2017/02/28/aws-is-investigating-s3-issues-affecting-quora-slack-trello/

How can I be sure my data is safe in Aiven?

Right now you do not need to do anything. Our online migration capability has recovered the situation to all customers. We are monitoring the situation constantly for any sign of new problems.

Our backup system currently does not automatically switch to a secondary backup destination site and requires operator intervention to run a single command to redirect the backups of a service to a working backup site. We have done this for all affected services. We have added a feature backlog item that this could be done automatically.

IMPORTANT: Please be aware of the limitations of any non-highly available plan ("Hobbyist" or "Startup") regarding availability and durability.

How are different service plans affected by incidents like this?

  • Single node plans (Hobbyist and Startup) have limited capabilities when the single node fails
  • Highly available plans (Business and Premium) are resilient to single node failures and in some cases to multiple node failures

Our Hobbyist and Startup plans are serviced by a single Virtual Machine node. This means that if the node fails, the service will be unavailable until a new VM is built. There may be some degree of data loss as it's possible that some of the latest changes to the data haven't been backed up before the failure. It can also take a long time to recover the service back to normal operation as a new VM needs to be created and restored from backups before the service can resume operations. The time to recover depends on the amount of data to restore. 

Our Business and Premium plans are very resilient to failures. A single node failure will cause no data loss and the possible downtime will be minimal.  In case an acting PostgreSQL or Redis master node fails an up-to-date replica node is automatically promoted to become the new master.  This will cause minimal outage to applications as they need to reconnect to the database to access the new master.

More questions?

Please contact our support at support@aiven.io

Did this answer your question?