Within M+M we've deployed myriad products to Amazon Web Services, including photos.active.com, advantage.active.com, ihoops.com, mobileapps.active.com, widgets.active.com, realtime.active.com, search.eteamz.com and labs.active.com. We've found that, with each product, we have to run through the same checklist of items to ensure continual up-time. Some of these items are related to application architecture, while others are related to infrastructure:
- Caching - Serving pre-compiled content is always the best way to reduce I/O and have a fast website. Memcached and Apache's mod_cache are frameworks we commonly use.
- Load Balance Across Availability Zones -- AWS has a 99.95% uptime SLA, but only if applications are deployed across multiple availability zones. As such, nodes in a pool should run in two or more availability zones. And, as if it needs to be said, applications should be load balanced.
- Data Backup - The cloud doesn't magically solve data retention issues (but it does help). Amazon's RDS offering is a great option when not wanting to deal with setting up backup policies. Otherwise, these must be put in place just like they would in any given data center.
- VIP Monitoring - Like most companies with products in the cloud, we run some of our products in a proprietary data center, and others on Amazon. But our monitoring system acts as a central repository reporting status on all systems. As such, monitoring must be setup on VIPs (i.e. www.ihoops.com) and nodes, with alerting configured should anything act up. Node monitoring is enabled by allowing traffic through from the source ip address of the monitoring system through Amazon's security group configuration.
- CPU and Memory Monitoring - CPU and memory utlization should also be monitored. UDP port 161 should be opened to the monitoring system to record SNMP traps.
- Human Support Process - When something goes wrong, somebody's gotta get paged.