Availability Knob
Flexible User-Defined Availability in the Cloud
Mohammad'Shahrad!and!David!Wentzlaff
October 5, 2016
IaaS Providers and
Availability Guarantees
2
Fixed 99.95% availability!
One thing in common:
What’s wrong with fixed availability?
3
Cloud customers:
Various downtime demands
Different WTP*
Cloud infrastructures:
Heterogeneous HW & !
SW reliability
* WTP= Willingness to Pay
The Availability Knob (AK)
Let’s have clients ask for
their desired availability
and be charged
correspondingly.
4
Cloud
Scheduler
Cloud
Scheduler
Cloud
Scheduler
Cloud
Scheduler
Cloud
Scheduler
Cloud
Scheduler
What should change in
cloud to support AK?
5
Cloud management
Gathering failure data and build failure stats
Avail-aware scheduling
Service Level
Agreements
(SLAs)
How do SLAs look with AK?
1. Desired Avail. / Period!
(e.g. 99.8% / 7 days)
3. Variable service credit (penalty)
6
2. Availability price scale!
e.g. (99.95%,1.00), (99.9%,0.95)
The AK Scheduler
7
1. Check for available resources
2. Find the cheapest resource
considering possible penalties using:
User’s experienced vs. requested DT
Expected PM time-to-next-failure
VM size and expected DT** length in case of failure
PM* Failure DB
Service Record DB
* PM= Physical Machine
** DT= Downtime
AK-Specific Scheduler Features
8
Extra Knowledge on user availability demand enables new
scheduling features:
Benign VM* Migration (BVM)
Deliberate Downtimes (DDT)
* VM= Virtual Machine
Benign VM Migration (BVM)
VMs can be over-served
Low failure rate
Assignment to HR resources (resource shortfall)
9
Periodic migration of over-served VMs to cheaper resources
* DTF= Downtime Fulfillment
** SLO= Service Level Objective
Deliberate Downtimes (DDT)
Providers can deliberately fail VMs near the end of period.
10
Motivations:
Building market incentives
Lowering energy consumption
Bidding redeemed resources
etc.
Requested Avail.
Delivered Avail.
Safety
Margin
Economics of AK
How to set prices to ensure mutual benefit?
How does AK make money?
11
Incentive Compatibility
Clients may:
- run buggy VMs
- cause deliberate DTs**.
Providers can:
- neglect meeting SLOs*
* SLO= Service Level Objective ** DT = Downtime
Pricing for incentive compatibility
12
Using game theory to ensure:
-
Providers maximize profit margin by not violating SLOs
-
Clients pay less by asking their true demands
How does AK make money?
13
1. Adapting service to real demand:!
Higher market efficiency through supply chain flexibility
2. More efficient resource utilization:!
Lowering OpEx, Extra Bidding/Sprinting
3. Variable profit margins:!
Compensates risks & supply/demand disparity
~10% Cost Reduction
~20% Profit Increase
AK Deployment
No hardware change required
Low technology adoption cost
14
Existing fixed availability a subset of AK
Can be offered as an optional feature
Easy shift to the new model
How to evaluate AK?
Infrequency of Failures
Accelerated testing
Simulations
15
Data center scale
1. Stochastic simulations in MATLAB
2. Prototype implementation with OpenStack
[1] http://gdkomeg.en.made-in-china.com/productimage
[1]
AKSim: Stochastic Cloud Simulator
16
Scalability
Resolution/Accuracy
trade-off
Diverse Applications
Multiple VMs
Various Machine Types
(cost/resilience trade-off)
OpenStack AK Prototype
17
Availability-aware Scheduler
18
1000 machines, 12000 users, Normal demand dist., 6 month
BVM every 1hr for top 10% of over-served clients
Benign VM Migration (BVM)
19
~7%
Cost Reduction
Increased Miss Rate
0.19% 0.34%
1000 machines, 12000 users, Uniform demand dist. [3 nines,5 nines], 30 days
BVM every 1hr for top 10% of over-served clients
Benefits of BVM depend on machine type blend
and data-center utilization.
Deliberate Downtimes (DDT)
20
1000 machines, 12000 users, Normal demand dist. [3 nines,5 nines], 6 month
BVM every 1hr for top 10% of over-served clients
Benefits of DDT depend on demand distribution.
DDT
Improved Service Satisfaction
21
Downtime
Price
* WTP= Willingness to Pay
AK Satisfaction
Fixed-avail Satisfaction
Things to Remember
Supply chain flexibility -> market efficiency
22
Knowing user demand can enable new techniques
Game theory to ensure mutual economic incentive
Leveraging reliability/cost trade-offs
The Availability Knob
23
Mohammad Shahrad
Back-up Slides
What if client’s demand changed?
Client must have the incentive to change his plan.
Price
No change; Fixed A
1
P
A
1
deliberate failures by
user to earn cash back
P
A
1
SC
A
1
(αA
1
+(1−α)A
2
)
Change to A
2
αP
A
1
+(1−α)P
A
2
Plan update condition:
Upper bound of SC given arbitrary P
25
Nash Equilibrium
26
Nash equilibrium:
Catastrophic Failure & AK
27
When the whole cloud service is down.
012345678
Catastrophic Event Length (Hour)
0
10
20
30
40
50
60
70
80
90
100
Missed SLOs (%)
AK (Uniform Dist)
Fixed Availability
Why OpenStack
VM migration (unlike Eucalyptus)
Diverse hypervisor support (KVM)
AWS Compatibility
Big community (good support)
Real world adoption in public/private/hybrid clouds
28
Some More Results
29
Service Credit Reshaping
30
Availability Monitoring Tools
There are some performance monitoring tools AK
can use to gather avail data:
Nagios (used in AWS)
Zabbix
Ganglia
31