Data-Driven Software Automation
By Kyle Dunn
Most of us don’t give much thought to elevator rides and the data-driven nature of them. A set of sensors informs precise motor control for acceleration and deceleration, providing a comfortable ride and an accurate stop at your desired floor. Too much acceleration brings the roller coaster experience to near the office but too little will make you late for your team meeting; a good balance of these two can be quite complex in practice.
Similar challenges are increasingly surfacing in the highly dynamic and complex world of software systems. In the case of a platform as a service, this manifests in application instance auto-scaling. It’s important to strike a balance between reacting too quickly, or with too much magnitude, but also not violating an SLA by waiting for absolute certainty from observations. Fortunately, the magic behind the elevator algorithm (aka the discipline of control systems theory) is well understood. We can learn from the countless applications in physical systems, like elevators, to move towards data-driven software automation.
For this writing we’ll focus on a situation where the market intelligence (read: data science) department at the fictitious ACME retail outlet. These data scientists need to evaluate a few hundred website design variations (for A/B testing) leading up to the launch of the marketing department’s back to school campaign. Coincidentally (or not) the platform devops team wants the weekend off for opening day (and recovery) but must ensure the results will be ready for the marketing team by Monday morning. ACME runs Pivotal Cloud Foundry (PCF) deployed on Google Cloud Platform… so… the finance department keeps close tabs on the monthly spend; over-provisioning capacity for a whole weekend will lead to a CFO nastygram – no bueno.
Enter first-order, linear forecasting. If we apply this concept to determine the required capacity (using PCF application instances) to complete the entire amount of work (say, in RabbitMQ) by Saturday evening, the operations team will have enough time to check in on the platform after Sunday brunch to verify they’ve met the SLA. The nuts and bolts of this are pretty straightforward:
Available Minutes for Work = Deadline - Current Time
Projected Work Completion [units] = Work Completion Rate [units/minute] * Available Minutes for Work
if Projected Work Completion > Amount of Work:
Have another beer ????
else:
Back to work ????
Picking through the above logic uncovers the three operative variables:
*Deadline
*Amount of Work
*Work Completion Rate
An operator can *only* influence the completion rate. The market intelligence team defines the amount of work required to provide meaningful results to the marketing department and the marketing department has a job to do also; hence the deadline. The operator is simply an intermediary between the two departments: ideally they’d automate the platform to deliver all the work on time by controlling only the number of application worker instances (i.e. the code dictating the completion rate).
a. Marketing defines a deadline
b. Market intelligence defines the worker applications
c. Devops defines a workflow to calculate the number of application workers (b) to satisfy (a)
Staying true to the light reading goal, we’ll assume the operations team succeeded in turning the tables in favor of their weekend social life and enabled the business to self-service the deadline, which can be seen in interactive, low-res, pictorial glory below:
While the intention of this article isn’t to emphasize the technical details or code-level implementation, it’s probably helpful to visualize how this might unfold using actual technologies. It should come as no surprise to the reader that Spring is used almost exclusively: