Self-hosted agents at Azure DevOps: a little cost-saving trick

Azure DevOps does a great job when providing hosted agent services. They come loaded with all required software, they care about updates and everything else, but they have some major drawbacks:

No static external IP-address (so it's not possible to have an additional layer of security))
You get a new VM each time, so you need to clone your repository fresh, install a fresh set of NPM packages, install all those base Docker images (not a big deal for alpine-based images, but when it comes to Microsoft one, it is really a hitter)
For closed source projects there is a hard limit of 1800 minutes per month per 1 hosted job (I do not like limits even if I never hit them 😀 )
And so on – you name it

To overcome this, one can deploy self-hosted agents, but then you'd have to deal with updates, installation of tooling and extra cost. How to deal with updates and actual management is covered in this blog post, but it still leaves the question of cost partially open. I have spent some time and did several improvements to the work of Wouter de Kort in a fork of his repository, check out at my repository. Here I did some scripting optimization and improvement, but the main thing, which I wish to cover in this blog post, is the cost-optimization tool I built for our fleet of self-managed agents.

Problem

We have had only 2 hosted VS2015/17 jobs for performing builds and releases for 10 projects, each of which was requiring anywhere from 5 to 20 minutes to build and somewhere between 15 and 30 minutes to release. That was quite taxing, especially when queues built-up.

Idea

In our particular situation we have 7 parallel jobs for self-hosted agents, which comes for "free", through subscription to Visual Studio, so I began searching how to leverage those to improve our build and release speed. Initial setup was strictly following the Wouter de Kort blog series, with automated switching of VMs in a Virtual Machines Scale set nightly and on weekends to save costs. But, as soon as I began receiving requests to start some VMs for the weekend or start them later at evening to fulfill some off-hour tasks, or earlier in the morning to do urgent deployments, I started seeking for a way to automate these tasks. Which lead to having the idea of a continuous web job, which will continuously monitor the queue in the target pool and start VMs when they are needed (and stop them when they are not needed).

Realization

I came to Azure DevOps with a strong TeamCity background, so I was hoping to find something similar to a build queue in TeamCity, but, alas, they have another approach: all tasks are assigned to a pool, where they are picked up by the first agent to be online and free in a FIFO manner (first in, first out). There is no queue at all, all tasks just have a property "Result". If it is null, then this task has not yet been executed. If all tasks assigned to the pool have a non-null property "Result", then there is nothing to do. So, if there are some tasks in a pool with a null property "Result", then the code will check how much online agents are present in the pool. If the agent count is more than or equal to the tasks count, again, there is nothing to do. If the agent count is less than the tasks count, we need to start more virtual machines in the virtual machines scale set for our agents. If there are more agents online present in our pool than the number of assigned tasks, we need to stop extra agents in the virtual machines in scale set. Also, there is an option to define business hours and days of the week when there is a minimum required amount of agents online to speed up development (so, teams do not have to wait for an agent to become active and consume a task, but the task will be started immediately). The check to provision more VMs is done once every 2 minutes, to minimize waiting time and not abuse the API too much. The check to deprovision VMs is done once in 15 minutes, this allows more runtime for agents. Normally, I observe that almost immediately after a successful build a developer will wish to deploy to the Test environment.

In my humble opinion, this solution is better than statically switching off/on virtual machines on some schedule, because it allows to fulfill any task (compile, test, release, whatever is executed on your agents) at any time in a Just-In-Time manner. Though, naturally, if all agents were switched off, it will take some time for them to become online, but due to the business hour / day option, this will only happen in off-hours.

All settings and deployment instructions for the Autoscaler application are described here. I would not duplicate them in this blog post, as overtime they could change, and the readme document will be kept up-to-date.

The code of the Autoscaler app can be seen at this location.

There is also an ARM template for the baseline configuration of a web app which is suitable if you have only one pool in Azure DevOps to monitor, as it defines settings at an App Settings level of the web app itself. If there is more than one pool, only shared settings shall be defined in the App Settings of web app, while specific settings should be added to the App Settings of an individual web job. You can host as much web jobs as you need, but mind the web app limits.

Be aware, that the ARM template by default will deploy to a D1 web app (Shared web app, which allows limited amount of CPU time and only 1 Gb of RAM without Always On). The "Always On" feature ensures that the hosting process is always awake and is not shut down after 20 minutes of inactivity. So, if a web job will be deployed without additional precautions, it would not work, as the web app runtime will shut down the Kudu process after 20 minutes of inactivity. There is a nice trick to keep it up and running: you need to ping the Kudu homepage of your web app at least once every 20 minutes. I am using https://www.happyapps.io/ to visit the Kudu homepage of my web app once per 5 minutes on the address https://webappName.scm.azurewebsites.net/

Deployment hints

By default, the Azure Web app runtime does not execute the continuous web job from the path where it is deployed to, but I still wish to be sure that it is not running when I am deploying it, so I am using the following PowerShell scripts to stop/start the web job

To stop webjob:

Invoke-AzureRmResourceAction -ResourceGroupName resourceGroupName -ResourceType Microsoft.Web/sites/continuouswebjobs -ResourceName webAppName/webJobName -Action Stop -Force -ApiVersion 2018-02-01

To start webjob:

Invoke-AzureRmResourceAction -ResourceGroupName resourceGroupName -ResourceType Microsoft.Web/sites/continuouswebjobs -ResourceName webAppName/webJobName -Action Start -Force -ApiVersion 2018-02-01

The same scripts are used when rebuilding the Virtual Machines Scale set to ensure that the web job will not attempt to stop the VMs before they have been registered at the pool.

This blog post have been created by me and edited by my colleague and friend Rob Habraken