This post has been republished via RSS; it originally appeared at: Microsoft Mobile Engineering - Medium.
This is a four-part series of articles by Tehmur Khan and Praveen Pendyala providing an overview of how the SwiftKey Android team in Microsoft use Continuous Integration and Continuous Delivery (CI/CD), particularly around how we use Azure DevOps (ADO) following our migration from Jenkins. A special mention goes to Helalur Khan who was instrumental in the Cloud Infrastructure efforts during this migration project.
We will first provide some context of SwiftKey and our needs before giving an overview of Azure DevOps (ADO), allowing those less familiar with ADO to still understand the rest of the content. We will explain how we used ADO to setup our CI/CD and how we leveraged Azure Pipeline features. After that we will explain some of the challenges we faced and how we went about resolving them, before summarising some learnings and our future CI/CD plans.
Firstly, what exactly is CI/CD?
Continuous Integration (CI) is the practice used by development teams to automate the merging and testing of code. Implementing CI helps to catch bugs early in the development cycle, which makes them less expensive to fix. Automated tests execute as part of the CI process to ensure quality. Artifacts are produced from CI systems and fed to release processes to drive frequent deployments…
Continuous Delivery (CD) is a process by which code is built, tested, and deployed to one or more test and production environments. Deploying and testing in multiple environments drives quality. CI systems produce the deployable artifacts including infrastructure and apps. Automated release processes consume these artifacts to release new versions and fixes to existing systems. Monitoring and alerting systems run continually to drive visibility into the entire CD process…¹
SwiftKey is a large-scale intelligent keyboard application leveraging machine learning and Artificial Intelligence. It has over 500 million downloads, including preinstallation on many different devices. We release to market at a regular cadence of every two weeks, and deliver the same application binary to both the Google Play Store and our OEM partners.
With great scale comes great responsibility! With a wide range of features and functionality offered by our keyboard we must ensure it maintains our high quality and stability bar. With so many use cases and potential edge cases the need for automation is imperative. One critical path is Android’s Direct Boot mode feature that allows data to be protected until the device has been fully unlocked. In this use case, if we are the only keyboard on the device and the user has chosen to use a password to unlock it, there is a potential to brick the device if the keyboard crashes. Clearly, we want to ensure this never happens and so we see how important an extensive and rigorous testing infrastructure is. We thus need to make sure our CI/CD setup caters to having such testing.
SwiftKey CI/CD Needs
There are three main requirements we need within SwiftKey for our CI/CD.
- The Android build environment
This is essential to building APKs and running unit tests.
- Android emulators
Android emulators are used extensively throughout our CI/CD, including for running Android nightly tests for each API level and Android tests on pull requests.
- Physical devices
We utilise physical devices for various custom test suites to identify memory leaks, keyboard performance regressions, end-to-end functionality regressions and Direct Boot mode regressions.
SwiftKey’s CI/CD Evolution
Over time our CI/CD infrastructure has evolved. We had been using Jenkins for a long time, but cracks were starting to appear. We would continuously apply plasters to mask the issues we faced. It was clear it was an ageing system and so we began a migration project onto Azure DevOps (ADO) in line with many other Microsoft apps and services. By the end of 2019 we had fully migrated onto ADO.
We previously used Jenkins with a site-to-site VPN between our own server room and Amazon Web Services (AWS). We had our Jenkins master hosted in AWS that could communicate to the local build nodes in our server room. Our Cloud Infrastructure (CLI) team would own the node setup and the Android team would handle the creation and maintenance of the job configurations.
Some advantages of our Jenkins setup included ease of set-up for custom webhooks, as well as being able to run on bare metal, which offered fast execution times. However, there were major inconveniences when needing to install new packages or perform maintenance work on Jenkins as they would require Jenkins to be rebooted causing downtime for our CI/CD. Also, when there were power cuts, Jenkins would go down and this would often cause issues across many nodes and builds.
Our lack of configuration management software on Jenkins meant changes to each node, such as when we needed to upgrade the Android tools, had to be done manually. With over 10 nodes this led to configuration drift on each node.
We also did not leverage new trends such as containerisation and relied on a few Jenkins specialists which would lead to situations where we were blocked on the infrastructure being fixed.
Azure DevOps (ADO)
So, what does ADO bring to the table? Our main requirement was reliability and scalability, and this is exactly what ADO was designed for. More specifically we replaced our Jenkins setup with Azure Pipelines.
Azure Pipelines…is a cloud native continuous integration pipeline, providing the management of build and release pipelines and build agent virtual machines hosted in the cloud. In addition, Azure Pipelines supports a hybrid cloud and on-premises model…²
Using ADO meant we could move away from more complex SSH and firewall setups we had on Jenkins and utilise the built-in ADO security mechanisms.
ADO also provides the flexibility to have self-hosted agents for more complex needs or utilise Microsoft-hosted agents. Microsoft-hosted agents are managed entirely by Microsoft, so software updates and maintenance are taken care of. Amongst other benefits self-hosted agents provide more flexibility in terms of utilisation time but come at a greater cost of additional complexity and management.
With the agent abstraction provided by ADO, each engineering team can own their docker agent configuration (Infrastructure as Code). Within SwiftKey this creates a clear separation of concerns between the Cloud Infrastructure (CLI) team and other engineering teams. This allows our Android team to self-serve our infrastructure and resolve issues or update packages without relying on the CLI team. The CLI team owns the base platform that offers this self-serving functionality for various engineering teams.
Migrating to ADO would naturally bring some degree of complexity for all teams, as it was a new infrastructure where up-skilling was needed to learn how it works and how to manage our own agents. We felt however this was a great opportunity to utilise a platform that is constantly evolving and has great momentum and an ever-growing community behind it.
In the next article in this series, we will dive into how we used Azure DevOps to setup our CI/CD. We will cover the building blocks of Azure Pipelines and how we set up our agents within the SwiftKey Android team.