Nowadays, automation has already entered the area of software development. We deal with it not only when building application code or running tests, but also in the process of product implementation for the client and its subsequent maintenance. There is nothing surprising or mysterious about it, and the Continuous Integration (CI) and Continuous Delivery (CD) processes have become our everyday life. Development teams consist of engineers with broad competences, including test automation (QA) and infrastructure management specialists (DevOps).
At Metapack, multi-functional Software Development teams are able to meet business challenges in much less time than it was a decade ago.
What influenced it and what was optimized so that the time of delivering new functionalities could be counted
in units a magnitude smaller than a few years ago? Answering this question, I will present to you the transformation that the very process of producing ‘software’ had to undergo. In addition, I will show how the teams coped with the various challenges that came their way while building a platform to support our clients’ business processes.
A long time ago…
As it’s usually the case in stories, it all began a long, long time ago with a monolith to which many teams made hundreds of changes from sprint to sprint. The code was managed by an archaic version control system that was placed ‘behind the wall’ on the server. The final application was compiled and assembled by our own tools and, as you probably guessed, taking the time to check whether the current changes did not spoil the whole thing was always a big challenge. The implementations, on the other hand, were associated with great emotions. Over time, gradual improvements began to appear in the form of a dedicated Source Building System (TFS), or automatic night tests using SoapUI, which eliminated the need to perform long-term manual tests. Despite the fact that it was possible to shorten the time needed to prepare the version for the client to a large extent, it still required several hours of stages. There was one more problem to be solved: the complicated implementation. We’ll get to that later, and for now let’s see how things turned out for our CI process.
A few years ago our team was challenged to move all sources and tools from local servers to the cloud.
During the implementation of this task, it was possible to implement a number of further improvements and open the gates to the next ones. Thus, the TFS server and the version control system it uses have been replaced by tools in the cloud – Bitbucket and the well-known Git system with all its benefits, such as proven branch management strategies or review of changes using ‘pull requests’. In turn, ‘pipelines’ for automatic code building and testing were built from scratch using a very versatile and intuitive tool, Teamcity. In fact, without much effort, we shortened the time of building our main products by another tens of minutes, relying only on the internal capabilities of the latter, such as reducing the number of rebuilt components only to those with changes in the code or optimization of shared library management thanks to built-in mechanisms caching.
The migration was successful, but we did not rest and with each successive sprint we introduced gradual improvements to the existing mechanisms. It was, among others, parallelization of unit test execution or consolidation of a large number of small projects (Solutions in .Net) into one whole, in order to take advantage of the feature of parallel code compilation using all available CPUs of the machine. Still, the time for delivering the monolith for production has not even come close to the fifteen minutes mentioned in the title.
Technologies to the rescue
We were greatly supported by the company’s strategic decisions regarding the architecture of the planned new functionalities and their integration with the current platform. This is how we started our adventure with highly available and independent network micro-services in the AWS cloud. The autonomy of websites was equal to the autonomy of teams, i.e. each owner of a new service decided about its internal architecture and technology.
My team has the role of building an alternative to the basic functionality of our platform. An ambitious and demanding task, especially due to a number of dependencies between various components and the need to maintain compatibility with the existing solution. As a result of our work, the concept of a platform based on containers (Docker) was created, with the possibility of extending its functionality by means of dedicated modules injected in the integration process.
Something similar to the well-known Dependency Injection pattern was created, which we could easily use in the code of the new platform itself thanks to the independence of sources and the selection of the latest version of .Net Core.
But how did it actually reduce the time it takes to deliver new functionalities to the customer? I will try to explain it briefly.
The separation of a large part of the core of the system into a separate repository made it possible to remove the ‘dead’ code and cover the rest with automated tests to such an extent as to ensure backward compatibility. At this point, we were able to build a base image of the so-called ‘core’ of the site. Most importantly, as long as there was no change to the ‘core’ area, the build process no longer had to be repeated every time the next version of the application was released.
The second improvement was the elimination of dependencies between the modules extending the functionalities of the platform and thus increasing the autonomy of teams, which gained the ability to choose a database adequate to their needs, or to decide on the method of integration with external systems. Additionally, each team has at its disposal a universal ‘pipeline’ for placing the module in the form of an independent service directly in the AWS infrastructure. Thanks to this, the code plugged into the ‘master’ branch is immediately built and automatically integrated with the main platform into a ready-to-upload container image (docker).
Then the whole thing is verified in terms of quality. This step is performed in two ways: the static code analysis part is performed together with the unit tests (nUnit / xUnit); at the same time, the module itself is verified in terms of business. For this purpose, we use higher-level tests, the so-called acceptance tests written in accordance with the BDD approach in the SpecFlow framework. Their number, according to the test pyramid, is of course proportionally smaller, thanks to which they are performed quickly and within only one module, which allowed us to eliminate the need for additional night tests.
Can such a verified image go directly to the cloud?
Almost, but before it happens, we must be sure that our website will cooperate with dependent services coming directly from AWS, such as DynamoDB, SQS or S3. But how to do it, if we want to find out about it as soon as possible, so as not to waste time unscrewing the entire intricate process? The very useful localstack.cloud framework came to our rescue.
Due to the fact that localstack provides AWS services in the form of a container image, we were able to run our integration tests already at the stage of the CI process.
The status of external services is always constant and known. Above all, however, the tests can be independent of others run on the same infrastructure, and their financial cost (excluding the costs of the Teamcity agent) is practically zero. Not to mention the time saved, which we would spend on creating independent infrastructure in the cloud each time. We verified the latter in the first approach, in which we used Terraform scripts to create resources.
An additional benefit of using containerized AWS services is that a similar solution can be used on a local machine. For this purpose, teams can use the nuget package we have created (a set of powershell scripts in conjunction with docker-compose), which, when building the code, automatically integrates the module with the platform and performs acceptance tests for the entire website.
To sum up: the entire process from plugging in the code to the ‘master’ branch, and then building an image (docker) and testing the application until its interaction with AWS services is verified, currently takes from 5 to 10 minutes, depending on on the number of tests.
There is only the last element left – deployment
As we remember, the release of the client version was usually associated with stress and time-consuming regression testing. We didn’t want to repeat the same thing this time around, so we put the greatest emphasis on reliability and stability. Above all, however, we focused on automation with minimal human input. Thanks to this approach, we tried to use the potential from the very beginning of the process. Since once carefully prepared acceptance tests (SpecFlow) have verified the finished artifacts, why not use them in the next stages.
In this way, the concept of service performance verification in the CD (deployment pipeline) process was created, and it was also possible to create smoke tests. The latter also act as a warm-up mechanism for our website so that it is immediately ready to accept traffic when it starts.
As a result of previously launched tests, we obtained reliable samples of inquiries to our API, which we successfully used in the CD process we were creating. They are used to generate a high load for the service that has just been issued, which we do with the great Gatling tool. Of course, we do not do it on the production environment, but on the previously updated test environment, which is also a mirror image of the former in terms of infrastructure. If the response times from the API (most often the 95th percentile) are within the defined criterion (SLO) for the module that is just being published, it means that nothing prevents you from approving the update of the website in production. The automated Terraform script is responsible for the last stage, which promotes the service to the latest version in a few simple steps. All this activity, thanks to the use of Fargate technology in AWS ECS and the aforementioned ‘warm-up’ mechanism, takes place of course without any downtime for ongoing traffic (zero-downtime), and the service remains scaled adequately to the prevailing conditions in the production environment.
How long does it all take?
On average, the test environment updates in minutes. To this we add 5 minutes for load tests. After confirming them, we have the green light to ‘update’ the production. At this point, we assign a new image to the task definition, run short smoke tests and finally switch traffic, which in total takes a few more minutes.
To sum up: taking into account the most favorable scenario, where the CI process takes a maximum of 5 minutes, and we need 7 minutes to ‘update’ the test environment and load tests, for dessert we have a whole 3 minutes to safely switch production. Here we have it in a quarter of an hour.
Technologies and tools used: