< Back

June 9th, 2024

Streamlining Engineering Workflows: Zooming in on Deployment Previews.

{ Engineering }

Streamlining Engineering Workflows: Zooming in on Deployment Previews.

Chima Ataman
by

Chima Ataman


Introduction.

Managing a shared codebase in a collaborative software engineering environment can present efficiency, consistency, and collaboration challenges. Streamlining workflows and ensuring efficient communication and coordination are essential for our team’s success. In this article, I highlight how we enhanced collaboration in our shared codebases, providing insights into the challenges, common workflows, and solutions we implemented.

Multiple Engineers, One Codebase.

When multiple engineers contribute to a shared environment, many challenges arise that impact the quality and speed of software development. These could include conflicts arising from simultaneous changes, deployment issues, and potential communication gaps. Establishing clear guidelines, communication channels, and structured workflows is essential to tackle these problems and ensure smooth collaboration.

Any company using shared environments should invest in tools and resources that facilitate collaboration and make it easier to track changes. Version control tools like Git and code-sharing platforms like GitHub help teams track and record changes made to codebases; however, properly using these tools determines their effectiveness.

For those unfamiliar with the concept, a codebase is the collection of source code files used to build a software application. Git, a widely used version control system, facilitates multiple engineers simultaneously editing the same codebase and manages changes and revisions effectively. You can read more about it with the Pro Git Book.

How We Solved This at Cowrywise.

The choice of a workflow depends on the specific needs and dynamics of the team and the nature of the projects undertaken. We’ve adopted the Feature Branch Workflow as our primary engineering process, tailoring our approach to the unique requirements of our team and projects. We also implemented several strategies to optimise this workflow within our teams. Here are some key approaches we adopted:

Leveraging Feature Flags for Flexibility

Introducing new features into a shared codebase environment can be complex, particularly when managing feature rollouts, conducting A/B testing, and enabling or disabling features based on customer feedback. Feature flags, or feature toggles, provide a powerful mechanism to address these challenges. We integrated feature flags into our codebase to allow us to turn specific features on or off, independent of code deployment.

This capability has proven invaluable in orchestrating gradual feature rollouts, conducting A/B testing, and controlling feature access for different customer segments. With feature flags, we can release new functionality to a subset of customers, gather feedback, and make informed decisions about feature activation and refinement. Furthermore, feature flags offer a mechanism for instant rollback by simply toggling off a feature flag, providing us with a safety net in case of unexpected issues following a feature rollout.

Implementing Deployment Previews for Validation.

This was the game-changer for us, and I’ll share why later.

Testing and validating changes before integrating them into the main codebase is essential to ensure the stability and functionality of our app. We implemented deploy previews for feature branches, which we activate when needed. As engineers work on new features in their respective branches, the changes are automatically deployed to a staging environment, allowing engineers and other stakeholders to preview and interact with the new features in a production-like setting. This approach encourages thorough testing, validation, and feedback collection, contributing to greater confidence in the quality and functionality of the changes before we merge them into the main codebase.

This approach is handy when changes affect more than one team. For example, while the backend team works on a feature that affects mobile, the mobile team would have a preview URL for use. The mobile team can also build test applications with this to enable manual testing by non-technical persons.

Deployment Previews: Our Implementation Journey.

At this point, we knew what we needed—a way to deploy feature branches that mirror our staging environment. Making this happen, however, was quite the task. Being fans of controlling our entire deployment stack, using completely external solutions was not high on our list.

First Draft.

Our first draft used Docker images, GitHub Actions, a dedicated EC2 box and tunneling tools like localtunnel and ngrok.

First, we created a custom label, branch-deploy, then a GitHub Actions workflow that would be triggered when a pull request (or PR) with this label is opened or updated. When a feature branch is ready, we open a PR to the main development branch (develop). If we need a deployment preview for the feature, we attach the branch-deploy label to the PR.

After the normal Continuous Integration (CI) test (also a GitHub Actions workflow) runs and passes, the CI job calls the branch-deploy workflow using the workflow_call trigger. You can read more about that in the GitHub Actions document.

The branch-deploy workflow logs into the provisioned EC2 server and does a few simple things:

  1. Builds a Docker image using the codebase’s Dockerfile.
  2. Deploys the image using Docker Compose.
  3. Creates a secure web tunnel from this server to the internet using either localtunnel or ngrok.

Also, the deployment goes down when the PR is merged or closed.

There were a few drawbacks to this approach.

First, and most annoyingly, the deployments have no state because the databases were destroyed and recreated on each deployment (or each push to the deployed branch).

Second, the tunnelling tools could have been better; localtunnel, which is free, becomes unreachable at random, while ngrok limits us to a single tunnel link on the free plan. We had no intention of paying for this service.

Second Draft.

We updated the first draft to fix a few pain points.

We created a volume for the database containers on deployments and namespaced it to the head branch of the PR that created it. This way, all updates to the PR and even new PRs on that branch would use this volume, and data for the feature would persist. It worked perfectly.

As for tunnelling tools, we started testing out alternatives. We tried hosting our localtunnel instance (an open-source project); however, we ran into some blockers there and were unwilling to allocate precious engineering hours to fix a seemingly unmaintained project. Then, we checked for alternative tools and found serveo.net. Also, for good measure, we had all these services in active use. We set up our deployment job to try each one in succession in this order: localtunnel, serveo, ngrok until one succeeds. Also, both serveo.net and localtunnel allowed setting up a dedicated subdomain for deployments so we set that up. This meant each update would get the same deployment URL. This worked well enough for a while. (On ngrok, this is a paid feature, so we didn’t bother with it).

All this was starting to feel really good, but it didn’t last.

The tunnelling services still had issues at times. localtunnel, after a successful deployment would sometimes refuse connections. serveo would sometimes not work, which meant a lot of fallbacks to ngrok and, thus, many changes to deployment URLs. The drawbacks were really frustrating for the mobile and web teams.

Localtunnel and Serveo had an understandable but frustrating security feature: each unique user connecting to the tunnel URL had to be verified. They would redirect the user to a web page where they entered the host IP address to make sure the user knew they were connecting to a tunneled link. But we were using this with APIs, which was unacceptable.

Also, the single EC2 box would sometimes run out of resources, and the number of deployments we could make was capped primarily due to the task server we bundled with the app. We had a few task worker instances for processing asynchronous requests.

Radical Third Draft (Final form).

We had the second draft running briefly as engineering hours were limited. Then, we noticed an opportunity. We had just set up a Docker Swarm cluster with multiple compute nodes. We updated our branch-deploy workflow to deploy the feature branches as stacks in the swarm cluster using a template compose file. We also reduced the number of task workers. This way, we solved our resource problem.

To solve our tunnelling issue, we went fully internal. We used Traefik, an open-source edge router, to route requests from the internet (through an internal tunnelling subdomain “*.tunnels...”) to the deployed Docker Swarm stack.

The whole route now looked like this:

Internet ("<subdomain>.tunnels...") -> AWS Load balancers -> Traefik -> Docker Swarm network -> App container

We also improved the cleanup routine. Now, when the pull request is merged or closed, the stack and all images are removed.

When an update comes to the feature branch PR and the CI passes, it triggers the branch-deploy workflow. This workflow builds the docker image and then deploys it on the Swarm cluster using a compose file generated from a template.

Here’s a stripped version of this template:

docker-compose-tmpl.swarm.yml


   1  version: '3.8'
   2  volumes:
   3    example_pr_{{ref_under}}_data: {}
   4 
   5  services:
   6    example_pr_{{ref_under}}_staging:
   7      image: ghcr.io/org/example/app-build:{{ref}}
   8      environment:
   9          PR_DEPLOY: 'true'
  10      depends_on:
  11        - example_pr_{{ref_under}}_db
  12 
  13      networks:
  14        - swarm-ingress-overlay
  15        - example-pr-{{ref_under}}-network
  16 
  17      deploy:
  18        restart_policy:
  19          condition: on-failure
  20          delay: 2s
  21          window: 30s
  22        labels:
  23          - "traefik.http.routers.example_pr_{{ref_under}}_staging.rule=Host(`{{subdomain}}.tunnels.example.com`)"
  24          - "traefik.http.routers.example_pr_{{ref_under}}_staging.entrypoints=web"
  25          - "traefik.http.services.example_pr_{{ref_under}}_staging.loadbalancer.server.port=8000"
  26          - "traefik.enable=true"
  27          - "traefik.docker.lbswarm=true"
  28 
  29    example_pr_{{ref_under}}_db:
  30      image: mysql
  31      volumes:
  32        - ./staging_dump.sql.gz:/docker-entrypoint-initdb.d/staging_dump.sql.gz
  33        - example_pr_{{ref_under}}_data:/var/lib/mysql
  34      networks:
  35        - example-pr-{{ref_under}}-network
  36      deploy:
  37        restart_policy:
  38          condition: on-failure
  39          delay: 2s
  40          max_attempts: 5
  41          window: 30s
  42 
  43  networks:
  44    swarm-ingress-overlay:
  45      external: true
  46    example-pr-{{ref_under}}-network:
  • This template would be copied in the deployment workflow, and the placeholders— { text in curly braces } —would be replaced with actual values unique to the PR. We easily accomplish this using tools like sedtr and cut. This filled copy is what gets deployed to the swarm cluster.

    Here’s a demo:
  1  REF=$(echo ${{ github.event.client_payload.ref }} | tr -c '[:alnum:]\n\r' '-' | tr '[:upper:]' '[:lower:]' | sed 's/refs-pull-/pr-/')
  2  REF_UNDER=$(echo ${{ github.event.client_payload.ref }} | tr -c '[:alnum:]\n\r' '_' | tr '[:upper:]' '[:lower:]' | sed 's/refs_pull_/pr_/')
  3  SUBDOMAIN=$(echo cws-${{ github.event.client_payload.branch }} | tr -c '[:alnum:]\n\r' '-' | tr '[:upper:]' '[:lower:]' | cut -c1-60)
  4  COMPOSE_FILE="path/to/docker-compose-${REF_UNDER}.swarm.yml"
  5
  6  # ======= use sed to prepare Compose file
  7  sed -i -e "s/{{ref_under}}/$REF_UNDER/g" $COMPOSE_FILE
  8  sed -i -e "s/{{ref}}/$REF/g" $COMPOSE_FILE
  9  sed -i -e "s/{{subdomain}}/$SUBDOMAIN/g" $COMPOSE_FILE
  • The labels on lines 22-27 of the compose file is how Traefik determines how and where to route requests. See more here.
  • The first volume feeds an initialisation SQL script for the database volumes on lines 31-33. We then use it to set up the database on the first deployment and populate it with data. staging_dump.sql.gz is a recent dump of the database in the staging environment. We use the second volume to store data between deployments.
  • For the networks (lines 43-46), swarm-ingress-overlay is the network configured for Traefik to route internet requests. This network is attached to all services that need to be reachable from the internet.

In a nutshell

We’ve enhanced collaboration and efficiency in our shared codebase environment by embracing the Feature Branch Workflow, implementing deploy previews, and leveraging feature flags. These measures have improved the quality and stability of our software products and fostered a culture of seamless collaboration and rapid iteration within our engineering team. We cut down ‘time-to-deployment’ for major features from several days to under one day.

As our engineering continues to evolve, the emphasis on streamlined workflows and collaboration will remain foundational to delivering innovative, high-quality products in a collaborative environment.

Happy coding and collaborating!

References

  1. Pro Git Book
  2. Git Flow
  3. Branching Patterns
  4. GitHub Actions Docs
  5. Traefik Docs
  6. Docker Swarm Docs
  7. Linux Handbook

Next article

From Buzz to Balance: Docker Swarm Orchestration Without Tears
From Buzz to Balance: Docker Swarm Orchestration Without Tears
Chima Ataman

Chima Ataman;

{ Team Lead Infrastructure } Cowrywise