Breaking Down a Monolithic API - Dockerizing the Bloomreach Suggest API Layer
Customers interact with Bloomreach’s products through Bloomreach’s frontend API layer which, historically, has been monolithic and served all types of requests like search, suggest, thematic, jfy, etc.
But not all requests are the same, they can have different requirements in terms of latency and queries per second (QPS). For example, the latency requirements for the autosuggest requests are much lower than any other request types.
So, we decided to refactor out suggest from the monolithic API and instead, deploy it as a microservice. The Suggest requests fit perfectly in the microservice definition, as the code base was already small and it had minimal dependency on other modules.
The internal codename for the project is Sunbird which is a bird known for being lean, quick & sweet - and these were the qualities we wanted in our project too :)
Fig: Old architecture with all dependencies hosted on a single node
Before we began our migration efforts we had four goals in mind:
Have an independent & isolated application
Dockerize it using docker-compose
Deploy it to on automated container orchestration service.
Integrate the new development, deployment process into Jenkins
When we began refactoring code for this new application, we wanted to make sure that:
We used the latest versions of all libraries and frameworks
Make it as lean as possible
The general request-response flow should remain unchanged
We had to port our code to Python 3 in the process. We didn’t face any major difficulties in the migration and 2to3 is really useful.
After getting a basic skeleton ready, the next step was to further lean down the codebase. For eliminating dead code we used coverage.py with unit tests and integration tests.
After this, we arrived at a code base which is shy of 5K lines of code.
One issue we faced was fetching configuration settings, which depended on other modules. For abstracting this, we offloaded settings as a dictionary (dict) to Redis (managed by AWS Elasticache) in addition with a cron job on a different machine that would gather all the configurations and sync them. Adding Redis in the architecture did add a latency of around 8-10 ms because in the previous setup all configs were present locally in the system, but more on this later.
We came up with a multi-container architecture for Sunbird, where one container would only run NGINX, and the other would run Python code using Gunicorn. With docker compose, orchestrating multi-container apps is pretty easy. Here is what we initially had:
version: '2' services: nginx: image: snap/sunbird:$NTAG ports: - "8000:8000" depends_on: - web links: - web app: image: snap/sunbird:$DTAG env_file: - ../config/devEnv.list environment: - ENVIRONMENT_VARIABLE=value expose: - "8000"
Again, we wanted to have a minimal footprint, so we used a python3-alpine as a base image in the dockerfile. A fully built image was only 160MB in size.
Removing ElastiCache Latency
To remove the additional latency caused by using ElastiCache, we came up with a multi-level cache alternative. We started storing some configurations locally using django-cache-memoize. This removed all the latency related to fetching configurations. It fit our use case perfectly because the data size that we wanted to store was rather small.
You wouldn’t want to use this approach if you have limited memory, as multiple Gunicorn workers would result in multiple local caches.
Finally, our application architecture looked something like this:
Containerized Suggest Serving
Next, we had to figure out how to get logs from the application. Logs are pretty important, as we do a lot of analytics and monitoring based on logs. We didn’t want to use any kind of persistent storage on the machine that would be running the Docker containers, which is why we turned to a remote rsyslog server.
The standard way to getting logs from a container is to let the application print everything out to stdout, and then use the Docker logging driver to redirect logs to a remote rsyslog server.
Using rsyslog for logging made sense for us because we were already using it for log processing, rotation and storage in other applications.
An example Docker command with logging enabled:
docker run \ -–log-driver syslog –-log-opt syslog-address=tcp://220.127.116.11:1111 \ mode=non-blocking --log-opt max-buffer-size=4m alpine echo hello world
Non-blocking mode means that an intermediate space is used as a log buffer before the logs are actually transferred from the container to the log driver. Using the non-blocking mode is recommended because it prevents the containers standard output (STDOUT) & standard error (STDERR) streams from choking up.
Deploying to ECS
The next step was to take our local setup and deploy it on ECS, which fit right in with our existing infrastructure.
To get started with ECS we needed to create an ECS cluster which is a group of EC2 machines and it can be orchestrated by you or ECS. This is where your Docker containers will run.
In our launch model, we started with an empty cluster and we created our own custom AMIs using the Amazon ECS optimized AMI. Inside each machine an ECS cluster runs the ecsAgent (docker container), which is responsible for managing everything that happens in the EC2 instance linked to the cluster.
Before we could deploy to ECS we needed to define a base taskdefinition.json file, which is like a JSON version of a docker-compose file. For this, we used ecs-cli which takes up a docker-compose file and creates a task definition for you on ECS.
Fig: Final Architecture
After the initial setup, it was time to run the performance tests. We use Vegeta for running performance tests. After the initial tests, we saw a 30% improvement in QPS without any additional impact to CPU and RAM usage when compared to old systems.
Ship Containers, Not Code
After successfully testing in our local dev environments, the next move in ECS was to automate everything using Jenkins.
With Docker, release management was a cakewalk. We tagged each release candidate in a container and uploaded the image to ECR.
For deploying to multiple regions we set up different clusters for each region. Each cluster had a consistent cluster and a service name so deploying to a new realm only meant adding a new suffix to determine the cluster and service name.
The above approach can pretty easily be integrated into a Continuous Integration system, where a CI server like Jenkins would pull in the code from a git repo and push finished images to ECR.
So in conclusion, experimenting with docker, microservices and upgrading to python3 we had a 30% increase in the performance of our existing machines with minimal additional infrastructure costs. This allows us to handle more traffic for autosuggest service and less downtime!
Acknowledgements - This project was made possible with help from Mohit Jain, Raunak Bhansali, Santosh Hegde, Bholi Patra, Scott Powers, Corwin Brown & Gianluca Privitera of the Bloomreach engineering team.
Special thanks to Naveen Vardhi for being an awesome mentor throughout the project.