Strengthening the current app

Posted by

After years of solid growth, Shopmonkey had reached a point where we have outgrown the old architecture we now call "Shopmonkey V1". Many of our services have accrued significant technical debt over the years, making it more challenging and complex to manage and maintain at scale. However, we quickly recognized that substantial changes had to be made to ensure that our services remained resilient, scalable, and (most importantly) highly available to all of our customers.

To address this, we began work on "Shopmonkey V2". This completely revised architecture will allow us to take advantage of better technologies and frameworks and redesign our services from the ground up to be much more efficient and scalable. We’re excited about the V2 architecture, which you can read more about in Jeff’s blog post here. Although we’re still a few months shy of realizing the full potential of V2, we’re still working hard on reducing the likelihood and impact of severe performance issues and outages that can affect your shop’s experience in the current app.

Recently, we embarked on a mission to shore up our V1 services by adding more advanced alerting and monitoring that can address potential issues before they become a real problem. Allowing us to be more proactive about issues and sometimes stop them before your shop even notices there was an issue in the first place. We also started hardening our Data Caching Layer (this is the software that makes your data available to the app as quickly as possible) to withstand load spikes during peak usage times and allow faster data access. Finally, we improved our postmortem workflow to communicate more efficiently with our customers during downtime incidents.  

To start with, we set up a comprehensive monitoring system that tracks the performance of all of our V1 services in real time. We added better alerting capabilities that immediately notify our team when an issue arises, allowing us to address it before it affects the system’s stability. With this setup, we can now proactively identify potential problems and take corrective measures in a timely fashion.

 

We also began moving some of our services to hosting providers better equipped to handle our load and scaling needs. One of those providers was our database provider which is where your data is stored in our system. Moving our primary database to a new managed provider has allowed us to provide a much more scalable service. It has also proven beneficial in helping us identify performance bottlenecks so we can address them proactively and has given us the ability to "scale up" as needed to address peak load conditions.

Another critical area we addressed was the performance of our data caching clusters. We noticed that this system was becoming a bottleneck under heavy loads, causing the entire V1 app to slow down to a crawl or crash altogether. To remedy this, we implemented a series of hardening measures to make our data caching clusters more resilient to load spikes. We optimized configurations, added redundancy, and distributed the load across multiple physical machines to improve performance and reduce the likelihood of downtime. We also migrated our data caching from a self-managed instance to a cloud-hosted service, which allows us to take advantage of the reliability, scalability, and Service Level Objectives offered by our cloud provider.

To prepare for the unlikely scenario when an unavoidable issue might arise, we have greatly improved our internal processes to allow our teams to communicate more efficiently and to keep you in the loop. As part of this effort, we implemented a postmortem template to handle every incident systematically and thoroughly. This template includes a detailed report of what happened, what actions were taken, which customers were impacted, a root cause analysis posted on our status page, along with what measures are being put in place to prevent similar incidents from happening in the future.

The efforts of our Engineering team have created a much more stable and responsive product for your shop. We look forward to continuing to work with you and helping you grow your business with a stable foundation underneath you with Shopmonkey.

 

Mike Goode is the Engineering Lead for our Infrastructure team which is responsible for monitoring, observability and performance of the Shopmonkey platform.