Building a better Shopmonkey

Posted by

For the past six months, we have embarked on a very ambitious, but necessary, re-platforming of our product.  We call this version "2.0" or "v2" and you'll be hearing much more about this effort starting today.  I've started this blog to give you additional information about this journey and help you gain better insight into how we're building a better Shopmonkey platform for your business.  My goal for this blog is to be super transparent.  In some cases it's going to be deeply technical in nature, but I'll do my best to try and frame it in ways that help you understand what's in it for you ā€“ your shop and your employees ā€“ and ultimately, what's in it for your customers, too. 

So, let's get "under the hood" and talk shop about building a better Shopmonkey.

How we got here

To start off, I thought it would be good to recap how we got here.  Our current platform, which we refer to as "1.0" or "v1", was built several years ago when the company was started.  It served us well as we grew from a few hundred to a few thousand customers.  But like any great vehicle, over time, it needs maintenance and at some point, it just needs a new engine. v1 is servicing many thousands of users per minute and at peak usage during the day, we're seeing over 21,000 requests per second against our database. That's a LOT.

User requests per second

Fundamentally, our scale and technical requirements to satisfy the needs of your shop have changed over the years.  About six months ago, just after I joined Shopmonkey and after a deep dive into some of the issues, we set off to build v2.  I will talk about some of the challenges and the solutions to these issues in this blog today and over the coming weeks and months. I hope you'll join along in this journey to help contextualize the issues you're sometimes experiencing.

In the end, we're building a much better product than you're using today and one that will serve us all for many years to come.

Let's talk about some of the "challenges" of our current product from the standpoint of architecture.  You can think of a product's architecture like the fundamental building blocks of the design of a car.  The requirements are vastly different between an F1 race car and a modern Sedan. The engine, the gear box, the suspension, fuel injection, and every little detail is different between a modern, high performance racing vehicle and something that helps you get around town. Products are the same.

Car architecture differences

At the current scale of the number of shops and users we have using Shopmonkey, we are challenged by a few major areas:

- Scalability.  How much can the product handle?  For example, how many users can currently use the product at any time.

- Responsiveness.  Given the number of users at any given point, how responsive is the application (in terms of latency: "quickness" or "slowness")?

- Availability.  When I go to use the product, can I perform the task or does it not work?

- Stability.  When I do something in the product, does it work every single time?

- Usability.  Is it clear how to perform the task I'm trying to perform or is it confusing, or not clear how to do something?

And to make matters more challenging, each of the above are interrelated. There's not often "one thing" that can be fixed since the changing or tweaking of one will impact the other.

v1 Architecture

Today, when you login to Shopmonkey and use the product, your browser is making requests to one major data center in the cloud located in Boardman, Oregon (regardless of where your shop is located). This impacts Responsiveness depending on how far in distance you are from this location.  For example, if your shop is located in Atlanta, Georgia, you are approximately 2,436 miles from the place where the product runs in the cloud. Unfortunately, while the Internet is fast these days, data transfer is still constrained by the speed of light. This constraint means that it takes a minimum of 85ms to send a packet of data from the east coast to the west coast, excluding all the other factors such as size, congestion, etc. Simply put: the farther you are away from the destination, the more latency (slowness) you will experience.  Even if everything was available and scaled up on the other side, it will be only as fast as the closeness to the location you are.  In modern cloud computing, you might have heard a lot of talk about "Edge Computing".  This is one of the problems that edge computing is attempting to solve: putting your workloads as close to your users (the "edge" of the network) as possible.

The other problem with having one location is that all traffic will land at the same point and often the same time.  This adds a lot of challenges for load distribution which impacts Scalability, Availability and Responsiveness, and results in the real perception of (in)Stability.  In v1, all our users are using the same infrastructure (compute, network, storage, database, etc) which creates a ton of burden for scaling our hardware and software to meet the needs of so many shops.  We often see what is called the "thundering herds problem" where a lot of different users hit the network at the exact moment.  You can see this represented in these typical graphs we often see:

Web transactions time, morning

Web transactions time, afternoon

These spikes are when you experience "sluggishness" or the feeling "it's not responding".  The software is trying to respond, it's just so many requests are being queued up that it's taking too long to service each one fast enough.  And what makes this problem even more challenging is that a browser will timeout or the network will retry the request, further compounding the issue.

One of the challenges with small spikes like this is that our infrastructure can't respond quickly enough to autoscale up to support the load.  When you have traffic which results in very short bursts of a high number of requests, the system will suffer in two major ways: (1) when it takes the average over a window of requests/latency, it will often look like "normal, in range" latency and (2) even when it notices a spike, the time required to scale up to meet the demand is worthless because the traffic has subsided within a minute or two and by the time the infrastructure can support the new load. šŸ¤Æ

All of these issues make for a difficult and extremely stressful situation at Shopmonkey. During the Christmas / New Years break, some of the team spent multiple days (with little to no rest and constant working) trying to maintain the system and keep it running to reduce any impact to your day-to-day operations.  And even after all this work, we weren't successful for all shops.  At some point, there's only so much we can do given the current architecture.

v2 Architecture

v2 is a fundamentally different architecture that attempts to tackle the unique requirements for you and the rest of our customers: (1) we have a large set of shops (nearly 5,000 and growing each month) across a massive geography in the US and Canada, (2) we have a predictable, but high demand during peak periods defined approximately as 6AM ET to 8PM ET with almost no traffic in the valley and (3) we have certain very short periods of the day where we have massive traffic spikes.

This allows us to rethink our architecture to match these requirements in a different way, starting with taking advantage of modern cloud capabilities in edge computing to move most of our processing as close to your shops as possible.  In v2, we are introducing 3 "super regions" and 11 "city regions" (or simply "region") where our software will run. A super region is a geographic region ā€“ in this case, US West Coast, US Central and US East Coast ā€“ where the majority of our infrastructure runs in a highly available configuration.  Inside each of the super regions, we have smaller city regions, 11 of them in total.  And within each region, we further divide into what are called "availability zones".

Think of a super region as a geographic part of the country spanning multiple states (US) and provinces (Canada).  A region is a city inside that larger geographic area.  An availability zone is a physical data center in close proximity (but not too close for disaster situations) in a city. For each region, we run in a minimum of 3 availability zones.  This means that if one data center had a disaster (say a fiber cut or fire in the building that caused an outage in connectivity or power) another center across the city would still be operational.

Regions diagram

From a business continuity standpoint, this means we can suffer a full region failure (i.e. a Hurricane on the East Coast of the US) and we still have 2 super regions and likely at least 8 other cities still able to service your traffic.

For each super region, we have an active-active replica of our core components, namely our database, messaging infrastructure and other critical systems that make the product work.

Each region is located inside the high speed, low latency private network of our cloud provider.  That means even if a request comes in to one city ā€“ say Los Angeles, California and it needs to service that request in Dallas, Texas ā€“ it's super fast given the high speed backbone of the cloud provider (vs. the public Internet which has to traverse a lot of different providers).

The other aspect of this architecture is that we can distribute load (i.e. your activity while you're using the product) across a lot of different regions. Instead of one location and set of infrastructure taking all the burden, we can now distribute that load across 11 different regions, which often have their own traffic distribution based on timezones.  It also means any failure (although we hope to make any failures completely transparent and behind the scenes) is localized to a smaller set of customers given this diffusion.

Shopmonkey cloud latency testing appYou can see see your nearest region and the latency and mileage by visiting https://api.shopmonkey.cloud. (We currently only have 4 of the 11 regions operational today, but that will change over the next few weeks).

Takeaways

From a business standpoint, our goal is to provide you with the best experience possible to help your shop service your customers every single day.  This means we need to ensure that our product is a fast as possible, all the time, and without any interruption.  This is a no small feat for any product company with our scale.  But the Shopmonkey team is "all in" to this commitment to you and I believe we are well on our way to this promise.  Iā€™m confident in our path forward and look forward to sharing details with you and getting your feedback.

Next up, we'll start to deep dive into some of the more exciting features coming as part of v2 that I know you're going to love. šŸ’™