Follow

November 2014 Service Disruption Postmortem

As part of our ongoing dedication to provide you with the greatest possible level of service, we at MJ Freeway have published the following Knowledgebase article to provide transparency into the service disruption many users experienced November 6-8. We hope this sheds some light on the circumstances in which the disruption occurred and serves to set minds at ease that the measures we are taking in response will minimize risk of similar disruptions in the future.

First, a timeline of the circumstances leading up to the service disruption:

 

  •        In the Spring of 2014 it became apparent to us that our hosting architecture needed substantial revision to support our continued growth.
  •        Following the plan of action set out in the Spring, we initiated a hardware upgrade in early August with our hosting provider at the time.
  •        Shortly thereafter it became evident that this upgrade would not be sufficient for our continued growth, resulting in a more immediate need for a reengineered server architecture.
  •        We employed expert web application and hosted software consultants to audit our infrastructure to determine the best solution to accommodate current and future growth. In that process, we determined that we would move to a fully virtual environment, allowing nearly unlimited future growth.  (Advances in cloud encryption technology have allowed us to provide our solution on a virtual environment while still following HIPAA compliance guidelines, something that has only newly become possible in the realm of cloud solutions.)
  •        We interviewed and selected a hosting provider capable of delivering this type of solution and premium service.
  •        In preparation for the migration to the new environment, we performed weeks of both functional and load testing.
  •        Although our load testing was an accurate predictor for performance and expected upgrade performance on our old environment, it proved faulty on our new environment, mostly due to the difference in the loads coming from only a handful of IP addresses v/s the same level of loads coming from many hundreds of different IP addresses which is our production scenario.
  •        The migration itself proceeded smoothly and according to plan; however, when traffic began to pick up the following day it became clear that the traffic outstripped the environment’s capabilities, leading to the significant disruption in the days following.

We have taken the following measures to stabilize and accommodate traffic:

  •        We have reorganized client data into three “pods,” clusters of site databases, each with its own dedicated hosting infrastructure and which will be geographically distributed for multiple failover scenarios. This will ensure a more even distribution of resources.
  •        We are revising our load and stress testing processes to ensure more accurate results in the future, preventing or mitigating future disruption.

In the long term, our plan is as follows:

  •        We will continue fine-tuning the pod system to guarantee each individual pod is configured optimally.
  •        To ensure the future scalability we sought from the outset, we will regularly partition new databases into additional pods.
  •        Should a given pod begun to suffer diminished performance due to insufficient resources, this will allow us to react much more quickly to upgrade an individual pod or split off a new pod with a far faster turnaround. This will allow the growth of our business and yours to continue, unimpeded by any growing pains we or any other clients may experience.

 

We take our responsibility to protect your data very seriously and continue to follow HIPAA security guidelines. At no point was any client data at risk, compromised, or lost during this process.

 

While this was a painful migration, virtualizing our hosting provides our clients a long-term environment with increased performance and reliability.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments