Node single point of failure

Note: written in August 2012 during my summer at Voxer. sky has been released as zag!

Voxer believes that any single point of failure is bad. Very bad (this is why we use riak). To achieve this, we've written a couple libraries to allow node processes to communicate across machines and in concert. The first is poolee by Danny Coates which does http load balancing on the client, and the second is ring, which organizes processes that compose a service into a cooperating whole (each process owns part of the keyspace, and thus part of the work).

Here's how this looks:

Requests emanate from clients via poolee, which has a list of all API servers. Poolee chooses a random API server to speak with, and will stop talking to the API server if it becomes unresponsive. Health checks on unresponsive servers are performed with using exponential backoff (e.g. 1 second, 8 seconds, 16 seconds, etc), so as not to flood the network.

The ring of API servers kicks in! Each client request has a distinguishing characteristic (e.g. geographic region), which dictates a specific API server as the "home" or "owner" of that data, and it is tasked with responding to that request. You may have noticed that poolee chooses a random node, but it turns our that this random node knows it is not the owner of the request, and can forward it to the true "home" node of that request.

ring is consistent, which means that when an API server goes down, the other servers will fill in for the downed server without changing the homes of data stored on any nodes other than one neighboring node. The efficiency of home re-shuffling is what makes the ring consistent. If this doesn't make sense, it is because it's difficult to explain, wikipedia does a better job of explaining consistent hash rings.

Failure

Poolee and ring work together and allow us to do rolling restarts without service interruption, restart single machines when SSL has memory leaks, and remove machines from our clusters.

Thus, Voxer achieves a no single point of failure architecture using the power of node.js.

Thanks for reading!
David Trejo
@ddtrejo & DTrejo on github

David Trejo

Engineer at Chime & consultant. Past clients include Credit Karma, Aconex, Triplebyte, Neo, the Brown Computer Science Department, Voxer, Cloudera, and the Veteran's Benefits Administration.