Load Balancing, Consistent Hashing, and Locality

http://cleverdomain.org/haproxy-velocity/

Vimeo

Skyfire

Handles the majority of our website plays, and all iOS/Android; gets over 1 billion requests/day

Skyfire (deployment diagram)

Dynamic Packaging

Indexing

First Approach

(Launch through Nov 2015)

Hash Review: Modulo Hashing

Consistent Hashing

(Image courtesy Zack M. Davis)

Hash space is treated like a ring. When a new server is added, it adds several points to the ring. Some parts of the ring will now be closest to the new points, so a portion of traffic moves to the new server. But most parts stay the same. Likewise for deletion.

Problem: Load Distribution

Solution: Different Load Balancing Policy

Problem: Now our cache is no good!

Solution: Add a shared cache

Shared Cache Pitfalls

The more servers we have, the lower the chance that a request goes to a server that already has the data in local cache, so the memcache traffic goes up linearly (or worse). If memcache goes down, then the situation on the previous slide dominates.

Wouldn’t it be nice to have both?

Brief summary of Power of Two Choices: when you need to make an optimal choice for load balancing, LRU, etc., picking two random entries and then selecting the better one according to whatever criterion, is almost as good as picking the globally optimal one, and sometimes even better.

Someone else had more follow-through

The paper considers hashing for storage, where it’s important to move items to maintain balance after deletions. I can’t move HTTP connections, only assign them when they come in, so I can ignore these concerns.

The Algorithm in Short

The paper says “While the idea seems pretty obvious, it appears to not have been considered before.”

Bringing it to Life

HAProxy Impelementation

Challenges

Hash Table Lookup

/* find the node after and the node before */
next = eb32_lookup_ge(root, hash);
if (!next)
  next = eb32_first(root);
if (!next)
  return NULL; /* tree is empty */

prev = eb32_prev(next);
if (!prev)
  prev = eb32_last(root);

nsrv = eb32_entry(next, struct tree_occ, node)->server;
psrv = eb32_entry(prev, struct tree_occ, node)->server;

/* OK we're located between two distinct servers, let's
  * compare distances between hash and the two servers
  * and select the closest server.
  */
dp = hash - prev->key;
dn = next->key - hash;

if (dp <= dn) {
  next = prev;
  nsrv = psrv;
}

// vvvvvv HERE vvvvvv
while (p->lbprm.chash.balance_factor && !chash_server_is_eligible(nsrv)) {    
    next = eb32_next(next);
    if (!next)
        next = eb32_first(root);
    nsrv = eb32_entry(next, struct tree_occ, node)->server;
}

return nsrv;

Everything above “HERE” is pre-existing code.

Eligibility test

int chash_server_is_eligible(struct server *s) {
    /* The total number of slots to allocate is the total number of outstanding requests 
     * (including the one we're about to make) times the load-balance-factor, rounded up.
     */
    unsigned tot_slots = ((s->proxy->served + 1) * s->proxy->lbprm.chash.balance_factor + 99) / 100;
    unsigned slots_per_weight = tot_slots / s->proxy->lbprm.tot_weight;
    unsigned remainder = tot_slots % s->proxy->lbprm.tot_weight;
    
    /* Allocate a whole number of slots per weight unit... */
    unsigned slots = s->eweight * slots_per_weight;
    /* And then distribute the rest among servers proportionally to their weight. */
    slots += ((s->cumulative_weight + s->eweight) * remainder) / s->proxy->lbprm.tot_weight
            - (s->cumulative_weight * remainder) / s->proxy->lbprm.tot_weight;
    /* But never leave a server with 0. */ 
    if (slots == 0)
        slots = 1;

    return s->served < slots;
}

Timeline

Willy was very enthusiastic about the feature and very helpful in getting my work up to snuff. The only delay was one of those long European vacations in the middle :)
The researchers who wrote the paper were very happy to see their work used in real life – and outside of Google!

Cache Hit %

The ripples are diurnal variation caused by scaling the number of servers with load. Here you see the effect: more servers = less local hits, more traffic to the shared cache.

Shared Cache Bandwidth

Response Time

i.e. no difference. Switching to consistent hashing made things no worse at our current traffic, which is all I could ask for.

Nice Properties

Closing Notes

I’m not a keynote speaker here to make inspirational pronouncements about the future, so I’ll keep it simple.
Usually when I encounter research algorithms, either they’ve already been implemented, or else they’re so abstruse that maybe I can appreciate them, but I have no chance of doing anything with them. I think I was very lucky to find a paper - that was not only applicable, but also well within my reach. But somebody has to do it, so keep your eyes open.
Data organization for cacheline locality, CPU pinning for L2 cache locality… smart load balancing for memcache locality! Sometimes the cloud really deserves to be thought of as a distributed system, and optimized for.

Thank You!

http://cleverdomain.org/haproxy-velocity/

Thanks to the authors for making this possible. Thanks to O’Reilly for inviting me. Thanks to the audience for being here.