We got it all wrong The very basic highly available infra now looks like this: dns: example.com A 1 example.com A 2 [ load balancer 1 ] [ load balancer 2 ] / | \ / | \ ... same for DC 2 ..., or in some cases cross-dc things [ web server DC 1] [ ] [ ] | | | [ service A ]] \ [ service B ]] \ [ service C ]] So the request is *pushed* down from load balancer, which usually uses some sophisticated strategy with backoff and errors prediction (varying from UCB to logistic regression and neural nets) At each hop someone (either the mesh layer, or the box itself) decides who to push the traffic to, and this is very natural as how HTTP works. (again using fancy strategies) I *believe* this is wrong and if we simply change the model from push to pull, many nice properties will emerge. Let me elaborate what is the actual problem with pushing work, lets say we have the following request distribution: 90% of the requests take 10ms 5% take 200ms 5% take 2000ms and each service has queue of 10 possible slots, and only 1 thread [oversimplification] And lets imagine that those times are taking form the top load balancer's point of view. Also for simplicity lets pick a request path: load balancer -> webserver -> shoppingCartService -> discountService -> campaignManager The top load balancer sees something like "GET /cart" with some cookie and for simplicity lets imagine the cookie just contains unencrypted session id. What features can it use to guess if the request will take 2 seconds or 10ms? If it sends 10 requests taking 10ms each to a webserver that is completely occupied by handling 2second request, it will of course enqueue them, and then those requests will have to wait 2 seconds to be executed. You can say: "wait wait wait, why is it queueing work which it can not execute? why not return 503 when bus", but remember that in our case most of the requests are handled within 10ms, and some of them are async, so we can of course queue them so we can actually do process another request while waiting. This pattern *always* (at least in the real world) creates pileups, because you always end up balancing fast requests on top of a service that is already handling a slow request. So now the whole industry is stuck in trying to *predict* errors and timeouts, but as you can see the features you can use are very limited, e.g. the top load balancer can use the user ip, request path, session id, time of day, and if it is stateful it could use some aggregations (like amount of slow requests of this session id), the mesh layer can also use all kinds of infra features like per cpu_utilization_last_5_seconds_per_instance, but as you can guess, all those features are just effects, or symptoms of the slowness, rarely any of them *cause* the slow request (well.. if you have 1 user with 1 million items in their shopping cart, the session id is a very good predictor for the slowness of the request). And we keep piling up algorithms and features and systems and platforms...to improve the balancing and reduce the pileups. If you think about it it is extremely difficult problem, in the modern free-for-all-stop-the-world-garbage-collection-cross-dc-sdn crazy world, to predict if the time of a request with any practical certainty. Some of the solutions we end up doing is we create "batch" clusters for "slower" requests, limit the queues with some super conservative numbers, retries strategies do upper confidence bound on picking the next host, etc. So, again, I believe we got excited and stepped into a new world while using only old ideas. Imagine if the request looks like this load balancer -> [ queue ] { request for /cart } | | | [ shoppingCart ] | | | [ queue ] { getDiscounts } | | | [ discountService ] | | | [ queue ] { getRunningCampaignRules } | | | [ campaignManager ] each instance of the services will take work *only* when it actually *can* process it. Apache Camel does that, on top of ActiveMQ with IO queues, there is some support for Spring+Kafka to do that as well (the ActiveMQ one does sub millisecond request roundtrip) A colleague of mine proposed this architecture few weeks ago, and the more I think about it the happier I get. I have not used it in production yet, but I hope I can try it out at larger scale. This year I will try to unlearn everything, and think about those problems from a different angle. EDIT: I started investigating the concept more and made https://github.com/jackdoe/back-to-back to play with it, so far it looks really nice --- github.com/jackdoe Tue Jan 8 13:40:27 UTC 2019 edit: Sun Feb 17 15:18:58 UTC 2019