The Cloud --- The cloud (aws, gcloud, azure... whatever) is a piece of shit. 10x price for 1/10th performance (so 100x price haha) Not only the hardware there is pathetic and all IPC is horrible, but also all the managed services you use from them are performing horribly. (from es to rabbitmq to even attached block storage [which is probably why they are performing this way]) What I do for https://baxx.dev/stat (and for other projects): * buy 2-3 machines from hetzner [or somewhere else] for 200E per month you get 24 core (48 ht) 128g ram box with 2tb ssd (usually in mirror, so 1tb), that can easily do ~100k randread and 50k randwrite iops, 1gbps unlimited network just for reference, this will cost ~5k on the cloud, and will perform (even though with similar specs) 1/10th of bare metal box * learn some basic sysadmin skills, now it is easier than ever * systemd + docker can go a long way * try not to use many dependencies, don't decouple without good reason * avoid queues if you can this seems counter intuitive, by queues I don't mean just kafka, I mean all kinds of receive (usually unbounded) queues, for example nginx's listen(2) backlog queue has limit N (unlimited in some cases), then you have accept(2) queue on wherever nginx is proxying to, and then from this thing to your database, and the database's queue depth and etc. interacting queues have extremely annoying emergent chaotic properties, so every time you can avoid it, do it (I did some investigation in the we-got-it-all-wrong post https://punkjazz.org/~jack/we-got-it-all-wrong.txt https://punkjazz.org/~jack/we-got-it-all-wrong-2.txt where I changed from push to pull to understand the dynamics better) you will probably need: * postgres/mysql setup master->slave so you can have 'hot' standby. on this machines 1 postgres master can handle your traffic (unless you just do bad design) until you reach mid size[100-200 employees] * zookeeper pretty much you start it and let it run, unless you abuse it * es, kafka, nginx, redis, some backend (node,go whatever) etc use cgroups or docker to make sure one dependency wont bring the whole box into thrashing, keep in mind modern thrashing is pretty much unstoppable. * some external dns, setup your zone records with 5 min ttl so when one of the machines dies you just manually switch until you have new one setup (which could take 1 day) the machines don't die every day.. so dns round robbin is enough and should bring you to .99+ availability * keep in mind you have 1 machine worth of capacity, the other one is pretty much for live/live backup, which means at all time you must be able to handle all the traffic with 1 machine * make the machines ping each other https://github.com/jackdoe/baxx/blob/master/README.txt#L76 (example of how I do it for baxx so I get notified when any process or cronjob on any box is not running as expected) * secure your boxes, following How-To-Secure-A-Linux-Server will give you a *very* good head start: https://github.com/imthenachoman/How-To-Secure-A-Linux-Server Once you are on your own: * keep running live/live setup Backups do not work very well in chaotic systems, there are gazillion reasons why a backup will fail, the only way you can be sure you can recover if a machine dies, is to know for a fact that the other machine is serving traffic. Here I want to distinguish between backups of data (saving old copy of a database increase someone truncates the wrong table by accident, which sadly happens way more than we want to admit), and having a way to recover from a situation where a machine is dead. As stated, the only way to ensure quick recovery is if you actually know that the fallback machine was working with the same live traffic as the dead machine. * avoid buying managed services Not being able to strace/gdb/iostat or use jmx to hook into the service that is causing you issues has caused me so much pain. I regret it every time I helplessly look at a slow operation that intuitively I know should be fast and cant explain why is it performing like shit. You cant even login to it to see if the disk is faulty. All those graphs and logs that the managed services usually give you are useless in crisis or hardware degradation scenario, as it is often impossible to isolate the symptom from the cause when the thrashing starts. * don't use CDNs This is harder than it sounds of course, especially if you managed to get to 2 MB javascript bundle and 50megapixel images.. CDNs increase your complexity, they creep into your deployments and the way you think..invalidation of objects, naming conventions etc etc.. inline as much as you can and be free. EDIT(08/08/2019): Many people commented they dont agree with this point, the theme of the whole post is about reducing complexity and cost *if you can*. I realize sometimes this is not possible, but when you have to use CDNs then you must use them, reality is in many cases you dont have to. * do it once Because you will end up running like 20 things, it is important to not worry about them. This whole enterprise boils down to you running things that are just good software, e.g. redis, you run it once and thats it. (LTS is way more marketing than it seems, so don't trust it blindly) * avoid big data while you can Most companies can go very far by appending their analytics events in a log file or a table. having 30-40 million events in a text file is in the order of 10-20GB on good ssd with a good cpu you can slice and dice it with incredible speed. cat | rg | jq | sort | uniq -c | sort > report.$(date).txt is amazing just imagine the alternative: oozie, hadoop, spark, job reports, transformers, dependencies brrrrrr amazing how we ended up here so we can count some numbers * remove layers e.g. don't run elasticsearch if you only need lucene, don't run rails if you can do it with sinatra, don't introduce caching layers unless absolutely needed, don't use haproxy if you can go by with dns round robbin, don't run cassandra if you just need LSMT can simply embed rocksdb, dont run kubernetes if you can do it with systemd.. Don't go to the cloud. It will force you to use super crappy and slow or limited things such as s3 and over-complicate your infrastructure to incredible degree. It is truly a piece of shit and will just force you to design systems in a horrible way. -b 07/08/2019