The Cloud
---

The cloud (aws, gcloud, azure... whatever) is a piece of shit.
10x price for 1/10th performance (so 100x price haha)

Not only the hardware there is pathetic and all IPC is horrible, but
also all the managed services you use from them are performing horribly.
(from es to rabbitmq to even attached block storage [which is
probably why they are performing this way])

What I do for https://baxx.dev/stat (and for other projects):
* buy 2-3 machines from hetzner [or somewhere else]
  for 200E per month you get 24 core (48 ht) 128g ram box with 2tb ssd
  (usually in mirror, so 1tb), that can easily do ~100k randread and
  50k randwrite iops, 1gbps unlimited network

  just for reference, this will cost ~5k on the cloud, and will
  perform (even though with similar specs) 1/10th of bare metal box

* learn some basic sysadmin skills, now it is easier than ever
* systemd + docker can go a long way
* try not to use many dependencies, don't decouple without good reason
* avoid queues if you can
  this seems counter intuitive, by queues I don't mean just kafka, I
  mean all kinds of receive (usually unbounded) queues, for example
  nginx's listen(2) backlog queue has limit N (unlimited in some
  cases), then you have accept(2) queue on wherever nginx is proxying
  to, and then from this thing to your database, and the database's
  queue depth and etc.

  interacting queues have extremely annoying emergent chaotic
  properties, so every time you can avoid it, do it
  (I did some investigation in the we-got-it-all-wrong post
   https://punkjazz.org/~jack/we-got-it-all-wrong.txt
   https://punkjazz.org/~jack/we-got-it-all-wrong-2.txt
   where I changed from push to pull to understand the dynamics
   better)

you will probably need:
* postgres/mysql
  setup master->slave so you can have 'hot' standby.
  on this machines 1 postgres master can handle your traffic (unless
  you just do bad design) until you reach mid size[100-200 employees]

* zookeeper
  pretty much you start it and let it run, unless you abuse it

* es, kafka, nginx, redis, some backend (node,go whatever) etc
  use cgroups or docker to make sure one dependency wont bring the
  whole box into thrashing, keep in mind modern thrashing is pretty
  much unstoppable.

* some external dns, setup your zone records with 5 min ttl so
  when one of the machines dies you just manually switch until you
  have new one setup (which could take 1 day)
  the machines don't die every day.. so dns round robbin is enough
  and should bring you to .99+ availability

* keep in mind you have 1 machine worth of capacity, the other one
  is pretty much for live/live backup, which means at all time you
  must be able to handle all the traffic with 1 machine

* make the machines ping each other
  https://github.com/jackdoe/baxx/blob/master/README.txt#L76
  (example of how I do it for baxx so I get notified when any process
  or cronjob on any box is not running as expected)

* secure your boxes, following How-To-Secure-A-Linux-Server will
  give you a *very* good head start:
  https://github.com/imthenachoman/How-To-Secure-A-Linux-Server

Once you are on your own:
* keep running live/live setup
  Backups do not work very well in chaotic systems, there
  are gazillion reasons why a backup will fail, the only way you can
  be sure you can recover if a machine dies, is to know for a fact
  that the other machine is serving traffic.

  Here I want to distinguish between backups of data (saving old copy
  of a database increase someone truncates the wrong table by
  accident, which sadly happens way more than we want to admit), and
  having a way to recover from a situation where a machine is dead.

  As stated, the only way to ensure quick recovery is if you actually
  know that the fallback machine was working with the same live
  traffic as the dead machine.

* avoid buying managed services
  Not being able to strace/gdb/iostat or use jmx to hook into the
  service that is causing you issues has caused me so much pain. I
  regret it every time I helplessly look at a slow operation that
  intuitively I know should be fast and cant explain why is it
  performing like shit. You cant even login to it to see if the disk
  is faulty.

  All those graphs and logs that the managed services usually give you
  are useless in crisis or hardware degradation scenario, as it is
  often impossible to isolate the symptom from the cause when the
  thrashing starts.

* don't use CDNs
  This is harder than it sounds of course, especially if you managed
  to get to 2 MB javascript bundle and 50megapixel images..

  CDNs increase your complexity, they creep into your deployments and
  the way you think..invalidation of objects, naming conventions etc
  etc.. inline as much as you can and be free.

  EDIT(08/08/2019):
  Many people commented they dont agree with this point, the theme of
  the whole post is about reducing complexity and cost *if you can*.
  I realize sometimes this is not possible, but when you have to use
  CDNs then you must use them, reality is in many cases you dont have
  to.

* do it once
  Because you will end up running like 20 things, it is important to
  not worry about them. This whole enterprise boils down to you
  running things that are just good software, e.g. redis, you run it
  once and thats it. (LTS is way more marketing than it seems, so
  don't trust it blindly)

* avoid big data while you can
  Most companies can go very far by appending their analytics events
  in a log file or a table.
  having 30-40 million events in a text file is in the order of 10-20GB
  on good ssd with a good cpu you can slice and dice it with incredible
  speed.

  cat | rg | jq | sort | uniq -c | sort > report.$(date).txt is amazing
  just imagine the alternative:
  oozie, hadoop, spark, job reports, transformers, dependencies
  brrrrrr amazing how we ended up here so we can count some numbers

* remove layers
  e.g. don't run elasticsearch if you only need lucene, don't run rails
  if you can do it with sinatra, don't introduce caching layers unless
  absolutely needed, don't use haproxy if you can go by with dns round
  robbin, don't run cassandra if you just need LSMT can simply embed
  rocksdb, dont run kubernetes if you can do it with systemd..

Don't go to the cloud.
It will force you to use super crappy and slow or limited things such
as s3 and over-complicate your infrastructure to incredible degree.
It is truly a piece of shit and will just force you to design systems
in a horrible way.

-b 07/08/2019