martes, 7 de marzo de 2017

High availabilty


  • Faul tolerance - A system to keep working when something unforeseen happens. Unexpected errors will happen, bugs will creep in, components might occassionally fail, network connections might drop, even the entire machine where the system is running might crash. Whatever happens, we want to localize the impact of an error as much as possible, recover from the error and keep the system running and providing service.
  • Scalability - A system should be able to handle any possible load. Of course that we will not busy tons of hardware just in case the entire planet population starts using our system some day. But we must be able to respond to a load increase by adding more hardware resources, without any software intervention. Ideally, this should be possible without a system restart.
  • Distribution - To make a system that never stops, we have to run it on multiple machines. This promotes the overall stability of the system - if some machines is taken down, another one may take over. Furthermore, this gives us means to scale horizontally - we can address load increase by adding more machines to the system, thus adding more work units to support the higher demand.
  • Responsiveness - It goes without saying that a system should always be reasonably fast and responsive. Drastic prolongations of request handling shouldn't happen, even if the load increase or unexpected errors happen. In particular, ocassional long tasks shouldn't block the rest of the system, or have a significant effect on its performance.
  • Live update - In some cases we want to push a new version of our software without restarting any server. For example, in a telephony system, we don't want to disconnect the established calls while upgrading the software.
source: Elixir in Action v8 MEAP-1

No hay comentarios: