Failures inevitably happen when least desired. In order to withstand these unexpected failures, resilience and fault tolerance are highly sought-after properties of distributed systems. However, assessing whether a system can handle failures is non-trivial. Chaos Engineering is a practice to assess a system’s resilience and has been successfully applied by large companies such as Netflix.
Inspired by this practice, we propose an automated technique for testing the resilience of actor programs written in Akka. However, our technique is intended for use during development rather than in production and operates at the granularity of actors and messages.
In particular, we leverage existing test suites to obtain program execution traces of each test run and analyze them to detect potential resilience problems. The results of this analysis are perturbation plans that consist of (i) perturbation targets (e.g., persistent actors or messages with a particular payload) and (ii) perturbations (e.g., restarting actors or dropping messages). Subsequent test case runs are then automatically perturbed at run-time with the generated plans in order to assess whether the system is resilient against these failures.
Test case failures can be indicative of problems and require further inspection. We provide an overview of the applied perturbations for each test run so that developers should be able to diagnose the problem. While our technique already works on synthetic programs, we are looking for more complex programs and test suites to further improve and evaluate it.