Dynamic Job Replication for Balancing Fault Tolerance, Latency, and Economic Efficiency: Work in Progress
Vladimir V. Marbukh
Recent research has demonstrated benefits of replication of requests with canceling, which initiates multiple concurrent replicas of a request and uses the first successful result immediately removing the remaining replicas of the completed request from the system. This paper suggests that benefits of replication may come at the risk of abrupt system transition to an undesirable highly congested equilibrium. To expose, evaluate, and ultimately manage these risk/benefit trade-offs, we generalize replication strategy by: (a) accounting for possible inefficiency of remote service, (b) allowing replication only when static routing fails to identify idle local server, and (c) requiring one or more replicas of the same request to be completed to improve fault tolerance using majority rule decision. Due to intractability of the Markov performance model, our analysis is based on mean-field and fluid approximations. Future research should evaluate accuracy of assertions based on these approximations, and ultimately develop practical solutions for optimization of various performance trade-offs in distributed systems with replication.
Dynamic Job Replication for Balancing Fault Tolerance, Latency, and Economic Efficiency: Work in Progress, IEEE SERVICES 2018, San Fransisco, CA, [online], https://doi.org/10.1109/SCC.2018.00043
(Accessed February 28, 2024)