This project is maintained by bjankie1
Recently when running performance tests we have spotted a worrying phenomenon. The response times were quite satisfying at first glance. On average we got < 20ms which was meeting contract requirements. Unfortunately average is not always a good measure and in our case it has become a perfect manifestation of the flow.
Percentile [%] | Value [ms] |
---|---|
50 | 15 |
90 | 28 |
95 | 30 |
99 | 118 |
99.9 | 350 |
The percentile values we observed had quite unusual distribution. Small percent of requests was executed in acceptable long times causing contract violations and timeouts. With help of NewRelic were able to quickly pinpoint the culprit - it was Redis. Surprisingly the long response times were occurring in isolation (responses following the long ones were normal) and there was no pattern.
Btw. our infrastructure sits on AWS. After spending some time tweaking Redis configuration we realised this was not going to be sufficient. We also got some feedback from other teams saying that they fall into similar problems and failed with taming that database.
I can imagine that in some cases one wouldn’t care about the last percentiles. Sometimes you are constrained by SLAs related to error rate ratio and long response times are signalled as errors.
The first idea we had was to define an aggressive timeout accompanied by retries for Redis operations. This would probably work in the most of situations. It has some limitations though - you need to squeeze the operation time along with the timeout within the time limit which was 50ms. This means that timeout would need to be set at half of this time - 25ms. In consequence given the percentile distribution we knew that significant percent of correct response times would fall into aggressive timeout trap.
Let’s give an example.
The second call has chances to succeed but overall aftermath is not promising.
The inspiration came from BigData world. In Spark when task execution runs suspiciously long the application master starts the same task speculatively on a different executor but it lets the long running tasks to continue. The solution looked elegant:
We named the pattern speculative task.
The selected Monix as a good candidate to help with the implementation. It appeared to be the right choice. What Monix has to offer is not only execution scheduling but also comes with time injection solution being invaluable tool for testing.
import java.util.concurrent.TimeoutException
import com.typesafe.scalalogging.StrictLogging
import monix.eval.Task
import monix.execution.atomic.{Atomic, AtomicAny}
import monix.execution.{Cancelable, Scheduler}
import scala.concurrent.duration.DurationLong
import scala.concurrent.{Future, Promise}
import scala.language.postfixOps
import scala.util.{Failure, Success, Try}
object SpeculativeCall {
def speculatively[R](attempts: Int, patience: Long, timeout: Long)
(action: () => Future[R])
(implicit scheduler: Scheduler): Future[R] = {
val promise = Promise.apply[R]()
val lastFailure = AtomicAny[Option[Throwable]](None)
val failedCount = Atomic(0)
def complete(result: Try[R]): Unit = {
promise.tryComplete(result)
()
}
val scheduledTasks = (for {
attempt <- 1 until attempts
} yield {
val delay = attempt * patience
attempt -> Task.deferFuture(action()).delayExecution(delay.millis)
}).toList
val firstTask = Task.deferFuture(action())
val tasks = (0, firstTask) :: scheduledTasks
val cancelables = tasks.map { case (attempt, task) =>
task.runOnComplete {
case s: Success[R] =>
complete(s)
case Failure(ex: TimeoutException) =>
if(failedCount.incrementAndGet(1) == attempts) {
complete(Failure(lastFailure.get.getOrElse(ex)))
}
case f @ Failure(ex) =>
lastFailure.set(Some(ex))
if(failedCount.incrementAndGet(1) == attempts) {
complete(f)
}
}
}
Task {
if(!promise.isCompleted) {
complete(Failure(lastFailure.get.getOrElse(new TimeoutException(s"All attempts timed-out"))))
}
}.delayExecution(timeout millis).runAsync
val result = promise.future
result.onComplete { _ => Cancelable.cancelAll(cancelables) }
result
}
}
Monix has quite clever solution for time injection to scheduler. That allows writing reliable and deterministic tests in our case. This is just an example of a test - I leave it to you to add some more.
class RetryTest extends AsyncFlatSpec with Matchers {
implicit val scheduler = TestScheduler()
def tasksSimulation[R](results: (Long, Try[R])*): (ArrayBuffer[Int], () => Future[R]) = {
val runs = ArrayBuffer.empty[Int]
val attempt = Atomic(0)
(runs, () => {
val attemptNo = attempt.getAndIncrement()
val (delay, result) = results(attemptNo)
runs.append(attemptNo)
Task {
result match {
case Success(s) =>
s
case Failure(ex) =>
throw ex
}
}.delayExecution(delay millis).runAsync
})
}
it should "succeed even if the first attempt fails" in {
val (runs, simulation) = tasksSimulation(
1L -> Failure(new RuntimeException),
1L -> Success("result2"),
100L -> Success("result3")
)
val resultF = Retry.speculatively(3, 30L, 300L, "description")(simulation)
val testResult = resultF map { result =>
result shouldBe "result2"
}
scheduler.tick(250 millis)
runs should contain only(0, 1)
testResult
}
...
}
Speculative task has saved the day in our case. It has reduced the last percentiles. At the same time we haven’t observed increase load to Redis. In reality the speculative call takes place only once per a few hundreds calls.
Percentile [%] | Value [ms] |
---|---|
50 | 15 |
90 | 28 |
95 | 30 |
99 | 35 |
99.9 | 65 |
I’ll leave to you to adjust the configuration of monix.execution.Scheduler
. In our case the calls were blocking therefore we used a fixed number of threads adjusted to the average duration of single call.