Advanced search
1 file | 2.00 MB Add to list

A performance analysis of fault recovery in stream processing frameworks

(2021) IEEE ACCESS. 9. p.93745-93763
Author
Organization
Abstract
Distributed stream processing frameworks have gained widespread adoption in the last decade because they abstract away the complexity of parallel processing. One of their key features is built-in fault tolerance. In this work, we dive deeper into the implementation, performance, and efficiency of this critical feature for four state-of-the-art frameworks. We include the established Spark Streaming and Flink frameworks and the more novel Spark Structured Streaming and Kafka Streams frameworks. We test the behavior under different types of faults and settings: master failure with and without high-availability setups, driver failures for Spark frameworks, worker failure with or without exactly-once semantics, application and task failures. We highlight differences in behavior during these failures on several aspects, e.g., whether there is an outage, downtime, recovery time, data loss, duplicate processing, accuracy, and the cost and behavior of different message delivery guarantees. Our results highlight the impact of framework design on the speed of fault recovery and explain how different use cases may benefit from different approaches. Due to their task-based scheduling approach, the Spark frameworks can recover within 30 seconds and in most cases without necessitating an application restart. Kafka Streams has only a few seconds of downtime, but is slower at catching up on delays. Finally, Flink can offer end-to-end exactly-once semantics at a low cost but requires job restarts for most failures leading to high recovery times of around 50 seconds.
Keywords
STATE MANAGEMENT, MODEL, Sparks, Task analysis, Semantics, Fault tolerant systems, Fault, tolerance, Storms, Benchmark testing, Apache spark, structured, streaming, apache flink, apache kafka, kafka streams, distributed, computing, stream processing frameworks, fault tolerance, benchmarking, big data

Downloads

  • published article.pdf
    • full text (Published version)
    • |
    • open access
    • |
    • PDF
    • |
    • 2.00 MB

Citation

Please use this url to cite or link to this publication:

MLA
van Dongen, Giselle, and Dirk Van den Poel. “A Performance Analysis of Fault Recovery in Stream Processing Frameworks.” IEEE ACCESS, vol. 9, 2021, pp. 93745–63, doi:10.1109/ACCESS.2021.3093208.
APA
van Dongen, G., & Van den Poel, D. (2021). A performance analysis of fault recovery in stream processing frameworks. IEEE ACCESS, 9, 93745–93763. https://doi.org/10.1109/ACCESS.2021.3093208
Chicago author-date
Dongen, Giselle van, and Dirk Van den Poel. 2021. “A Performance Analysis of Fault Recovery in Stream Processing Frameworks.” IEEE ACCESS 9: 93745–63. https://doi.org/10.1109/ACCESS.2021.3093208.
Chicago author-date (all authors)
van Dongen, Giselle, and Dirk Van den Poel. 2021. “A Performance Analysis of Fault Recovery in Stream Processing Frameworks.” IEEE ACCESS 9: 93745–93763. doi:10.1109/ACCESS.2021.3093208.
Vancouver
1.
van Dongen G, Van den Poel D. A performance analysis of fault recovery in stream processing frameworks. IEEE ACCESS. 2021;9:93745–63.
IEEE
[1]
G. van Dongen and D. Van den Poel, “A performance analysis of fault recovery in stream processing frameworks,” IEEE ACCESS, vol. 9, pp. 93745–93763, 2021.
@article{8724505,
  abstract     = {{Distributed stream processing frameworks have gained widespread adoption in the last decade because they abstract away the complexity of parallel processing. One of their key features is built-in fault tolerance. In this work, we dive deeper into the implementation, performance, and efficiency of this critical feature for four state-of-the-art frameworks. We include the established Spark Streaming and Flink frameworks and the more novel Spark Structured Streaming and Kafka Streams frameworks. We test the behavior under different types of faults and settings: master failure with and without high-availability setups, driver failures for Spark frameworks, worker failure with or without exactly-once semantics, application and task failures. We highlight differences in behavior during these failures on several aspects, e.g., whether there is an outage, downtime, recovery time, data loss, duplicate processing, accuracy, and the cost and behavior of different message delivery guarantees. Our results highlight the impact of framework design on the speed of fault recovery and explain how different use cases may benefit from different approaches. Due to their task-based scheduling approach, the Spark frameworks can recover within 30 seconds and in most cases without necessitating an application restart. Kafka Streams has only a few seconds of downtime, but is slower at catching up on delays. Finally, Flink can offer end-to-end exactly-once semantics at a low cost but requires job restarts for most failures leading to high recovery times of around 50 seconds.}},
  author       = {{van Dongen, Giselle and Van den Poel, Dirk}},
  issn         = {{2169-3536}},
  journal      = {{IEEE ACCESS}},
  keywords     = {{STATE MANAGEMENT,MODEL,Sparks,Task analysis,Semantics,Fault tolerant systems,Fault,tolerance,Storms,Benchmark testing,Apache spark,structured,streaming,apache flink,apache kafka,kafka streams,distributed,computing,stream processing frameworks,fault tolerance,benchmarking,big data}},
  language     = {{eng}},
  pages        = {{93745--93763}},
  title        = {{A performance analysis of fault recovery in stream processing frameworks}},
  url          = {{http://dx.doi.org/10.1109/ACCESS.2021.3093208}},
  volume       = {{9}},
  year         = {{2021}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: