Advanced search
1 file | 2.26 MB Add to list

RFHOC: a random-forest approach to auto-tuning Hadoop's configuration

Author
Organization
Abstract
Hadoop is a widely-used implementation framework of the MapReduce programming model for large-scale data processing. Hadoop performance however is significantly affected by the settings of the Hadoop configuration parameters. Unfortunately, manually tuning these parameters is very time-consuming, if at all practical. This paper proposes an approach, called RFHOC, to automatically tune the Hadoop configuration parameters for optimized performance for a given application running on a given cluster. RFHOC constructs two ensembles of performance models using a random-forest approach for the map and reduce stage respectively. Leveraging these models, RFHOC employs a genetic algorithm to automatically search the Hadoop configuration space. The evaluation of RFHOC using five typical Hadoop programs, each with five different input data sets, shows that it achieves a performance speedup by a factor of 2.11 on average and up to 7.4 over the recently proposed cost-based optimization (CBO) approach. In addition, RFHOC’s performance benefit increases with input data set size.
Keywords
random forest, genetic algorithm, system configuration, Performance tuning, MapReduce/Hadoop

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 2.26 MB

Citation

Please use this url to cite or link to this publication:

MLA
Bei, Zhendong et al. “RFHOC: a Random-forest Approach to Auto-tuning Hadoop’s Configuration.” IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 27.5 (2016): 1470–1483. Print.
APA
Bei, Z., Yu, Z., Zhang, H., Xiong, W., Xu, C., Eeckhout, L., & Feng, S. (2016). RFHOC: a random-forest approach to auto-tuning Hadoop’s configuration. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 27(5), 1470–1483.
Chicago author-date
Bei, Zhendong, Zhibin Yu, Huiling Zhang, Wen Xiong, Chengzhong Xu, Lieven Eeckhout, and Shengzhong Feng. 2016. “RFHOC: a Random-forest Approach to Auto-tuning Hadoop’s Configuration.” Ieee Transactions on Parallel and Distributed Systems 27 (5): 1470–1483.
Chicago author-date (all authors)
Bei, Zhendong, Zhibin Yu, Huiling Zhang, Wen Xiong, Chengzhong Xu, Lieven Eeckhout, and Shengzhong Feng. 2016. “RFHOC: a Random-forest Approach to Auto-tuning Hadoop’s Configuration.” Ieee Transactions on Parallel and Distributed Systems 27 (5): 1470–1483.
Vancouver
1.
Bei Z, Yu Z, Zhang H, Xiong W, Xu C, Eeckhout L, et al. RFHOC: a random-forest approach to auto-tuning Hadoop’s configuration. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS. 2016;27(5):1470–83.
IEEE
[1]
Z. Bei et al., “RFHOC: a random-forest approach to auto-tuning Hadoop’s configuration,” IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, vol. 27, no. 5, pp. 1470–1483, 2016.
@article{7245576,
  abstract     = {{Hadoop is a widely-used implementation framework of the MapReduce programming model for large-scale data processing. Hadoop performance however is significantly affected by the settings of the Hadoop configuration parameters. Unfortunately, manually tuning these parameters is very time-consuming, if at all practical. This paper proposes an approach, called RFHOC, to automatically tune the Hadoop configuration parameters for optimized performance for a given application running on a given cluster. RFHOC constructs two ensembles of performance models using a random-forest approach for the map and reduce stage respectively. Leveraging these models, RFHOC employs a genetic algorithm to automatically search the Hadoop configuration space. The evaluation of RFHOC using five typical Hadoop programs, each with five different input data sets, shows that it achieves a performance speedup by a factor of 2.11 on average and up to 7.4 over the recently proposed cost-based optimization (CBO) approach. In addition, RFHOC’s performance benefit increases with input data set size.}},
  author       = {{Bei, Zhendong and Yu, Zhibin and Zhang, Huiling and Xiong, Wen and Xu, Chengzhong and Eeckhout, Lieven and Feng, Shengzhong}},
  issn         = {{1045-9219}},
  journal      = {{IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS}},
  keywords     = {{random forest,genetic algorithm,system configuration,Performance tuning,MapReduce/Hadoop}},
  language     = {{eng}},
  number       = {{5}},
  pages        = {{1470--1483}},
  title        = {{RFHOC: a random-forest approach to auto-tuning Hadoop's configuration}},
  url          = {{http://dx.doi.org/10.1109/TPDS.2015.2449299}},
  volume       = {{27}},
  year         = {{2016}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: