
Re-assessing accuracy degradation : a framework for understanding DNN behavior on similar-but-non-identical test datasets
- Author
- Esla Timothy Anzaku (UGent) , Haohan Wang, Ajiboye Babalola, Arnout Van Messem (UGent) and Wesley De Neve (UGent)
- Organization
- Abstract
- Deep Neural Networks (DNNs) often demonstrate remarkable performance when evaluated on the test dataset used during model creation. However, their ability to generalize effectively when deployed is crucial, especially in critical applications. One approach to assess the generalization capability of a DNN model is to evaluate its performance on replicated test datasets, which are created by closely following the same methodology and procedures used to generate the original test dataset. Our investigation focuses on the performance degradation of pre-trained DNN models in multi-class classification tasks when evaluated on these replicated datasets; this performance degradation has not been entirely explained by generalization shortcomings or dataset disparities. To address this, we introduce a new evaluation framework that leverages uncertainty estimates generated by the models studied. This framework is designed to isolate the impact of variations in the evaluated test datasets and assess DNNs based on the consistency of their confidence in their predictions. By employing this framework, we can determine whether an observed performance drop is primarily caused by model inadequacy or other factors. We applied our framework to analyze 564 pre-trained DNN models across the CIFAR-10 and ImageNet benchmarks, along with their replicated versions. Contrary to common assumptions about model inadequacy, our results indicate a substantial reduction in the performance gap between the original and replicated datasets when accounting for model uncertainty. This suggests a previously unrecognized adaptability of models to minor dataset variations. Our findings emphasize the importance of understanding dataset intricacies and adopting more nuanced evaluation methods when assessing DNN model performance. This research contributes to the development of more robust and reliable DNN models, especially in critical applications where generalization performance is of utmost importance. The code to reproduce our experiments will be available at https://github.com/esla/Reassessing_DNN_Accuracy.
- Keywords
- DNN model performance degradation, ImageNet benchmarking, ML datasets and benchmarks, Model evaluation, Multi-class classification, Pattern recognition
Downloads
-
DS876.pdf
- full text (Published version)
- |
- open access
- |
- |
- 2.72 MB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-01JMVTJPHS3HXEV7WB402R053Z
- MLA
- Anzaku, Esla Timothy, et al. “Re-Assessing Accuracy Degradation : A Framework for Understanding DNN Behavior on Similar-but-Non-Identical Test Datasets.” MACHINE LEARNING, vol. 114, no. 3, 2025, doi:10.1007/s10994-024-06693-x.
- APA
- Anzaku, E. T., Wang, H., Babalola, A., Van Messem, A., & De Neve, W. (2025). Re-assessing accuracy degradation : a framework for understanding DNN behavior on similar-but-non-identical test datasets. MACHINE LEARNING, 114(3). https://doi.org/10.1007/s10994-024-06693-x
- Chicago author-date
- Anzaku, Esla Timothy, Haohan Wang, Ajiboye Babalola, Arnout Van Messem, and Wesley De Neve. 2025. “Re-Assessing Accuracy Degradation : A Framework for Understanding DNN Behavior on Similar-but-Non-Identical Test Datasets.” MACHINE LEARNING 114 (3). https://doi.org/10.1007/s10994-024-06693-x.
- Chicago author-date (all authors)
- Anzaku, Esla Timothy, Haohan Wang, Ajiboye Babalola, Arnout Van Messem, and Wesley De Neve. 2025. “Re-Assessing Accuracy Degradation : A Framework for Understanding DNN Behavior on Similar-but-Non-Identical Test Datasets.” MACHINE LEARNING 114 (3). doi:10.1007/s10994-024-06693-x.
- Vancouver
- 1.Anzaku ET, Wang H, Babalola A, Van Messem A, De Neve W. Re-assessing accuracy degradation : a framework for understanding DNN behavior on similar-but-non-identical test datasets. MACHINE LEARNING. 2025;114(3).
- IEEE
- [1]E. T. Anzaku, H. Wang, A. Babalola, A. Van Messem, and W. De Neve, “Re-assessing accuracy degradation : a framework for understanding DNN behavior on similar-but-non-identical test datasets,” MACHINE LEARNING, vol. 114, no. 3, 2025.
@article{01JMVTJPHS3HXEV7WB402R053Z, abstract = {{Deep Neural Networks (DNNs) often demonstrate remarkable performance when evaluated on the test dataset used during model creation. However, their ability to generalize effectively when deployed is crucial, especially in critical applications. One approach to assess the generalization capability of a DNN model is to evaluate its performance on replicated test datasets, which are created by closely following the same methodology and procedures used to generate the original test dataset. Our investigation focuses on the performance degradation of pre-trained DNN models in multi-class classification tasks when evaluated on these replicated datasets; this performance degradation has not been entirely explained by generalization shortcomings or dataset disparities. To address this, we introduce a new evaluation framework that leverages uncertainty estimates generated by the models studied. This framework is designed to isolate the impact of variations in the evaluated test datasets and assess DNNs based on the consistency of their confidence in their predictions. By employing this framework, we can determine whether an observed performance drop is primarily caused by model inadequacy or other factors. We applied our framework to analyze 564 pre-trained DNN models across the CIFAR-10 and ImageNet benchmarks, along with their replicated versions. Contrary to common assumptions about model inadequacy, our results indicate a substantial reduction in the performance gap between the original and replicated datasets when accounting for model uncertainty. This suggests a previously unrecognized adaptability of models to minor dataset variations. Our findings emphasize the importance of understanding dataset intricacies and adopting more nuanced evaluation methods when assessing DNN model performance. This research contributes to the development of more robust and reliable DNN models, especially in critical applications where generalization performance is of utmost importance. The code to reproduce our experiments will be available at https://github.com/esla/Reassessing_DNN_Accuracy.}}, articleno = {{84}}, author = {{Anzaku, Esla Timothy and Wang, Haohan and Babalola, Ajiboye and Van Messem, Arnout and De Neve, Wesley}}, issn = {{0885-6125}}, journal = {{MACHINE LEARNING}}, keywords = {{DNN model performance degradation,ImageNet benchmarking,ML datasets and benchmarks,Model evaluation,Multi-class classification,Pattern recognition}}, language = {{eng}}, number = {{3}}, pages = {{22}}, title = {{Re-assessing accuracy degradation : a framework for understanding DNN behavior on similar-but-non-identical test datasets}}, url = {{http://doi.org/10.1007/s10994-024-06693-x}}, volume = {{114}}, year = {{2025}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: