Automating Unit Test Generation with Large Language Models: An Integrated Empirical and Theoretical Investigation

John K. Morales

Authors

John K. Morales Global Institute of Computing, University of Lisbon

Keywords:

large language models, unit test generation, automated testing, search-based testing

Abstract

Background: The emergence of large language models (LLMs) has introduced novel capabilities for program understanding and automated artifact generation, including unit tests. Recent empirical work suggests both promise and limitations of LLMs when applied to unit test generation tasks (Yang et al., 2024; Siddiq et al., 2024; Tang et al., 2024). However, existing studies vary in scope, metrics, and experimental controls, and there remains a need for an integrative study that synthesizes prior methodologies, aligns evaluation criteria with software engineering standards, and explores complementary approaches such as search-based software testing (SBST) and feedback-directed random testing (Pacheco et al., 2007; Harman & McMinn, 2010).

Objective: This study systematically investigates the effectiveness, reliability, and practical utility of LLM-based unit test generation compared with established automated testing techniques. We aim to provide a rigorous methodology, reproduceable evaluation framework, and nuanced theoretical interpretation to inform practitioners and researchers about when and how LLMs can augment or replace traditional test-generation approaches.

Methods: Drawing on methods and evaluation practices from prior literature on LLMs for software engineering (Fan et al., 2023; Hou et al., 2024; Rao et al., 2023), symbolic execution and loop characteristic analyses (Xiao et al., 2013), SBST (Harman & McMinn, 2010), and feedback-directed random test generation (Pacheco et al., 2007), we designed a controlled empirical study. The study uses a curated corpus of Java classes with documented behavior and existing high-quality JUnit suites, a set of LLM prompting strategies and model variants, and baseline automated tools. Evaluation metrics include functional correctness of generated tests, mutation score, coverage (statement/branch), fault-revealing power, human-readability, and maintenance cost proxies (Dustin et al., 2009; ISO/IEC/IEEE 24765:2017). We also present an extended analytical framework for interpreting results in light of testing taxonomies and automation frameworks (Mayeda & Andrews, 2021; Lonetti & Marchetti, 2018; Chandra et al., 2025).

Results: LLM-generated test suites exhibit notable strengths in producing human-readable, behaviorally-oriented unit tests that capture common usage patterns and edge-case assertions, often matching or exceeding baseline code-coverage for idiomatic code fragments (Yang et al., 2024; Siddiq et al., 2024). However, LLMs struggle on code segments dominated by complex loop constructs, intricate path conditions, and subtle numerical invariants—areas where symbolic execution and SBST demonstrate comparative advantages (Xiao et al., 2013; Harman & McMinn, 2010). Hybrid strategies that combine LLM-generated scaffolding with automated search or symbolic refinement substantially improve fault-revealing power and coverage plateau escape (Lemieux et al., 2023; Tang et al., 2024).

Conclusions: LLMs are a valuable addition to the automated testing toolkit but are not a universal replacement for established techniques. Best practices include using LLMs for rapid generation of semantically rich, human-readable tests and integrating them with SBST, symbolic execution, or feedback-directed random testing to improve coverage and fault detection. We conclude with a prescriptive automation framework, theoretical implications for the future of test automation, limitations of our study, and concrete directions for follow-up empirical and tooling work.

References

Yang, L.; Yang, C.; Gao, S.; Wang, W.; Wang, B.; Zhu, Q.; Chu, X.; Zhou, J.; Liang, G.; Wang, Q.; et al. On the Evaluation of Large Language Models in Unit Test Generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE’24, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1607–1619.

Siddiq, M.L.; Da Silva Santos, J.C.; Tanvir, R.H.; Ulfat, N.; Al Rifat, F.; Carvalho Lopes, V. Using Large Language Models to Generate JUnit Tests: An Empirical Study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, EASE’24, Salerno, Italy, 18–21 June 2024; pp. 313–322.

Dustin, E.; Garrett, T.; Gauf, B. Implementing Automated Software Testing: How to Save Time and Lower Costs While Raising Quality; Pearson Education: Upper Saddle River, NJ, USA, 2009.

Pacheco, C.; Lahiri, S.K.; Ernst, M.D.; Ball, T. Feedback-Directed Random Test Generation. In Proceedings of the 29th International Conference on Software Engineering (ICSE’07), Minneapolis, MN, USA, 20–26 May 2007; pp. 75–84.

Xiao, X.; Li, S.; Xie, T.; Tillmann, N. Characteristic studies of loop problems for structural test generation via symbolic execution. In Proceedings of the 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), Silicon Valley, CA, USA, 11–15 November 2013; pp. 246–256.

Harman, M.; McMinn, P. A Theoretical and Empirical Study of Search-Based Testing: Local, Global, and Hybrid Search. IEEE Trans. Softw. Eng. 2010, 36, 226–247.

Yuan, Z.; Lou, Y.; Liu, M.; Ding, S.; Wang, K.; Chen, Y.; Peng, X. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. arXiv 2024, arXiv:2305.04207.

Fan, A.; Gokkaya, B.; Harman, M.; Lyubarskiy, M.; Sengupta, S.; Yoo, S.; Zhang, J.M. Large Language Models for Software Engineering: Survey and Open Problems. In Proceedings of the 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), Melbourne, Australia, 14–20 May 2023; pp. 31–53.

Chandra, R.; Lulla, K.; Sirigiri, K. Automation frameworks for end-to-end testing of large language models (LLMs). Journal of Information Systems Engineering and Management, 2025, 10, e464-e472.

ISO/IEC/IEEE 24765:2017(E); ISO/IEC/IEEE International Standard—Systems and Software Engineering—Vocabulary. IEEE: New York, NY, USA, 2017; pp. 1–541.

Mayeda, M.; Andrews, A. Evaluating Software Testing Techniques: A Systematic Mapping Study. In Advances in Computers; Missouri University of Science and Technology: Rolla, MO, USA, 2021; ISBN 978-0-12-824121-9.

Lonetti, F.; Marchetti, E. Emerging Software Testing Technologies. In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 2018; Volume 108, pp. 91–143. ISBN 978-0-12-815119-8.

Clark, A.G.; Walkinshaw, N.; Hierons, R.M. Test Case Generation for Agent-Based Models: A Systematic Literature Review. Inf. Softw. Technol. 2021, 135, 106567.

Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv 2024, arXiv:2308.10620.

Tang, Y.; Liu, Z.; Zhou, Z.; Luo, X. ChatGPT vs. SBST: A Comparative Assessment of Unit Test Suite Generation. IEEE Trans. Softw. Eng. 2024, 50, 1340–1359.

Chen, Y.; Hu, Z.; Zhi, C.; Han, J.; Deng, S.; Yin, J. ChatUniTest: A Framework for LLM-Based Test Generation. arXiv 2024, arXiv:2305.04764.

Rao, N.; Jain, K.; Alon, U.; Goues, C.L.; Hellendoorn, V.J. CAT-LM Training Language Models on Aligned Code And Tests. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 409–420.

Lemieux, C.; Inala, J.P.; Lahiri, S.K.; Sen, S. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 919–931.

Automating Unit Test Generation with Large Language Models: An Integrated Empirical and Theoretical Investigation

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section