Dynamic Resource Optimization for Generative AI Workloads: A Simulation-Driven Approach to Mitigating Cold-Start Latency and Cost Inefficiency in Cloud Environments

A. S. Researcher

Authors

A. S. Researcher independent researcher, Indonesia

Keywords:

Generative AI, Cloud Computing, Dynamic Scaling, ABS Simulation

Abstract

The rapid global adoption of Generative AI (GenAI) has precipitated a paradigm shift in cloud resource management. While GenAI offers transformative potential, it imposes significant computational demands, characterized by high variance in inference times and resource intensity. Traditional auto-scaling mechanisms, primarily designed for deterministic web traffic, often fail to address the specific "cold-start" latency issues associated with loading large model weights, leading to suboptimal performance or excessive over-provisioning costs. This study proposes a novel, simulation-driven framework for dynamic resource allocation specifically tailored for GenAI workloads. By leveraging the Abstract Behavioral Specification (ABS) language to model complex, concurrent service behaviors and integrating predictive bytecode instruction counting, we develop a multi-tiered scaling strategy. We benchmark this strategy against standard AWS Auto Scaling configurations using a diverse dataset of simulated inference requests. Our results indicate that the proposed "GenAI-Aware Scaling Engine" (GASE) reduces cold-start latency by approximately 35% while lowering idle resource costs by 22% compared to reactive baseline models. Furthermore, we demonstrate the efficacy of Ansible-based orchestration in translating these simulation-derived policies into actionable runtime configurations on Azure PaaS. These findings suggest that a shift from reactive to simulation-validated predictive scaling is essential for the sustainable scaling of enterprise-grade AI applications.

References

Sai Nikhil Donthi. (2025). Ansible-Based End-To-End Dynamic Scaling on Azure Paas for Refinery Turnarounds: Cold-Start Latency and Cost–Performance Trade-Offs. Frontiers in Emerging Computer Science and Information Technology, 2(11), 01–17. https://doi.org/10.64917/fecsit/Volume02Issue11-01

H. Ali, et al., "Global Adoption of Generative AI: What Matters Most?," Journal of Economy and Technology, vol. 12, no. 4, pp. 156-173, Oct. 2024. Available: https://www.sciencedirect.com/science/article/pii/S2949948824000520

K. Randhi and S. R. Bandarapu, "Efficient resource allocation for generative AI workloads in cloud-native infrastructures: A multi-tiered approach," International Journal of Science and Research Archive, vol. 13, no. 2, pp. 826-839, Nov. 2024. Available: https://ijsra.net/sites/default/files/IJSRA-2024-2208.pdf

M. Abdullah, W. Iqbal, A. Mahmood, F. Bukhari, and A. Erradi. Predictive autoscaling of microservices hosted in fog microdata center. IEEE Systems Journal, pages 1–12, 2020.

E. Abrah´am, F. Corzilius, E. B. Johnsen, G. Kremer, and J. Mauro. Zephyrus2: On the fly deployment optimization using SMT and CP technologies. In M. Fr¨anzle, D. Kapur, and N. Zhan, editors, Dependable Software Engineering: Theories, Tools, and Applications - Second International Symposium, SETTA 2016, Beijing, China, November 9-11, 2016, Proceedings, volume 9984 of Lecture Notes in Computer Science, pages 229–245, 2016.

ABS. ABS documentation. http://docs.abs-models.org/.

ABS. ABS toolchain. https://abs-models.org/laboratory/.

Amazon. Amazon cloudwatch. https://aws.amazon.com/cloudwatch/.

Amazon. AWS auto scaling. https://aws.amazon.com/autoscaling/.

Apache. Apache mesos. http://mesos.apache.org/.

N. Bezirgiannis, F. S. de Boer, and S. de Gouw. Human-in-the-loop simulation of cloud services. In F. D. Paoli, S. Schulte, and E. B. Johnsen, editors, ServiceOriented and Cloud Computing - 6th IFIP WG 2.14 European Conference, ESOCC 2017, Oslo, Norway, September 27-29, 2017, Proceedings, volume 10465 of Lecture Notes in Computer Science, pages 143–158. Springer, 2017.

W. Binder, J. Hulaas, and A. Camesi. Continuous bytecode instruction counting for cpu consumption estimation. In Third International Conference on the Quantitative Evaluation of Systems-(QEST’06), pages 19–30. IEEE, 2006.

W. Binder, J. Hulaas, P. Moret, and A. Villaz´on. Platform-independent profiling in a virtual execution environment. Software: Practice and Experience, 39(1):47–79, 2009.

Dynamic Resource Optimization for Generative AI Workloads: A Simulation-Driven Approach to Mitigating Cold-Start Latency and Cost Inefficiency in Cloud Environments

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section