Accurate crowd density estimation has become critical in applications ranging from intelligent urban planning and public safety monitoring to marketing analytics and emergency response. In recent developments, various methods have been used to enhance the precision of crowd analysis systems. In this study, a Convolutional Neural Network (CNN)-based approach was presented for crowd density detection, wherein the Congested Scene Recognition Network (CSRNet) architecture was employed with a Visual Geometry Group (VGG)-16 backbone. This method was applied to two benchmark datasets—Mall and Crowd-UIT—to assess its effectiveness in real-world crowd scenarios. Density maps were generated to visualize spatial distributions, and performance was quantitatively evaluated using Mean Squared Error (MSE) and Mean Absolute Error (MAE) metrics. For the Mall dataset, the model achieved an MSE of 0.08 and an MAE of 0.10, while for the Crowd-UIT dataset, an MSE of 0.05 and an MAE of 0.15 were obtained. These results suggest that the proposed VGG-16-based CSRNet model yields high accuracy in crowd estimation tasks across varied environments and crowd densities. Additionally, the model demonstrates robustness in generalizing across different dataset characteristics, indicating its potential applicability in both surveillance systems and public space management. The outcomes of this investigation offer a promising direction for future research in data-driven crowd analysis, particularly in enhancing predictive reliability and real-time deployment capabilities of deep learning models for population monitoring tasks.