This study explores the application and fine-tuning of You Only Look Once (YOLOv8) models for real-time tomato recognition using drone imagery in greenhouse environments, with a focus on practical optimization strategies. Our evaluation of YOLO’s speed, robustness, and adaptability revealed that varying batch sizes and epochs had minimal impact on performance. Notably, the YOLOv8n model matched the performance of the YOLOv8x model while reducing training time by up to 60 times. Further fine-tuning identified the final learning rate (lrf) and dataset annotation quality as critical factors for model performance. Optimizing the lrf and enhancing dataset annotations significantly improved accuracy, underscoring their importance in effective YOLO model deployment. Our results demonstrate YOLOv8’s superiority over YOLOv5, with the optimized YOLOv8n model being ready for deployment in future tomato recognition tasks, paving the way for more efficient agricultural monitoring. This work provides valuable insights into object detection and offers practical guidance for researchers addressing similar challenges.