Umair Amir | Freelancer Forest Fire Ai: Swin Transformers & Mo E

Forest Fire AI: Swin Transformers & MoE

Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts Sheraz Waseem - Umair Amir – Ahmed Rayyan Aamir – Talha Rehan https://github.com/UmairAmir/moe-fire-detection 1. Abstract Robust fire detection in real-world settings requires models that can generalize across vastly different visual environments, including indoor scenes, outdoor street views, farfield surveillance footage, and satellite imagery. We propose a Mixture of Experts (MoE) architecture that leverages four specialized YOLOv8 detectors, each trained on a distinct fire scenario. A lightweight convolutional gating network dynamically assigns soft weights to expert outputs based on scene context. To enhance gating accuracy, we incorporate a self-attention mechanism, boosting classification performance from 90% to 96%. For final prediction refinement, we employ Weighted Box Fusion (WBF) to aggregate detections. Our results demonstrate significant improvements in detection metrics, including a rise in F1 score from 0.57 to 0.60 and approximately 50 additional correct detections, highlighting the system’s adaptability and effectiveness. 2. Introduction Detecting fire accurately across diverse real-world environments poses a persistent challenge in computer vision, particularly for applications in surveillance, industrial safety, and disaster response. Standard object detection models often struggle to maintain performance across varied visual domains such as indoor spaces, open streets, or satellite images. To overcome this limitation, we introduce a Mixture of Experts (MoE) framework that decomposes the task across multiple specialized detectors. Each expert—a YOLOv8 model—is trained exclusively on one type of fire scenario. A separate CNN-based gating network evaluates the input scene and assigns soft weights to each expert’s output, effectively tailoring predictions to the image context. We further strengthen the gating network with a lightweight self-attention module, allowing it to focus on spatially salient features, and refine final outputs using Weighted Box Fusion to merge overlapping detections. This paper presents the design, implementation, and evaluation of our system, demonstrating the advantages of modular, scenario-aware fire detection. 3. Motivation Detecting fire in real-world applications poses unique challenges due to the diversity of scenes in which fires can appear—from confined indoor spaces to satellite images capturing vast landscapes. A one-size-fits-all detection model often struggles to generalize across such drastically different environments. The Mixture of Experts (MoE) paradigm offers a compelling solution by allowing the system to leverage the strengths of multiple specialized models, or ”experts.” Rather than forcing a single model to learn all scene types, MoE assigns responsibility dynamically to models that are best suited for each input image. This selective activation is orchestrated by a gating network that determines the most appropriate experts to trust for any given input. Our project is motivated by this potential: we hypothesize that combining multiple YOLOv8 models—each fine-tuned for a specific fire scenario—and using a lightweight, selfattention-enhanced gating network to blend their outputs, will result in significantly more accurate and robust fire detection. 4. Related Work Several recent studies have explored fire detection using deep learning techniques, primarily relying on convolutional neural networks and object detection models. Abdusalomov et al. (1) proposed an enhanced forest fire detection system using Mask R-CNN within the Detectron2 framework to perform instance segmentation. Their approach was especially effective for detecting small and irregularly shaped fires. They achieved 99.3% precision using an augmented dataset of over 119,000 images. Cao et al. (2) introduced an attention-enhanced Bidirectional LSTM (ABi-LSTM) for early smoke detection in surveil- Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts lance videos. The model combined spatial features from InceptionV3 and temporal dependencies from Bi-LSTM with a soft attention mechanism, achieving 97.8% accuracy. Li and Zhao (3) discussed the evolution of fire detection models, comparing two-stage detectors like Faster RCNN and R-FCN with one-stage models such as SSD and YOLOv3. YOLOv3 was found to provide the best balance between speed and accuracy, especially for fire detection. Van Gool et al. (4) conducted a robustness analysis indicating that YOLOv3 maintained strong performance across variable conditions and was significantly better for fire detection compared to other models. Despite strong performance in isolated domains, most prior methods lack adaptability across vastly different scenarios such as indoor environments or satellite views. Our work addresses this gap by using a Mixture of Experts approach with scenario-specific detectors and a context-aware gating mechanism. 5. Datasets • Far-field (Tower Cameras): We selected 2000 images from the PyroNear and Pyro-SDIS collections, which primarily include tower surveillance imagery of distant wildfires. These far-field images presented challenges in scale and low fire pixel visibility. We manually ensured the inclusion of diverse terrain types and fire intensities. • Indoor: We merged 1157 images from multiple Roboflow fire datasets containing images of controlled fires in indoor environments like kitchens, warehouses, and labs. This dataset was the most limited in size, so we supplemented it with synthetically augmented data during training to maintain performance. • Satellite: We gathered 2000 annotated images from the Roboflow wildfire satellite dataset, which provides false-color satellite views of wildfire spread. This modality required bounding box annotations over noisy terrain patterns. To improve consistency, we filtered out corrupted or low-resolution samples before training. 6. Methodology 6.1. Baseline: Expert-Specific YOLOv8 Models We trained four YOLOv8 object detection models, each specialized for a specific fire detection scenario as described in the Datasets section. Each model used a YOLOv8m backbone with 640×640 input resolution and was trained for 100 epochs with a batch size of 16. Each model was trained independently to specialize in its respective domain, serving as an ”expert” in the later Mixture of Experts architecture. Figure 1. Image distribution per scenario. Outdoor images dominate due to greater availability, while indoor samples are more limited. We curated four distinct datasets, each tailored to a specific fire detection scenario: indoor, outdoor, far-field, and satellite imagery. These datasets were sourced from publicly available repositories and carefully processed to ensure scene-specific specialization among expert models. • Outdoor (Street-level): We collected 2621 annotated images from a combination of Kaggle fire datasets and Google Image scraping. The outdoor dataset was initially highly redundant with many similar-looking street fires. To avoid overfitting and improve generalization, we applied stratified sampling to ensure diversity in background textures, fire size, and time of day. This dataset forms the largest subset due to the availability of varied street fire footage. 6.2. Gating Network To dynamically weigh the outputs of the experts, we trained a lightweight CNN classifier, referred to as the gating network. This model takes a resized input image (224x224) and outputs a 4-dimensional softmax probability vector indicating the confidence for each scenario. The architecture includes 3 convolutional layers, ReLU activations, max pooling, and a final linear classifier. The gating network was trained on a labeled dataset combining all four scenarios. Ground truth labels indicated the scenario type of each image. 6.3. First Improvement: Mixture of Experts Inference At inference time, the input image is passed through all four YOLOv8 experts. Simultaneously, the gating network computes softmax weights for the image. The outputs of each expert are then scaled by their respective weights. Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts Figure 2. Sample images from different fire detection scenarios: (a) Indoor, (b) Outdoor, (c) Far-field, and (d) Satellite imagery, showing the diversity in appearance and context. A confidence threshold is applied to filter low-confidence boxes, and the remaining boxes are aggregated and passed through Non-Maximum Suppression (NMS) to eliminate overlaps and finalize predictions. Applying softmax to obtain normalized weights for each expert: ezi wi = Σ 4 , ezj for i = 1, 2, 3, 4 j=1 6.4. Second Improvement: Attention-Enhanced Gating + WBF We introduced a lightweight Self-Attention block into the CNN to help the model focus on spatially important features. This resulted in a classification accuracy jump from 90% to 96% on the validation set. Expert Inference and Weighting. Each expert Ei produces a set of detections: i Ei(x) = {(bi,k, ci,k)}Nk=1 where b i,k = (x1, y1, x2 , y2) is a bounding box and ci,k ∈ [0, 1] is the confidence score. To further improve output box aggregation, we replaced NMS with Weighted Box Fusion (WBF), which merges overlapping predictions based on box coordinates and confidence scores. This yielded a higher F1 score and increased the total correct detections. We reweight the confidences using the gating weights: As illustrated in Figure 3, the input image is passed through a gating network to assign weights, followed by inference from all four YOLOv8 experts and a final aggregation via Weighted Box Fusion. We keep predictions that pass the threshold: 6.5. Additional Experiments: Test-Time Augmentation We also experimented with Test-Time Augmentation (TTA) by performing horizontal flips, scaling, and brightness adjustment. Each augmented image was passed through the MoE pipeline and results were merged. However, performance degraded due to misaligned bounding boxes and translation inconsistencies during post-aggregation. 6.6. Mixture of Experts: Mathematical Flow Let x ∈ R3×224×224 be the input image. The Self-Attentionenhanced Gating CNN outputs raw logits: cˆi,k = wi · ci,k cˆi,k > τ (confidence threshold) Weighted Box Fusion (WBF). All filtered boxes are normalized and combined: B= 4 [ {(bi,k , cˆi,k)} i=1 The WBF algorithm merges overlapping boxes {b1, b2, ..., bn} using a confidence-weighted average: Σn cˆj · bj j=1 n ¯b = Σ cˆj j=1 G(x) = [z1, z2, z3, z4] ∈ R 4 producing a refined box ¯b with weighted average coordinates. Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts w1 Expert Models Indoor Expert (YOLOv8) Gating CNN with Self-Attention Image Image Outdoor Expert (YOLOv8) Input Image Weighted Box Fusion Fire Detections IoU-based box merging Image Image Outputs scenario weights Farfield Expert (YOLOv8) Satellite Expert (YOLOv8) Figure 3. Mixture of Experts Architecture for Fire Detection 6.7. Self-Attention Block Metric The self-attention mechanism computes contextual importance by constructing Query (Q), Key (K), and Value (V ) projections from the feature map x. The attention map is then calculated as: Total Predictions [email protected] Average IoU Precision Recall F1-Score Classification Acc. Attention(x) = γ · V · softmax(Q · KT ) + x where γ is a learnable scalar and Q, K, and V are 1 × 1 convolution layers. The output retains the same spatial dimensions and is used to enhance the feature representation in the gating network. 7. Results 7.1. Quantitative Evaluation MoE MoE+Attn+WBF -% -% Table 1. Performance comparison between the baseline Mixture of Experts (MoE) approach and the enhanced version with attention and Weighted Box Fusion. these were offset by an increase in recall (from 0.5029 to 0.5315), leading to a higher overall F1-score. We evaluated our approach incrementally on a held-out validation set containing 698 ground truth boxes across all scenarios. Table 1 presents the performance metrics for the baseline Mixture of Experts (MoE) configuration using a gating CNN, and the enhanced MoE setup incorporating a self-attention block and Weighted Box Fusion (WBF). In particular, the addition of the self-attention block in the gating network significantly boosted scene classification accuracy from 90% to 96%, resulting in better expert selection and stronger end-to-end detection performance. The enhanced MoE architecture increased the total number of correct detections from 498 to 548—an improvement of approximately 50 fire instances across the validation set. Although there was a minor drop in [email protected] (from 0.8998 to 0.8877) and precision (from 0.7048 to 0.6770), To validate the effectiveness of our Mixture of Experts approach, we compared it against a single YOLOv8 model trained on the entire combined dataset. The results strongly favor the MoE architecture across all metrics: The most significant improvement is observed in [email protected], where 7.2. Comparison with Single Model Baseline Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts Metric Precision Recall [email protected] [email protected]:0.95 Average IoU F1-Score Total Predictions Single YOLO MoE (Ours) - n/a 0.5577* n/a - n/a- Table 2. Performance comparison between a single YOLOv8 model trained on the combined dataset versus our Mixture of Experts approach. *F1-Score for single YOLO calculated from provided precision and recall values. Figure 5. Confusion matrix for the Single YOLO model. our MoE approach achieves 0.8877 compared to 0.5253 for the single model—a 69% relative improvement. This substantial gain indicates that our expert specialization and dynamic weighting strategy is highly effective at producing high-confidence detections that match ground truth objects. Both precision and recall metrics show consistent improvements, resulting in a higher overall F1-Score. These results validate our hypothesis that specialized expert models, when combined with an intelligent gating mechanism, outperform a monolithic approach that attempts to handle all scenarios with a single model. Both models show a tendency to confuse fire with background regions, particularly in cases involving bright lights or partial occlusion. However, the MoE model achieves a better balance in true fire detection, as evidenced by a higher true positive count in the upper-left cell of the matrix. The YOLO model shows more conservative predictions, which results in lower recall but higher precision. These matrices support our earlier findings: the MoE framework improves recall and scene adaptability while slightly compromising precision, aligning with the trade-offs observed in Table 1. 7.3. Confusion Matrix Analysis To better understand classification errors, we visualize the confusion matrices of both the baseline YOLO model and our enhanced Mixture of Experts (MoE) system. 7.4. Impact of Weighted Box Fusion (WBF) Replacing NMS with WBF led to better box refinement and reduced missed overlaps in crowded fire scenarios. WBF particularly helped in merging partial predictions from multiple experts into a single high-confidence detection. 7.5. Test-Time Augmentation (TTA) Results TTA failed to yield improvements and often degraded performance due to coordinate inconsistencies and increased false positives. The model misinterpreted transformations like flips and zooms, resulting in scattered or ghost detections. 8. Discussion The proposed Mixture of Experts (MoE) framework effectively addresses the challenge of generalizing fire detection across diverse environments by leveraging scenario-specific detectors and dynamic weighting. This modular design proved superior to a single, generalized model, as demonstrated by significant gains in recall and F1-score. Figure 4. Confusion matrix for the Mixture of Experts model. A key strength of the system lies in the gating network’s ability to assign scenario-specific weights. Incorporating a Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts self-attention block enhanced its spatial awareness, resulting in better expert selection and ultimately contributing to an improved detection rate. The observed increase in scene classification accuracy from 90% to 96% highlights this advancement. However, certain limitations persist. False positives were commonly triggered by bright, fire-like light sources such as vehicle headlights or interior lighting. These distractors, often misclassified due to visual similarity in color and intensity, underscore the need for improved contextual modeling or hard negative mining. Additionally, while Weighted Box Fusion (WBF) refined output quality by resolving overlapping detections, it introduced slight trade-offs in precision. This suggests that over-merging across expert outputs may sometimes dilute individual model confidence. Our experiments with Test-Time Augmentation (TTA) revealed its sensitivity to spatial transformations. Techniques such as flips and brightness adjustments introduced box misalignments, reducing reliability and increasing false positives. This result underscores the importance of bounding box calibration when integrating TTA into fusion pipelines. Overall, the MoE framework provides a flexible foundation for scalable fire detection and can be extended to accommodate more scenarios or future enhancements in object detection and scene understanding. 9. Future Work While our current Mixture of Experts framework demonstrates strong performance across multiple fire detection scenarios, several areas remain open for future exploration: • Model Compression: Investigate knowledge distillation or pruning techniques to reduce the computational cost of running four YOLOv8 models simultaneously. • Multi-modal Inputs: Extend the system to incorporate additional input modalities, such as thermal or infrared data, to improve detection under low-visibility conditions. • Hard Negative Mining: Incorporate explicit handling of fire-like distractors such as car headlights or bright lights to reduce false positives in nighttime or indoor scenes. • Active Learning: Implement strategies for continuous data acquisition and model fine-tuning using uncertain or misclassified samples. • Multi-class Support: Extend detection capabilities to include related classes such as smoke, heat signatures, or fire sources (e.g., stove, electrical fault). 10. Conclusion In this project, we proposed a Mixture of Experts framework tailored for robust fire detection across varied scenarios including indoor, outdoor, far-field, and satellite imagery. By combining four specialized YOLOv8 detectors with a learned gating network, we dynamically weighted each model’s contribution to inference. Our results demonstrate that expert specialization combined with context-aware weighting significantly improves detection quality. Further enhancements using attention in the gating model and box fusion strategies led to measurable gains in detection metrics. While some augmentations like TTA did not perform as expected, the modular nature of our system makes it adaptable for future experimentation and real-world deployment. 11. Contributions • Trained four scenario-specific YOLOv8 fire detection models for satellite, far-field, indoor, and outdoor imagery. • Developed a lightweight CNN-based gating network to classify scene types and assign soft weights to expert outputs. • Proposed a Mixture of Experts (MoE) architecture that blends expert predictions based on learned scene context. • Incorporated a self-attention mechanism into the gating network, improving scene classification accuracy from 90% to 96%. • Replaced Non-Maximum Suppression (NMS) with Weighted Box Fusion (WBF), leading to improved box aggregation and an increase in F1 score. • Conducted extensive evaluation and ablation experiments, including Test-Time Augmentation (TTA), to analyze impact on performance. • Achieved an overall improvement in correct detections from 498 (baseline) to 548 (final model) on a 698image validation set. • Deployed an interactive Streamlit web application that allows users to upload images and visualize fire detections using the trained MoE model. References [1] A. B. Abdusalomov, B. M. S. Islam, R. Nasimov, M. Mukhiddinov, and T. K. Whangbo. An Improved Forest Fire Detection Method Based on the Detectron2 Model and a Deep Learning Approach. Sensors, 23(3):1512, 2023. doi:10.3390/s-. Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts [2] Y. Cao, F. Yang, Q. Tang, and X. Lu. An Attention Enhanced Bidirectional LSTM for Early Forest Fire Smoke Recognition. IEEE Access, 7:-, 2019. doi:10.1109/ACCESS-. [3] P. Li and W. Zhao. Image fire detection algorithms based on convolutional neural networks. Case Studies in Thermal Engineering, 19:100625, 2020. [4] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111:98–136, 2015.