Towards Robust Fire Detection Across Diverse Environments using Mixture of
YOLOv8 Experts
Sheraz Waseem - Umair Amir – Ahmed Rayyan Aamir – Talha Rehan
https://github.com/UmairAmir/moe-fire-detection
1. Abstract
Robust fire detection in real-world settings requires models
that can generalize across vastly different visual environments, including indoor scenes, outdoor street views, farfield surveillance footage, and satellite imagery. We propose
a Mixture of Experts (MoE) architecture that leverages four
specialized YOLOv8 detectors, each trained on a distinct
fire scenario. A lightweight convolutional gating network
dynamically assigns soft weights to expert outputs based on
scene context. To enhance gating accuracy, we incorporate
a self-attention mechanism, boosting classification performance from 90% to 96%. For final prediction refinement,
we employ Weighted Box Fusion (WBF) to aggregate detections. Our results demonstrate significant improvements
in detection metrics, including a rise in F1 score from 0.57
to 0.60 and approximately 50 additional correct detections,
highlighting the system’s adaptability and effectiveness.
2. Introduction
Detecting fire accurately across diverse real-world environments poses a persistent challenge in computer vision, particularly for applications in surveillance, industrial safety,
and disaster response. Standard object detection models
often struggle to maintain performance across varied visual
domains such as indoor spaces, open streets, or satellite
images.
To overcome this limitation, we introduce a Mixture of Experts (MoE) framework that decomposes the task across
multiple specialized detectors. Each expert—a YOLOv8
model—is trained exclusively on one type of fire scenario.
A separate CNN-based gating network evaluates the input
scene and assigns soft weights to each expert’s output, effectively tailoring predictions to the image context.
We further strengthen the gating network with a lightweight
self-attention module, allowing it to focus on spatially
salient features, and refine final outputs using Weighted Box
Fusion to merge overlapping detections. This paper presents
the design, implementation, and evaluation of our system,
demonstrating the advantages of modular, scenario-aware
fire detection.
3. Motivation
Detecting fire in real-world applications poses unique challenges due to the diversity of scenes in which fires can
appear—from confined indoor spaces to satellite images capturing vast landscapes. A one-size-fits-all detection model
often struggles to generalize across such drastically different
environments.
The Mixture of Experts (MoE) paradigm offers a compelling
solution by allowing the system to leverage the strengths
of multiple specialized models, or ”experts.” Rather than
forcing a single model to learn all scene types, MoE assigns
responsibility dynamically to models that are best suited for
each input image. This selective activation is orchestrated
by a gating network that determines the most appropriate
experts to trust for any given input.
Our project is motivated by this potential: we hypothesize
that combining multiple YOLOv8 models—each fine-tuned
for a specific fire scenario—and using a lightweight, selfattention-enhanced gating network to blend their outputs,
will result in significantly more accurate and robust fire
detection.
4. Related Work
Several recent studies have explored fire detection using
deep learning techniques, primarily relying on convolutional
neural networks and object detection models.
Abdusalomov et al. (1) proposed an enhanced forest fire
detection system using Mask R-CNN within the Detectron2
framework to perform instance segmentation. Their approach was especially effective for detecting small and irregularly shaped fires. They achieved 99.3% precision using
an augmented dataset of over 119,000 images.
Cao et al. (2) introduced an attention-enhanced Bidirectional
LSTM (ABi-LSTM) for early smoke detection in surveil-
Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts
lance videos. The model combined spatial features from
InceptionV3 and temporal dependencies from Bi-LSTM
with a soft attention mechanism, achieving 97.8% accuracy.
Li and Zhao (3) discussed the evolution of fire detection models, comparing two-stage detectors like Faster RCNN and R-FCN with one-stage models such as SSD and
YOLOv3. YOLOv3 was found to provide the best balance
between speed and accuracy, especially for fire detection.
Van Gool et al. (4) conducted a robustness analysis indicating that YOLOv3 maintained strong performance across
variable conditions and was significantly better for fire detection compared to other models.
Despite strong performance in isolated domains, most prior
methods lack adaptability across vastly different scenarios
such as indoor environments or satellite views. Our work
addresses this gap by using a Mixture of Experts approach
with scenario-specific detectors and a context-aware gating
mechanism.
5. Datasets
• Far-field (Tower Cameras): We selected 2000 images
from the PyroNear and Pyro-SDIS collections, which
primarily include tower surveillance imagery of distant
wildfires. These far-field images presented challenges
in scale and low fire pixel visibility. We manually
ensured the inclusion of diverse terrain types and fire
intensities.
• Indoor: We merged 1157 images from multiple
Roboflow fire datasets containing images of controlled
fires in indoor environments like kitchens, warehouses,
and labs. This dataset was the most limited in size, so
we supplemented it with synthetically augmented data
during training to maintain performance.
• Satellite: We gathered 2000 annotated images from
the Roboflow wildfire satellite dataset, which provides
false-color satellite views of wildfire spread. This
modality required bounding box annotations over noisy
terrain patterns. To improve consistency, we filtered
out corrupted or low-resolution samples before training.
6. Methodology
6.1. Baseline: Expert-Specific YOLOv8 Models
We trained four YOLOv8 object detection models, each
specialized for a specific fire detection scenario as described
in the Datasets section. Each model used a YOLOv8m
backbone with 640×640 input resolution and was trained
for 100 epochs with a batch size of 16.
Each model was trained independently to specialize in its respective domain, serving as an ”expert” in the later Mixture
of Experts architecture.
Figure 1. Image distribution per scenario. Outdoor images dominate due to greater availability, while indoor samples are more
limited.
We curated four distinct datasets, each tailored to a specific fire detection scenario: indoor, outdoor, far-field, and
satellite imagery. These datasets were sourced from publicly available repositories and carefully processed to ensure
scene-specific specialization among expert models.
• Outdoor (Street-level): We collected 2621 annotated
images from a combination of Kaggle fire datasets
and Google Image scraping. The outdoor dataset was
initially highly redundant with many similar-looking
street fires. To avoid overfitting and improve generalization, we applied stratified sampling to ensure diversity in background textures, fire size, and time of
day. This dataset forms the largest subset due to the
availability of varied street fire footage.
6.2. Gating Network
To dynamically weigh the outputs of the experts, we trained
a lightweight CNN classifier, referred to as the gating network. This model takes a resized input image (224x224)
and outputs a 4-dimensional softmax probability vector indicating the confidence for each scenario. The architecture
includes 3 convolutional layers, ReLU activations, max pooling, and a final linear classifier.
The gating network was trained on a labeled dataset combining all four scenarios. Ground truth labels indicated the
scenario type of each image.
6.3. First Improvement: Mixture of Experts Inference
At inference time, the input image is passed through all
four YOLOv8 experts. Simultaneously, the gating network
computes softmax weights for the image. The outputs of
each expert are then scaled by their respective weights.
Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts
Figure 2. Sample images from different fire detection scenarios: (a) Indoor, (b) Outdoor, (c) Far-field, and (d) Satellite imagery, showing
the diversity in appearance and context.
A confidence threshold is applied to filter low-confidence
boxes, and the remaining boxes are aggregated and passed
through Non-Maximum Suppression (NMS) to eliminate
overlaps and finalize predictions.
Applying softmax to obtain normalized weights for each
expert:
ezi
wi = Σ 4
,
ezj
for i = 1, 2, 3, 4
j=1
6.4. Second Improvement: Attention-Enhanced Gating
+ WBF
We introduced a lightweight Self-Attention block into the
CNN to help the model focus on spatially important features.
This resulted in a classification accuracy jump from 90% to
96% on the validation set.
Expert Inference and Weighting. Each expert Ei produces a set of detections:
i
Ei(x) = {(bi,k, ci,k)}Nk=1
where b i,k = (x1, y1, x2 , y2) is a bounding box and ci,k ∈
[0, 1] is the confidence score.
To further improve output box aggregation, we replaced
NMS with Weighted Box Fusion (WBF), which merges
overlapping predictions based on box coordinates and confidence scores. This yielded a higher F1 score and increased
the total correct detections.
We reweight the confidences using the gating weights:
As illustrated in Figure 3, the input image is passed through
a gating network to assign weights, followed by inference
from all four YOLOv8 experts and a final aggregation via
Weighted Box Fusion.
We keep predictions that pass the threshold:
6.5. Additional Experiments: Test-Time Augmentation
We also experimented with Test-Time Augmentation (TTA)
by performing horizontal flips, scaling, and brightness adjustment. Each augmented image was passed through the
MoE pipeline and results were merged. However, performance degraded due to misaligned bounding boxes and
translation inconsistencies during post-aggregation.
6.6. Mixture of Experts: Mathematical Flow
Let x ∈ R3×224×224 be the input image. The Self-Attentionenhanced Gating CNN outputs raw logits:
cˆi,k = wi · ci,k
cˆi,k > τ
(confidence threshold)
Weighted Box Fusion (WBF). All filtered boxes are normalized and combined:
B=
4
[
{(bi,k , cˆi,k)}
i=1
The WBF algorithm merges overlapping boxes
{b1, b2, ..., bn} using a confidence-weighted average:
Σn
cˆj · bj
j=1
n
¯b = Σ
cˆj
j=1
G(x) = [z1, z2, z3, z4] ∈ R
4
producing a refined box ¯b with weighted average coordinates.
Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts
w1
Expert Models
Indoor Expert
(YOLOv8)
Gating CNN with
Self-Attention
Image
Image
Outdoor Expert
(YOLOv8)
Input Image
Weighted
Box Fusion
Fire
Detections
IoU-based box merging
Image
Image
Outputs scenario weights
Farfield Expert
(YOLOv8)
Satellite Expert
(YOLOv8)
Figure 3. Mixture of Experts Architecture for Fire Detection
6.7. Self-Attention Block
Metric
The self-attention mechanism computes contextual importance by constructing Query (Q), Key (K), and Value (V )
projections from the feature map x. The attention map is
then calculated as:
Total Predictions
[email protected]
Average IoU
Precision
Recall
F1-Score
Classification Acc.
Attention(x) = γ · V · softmax(Q · KT ) + x
where γ is a learnable scalar and Q, K, and V are 1 × 1
convolution layers. The output retains the same spatial
dimensions and is used to enhance the feature representation
in the gating network.
7. Results
7.1. Quantitative Evaluation
MoE
MoE+Attn+WBF
-%
-%
Table 1. Performance comparison between the baseline Mixture of
Experts (MoE) approach and the enhanced version with attention
and Weighted Box Fusion.
these were offset by an increase in recall (from 0.5029 to
0.5315), leading to a higher overall F1-score.
We evaluated our approach incrementally on a held-out
validation set containing 698 ground truth boxes across all
scenarios. Table 1 presents the performance metrics for the
baseline Mixture of Experts (MoE) configuration using a
gating CNN, and the enhanced MoE setup incorporating a
self-attention block and Weighted Box Fusion (WBF).
In particular, the addition of the self-attention block in the
gating network significantly boosted scene classification accuracy from 90% to 96%, resulting in better expert selection
and stronger end-to-end detection performance.
The enhanced MoE architecture increased the total number
of correct detections from 498 to 548—an improvement
of approximately 50 fire instances across the validation
set. Although there was a minor drop in
[email protected] (from
0.8998 to 0.8877) and precision (from 0.7048 to 0.6770),
To validate the effectiveness of our Mixture of Experts approach, we compared it against a single YOLOv8 model
trained on the entire combined dataset. The results strongly
favor the MoE architecture across all metrics: The most
significant improvement is observed in
[email protected], where
7.2. Comparison with Single Model Baseline
Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts
Metric
Precision
Recall
[email protected]
[email protected]:0.95
Average IoU
F1-Score
Total Predictions
Single YOLO
MoE (Ours)
-
n/a
0.5577*
n/a
-
n/a-
Table 2. Performance comparison between a single YOLOv8
model trained on the combined dataset versus our Mixture of
Experts approach. *F1-Score for single YOLO calculated from
provided precision and recall values.
Figure 5. Confusion matrix for the Single YOLO model.
our MoE approach achieves 0.8877 compared to 0.5253
for the single model—a 69% relative improvement. This
substantial gain indicates that our expert specialization and
dynamic weighting strategy is highly effective at producing
high-confidence detections that match ground truth objects.
Both precision and recall metrics show consistent improvements, resulting in a higher overall F1-Score. These results
validate our hypothesis that specialized expert models, when
combined with an intelligent gating mechanism, outperform
a monolithic approach that attempts to handle all scenarios
with a single model.
Both models show a tendency to confuse fire with background regions, particularly in cases involving bright lights
or partial occlusion. However, the MoE model achieves a
better balance in true fire detection, as evidenced by a higher
true positive count in the upper-left cell of the matrix. The
YOLO model shows more conservative predictions, which
results in lower recall but higher precision.
These matrices support our earlier findings: the MoE framework improves recall and scene adaptability while slightly
compromising precision, aligning with the trade-offs observed in Table 1.
7.3. Confusion Matrix Analysis
To better understand classification errors, we visualize the
confusion matrices of both the baseline YOLO model and
our enhanced Mixture of Experts (MoE) system.
7.4. Impact of Weighted Box Fusion (WBF)
Replacing NMS with WBF led to better box refinement and
reduced missed overlaps in crowded fire scenarios. WBF
particularly helped in merging partial predictions from multiple experts into a single high-confidence detection.
7.5. Test-Time Augmentation (TTA) Results
TTA failed to yield improvements and often degraded performance due to coordinate inconsistencies and increased false
positives. The model misinterpreted transformations like
flips and zooms, resulting in scattered or ghost detections.
8. Discussion
The proposed Mixture of Experts (MoE) framework effectively addresses the challenge of generalizing fire detection
across diverse environments by leveraging scenario-specific
detectors and dynamic weighting. This modular design
proved superior to a single, generalized model, as demonstrated by significant gains in recall and F1-score.
Figure 4. Confusion matrix for the Mixture of Experts model.
A key strength of the system lies in the gating network’s
ability to assign scenario-specific weights. Incorporating a
Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts
self-attention block enhanced its spatial awareness, resulting
in better expert selection and ultimately contributing to an
improved detection rate. The observed increase in scene
classification accuracy from 90% to 96% highlights this
advancement.
However, certain limitations persist. False positives were
commonly triggered by bright, fire-like light sources such
as vehicle headlights or interior lighting. These distractors,
often misclassified due to visual similarity in color and intensity, underscore the need for improved contextual modeling
or hard negative mining.
Additionally, while Weighted Box Fusion (WBF) refined
output quality by resolving overlapping detections, it introduced slight trade-offs in precision. This suggests that
over-merging across expert outputs may sometimes dilute
individual model confidence.
Our experiments with Test-Time Augmentation (TTA) revealed its sensitivity to spatial transformations. Techniques
such as flips and brightness adjustments introduced box
misalignments, reducing reliability and increasing false positives. This result underscores the importance of bounding
box calibration when integrating TTA into fusion pipelines.
Overall, the MoE framework provides a flexible foundation
for scalable fire detection and can be extended to accommodate more scenarios or future enhancements in object
detection and scene understanding.
9. Future Work
While our current Mixture of Experts framework demonstrates strong performance across multiple fire detection
scenarios, several areas remain open for future exploration:
• Model Compression: Investigate knowledge distillation or pruning techniques to reduce the computational
cost of running four YOLOv8 models simultaneously.
• Multi-modal Inputs: Extend the system to incorporate additional input modalities, such as thermal or
infrared data, to improve detection under low-visibility
conditions.
• Hard Negative Mining: Incorporate explicit handling
of fire-like distractors such as car headlights or bright
lights to reduce false positives in nighttime or indoor
scenes.
• Active Learning: Implement strategies for continuous
data acquisition and model fine-tuning using uncertain
or misclassified samples.
• Multi-class Support: Extend detection capabilities to
include related classes such as smoke, heat signatures,
or fire sources (e.g., stove, electrical fault).
10. Conclusion
In this project, we proposed a Mixture of Experts framework tailored for robust fire detection across varied scenarios
including indoor, outdoor, far-field, and satellite imagery.
By combining four specialized YOLOv8 detectors with
a learned gating network, we dynamically weighted each
model’s contribution to inference. Our results demonstrate
that expert specialization combined with context-aware
weighting significantly improves detection quality.
Further enhancements using attention in the gating model
and box fusion strategies led to measurable gains in detection metrics. While some augmentations like TTA did
not perform as expected, the modular nature of our system
makes it adaptable for future experimentation and real-world
deployment.
11. Contributions
• Trained four scenario-specific YOLOv8 fire detection
models for satellite, far-field, indoor, and outdoor imagery.
• Developed a lightweight CNN-based gating network to
classify scene types and assign soft weights to expert
outputs.
• Proposed a Mixture of Experts (MoE) architecture that
blends expert predictions based on learned scene context.
• Incorporated a self-attention mechanism into the gating
network, improving scene classification accuracy from
90% to 96%.
• Replaced Non-Maximum Suppression (NMS) with
Weighted Box Fusion (WBF), leading to improved
box aggregation and an increase in F1 score.
• Conducted extensive evaluation and ablation experiments, including Test-Time Augmentation (TTA), to
analyze impact on performance.
• Achieved an overall improvement in correct detections
from 498 (baseline) to 548 (final model) on a 698image validation set.
• Deployed an interactive Streamlit web application that
allows users to upload images and visualize fire detections using the trained MoE model.
References
[1] A. B. Abdusalomov, B. M. S. Islam, R. Nasimov,
M. Mukhiddinov, and T. K. Whangbo. An Improved
Forest Fire Detection Method Based on the Detectron2 Model and a Deep Learning Approach. Sensors,
23(3):1512, 2023. doi:10.3390/s-.
Towards Robust Fire Detection Across Diverse Environments using Mixture of YOLOv8 Experts
[2] Y. Cao, F. Yang, Q. Tang, and X. Lu. An Attention
Enhanced Bidirectional LSTM for Early Forest Fire
Smoke Recognition. IEEE Access, 7:-,
2019. doi:10.1109/ACCESS-.
[3] P. Li and W. Zhao. Image fire detection algorithms
based on convolutional neural networks. Case Studies
in Thermal Engineering, 19:100625, 2020.
[4] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I.
Williams, J. Winn, and A. Zisserman. The PASCAL
Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111:98–136,
2015.