Umair Amir | Freelancer Alzheimer Detection Using Squeeze And Excitation Model

Alzheimer Detection using Squeeze and Excitation Model

Slice Aware Vision Transformer with Squeeze and Excitation for MRI Based Alzheimer’s Progression Sheraz Waseem Umair Amir github.com/UmairAmir/Alzheimer-Detection-23- Abstract Accurate staging of Alzheimer’s Disease (AD) from MRI data remains a clinically significant yet technically challenging task due to subtle, spatially diffuse brain changes. We propose a novel Slice-Aware Vision Transformer (SE-ViT) architecture that integrates a Squeeze and Excitation (SE) module to rank and select the most diagnostically salient MRI slices prior to classification. Using the OASIS-2 dataset, we benchmark four progressively refined models: a baseline ViT (E0), a binary SE-ViT (E-1), a direct 4-class SE-ViT (E-2), and a hierarchical SE-ViT (E-3) that mirrors clinical diagnostic pipelines. Results show that the SE module improves model specificity and early-stage recall, particularly for the clinically critical ”Very Mild” cohort. The final hierarchical model achieves 74% accuracy and 0.72 F1 score, outperforming the baseline by over 10 percentage points. Our framework offers interpretable, anatomically grounded predictions and sets the foundation for future extensions incorporating temporal modeling and multimodal fusion. 1. Introduction Alzheimer’s disease (AD) is a long-term brain disorder that causes memory loss and thinking problems, getting worse over time. Tracking how the disease progresses, from normal aging to early and later stages of dementia, is very important for starting treatment early and for testing new drugs. MRI scans help doctors see changes in the brain, such as shrinking of the hippocampus, larger brain cavities (ventricles), and thinning of the brain’s outer layer (cortex). However, these changes are often subtle, vary from person to person, and appear in different parts of the brain at different times. Traditional analysis methods that rely on manually selected features or specific brain regions often fail to fully capture this complex and detailed information. 2. Research in the Field Recent advances in deep learning have shifted the paradigm towards end-to-end representation learning. Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), and survival analysis hybrids have reported encouraging results, but exhibit two persistent limitations: 1. Global token misweighting: Three-dimensional CNNs process full MRI volumes indiscriminately, allocating equal importance to every slice—even though medial temporal slices typically carry far more pathological signal than superior or inferior sections. 2. Limited long-range contextual modelling: CNN kernels possess finite receptive fields; capturing distant, cross-regional dependencies (e.g., simultaneous hippocampal and ventricular changes) requires very deep architectures with heavy parameter counts. To address these shortcomings, Vision Transformers (ViTs) have emerged as attractive alternatives. By decomposing input images into patch tokens and processing them through self-attention, ViTs model global interactions irrespective of spatial distance, providing state-of-the-art performance on natural image benchmarks and increasingly on medical imaging tasks. Nevertheless, vanilla ViTs remain sliceagnostic when applied to volumetric MRI stacks, treating each token identically and ignoring domain knowledge about slice saliency. 2.1. Project Motivation We propose a Slice-Aware ViT framework enhanced with a Squeeze and Excitation (SE) gating mechanism that: 1. Learns slice importance weights via channel-wise recalibration, effectively ranking 256 axial slices and selecting the top-k salient ones during inference. 2. Trains jointly in a pipeline regime, where SE parameters are optimized using the downstream ViT classification loss, ensuring end-to-end gradient flow. Slice Aware Vision Transformer with Squeeze and Excitation for MRI Based Alzheimer’s Progression 3. Supports both direct four-way classification and hierarchical staging, offering two complementary strategies: • Experiment 1: Single-head ViT with a 4-class softmax (Nondemented, Very Mild, Mild, Moderate Demented). • Experiment 2: Two-stage model that first distinguishes Demented vs. Nondemented, then refines Demented cases into severity sub-classes. 3.3. OASIS 2 Characteristics The table below describes the characteristics of the OASIS-2 Dataset. Our tests show that adding the SE module improves the overall F1 score and sensitivity for each class, especially for the less common Very Mild stage, while keeping the model size reasonably small. 3. Dataset The selection of an appropriate dataset is crucial for research involving medical imaging, particularly for studies on conditions like Alzheimer’s disease. Two prominent datasets frequently utilized in this domain are the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Open Access Series of Imaging Studies (OASIS). A comparison of these datasets, highlighting their key characteristics, is presented in the table below. Figure 2. OASIS 2 Characteristics 3.1. Dataset Comparison Figure 3. OASIS 2 Age wise distribution Figure 1. Dataset Comparison between ADNI and OASIS 3.2. Selection Rationale Although ADNI offers a larger cohort, we opted for OASIS 2 because of several key advantages: Complete Longitudinal Metadata with CDR scores available for every visit, enabling precise disease progression labels; Homogeneous Acquisition Protocol where all scans were captured on a Siemens Vision 1.5.T scanner using identical MPRAGE parameters (TR=9.7ms, TE=4ms, flip=10°), reducing scannerinduced variance; Elderly-focused Cohort with subjects aged 60–96 years better representing the typical onset window for Alzheimer’s pathology; and Ethical & Licensing Clarity as the CC BY-SA licence permits derivative works without additional Institutional Data Use Agreements. Figure 4. OASIS 2 Slices Figure 5. OASIS 2 Different Patients Slice Aware Vision Transformer with Squeeze and Excitation for MRI Based Alzheimer’s Progression 3.4. Preprocessing Applied Before feeding the MRI data into our model, we implemented a comprehensive preprocessing pipeline to standardize the images and remove potential confounding factors. This multi-step approach was carefully designed to enhance pathological features while minimizing technical variations that could interfere with disease classification. Our preprocessing workflow consisted of the following sequential steps: pre-processing stack; subsequently, a Squeeze and Excitation (SE) gate ranks axial slices by diagnostic salience. The gated subset then feeds a custom Vision Transformer (ViT) that performs either (i) direct four-way staging or (ii) a hierarchical two-stage classification, depending on the experiment. 1. Spatial normalization: Resample to 1 mm3 , then center-crop to 256 × 256 × 128. 2. Intensity normalization: Z-score per volume (µ = 0, σ = 1). 3. Slice extraction: 256 axial slices retained for SE-ViT pipeline. This harmonized pipeline ensures that the model learns pathology-specific features rather than scanner artifacts or intensity drift. Figure 6. Workflow Pipeline 3.5. Limitations & Mitigation While OASIS 2 provides valuable advantages for our research, we acknowledge several limitations of our dataset selection. These constraints represent important considerations that could affect the generalizability and scope of our findings, and we have identified strategies to address them in future work: 4.2. Slice Aware Gating with Squeeze and Excitation Let S ∈ R256×H×W be the stack of axial slices for a subject. The SE block performs: 1. Squeeze: H W 1 XX Si,h,w , HW w=1 zi = • Sample Size: OASIS 2 is modest compared with the full ADNI dataset. • Demographic Bias: Predominantly U.S. Caucasian cohort; future work will validate on cross-site datasets. • Single Modality: Only T1-weighted MRI is used; multimodal fusion (e.g., PET, fMRI) remains future scope. i = 1, . . . , 256 h=1 producing a 256-dimensional global descriptor z. 2. Excitation: w = σ(MLP(z)) , w ∈ (0, 1)256 where MLP is a fully connected layer with 256 → 64 → 256 dimensions with ReLU activation, and σ is the Sigmoid function. 3. Top k selection: By rigorously comparing ADNI and OASIS 2, we selected the dataset that offers the most consistent longitudinal ground truth and acquisition homogeneity, thereby maximizing internal validity for slice-aware transformer training. 4. Methodology 4.1. Data Pipeline Overview Our entire workflow is organized as a slice-aware, end-toend learning pipeline. Each 3D T1-weighted MRI volume in OASIS 2 is first standardized through a reproducible ⌈arg top 3(w)⌉ yields indices of the three most informative slices, S ′ . All other slices are zeroed, ensuring backpropagation still flows through w. The SE parameters θSE are learned jointly with ViT parameters θViT via the downstream cross-entropy loss: L=− C X c=1 yc log ŷc , ŷ = fViT (S ′ ; θViT ) Slice Aware Vision Transformer with Squeeze and Excitation for MRI Based Alzheimer’s Progression 4.3. Vision Transformer Backbone 4.5.1. E XPERIMENT E-0: BASELINE A SSESSMENT Our model uses a Vision Transformer (ViT) architecture as the backbone. The specifications provided below. Our investigation began with a baseline experiment utilizing a pre-existing Vision Transformer model from the Hugging Face repository (fawadkhan/ViT FineTuned on ImagesOASIS), which had been previously fine-tuned on the OASIS dataset. This model implemented standard ViT architecture with pre-trained weights, focusing solely on binary classification (demented vs. non-demented). We evaluated this model on our curated subset of OASIS-2 data to establish baseline performance metrics. • Patch projection: Each 128 × 128 slice is split into 16 × 16 patches, yielding 64 tokens per slice. Tokens are flattened and linearly embedded to d = 768. • Positional encodings: Learnable, added at the slice token level to retain intra-slice geometry. • Transformer encoder: 12 layers, 12-head selfattention, feed-forward width 3072 with GELU activation. • Classification scheme: – E0 /E1 (Binary) → single neuron sigmoid. – E2 (Multiclass) → 4 neuron softmax. – E3 (Hierarchical) → Stage 1: sigmoid; Stage 2: 3-way softmax attached to the same CLS embedding, enabling parameter sharing. The results revealed suboptimal generalization, with accuracy falling significantly below reported benchmarks. Through error analysis, we identified several contributing factors: • Divergent preprocessing protocols between the original model training and our implementation • Lack of slice-specific attention mechanisms to focus on diagnostically relevant brain regions • Inability to distinguish between progressive stages of Alzheimer’s disease 4.4. Training Details 4.5.2. E XPERIMENT E-1: C USTOM SE-V I T FOR B INARY C LASSIFICATION Building upon lessons from E-0, we developed a custom Vision Transformer implementation with integrated Squeeze and Excitation (SE) gating mechanism. This model maintained the binary classification objective but introduced several key innovations: • Implementation of an end-to-end differentiable pipeline with full control over all architectural components • Introduction of the novel SE gating module to automatically identify and prioritize the most diagnostically informative MRI slices Figure 7. Training Details 4.5. Experiments Our experimental methodology involved four distinct model configurations (E-0 through E-3) to systematically evaluate different aspects of Alzheimer’s disease progression classification. Each experiment was designed to build upon insights from the previous one, progressively enhancing the model architecture to address specific challenges in MRI-based AD staging. The SE module specifically addressed the ”all slices are equal” limitation of conventional ViTs, enabling the model to focus computational resources on medial temporal regions where AD pathology is most evident. Binary classification performance improved significantly with this approach, achieving a 7.2% increase in F1 score compared to E-0. 4.5.3. E XPERIMENT E-2: D IRECT M ULTI - CLASS P ROGRESSION S TAGING Experiment E-2 extended our SE-ViT architecture to perform direct four-way classification, distinguishing between Slice Aware Vision Transformer with Squeeze and Excitation for MRI Based Alzheimer’s Progression Nondemented, Very Mild, Mild, and Moderate-to-Severe Dementia classes. The model architecture remained consistent with E-1, with the following modifications: comparison. All models were trained until convergence with epochs from 5-20. • Replacement of the sigmoid output layer with a 4-class softmax • Implementation of class-weighted cross-entropy loss to account for class imbalance • Additional regularization via label smoothing (= 0.1) to enhance generalization • Modified learning rate schedule with longer warm-up period to stabilize multi-class training This experiment tested the hypothesis that direct multiclass staging could leverage inter-class relationships and subtle progression markers that might be lost in binary classification. While overall accuracy remained competitive, we observed challenges in differentiating between adjacent severity classes, particularly between Very Mild and Mild categories. 4.5.4. E XPERIMENT E-3: H IERARCHICAL S TAGING A PPROACH Our final experiment implemented a hierarchical classification strategy that mirrors clinical diagnostic procedures. The SE-ViT architecture first performed binary classification (demented vs. non-demented), and then, only for subjects classified as demented, further differentiated between Very Mild, Mild, and Moderate-to-Severe categories using a secondary three-class softmax. This approach offered several advantages: • Better alignment with the clinical decision process • Mitigation of class imbalance effects by separating the initial binary decision • Allowing the model to develop specialized feature representations for severity staging • Parameter sharing between stages, reducing overall model complexity In particular, the hierarchical classifier achieved superior performance for the challenging Very Mild class (9% improvement in recall), which has particular clinical value for early intervention. The model maintained high precision across all severity levels while reducing false-positive classifications within the critical mild cognitive impairment spectrum. Each experiment was carried out using identical train / validation splits (80% / 20% ) stratified by class to ensure fair Figure 8. Experiments Slice Aware Vision Transformer with Squeeze and Excitation for MRI Based Alzheimer’s Progression 5. Results 5.0.1. E XPERIMENT E-0: BASELINE A SSESSMENT R ESULT Our baseline experiment served as the initial benchmark to evaluate how well existing ViT architectures could generalize to our curated OASIS-2 subset. The model achieved an accuracy of 64.1% and an F1 score of 0.643 on the held-out test set. The corresponding confusion matrix is shown in Figure 9. Figure 10. Confusion matrix for SE-ViT model (E-1) on test set. The model correctly classified 21 out of 23 Nondemented patients (True Negatives) and misclassified only 2 as Demented (False Positives). Among the 16 Demented patients, 9 were correctly identified (True Positives), while 7 were misclassified as Nondemented (False Negatives). Figure 9. Confusion matrix for baseline ViT model on test set. From the matrix, we observe that the model correctly identified 15 out of 23 Nondemented patients (True Negatives) and misclassified 8 as Demented (False Positives). Among 16 Demented patients, 10 were correctly classified (True Positives), while 6 were misclassified as Nondemented (False Negatives). These results highlight a noticeable asymmetry in classification performance. While the model showed moderate ability to detect dementia, it struggled with specificity, as evidenced by the high number of false positives. This underperformance likely stems from the model’s lack of slice-specific attention which we introduced in our next experiment. Compared to the baseline (E-0), the SE-ViT model significantly reduced false positives, demonstrating improved specificity. This improvement is attributed to the SE module’s ability to rank and prioritize diagnostically informative axial slices—primarily from the medial temporal lobe—during training and inference. These results confirm that incorporating slice-level attention not only aligns better with neuropathological expectations but also enhances the model’s ability to focus on diseaserelevant patterns while ignoring non-contributory slices. Figure 11 shows the frequency with which different axial slices were selected by the SE module during validation. Notably, slices near indices 140–180 were chosen most often, aligning with the location of medial temporal structures like the hippocampus, which are known to exhibit early signs of atrophy in Alzheimer’s disease. 5.0.2. E XPERIMENT E-1: SE-V I T B INARY C LASSIFICATION R ESULT In Experiment E-1, we introduced our custom Slice-Aware Vision Transformer model enhanced with a Squeeze and Excitation (SE) gating mechanism. Unlike the baseline ViT, this model was trained end-to-end with full control over the preprocessing pipeline, slice selection, and architectural parameters. The SE-ViT model achieved a notable improvement in performance, with an accuracy of 71.7% and an F1 score of 0.716 on the held-out test set. The corresponding confusion matrix is shown in Figure 10. Figure 11. Most frequently selected slices by SE module on the validation set. Peaks around slices 140–180 reflect medial temporal lobe importance. Slice Aware Vision Transformer with Squeeze and Excitation for MRI Based Alzheimer’s Progression 5.0.3. E XPERIMENT E-2: D IRECT M ULTI - CLASS P ROGRESSION S TAGING R ESULT 5.0.4. E XPERIMENT E-3: H IERARCHICAL S TAGING R ESULT In Experiment E-2, we adapted the SE-ViT architecture to perform direct four-way classification of Alzheimer’s disease progression using the Clinical Dementia Rating (CDR) scale. The model predicted one of four stages: Nondemented (CDR 0), Very Mild (CDR 0.5), Mild (CDR 1), and Moderate or worse (CDR 2+). This setup sought to evaluate the model’s ability to distinguish fine-grained disease severity levels directly. In Experiment E-3, we adopted a hierarchical classification strategy that mirrors real-world diagnostic workflows: the model first performed binary classification (Demented vs. Nondemented), followed by a second-stage classifier that determined the severity level for demented cases. Both stages were trained jointly in a multi-head architecture, allowing parameter sharing and optimizing for both coarse detection and fine-grained staging. The model achieved an accuracy of 64.1% and an F1 score of 0.58 on the validation set. The corresponding confusion matrix is shown in Figure 12. The model achieved an accuracy of 74.0% and an F1 score of 0.72 on the validation set. The resulting 4-class confusion matrix is presented in Figure 14. Figure 12. Confusion matrix for SE-ViT (E-2) in direct 4-class progression staging. The model correctly classified 20 out of 22 Nondemented patients. However, performance degraded across the dementia spectrum: Only 5 out of 12 Very Mild cases were correctly identified, with the rest misclassified as Nondemented and all Mild cases were misclassified, predominantly as Nondemented. However, no Moderate cases were detected. While overall accuracy remained comparable to the baseline, the model struggled to separate adjacent CDR stages. This confusion is likely due to subtle anatomical differences between early-stage classes, which the model found difficult to resolve in a flat classification setup. The severe underperformance for Moderate cases further suggests a need for stronger inductive bias or staging structure. These shortcomings motivated the hierarchical setup explored in Experiment E-3, which separates dementia detection and severity estimation into distinct, dedicated tasks. Figure 13. Confusion matrix for SE-ViT (E-3) hierarchical staging model. The model correctly classified 21 out of 23 Nondemented subjects. Most notably, it achieved: • Improved sensitivity to the Very Mild class: 6 out of 11 cases correctly classified (vs. 5/12 in E-2). • Mild class detection with limited accuracy: 1 correct, 3 misclassified as Nondemented or Very Mild. • No Moderate cases detected, likely due to class imbalance and limited training samples for that category. These results represent a significant improvement in earlystage detection, particularly for the Very Mild class, which holds high clinical value for early intervention. The separation of detection and staging allowed the model to specialize in both tasks independently, leading to a more interpretable and effective framework compared to flat softmax classification. Slice Aware Vision Transformer with Squeeze and Excitation for MRI Based Alzheimer’s Progression 5.0.5. OVERALL C OMPARISON ACROSS E XPERIMENTS To consolidate insights from our experimental pipeline, we present a summary comparison in Table ??. Each model reflects a stepwise refinement in architecture and training strategy, progressing from a baseline ViT to slice-aware and hierarchically structured approaches. Figure 15. Interpreting the Performance Gains These findings align with neuropathological evidence: the mesial temporal lobe manifests earliest and most pronounced atrophy in AD. By allowing the SE gate to upweight such slices, we implicitly encode domain knowledge without manual ROI segmentation, preserving end-to-end differentiability. 6.2. Clinical Relevance Figure 14. Experiment results • Early stage sensitivity: SE ViT improves recall for the Very Mild group, arguably the most clinically valuable cohort. Detecting such prodromal changes can facilitate earlier lifestyle or pharmacological interventions. • Workflow synergy: The hierarchical model mirrors the diagnostic pipeline: radiologists first confirm “presence of dementia” (binary) before assigning severity (ordinal). Embedding this structure in the model yields a more intuitive user experience. As seen in Figure 14, the baseline model (E-0) struggled with false positives and lacked contextual focus, motivating the integration of slice-level attention in E-1. The SE-ViT binary classifier (E-1) significantly improved specificity by leveraging medial temporal saliency. However, direct fourway staging in E-2 introduced confusion between adjacent classes, particularly Mild and Moderate. The hierarchical model (E-3) achieved the best balance between accuracy and clinical relevance, demonstrating improved sensitivity to early dementia stages and clearer separation of classification responsibilities. This stepwise evolution confirms the value of both anatomical priors and architectural structuring in medical imaging pipelines. 6. Discussion 6.1. Interpreting the Performance Gains The experimental series demonstrates that explicit slicelevel attention is a decisive factor in MRI-based Alzheimer’s staging: 6.3. Limitations 1. Cohort Size & Demographics: OASIS-2 involves ≈150 subjects, predominantly Caucasian. This limits statistical power and external validity across ethnic groups or scanner vendors. 2. Longitudinal Ignorance: Our current framework treats each session independently; temporal progression cues (atrophy trajectory) are not explicitly modeled. 3. Single Modality Constraint: MRI alone may not capture metabolic changes detectable via PET or CSF biomarkers; multimodal fusion could enhance staging accuracy. 4. Hyperparameter Sensitivity: Top-k = 3 was selected heuristically. Although ablation shows it outperforms k = 1, broader search (k = 2–5) may further optimize the trade-off between information retention and noise. Slice Aware Vision Transformer with Squeeze and Excitation for MRI Based Alzheimer’s Progression 5. Potential Slice Order Bias: Axial ordering is fixed; pathologies in oblique planes might be overlooked. Incorporating multi-plane slices (axial, coronal, sagittal) may provide a fuller context. 6.4. Future Work • Temporal Transformers: Explore models like TimeSformer or recurrent Vision Transformers to capture how a patient’s condition changes over time across multiple scan sessions. • Multimodal Extension: Combine MRI with other data sources such as FDG-PET scans and genetic information using cross-modal attention to build a more comprehensive diagnostic tool. • Uncertainty Quantification: Add methods such as Monte Carlo dropout or deep ensembles to estimate how confident the model is in its predictions - an important step toward clinical reliability. • Broader Applications: Our combined approach of SE and Vision Transformers can also be applied to other medical fields that use 3D scan data, such as CT or MRI scans for cancer detection or orthopedic analysis. 7. Conclusion To conclude, the proposed SE ViT pipeline advances MRIbased Alzheimer’s staging by combining slice-level saliency learning with global transformer reasoning. Its enhanced SE module highlights diagnostically relevant slices without manual ROI selection, while the hierarchical classification approach improves early-stage detection, particularly for Very Mild cases. Moreover, its modular and generalizable design extends to other 3D imaging tasks like tumor staging, orthopedic analysis, and organ assessments, supporting future integration with multimodal data and temporal modeling. 8. Contributions Dr. Murtaza Taj supervised the research, provided expert guidance throughout the project, and offered critical feedback on the model design and evaluation. Sheraz and Umair contributed equally to the work, each designing and conducting two key experiments, including model training, analysis, and result interpretation. Yahya Khawaja offered architectural insights and advised on the training pipeline and experimental setup. All authors discussed the results and contributed to the final manuscript. References Abunadi, I. (2022). Deep and hybrid learning of MRI diagnosis for early detection of the progression stages in Alzheimer’s disease. Connection Science, 34(1),-. https://doi.org/10.1080/- Malik, I., Iqbal, A., Gu, Y. H., & Al-Antari, M. A. (2024). Deep Learning for Alzheimer’s Disease prediction: A Comprehensive review. Diagnostics, 14(12), 1281. https: //doi.org/10.3390/diagnostics- Li, H., Habes, M., Wolk, D. A., & Fan, Y. (2019). A deep learning model for early prediction of Alzheimer’s disease dementia based on hippocampal magnetic resonance imaging data. Alzheimer’s & Dementia, 15(8),-. https://doi.org/10.1016/j.jalz- Kim, M., Kim, J., Qu, J., Huang, H., Long, Q., Sohn, K., Kim, D., & Shen, L. (2021). Interpretable temporal graph neural network for prognostic prediction of Alzheimer’s disease using longitudinal neuroimaging data. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp-). https://doi.org/10. 1109/bibm- Ocasio, E., & Duong, T. Q. (2021). Deep learning prediction of mild cognitive impairment conversion to Alzheimer’s disease at 3 years after diagnosis using longitudinal and whole-brain 3D MRI. PeerJ Computer Science, 7, e560. https://doi.org/10.7717/peerj-cs.560 fawadkhan. (n.d.). ViT FineTuned on ImagesOASIS. Hugging Face. https://huggingface.co/ fawadkhan/ViT_FineTuned_on_ImagesOASIS