CNN, Transformer, and Mamba: Which Architecture Best Suits PPG-Based Emotion Recognition?
Wearable Affective Computing Faces an Architecture Showdown
With the proliferation of smartwatches, fitness bands, and other wearable devices, emotion recognition using photoplethysmography (PPG) signals is emerging as a hot research direction in affective computing. PPG sensors are inexpensive and easy to integrate, and are already widely embedded in consumer-grade devices. However, as new deep learning architectures continue to emerge, a critical question remains unanswered: Are long-range sequence models — which have excelled in natural language processing and general time-series tasks — truly suitable for PPG-based emotion recognition?
A recent paper published on arXiv (arXiv:2604.26078v1) provides a systematic answer to this question. Adopting a "measurement-driven" perspective, the research team conducted a comprehensive comparison of three mainstream deep learning architectures — Convolutional Neural Networks (CNN), Transformer, and Mamba — on PPG-based emotion recognition tasks.
Core Differences Among the Three Architectures
These three architectures represent three distinct technical approaches to processing sequential data in deep learning:
-
CNN (Convolutional Neural Network): Extracts features through local convolutional kernels, excelling at capturing local patterns and short-range dependencies in signals. CNNs have a well-established track record in physiological signal processing.
-
Transformer: Based on the self-attention mechanism, Transformers can model long-range dependencies between arbitrary positions in a sequence. Since their introduction in 2017, Transformers have demonstrated powerful capabilities across NLP, computer vision, and time-series analysis.
-
Mamba: As the latest representative of Structured State Space Models (SSMs), Mamba achieves efficient long-sequence modeling while maintaining linear computational complexity, positioning it as a strong competitor to the Transformer.
Research Methodology and Key Findings
A major highlight of this study is its "measurement-driven" experimental design. Rather than simply running benchmarks on a single dataset, the research team systematically evaluated the performance of different architectures under various experimental conditions, grounded in the physical characteristics of PPG signals and the actual quality of emotion annotations.
Based on the published research framework, the study focuses on several core dimensions:
-
The Necessity of Long-Range Dependencies: How long a temporal window do emotion-relevant features in PPG signals actually span? Does long-range modeling capability genuinely translate into performance gains?
-
Trade-offs Between Computational Efficiency and Accuracy: Does Mamba's linear complexity advantage hold practical significance in PPG scenarios? Does the Transformer's quadratic complexity pose a real bottleneck?
-
Architectural Robustness: Which architecture proves more resilient in the face of common PPG challenges such as motion artifacts and inter-subject variability?
The value of this research lies in its refusal to blindly champion novel architectures. Instead, it objectively examines the effectiveness of technology transfer based on data and experimentation. In many physiological signal processing tasks, classical CNN architectures are often more than adequate. Whether the introduction of new long-range models can deliver "tangible benefits" is precisely the question this study seeks to answer.
Implications for Wearable Affective Computing
This study offers important reference value for both industry and academia:
For device manufacturers, architecture selection directly impacts power consumption and latency when deploying emotion recognition models on edge devices. If a lightweight CNN can meet requirements, there is no need to increase computational overhead in pursuit of cutting-edge architectures.
For researchers, this study provides a standardized comparison framework. In physiological signal processing, blindly applying the latest architectures from NLP or computer vision is not always optimal — the physical characteristics of the signal and task requirements should be the primary basis for architecture selection.
For the affective computing community, PPG-based emotion recognition still faces numerous challenges, including annotation noise, cross-subject generalization, and multimodal fusion. Architecture choice is only one component of system design; data quality and experimental design are equally critical.
Future Outlook
As novel state space models like Mamba undergo rapid iteration, and as Transformers continue to be optimized for long-sequence modeling (e.g., linear attention, FlashAttention), the technology stack for PPG-based emotion recognition remains in active evolution. Future research is expected to focus on the following directions:
- Hybrid Architecture Design: Combining CNN's local feature extraction capabilities with the long-range modeling power of Transformers or Mamba
- Efficient On-Device Inference: Enabling real-time emotion sensing on resource-constrained devices such as smartwatches
- Cross-Scenario Generalization: Advancing robust emotion recognition from laboratory settings to real-world environments
This study provides a clear "technology mirror" for the wearable affective computing field, reminding us to maintain a pragmatic, task-driven approach even as we embrace new technologies.
📌 Source: GogoAI News (www.gogoai.xin)
🔗 Original: https://www.gogoai.xin/article/cnn-transformer-mamba-ppg-emotion-recognition-comparison
⚠️ Please credit GogoAI when republishing.