Publication – Laboratorium Intelligent Robots and Systems (IRoS)

IEEE Access

Data Imbalance Handling in Facial Expression Recognition: A Systematic Literature Review

Facial Expression Recognition (FER) has emerged as a pivotal research domain within computer vision and affective computing. Despite technological advances in deep learning architectures such as CNNs, VGG-16, ResNet-50, and MobileNet, FER systems continue to face critical challenges, particularly data imbalance where some emotional categories are abundantly represented while others appear far less frequently. This imbalance leads to biased model performance and poor recognition rates for minority emotions, which are often the most crucial in sensitive applications. While numerous systematic reviews have examined fundamental FER aspects including recognition techniques, deep learning methodologies, and general challenges, there is a notable absence of focused systematic literature reviews that comprehensively analyze imbalance-specific solutions within FER. Our analysis revealed four primary approaches to handling data imbalance in FER: 1) Loss Function approaches – the most prevalent due to their simplicity and computational efficiency, 2) Generative Network approaches demonstrating the highest performance gains through sophisticated synthetic data generation; 3) Resampling approaches offering intuitive solutions through oversampling and undersampling techniques; and 4) Learning approaches employing multi-stage and ensemble architectures for sophisticated representation learning.

IEEE Access

Mamba-Based Deep Learning Methods in Medical Image Analysis: A Systematic Literature Review

This systematic literature review charts about Mamba research in medical imaging. A preregistered Scopus search followed by strict quality appraisal yielded 85 peer-reviewed papers. Evidence shows that Mamba-enhanced networks have been applied to eleven tasks: regression, radiology-report generation, super-resolution, image translation, fusion, denoising, enhancement, reconstruction, segmentation, classification, and registration, with segmentation dominating the corpus. Fourteen imaging modalities are represented: MRI, computed tomography, dermoscopy, endoscopy, and fundus photography appear most frequently, confirming the architecture’s versatility across clinical domains. Mamba usually integrates with five patterns strategy. Convolution–Mamba hybrids remain the workhorse, while attention and diffusion-based variants lower GPU memory and sampling cost. Recurrence-augmented designs and specialised pairings with graph, spiking-neuron, or physics-informed modules address motion, temporal coherence, or energy efficiency. Despite these advances, studies converge on five critical limitations: 1) memory and computation overhead, 2) poor scalability to full 3-D volumes, 3) unmet real-time or bedside latency, 4) vulnerability to noise, low-dose, and sparse sampling, and 5) instability under respiratory or cardiac motion. Future work consistently points to: high-resolution 3-D pipelines that preserve Mamba’s linear complexity; aggressive model compression for edge deployment; robust domain generalisation with continual learning; tighter coupling with complementary deep-learning architectures to encode structural or physical priors; and data-efficient training that exploits weak labels and synthetic augmentation. Addressing these directions will translate Mamba from a promising sequence backbone into a reliable, resource-aware engine for diagnostic, interventional, and robotic imaging workflows.

IEEE Geoscience and Remote Sensing Letters

Enhanced Slum Mapping Through U-Net CNN and Multimodal Remote Sensing Data: A Case Study of Makassar City

Urban slums present critical challenges for sustainable development, particularly in rapidly urbanizing cities like Makassar, Indonesia. This study develops an automated slum mapping approach that integrates high-resolution SPOT-6/7 satellite imagery (1.5-m spatial resolution) with multimodal geospatial data using a U-Net convolutional neural network. Our methodology combines spectral and textural features from satellite imagery with nighttime light emissions, infrastructure proximity analysis, land use classifications, and socioeconomic indicators. The integrated approach achieves an overall accuracy of 97.1%–98.3% across both the datasets. However, slum-specific classification remains challenging with producer’s accuracy of 55.8%–59.1% and user’s accuracy of 22.9%–35.7%, yielding F1-scores of 0.33–0.43 for slum detection. Despite these limitations, the approach demonstrates significant enhancements over traditional census-based methods through automated processing, improved spatial resolution (1.5 m versus administrative units), and increased temporal frequency (annual versus decadal updates). The framework provides actionable insights for urban planning and social assistance targeting while establishing a foundation for automated slum monitoring system iterative improvement.

IEEE Access

GWSC-SegMamba: Gate Wavelet Spatial Convolution Enhanced State Space Model for Multi-Temporal Agricultural Land Segmentation

The study of multi-temporal satellite data for agricultural land segmentation faces significant computational challenges when processing extended temporal sequences, particularly due to CNNs’ limited receptive fields and Transformers’ quadratic complexity, since convolutional neural networks are constrained by local receptive fields, whereas Transformers experience quadratic complexity in their self-attention mechanisms. These constraints particularly impede the precise identification of spectrally analogous crop varieties and minority classes within agricultural landscapes. We present GWSC-SegMamba, an innovative hybrid architecture that integrates State Space Models (SSMs) with Gate Wavelet Convolution (GWC) and Gate Spatial Convolution (GSC) components to tackle multi-temporal agricultural segmentation issues. The GWC component conducts multi-resolution analysis with discrete wavelet transforms to address spatial resolution constraints for medium-resolution satellite data, whereas the GSC component identifies spatial correlations essential for delineating crop boundaries. Our thorough assessment of three benchmark datasets (Munich, Lombardia, and PASTIS) reveals substantial performance enhancements: a 7.65% increase in mIoU relative to the conventional SegMamba, a 4.43% improvement over the Swin-UNETR baseline, and an impressive 53.13% augmentation in IoU for difficult minority classes, including winter triticale. The design processes temporal sequences with linear computational cost, incurring about 8-10% additional computational overhead relative to the basic SegMamba and demonstrating enhanced scalability properties compared to transformer-based approaches. These findings provide enhanced crop monitoring systems vital for precision agriculture, especially in differentiating spectrally analogous crop varieties needed for yield estimation and sustainable land management decisions.

IEEE Access

Weighted Ensemble Based on Prisoner Dilemma for Facial Expression Recognition

Facial Expression Recognition (FER) is a task that recognizes the expression or emotion of a person based on their face, enabling computers to identify the mood and emotions of individuals. FER tasks in real-world scenarios remain challenging due to the variations of many parameters captured by the image sensor. Many approaches have been proposed to improve FER tasks in real-world scenarios. One of them is the utilization of an ensemble model. The weighted ensemble is one way to do an ensemble by weighting each model with a weight. However, the weight values are left to the researcher to decide, which raises another problem in determining a weight for the weighted ensemble. In this research, we proposed a novel weighting voting inspired by the Prisoner Dilemma. Based on the experiments, our proposed method achieved an accuracy and f1 score of 83.25% and 75.02% respectively in the RAFDB dataset, and an accuracy and f1 score of 64.73% and 63.07% in the FER2013 dataset, which is relatively better than the other state-of-the-art and our baseline.

IEEE Access

Hybrid Wavelet-Attention Model for Detecting Changes in High-Resolution Remote Sensing Images

Change detection is a remote sensing task for detecting a change from two satellite images in the same area, while being taken at different times. Change detection is one of the most difficult remote sensing tasks because the change to be detected (real-change) is mixed with apparent changes (pseudo-change) due to differences in the two images, such as brightness, humidity, seasonal differences, etc. The emergence of a Vision Transformer (ViT) as a new standard in Computer Vision, replacing Convolutional Neural Network (CNN), also shifts the role of CNN in the field of remote sensing. Although ViT can capture long-range interactions between image patches, its computational complexity increases quadratically with the number of patches. One solution to reduce the computational complexity in ViT is to reduce the key and value matrices in the self-attention (SA) mechanism. However, this causes information loss, resulting in a trade-off between the effectiveness and efficiency of the method. To solve the problem, we developed a new change detection method called WaveCD. WaveCD uses Wave Attention (WA) instead of SA. WA uses the Discrete Wavelet Transform (DWT) decomposition to reduce the key and values matrices. Besides reducing the data, DWT decomposition also serves to extract important features that represent images so that the initial data can be approximated through the Inverse Discrete Wavelet Transform (IDWT) process. On the CDD dataset, WaveCD outperforms the state-of-the-art CD method, SwinSUNet, by 12.3% on IoU and 7.3% on F1 score. While on the LEVIR-CD dataset, WaveCD outperforms SwinSUNet by 4% on IoU and 2.5% on F1 score.