MULTIMODAL ALIGNMENT AND HIERARCHICAL FUSION NETWORK FOR MULTIMODAL SENTIMENT ANALYSIS

Huang, Jiasheng; Li, Huan; Mo, Xinyue

Please use this identifier to cite or link to this item: https://repositori.mypolycc.edu.my/jspui/handle/123456789/9932

Full metadata record

DC Field	Value	Language
dc.contributor.author	Huang, Jiasheng	-
dc.contributor.author	Li, Huan	-
dc.contributor.author	Mo, Xinyue	-
dc.date.accessioned	2026-05-08T07:35:35Z	-
dc.date.available	2026-05-08T07:35:35Z	-
dc.date.issued	2025-09-26	-
dc.identifier.other	doi.org/10.3390/electronics14193828	-
dc.identifier.uri	https://repositori.mypolycc.edu.my/jspui/handle/123456789/9932	-
dc.description.abstract	The widespread emergence of multimodal data on social platforms has presented new opportunities for sentiment analysis. However, previous studies have often overlooked the issue of detail loss during modal interaction fusion. They also exhibit limitations in addressing semantic alignment challenges and the sensitivity of modalities to noise. To enhance analytical accuracy, a novel model named MAHFNet is proposed. The proposed architecture is composed of three main components. Firstly, an attention-guided gated interaction alignment module is developed for modeling the semantic interaction between text and image using a gated network and a cross-modal attention mechanism. Next, a contrastive learning mechanism is introduced to encourage the aggregation of semantically aligned image-text pairs. Subsequently, an intra-modality emotion extraction module is designed to extract local emotional features within each modality. This module serves to compensate for detail loss during interaction fusion. The intra-modal local emotion features and cross-modal interaction features are then fed into a hierarchical gated fusion module, where the local features are fused through a cross-gated mechanism to dynamically adjust the contribution of each modality while suppressing modality-specific noise. Then, the fusion results and cross-modal interaction features are further fused using a multi-scale attention gating module to capture hierarchical dependencies between local and global emotional information, thereby enhancing the model’s ability to perceive and integrate emotional cues across multiple semantic levels. Finally, extensive experiments have been conducted on three public multimodal sentiment datasets, with results demonstrating that the proposed model outperforms existing methods across multiple evaluation metrics. Specifically, on the TumEmo dataset, our model achieves improvements of 2.55% in ACC and 2.63% in F1 score compared to the second-best method. On the HFM dataset, these gains reach 0.56% in ACC and 0.9% in F1 score, respectively. On the MVSA-S dataset, these gains reach 0.03% in ACC and 1.26% in F1 score. These findings collectively validate the overall effectiveness of the proposed model.	ms_IN
dc.language.iso	en	ms_IN
dc.publisher	MDPI	ms_IN
dc.relation.ispartofseries	Electronics;2025, 14, 3828	-
dc.subject	Multimodal	ms_IN
dc.subject	Multimodalsentiment analysis	ms_IN
dc.subject	Multi-level fusion	ms_IN
dc.subject	Gated networks	ms_IN
dc.subject	Contrastive learning	ms_IN
dc.title	MULTIMODAL ALIGNMENT AND HIERARCHICAL FUSION NETWORK FOR MULTIMODAL SENTIMENT ANALYSIS	ms_IN
dc.type	Article	ms_IN
Appears in Collections:	JABATAN KEJURUTERAAN ELEKTRIK

Files in This Item:

File	Description	Size	Format
Multimodal Alignment and Hierarchical Fusion Network for.pdf		2.39 MB	Adobe PDF	View/Open

Show simple item record