STORYTELLING STYLE SPEECH GENERATION SYSTEM: EMOTIONAL VOICE CONVERSION MODULE BASED ON CYCLE-CONSISTENT GENERATIVE ADVERSARIAL NETWORKS

Deng, Guangfeng

Please use this identifier to cite or link to this item: https://repositori.mypolycc.edu.my/jspui/handle/123456789/6805

Full metadata record

DC Field	Value	Language
dc.contributor.author	Deng, Guangfeng	-
dc.date.accessioned	2025-10-13T04:04:38Z	-
dc.date.available	2025-10-13T04:04:38Z	-
dc.date.issued	2025-04-30	-
dc.identifier.issn	2327-5227	-
dc.identifier.issn	2327-5219	-
dc.identifier.other	doi.org/10.4236/jcc.2025.134020	-
dc.identifier.uri	https://repositori.mypolycc.edu.my/jspui/handle/123456789/6805	-
dc.description.abstract	Telling a story requires various emotional ups and downs as well as pauses. Preparing a parallel corpus for emotional voice conversion is often costly and impractical. Developing high-quality non-parallel methods poses a significant challenge. Although non-parallel methods have been shown to enable emotional voice conversion, its capability for Chinese storytelling has not been clarified. Additionally, the storytelling results of emotional voice conversion have not been validated within a 3-12-year-old children. This study proposes a two-stage Chinese Storytelling Style Speech Generation System (SSPGS) composed of a text-to-speech system and an emotional voice conversion module. The SSPGS requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training and requires only several minutes of training examples to generate reasonably realistic sounding speech. A small corpus neutral speech model is constructed on the text-to-speech system in the first stage, which is based on the speech synthesis system using a Hidden Markov Model (HMM). In the second stage, the emotional voice con version module based on Cycle-Consistent generative adversarial networks (CycleGAN) is built. It enables the neutral speech generated by the text-to speech system in the first stage to be transformed into the happiness, anger, and sadness necessary for storytelling tone using the timbre (spectrum), pitch (fundamental frequency F0), and rhythm (speech rate) of neutral speech. The validity of SSPGS is verified in two ways. First, a 5-point Mean Opinion Score (MOS) was performed for young children’s parents. The results demonstrated that compared with general speech synthesizers, such as Google, the system generated more natural and genuine sound, that was more preferrable to the target audience. After that, the kids underwent a story immersion evaluation. Analysis of the degree of engagement, liking, and empathy in listening to the story revealed no statistically significant difference between real-person dubbing and emotional speech synthesis dubbing. As a result, it has been initially confirmed that SSPGS might be added to the story robot product in the future.	ms_IN
dc.language.iso	en	ms_IN
dc.publisher	Scientific Research Publishing Inc.	ms_IN
dc.relation.ispartofseries	Journal of Computer and Communications;2025, 13(4), 324-346	-
dc.subject	Storytelling style speech generation system	ms_IN
dc.subject	Emotional voice conversion module	ms_IN
dc.subject	Cycle-consistent generative adversarial networks	ms_IN
dc.subject	Text-to-speech system	ms_IN
dc.subject	Mean opinion score	ms_IN
dc.subject	Immersion measurement	ms_IN
dc.title	STORYTELLING STYLE SPEECH GENERATION SYSTEM: EMOTIONAL VOICE CONVERSION MODULE BASED ON CYCLE-CONSISTENT GENERATIVE ADVERSARIAL NETWORKS	ms_IN
dc.type	Article	ms_IN
Appears in Collections:	JABATAN KEJURUTERAAN ELEKTRIK

Files in This Item:

File	Description	Size	Format
Storytelling Style Speech Generation System.pdf		7.57 MB	Adobe PDF	View/Open

Show simple item record