Bilingual Experiments on Automatic Recovery of Capitalization and Punctuation of Automatic Speech Transcripts

Batista, F.; Moniz, H.; Trancoso, I.; Mamede, N.

Utilize este identificador para referenciar este registo: http://hdl.handle.net/10071/7049

Registo completo

Campo DC	Valor	Idioma
dc.contributor.author	Batista, F.	-
dc.contributor.author	Moniz, H.	-
dc.contributor.author	Trancoso, I.	-
dc.contributor.author	Mamede, N.	-
dc.date.accessioned	2014-05-02T14:42:32Z	-
dc.date.available	2014-05-02T14:42:32Z	-
dc.date.issued	2012	-
dc.identifier	10.1109/TASL.2011.2159594	en_US
dc.identifier.issn	1558-7916	por
dc.identifier.uri	https://ciencia.iscte-iul.pt/public/pub/id/6531	en_US
dc.identifier.uri	http://hdl.handle.net/10071/7049	-
dc.description	WOS:000299525800012 (Nº de Acesso Web of Science)	-
dc.description	“Prémio Científico ISCTE-IUL 2013”	-
dc.description.abstract	This paper focuses on the tasks of recovering capitalization and punctuation marks from texts without that information, such as spoken transcripts, produced by automatic speech recognition systems. These two practical rich transcription tasks were performed using the same discriminative approach, based on maximum entropy, suitable for on-the-fly usage. Reported experiments were conducted both over Portuguese and English broadcast news data. Both force aligned and automatic transcripts were used, allowing to measure the impact of the speech recognition errors. Capitalized words and named entities are intrinsically related, and are influenced by time variation effects. For that reason, the so-called language dynamics have been addressed for the capitalization task. Language adaptation results indicate, for both languages, that the capitalization performance is affected by the temporal distance between the training and testing data. In what regards the punctuation task, this paper covers the three most frequent punctuation marks: full stop, comma, and question marks. Different methods were explored for improving the baseline results for full stop and comma. The first uses punctuation information extracted from large written corpora. The second applies different levels of linguistic structure, including lexical, prosodic, and speaker related features. The comma detection improved significantly in the first method, thus indicating that it depends more on lexical features. The second method provided even better results, for both languages and both punctuation marks, best results being achieved mainly for full stop. As for question marks, there is a small gain, but differences are not very significant, due to the relatively small number of question marks in the corpora.	por
dc.language.iso	eng	por
dc.publisher	IEEE Signal Processing Society	por
dc.rights	embargoedAccess	por
dc.subject	Automatic speech processing	por
dc.subject	Capitalization	por
dc.subject	Language dynamics	por
dc.subject	Natural language processing	por
dc.subject	Punctuation marks	por
dc.subject	Rich transcription	por
dc.title	Bilingual Experiments on Automatic Recovery of Capitalization and Punctuation of Automatic Speech Transcripts	por
dc.type	article	en_US
dc.pagination	474-485	por
dc.publicationstatus	Publicado	por
dc.peerreviewed	Sim	por
dc.relation.publisherversion	The definitive version is available at IEEE: http://dx.doi.org/10.1109/TASL.2011.2159594	por
dc.journal	IEEE Transactions on Audio, Speech, and Language Processing	por
dc.distribution	Internacional	por
dc.volume	20	por
dc.number	2	por
degois.publication.firstPage	474	por
degois.publication.lastPage	485	por
degois.publication.issue	2	por
degois.publication.title	IEEE Transactions on Audio, Speech, and Language Processing	por
dc.date.updated	2014-05-02T14:34:58Z	-
Aparece nas coleções:	CTI-RI - Artigos em revistas científicas internacionais com arbitragem científica

Ficheiros deste registo:

Ficheiro	Descrição	Tamanho	Formato
Batista 2012 IEEE.pdf Restricted Access		925,74 kB	Adobe PDF	Ver/Abrir Request a copy

Mostrar registo em formato simples Visualizar estatísticas