Semi-supervised annotation of Portuguese hate speech across social media domains

Santos, R. B.; Matos, B. C.; Carvalho, P.; Batista, F.; Ribeiro, R.

doi:10.4230/OASIcs.SLATE.2022.11

Utilize este identificador para referenciar este registo: http://hdl.handle.net/10071/25973

Registo completo

Campo DC	Valor	Idioma
dc.contributor.author	Santos, R. B.	-
dc.contributor.author	Matos, B. C.	-
dc.contributor.author	Carvalho, P.	-
dc.contributor.author	Batista, F.	-
dc.contributor.author	Ribeiro, R.	-
dc.contributor.editor	Cordeiro, J., Pereira, M. J., Rodrigues, N. F., and Pais, S.	-
dc.date.accessioned	2022-08-02T13:59:09Z	-
dc.date.available	2022-08-02T13:59:09Z	-
dc.date.issued	2022	-
dc.identifier.isbn	978-3-95977-245-7	-
dc.identifier.issn	2190-6807	-
dc.identifier.uri	http://hdl.handle.net/10071/25973	-
dc.description.abstract	With the increasing spread of hate speech (HS) on social media, it becomes urgent to develop models that can help detecting it automatically. Typically, such models require large-scale annotated corpora, which are still scarce in languages such as Portuguese. However, creating manually annotated corpora is a very expensive and time-consuming task. To address this problem, we propose an ensemble of two semi-supervised models that can be used to automatically create a corpus representative of online hate speech in Portuguese. The first model combines Generative Adversarial Networks and a BERT-based model. The second model is based on label propagation, and consists of propagating labels from existing annotated corpora to the unlabeled data, by exploring the notion of similarity. We have explored the annotations of three existing corpora (CO-HATE, ToLR-BR, and HPHS) in order to automatically annotate FIGHT, a corpus composed of geolocated tweets produced in the Portuguese territory. Through the process of selecting the best model and the corresponding setup, we have tested different pre-trained embeddings, performed experiments using different training subsets, labeled by different annotators with different perspectives, and performed several experiments with active learning. Furthermore, this work explores back translation as a mean to automatically generate additional hate speech samples. The best results were achieved by combining all the labeled datasets, obtaining 0.664 F1-score for the Hate Speech class in FIGHT.	eng
dc.language.iso	eng	-
dc.publisher	Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing	-
dc.relation	HATE Covid-19 (Proj. 759274510)	-
dc.relation	info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F50021%2F2020/PT	-
dc.relation	info:eu-repo/grantAgreement/FCT/3599-PPCDT/PTDC%2FCCI-CIF%2F32607%2F2017/PT	-
dc.relation.ispartof	OpenAccess Series in Informatics	-
dc.rights	openAccess	-
dc.subject	Hate speech	eng
dc.subject	Semi-supervised learning	eng
dc.subject	Semi-automatic annotation	eng
dc.title	Semi-supervised annotation of Portuguese hate speech across social media domains	eng
dc.type	conferenceObject	-
dc.event.title	11th Symposium on Languages, Applications and Technologies (SLATE 2022)	-
dc.event.type	Conferência	pt
dc.event.location	Covilhã	eng
dc.event.date	2022	-
dc.peerreviewed	yes	-
dc.volume	104	-
dc.date.updated	2022-08-02T14:57:13Z	-
dc.description.version	info:eu-repo/semantics/publishedVersion	-
dc.identifier.doi	10.4230/OASIcs.SLATE.2022.11	-
dc.subject.fos	Domínio/Área Científica::Ciências Naturais::Ciências da Computação e da Informação	por
dc.subject.fos	Domínio/Área Científica::Humanidades::Línguas e Literaturas	por
iscte.subject.ods	Paz, justiça e instituições eficazes	por
iscte.identifier.ciencia	https://ciencia.iscte-iul.pt/id/ci-pub-89928	-
Aparece nas coleções:	IT-CRI - Comunicações a conferências internacionais

Ficheiros deste registo:

Ficheiro	Tamanho	Formato
conferenceobject_89928.pdf	541,72 kB	Adobe PDF	Ver/Abrir

Mostrar registo em formato simples Visualizar estatísticas