SDRS: a new lossless dimensionality reduction for text corpora

De Mendizabal, I. V.; Basto-Fernandes, V.; Ezpeleta, E,; Méndez, J. R.; Zurutuza, U.

doi:10.1016/j.ipm.2020.102249

Utilize este identificador para referenciar este registo: http://hdl.handle.net/10071/20406

Registo completo

Campo DC	Valor	Idioma
dc.contributor.author	De Mendizabal, I. V.	-
dc.contributor.author	Basto-Fernandes, V.	-
dc.contributor.author	Ezpeleta, E,	-
dc.contributor.author	Méndez, J. R.	-
dc.contributor.author	Zurutuza, U.	-
dc.date.accessioned	2020-04-22T14:54:41Z	-
dc.date.issued	2020	-
dc.identifier.issn	0306-4573	-
dc.identifier.uri	http://hdl.handle.net/10071/20406	-
dc.description.abstract	In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.	eng
dc.language.iso	eng	-
dc.publisher	Elsevier	-
dc.relation	UIDP/04466/2020	-
dc.relation	UIDB/04466/2020	-
dc.rights	openAccess	-
dc.subject	Spam filtering	eng
dc.subject	Token-based representation	eng
dc.subject	Synset-based representation	eng
dc.subject	Semantic-based feature reduction	eng
dc.subject	Multi-objective evolutionary algorithms	eng
dc.title	SDRS: a new lossless dimensionality reduction for text corpora	eng
dc.type	article	-
dc.peerreviewed	yes	-
dc.journal	Information Processing and Management	-
dc.volume	57	-
dc.number	4	-
degois.publication.issue	4	-
degois.publication.title	SDRS: a new lossless dimensionality reduction for text corpora	eng
dc.date.updated	2020-04-22T15:53:39Z	-
dc.description.version	info:eu-repo/semantics/acceptedVersion	-
dc.identifier.doi	10.1016/j.ipm.2020.102249	-
dc.subject.fos	Domínio/Área Científica::Ciências Naturais::Ciências da Computação e da Informação	por
dc.date.embargo	2023-03-21	-
iscte.subject.ods	Indústria, inovação e infraestruturas	por
iscte.subject.ods	Cidades e comunidades sustentáveis	por
iscte.subject.ods	Paz, justiça e instituições eficazes	por
iscte.identifier.ciencia	https://ciencia.iscte-iul.pt/id/ci-pub-70824	-
iscte.alternateIdentifiers.scopus	2-s2.0-85081988881	-
Aparece nas coleções:	ISTAR-RI - Artigos em revistas científicas internacionais com arbitragem científica

Ficheiros deste registo:

Ficheiro	Descrição	Tamanho	Formato
Manuscript.pdf	Pós-print	1,1 MB	Adobe PDF	Ver/Abrir

Mostrar registo em formato simples Visualizar estatísticas