Creating classification models from textual descriptions of companies using crunchbase

Marco Felgueiras; Batista, F.; João P. Carvalho

doi:10.1007/978-3-030-50146-4_51

Utilize este identificador para referenciar este registo: http://hdl.handle.net/10071/20839

Registo completo

Campo DC	Valor	Idioma
dc.contributor.author	Marco Felgueiras	-
dc.contributor.author	Batista, F.	-
dc.contributor.author	João P. Carvalho	-
dc.contributor.editor	Lesot, Marie-Jeanne and Vieira, Susana and Reformat, Marek Z. and Carvalho, João Paulo and Wilbik, Anna and Bouchon-Meunier, Bernadette and Yager, Ronald R.	-
dc.date.accessioned	2020-11-20T11:05:04Z	-
dc.date.available	2020-11-20T11:05:04Z	-
dc.date.issued	2020	-
dc.identifier.isbn	978-3-030-50146-4	-
dc.identifier.uri	http://hdl.handle.net/10071/20839	-
dc.description.abstract	This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the categories based solely on the company textual description. A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and part-of-speech tags. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. Our findings reveal that the description text of each company contain features that allow to predict its area of activity, expressed by its corresponding categories, with about 70% precision, and 42% recall. In a second set of experiments, a multiclass problem that attempts to find the most probable category, we obtained about 67% accuracy using SVM and Fuzzy Fingerprints. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc.	eng
dc.language.iso	eng	-
dc.publisher	Springer International Publishing	-
dc.relation	UIDB/50021/2020	-
dc.rights	openAccess	-
dc.title	Creating classification models from textual descriptions of companies using crunchbase	eng
dc.type	conferenceObject	-
dc.event.title	IPMU 2020: Information Processing and Management of Uncertainty in Knowledge-Based Systems	-
dc.event.type	Conferência	pt
dc.event.location	Lisboa	eng
dc.event.date	2020	-
dc.pagination	695 - 707	-
dc.peerreviewed	yes	-
dc.journal	Information Processing and Management of Uncertainty in Knowledge-Based Systems	-
degois.publication.firstPage	695	-
degois.publication.lastPage	707	-
degois.publication.location	Lisboa	eng
degois.publication.title	Creating classification models from textual descriptions of companies using crunchbase	eng
dc.date.updated	2020-11-20T11:01:40Z	-
dc.description.version	info:eu-repo/semantics/publishedVersion	-
dc.identifier.doi	10.1007/978-3-030-50146-4_51	-
dc.subject.fos	Domínio/Área Científica::Ciências Naturais::Ciências da Computação e da Informação	por
dc.subject.fos	Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática	por
dc.subject.fos	Domínio/Área Científica::Humanidades::Línguas e Literaturas	por
iscte.identifier.ciencia	https://ciencia.iscte-iul.pt/id/ci-pub-72399	-
iscte.alternateIdentifiers.scopus	2-s2.0-85086244630	-
Aparece nas coleções:	IT-CRI - Comunicações a conferências internacionais