A Deep Learning Approach for Sentence Classiﬁcation of Scientiﬁc Abstracts

. The classiﬁcation of abstract sentences is a valuable tool to support scientiﬁc database querying, to summarize relevant literature works and to assist in the writing of new abstracts. This study proposes a novel deep learning approach based on a convolutional layer and a bi-directional gated recurrent unit to classify sentences of abstracts. The proposed neural network was tested on a sample of 20 thousand abstracts from the biomedical domain. Competitive results were achieved, with weight-averaged precision, recall and F1-score values around 91%, which are higher when compared to a state-of-the-art neural network.


Introduction
In the last decades, there has been a rise in the number of scholarly publications [14]. For instance, around 114 million of English scholarly documents were accessible on the Web in 2014 [9]. Such volume makes it difficult to quickly select relevant scientific documents. Scientific abstracts summarize the most important elements of a paper and thus those are valuable sources for filtering the most relevant papers during a literature review process [1].
The classification of scientific abstracts is a particular instance of the sequential classification task, considering there is a typical order in the classes (e.g., the 'Objective' label tends to appear after the 'Background'). This classification transforms unstructured text into a more information manageable structure [6]. This is acknowledged by the Emerald publisher, which requires all submissions to include a structured abstract [4]. In effect, the automatic classification of abstract sentences presents several advantages. It is a valuable tool for general scientific database querying (e.g., using Web of Science, Scopus). Also, it can assist in manual [11] or text mining [15] systematic literature review processes, as well as other bibliometric analyses. Moreover, it can help in the writing of new paper abstracts [13].
In this study, we present a deep learning neural network architecture for the sequential classification of abstract sentences. The architecture uses a word embedding layer, a convolutional layer, a bi-directional Gated Recurrent Unit (GNU) and a final concatenation layer. The proposed deep learning model is compared with a recently proposed bi-directional Long Short-Term Memory (LSTM) based model [6], showing an interesting performance on a large 20K abstract corpus that assumes five sentence classes: 'Background', 'Objectives', 'Methods', 'Results' and 'Conclusions'. This paper is organized as follows. First, the related work is introduced in Section 2. Next, the abstract corpus and methods are described in Section 3. Then, the experimental results are presented and analyzed in Section 4. Finally, the main conclusions are discussed in Section 5.

Related Work
As pointed out in [6], most sequential sentence classification methods are based on 'shallow' methods (e.g., naive Bayes, Support Vector Machines (SVM)) that require a manual feature engineering based on lexical (e.g., bag of words, ngrams), semantic (e.g, synonyms), structural (e.g., part-of-speech tags) or sequential (e.g., sentence position) information. The advantage of using deep learning is that the neural networks do not require such manual design of features. Also, deep learning often achieves competitive results in text classification [8].
Regarding abstract sentence classification, this topic has been scarcely researched when compared to other text classification tasks (e.g., sentiment analysis). The main reason for this reduced attention is the restricted availability of publicly datasets. In 2010 [2], the manual engineering approach was used to set nine features (e.g., bi-grams) and train five classifiers (e.g., SVM) that were combined to classify four main elements of medical abstracts. In 2013 [13], a private corpus with 4550 abstracts from different scientific fields was collected from ScienceDirect. The abstract sentences were manually labeled into four categories: 'Background', 'Goal', 'Methods' and 'Results'. The authors also used the conventional manual feature design approach (e.g., n-grams) and a transductive SVM. More recently, in 2017 [5], a large abstract corpus was made publicly available. Using this dataset, a deep learning model, based on one bi-directional LSTM, was proposed for a five class sentence prediction, outperforming four other approaches (e.g., n-gram logistic regression, multilayer perceptron) [6].
In this paper, we propose a different deep learning architecture, mainly composed by a convolutional layer and a bi-directional GRU layer to classify the sentences from abstracts, which uses word embeddings instead of character embeddings. By taking into consideration the position of the sentences, as well as encoding contextual information on the vector of each sentence, we expect that the proposed architecture can potentially achieve better results when compared with the study by [6].

Abstract Corpus
We adopted the abstract corpus first analyzed by [5], which sets the baseline for comparison purposes. The corpus includes open access papers from the PubMed biomedical database and related with Randomized Controlled Trials (RCT). The sentences were classified by the authors of the articles into the five standardized labels.
The full corpus has a total of 200K abstracts. A smaller subset, with 20K most recent abstracts, was also made available for a faster experimentation of sequential sentence classification methods. Considering the 20K subset was used in the work of [6], we also adopt the same dataset, to facilitate the experimental comparison. Table 1 presents the class frequencies and train, validation and test split sizes. This is an unbalanced dataset, with most sentences being related with 'Methods' or 'Results' (around 30%).

Neural Networks Models
In the last years, there has been remarkable developments in deep learning [8].
Architectures such as Convolutional Neural Network (CNN), LSTM and GRU have obtained competitive results in several competitions (e.g., computer vision, signal and natural language processing). The CNN is a network mainly composed by convolutional layers. The purpose of the convolutional layers is to extract features that preserve relevant information from the inputs [12]. To obtain the features, a convolutional layer receives a matrix as input, to which a matrix with a set of weights, known as a filter, is applied using a sliding window approach and, at each of the sliding window steps, a convolution is calculated, resulting in a feature. The size of the filter is a relevant hyperparameter.
Although CNNs have been widely used in computer vision, they can also be used in in sentence classification [10]. The use of convolutional layers enables the extraction of features from a window of words, which is useful because word embeddings alone are not able to detect specific nuances, such as double negation, which is important for sentiment classification. The width of the filter, represented by h, determines the length of the n-grams. The number of filters is also a hyperparameter, making it possible to use multiple filters with varying lengths [10]. The filters are initialized with random weights and, during the training of the network, the weights are learned for the specific task of the network, through backpropagation. Since each filter produces its own feature map, there is a need to reduce the dimensionality caused by using multiple filters. A sentence can be encoded as a single vector by applying a max pooling layer after the convolutional layer, which takes the maximum value for each position, from all the feature maps, keeping only the most important features.
Recurrent Neural Networks (RNN) are relevant for sequential data, such as the words that appear in a sentence. Consider the words (x 1 , ..., x t ) from a given sentence (sequence of words). The hidden state s t of the word x t depends on the hidden state s t−1 , which in turn is the hidden state of the word x t−1 and, for this reason, the order in which words appear over the sequence also influence the various hidden states of the RNN.
The LSTM network is a particular RNN that uses an internal memory to keep information between distant time steps to model long-term dependencies of the sequence. It uses two gating mechanisms, update gate and forget gate, which controls what information should be updated into the memory, and what information should be erased from the memory, respectively. The GRU [3] was recently introduced and it can be used as an alternative to the LSTM model. The GRU uses a reset and update gate, which are able to control how much information should be kept from previous time steps. Both GRU and LSTM are solutions that help mitigate the vanishing gradient problem of conventional RNNs.
A deep learning model was used in [6] for abstract sentence classification. The model uses character embeddings that are then concatenated with word embeddings and used as input for a bi-directional LSTM layer, which outputs a sentence vector based on those hybrid embeddings. The sentence vector is used to predict the probabilities of the labels for that sentence. The authors also use a sequence optimization layer, which has the objective of optimizing the classification of a sequence of sentences, exploiting existing dependencies between labels.

Proposed Architecture
The proposed word embedding, convolutional and bi-directional GRU (Word-BiGRU) architecture is shown in Figure 1. We assume that each abstract has i sentences (S 1 , ..., S i ) and each individual sentence has n words (x 1 1 , ..., x i n ), where x i n is the n th word from the i th sentence. The various words from the sentences are mapped to their respective word embeddings, and those embedding are used to create a sentence matrix E ∈ R m×d , where d equals to the dimensionality of the embeddings. We use word embeddings pre-trained on English Wikipedia, provided by Glove (with d = 200) [  Then, a convolutional layer is used with a sliding window approach that extracts the most important features from the sentences. Let E ∈ R m×d denote the sentence matrix, w ∈ R h×d a filter, and E[i : j] the sub-matrix from row i to j. The single feature o i is obtained using: In this study, we use a filter with a size of h = 5. To add nonlinearity to the output, an activation function applied to every single feature. For the feature o i , it is obtained by: where f is the activation function and b is the bias. We use ReLU as the activation function in our model because it tends to present a faster convergence [7]. Next, we take the various features maps obtained from the convolutional layer, and feed them into a max pooling layer to encode the most important features extracted by the convolutional layer into a single vector representation that can be used by the next layers. Let g 1 , ..., g i denote several vectors, each one encoding a particular sentence of the abstract. The vectors are then fed to bidirectional GRU layer, where the hidden states for each time step are calculated. We will use to denote the Hadamard Product, while using W and U to denote weight matrices of the GRU layer. Let h i−1 be the hidden state of the previous sentence from the same abstract, the candidate hidden stateh i for the current sentence is given by:h The reset gate r i ∈ [0, 1] has the purpose of controlling how much information of the past hidden state, h t−1 will be kept. Let σ be the sigmoid activation function. The reset gate r i is calculated by: To control how much new information will be stored in the hidden state, an update gate z i ∈ [0, 1] is used, given by: The hidden state h i , which is the hidden state of the sentence i, is obtained by: Since we use a bi-directional GRU layer, there is a forward pass and a backward pass. The hidden states resulting from the forward pass are: where h i is the hidden state of the i th sentence of the abstract. Similarly, the hidden states resulting from the backward pass are: By using a bi-directional GRU, we want to capture contextual information about each sentence of the abstract, by taking into consideration the sentences that appear before and after it. For the i th sentence of the abstract, the individual vector k i , which encodes the sentence with contextual information captured using the bi-directional GRU layer, is obtained by concatenating (⊕ operator) the forward and backward hidden states: Each encoded sentence k i is then concatenated with an integer value indicating the position of that sentence in the abstract, resulting in z i : Finally, a softmax layer is used, such that the outputs can be interpreted as class probabilities.

Evaluation
Classification accuracy is often measured using a confusion matrix, which maps predicted versus desired labels. From this matrix, several metrics can be computed, such as: [17]: Precision, Recall, F1-score. For a class c, these metrics are obtained using: where T P c , F P c , F N c denote the number of true positives, false positives and false negatives for class c.
To combine all five class results into a single measure, we adopt two aggregation methods: macro-averaging and weight-averaging. The macro-averaging computes first the metric (e.g., Precision using Equation 11) for each class and then averages the overall result. The weight-averaging is computed in a similar way except that each class metric is weighted proportionally to its prevalence in the data. In [6] only the weight-averaging method was used.
For comparison purposes, we adopt the same train, validation and test sets used in [6] (Table 1). When fitting the deep learning architecture, we adjusted different combinations of its main hyperparameters, namely: the number of filters (128 or 256) in the convolutional layer and the number of units (∈ {25, 50, 75, 100}) in the bi-directional GRU Layer. The validation set was used to select the best configuration, when monitoring the macro-averaging Precision metric. In the test set comparison, we computed all classification metrics.

Results
The deep learning models were trained on the p2.xlarge instance from Amazon Elastic Compute Cloud, which has an Intel Xeon E5-2686 v4 2.30 GHz, Nvidia Tesla K80 and 61 GB of RAM. The experiments were implemented in Python using the Keras and Scikit packages. The selected hyperparameters (using validation metrics) are shown in Table 2. Figure 2 shows the normalized confusion matrix of the proposed model. The matrix confirms that a very good classification was achieved, in particular for the 'Methods', 'Conclusions' and 'Results' labels and that correspond to the most frequent classes.
The proposed Word-BiGRU deep learning architecture is compared with two other approaches: a similar model that does not include the bi-directional GRU layer (CNN model), and with the results provided in [6] (Char-BiLSTM). Table 3 shows the test results for each class. Word-BiGRU shows competitive results when compared with Char-BiLSTM. Specifically, it achieves the best Precision and Recall values for three classes and the best F1-scores for all classes. Furthermore, the deep learning model provides the highest classification improvement (11.3 percentage points) for the least frequent class ('Objectives'). The averaged class results are detailed in Table 4. Word-BiGRU provides better results in all metrics when compared with the other models. The improvement ranges: Table 2. Selected hyperparameters of the proposed model.

Conclusions
Abstract sentence classification is a key element to assist in scientific database querying, performing literature reviews and to support the writing of new abstracts. In this paper, we present a novel deep learning architecture for abstract sentence classification. The proposed Word-BiGRU architecture assumes word embeddings, a convolutional layer and a bi-directional Gated Recurrent Unit (GRU). Using a large sentence corpus, related with 20 thousand abstracts from the biomedical domain, we have obtaining high quality classification performances, with weight-average Precision, Recall and F1-score values around 91%. These results compare favourably against a state-of-the-art bi-directional Long Short-Term Memory (LSTM) model. In future work, we wish to enlarge the experimentation of the proposed deep learning architecture to classify abstract corpus from other scientific domains and also to other sequential tasks.