A Data-Driven Approach to Predict the Success of Bank Telemarketing

We propose a data mining (DM) approach to predict the success of telemarketing calls for selling bank long-term deposits. A Portuguese retail bank was addressed, with data collected from 2008 to 2013, thus including the eﬀects of the recent ﬁnan-cial crisis. We analyzed a large set of 150 features related with bank client, product and social-economic attributes. A semi-automatic feature selection was explored in the modeling phase, performed with the data prior to July 2012 and that allowed to select a reduced set of 22 features. We also compared four DM models: logistic regression, decision trees (DT), neural network (NN) and support vector machine. Using two metrics, area of the receiver operating characteristic curve (AUC) and area of the LIFT cumulative curve (ALIFT), the four models were tested on an evaluation phase, using the most recent data (after July 2012) and a rolling windows scheme. The NN presented the best results (AUC=0.8 and ALIFT=0.7), allowing to reach 79% of the subscribers by selecting the half better classiﬁed clients. Also, two knowledge extraction methods, a sensitivity analysis and a DT, were applied to the NN model and revealed several key attributes (e.g., Euribor rate, direction of the call and bank agent experience). Such knowledge extraction conﬁrmed the obtained model as credible and valuable for telemarketing campaign managers.


Introduction
Marketing selling campaigns constitute a typical strategy to enhance business. Companies use direct marketing when targeting segments of customers by contacting them to meet a specific goal. Centralizing customer remote interactions in a contact center eases operational management of campaigns.
Such centers allow communicating with customers through various channels, telephone (fixed-line or mobile) being one of the most widely used. Marketing operationalized through a contact center is called telemarketing due to the remoteness characteristic [16]. Contacts can be divided in inbound and outbound, depending on which side triggered the contact (client or contact center), with each case posing different challenges (e.g., outbound calls are often considered more intrusive). Technology enables rethinking marketing by focusing on maximizing customer lifetime value through the evaluation of available information and customer metrics, thus allowing to build longer and tighter relations in alignment with business demand [28]. Also, it should be stressed that the task of selecting the best set of clients, i.e., that are more likely to subscribe a product, is considered NP-hard in [31].
Decision support systems (DSS) use information technology to support managerial decision making. There are several DSS sub-fields, such as personal and intelligent DSS. Personal DSS are related with small-scale systems that support a decision task of one manager, while intelligent DSS use artificial intelligence techniques to support decisions [1]. Another related DSS concept is Business Intelligence (BI), which is an umbrella term that includes information technologies, such as data warehouses and data mining (DM), to support decision making using business data [32]. DM can play a key role in personal and intelligent DSS, allowing the semi-automatic extraction of explanatory and predictive knowledge from raw data [34]. In particular, classification is the most common DM task [10] and the goal is to build a data-driven model that learns an unknown underlying function that maps several input variables, which characterize an item (e.g., bank client), with one labeled output target (e.g., type of bank deposit sell: "failure" or "success").
There are several classification models, such as the classical Logistic Regression (LR), decision trees (DT) and the more recent neural networks (NN) and support vector machines (SVM) [13]. LR and DT have the advantage of fitting models that tend to be easily understood by humans, while also providing good predictions in classification tasks. NN and SVM are more flexible (i.e., no a priori restriction is imposed) when compared with classical statistical modeling (e.g., LR) or even DT, presenting learning capabilities that range from linear to complex nonlinear mappings. Due to such flexibility, NN and SVM tend to provide accurate predictions, but the obtained models are difficult to be understood by humans. However, these "black box" models can be opened by using a sensitivity analysis, which allows to measure the importance and effect of particular input in the model output response [7]. When comparing DT, NN and SVM, several studies have shown different classification performances. For instance, SVM provided better results in [6] [8], comparable NN and SVM performances were obtained in [5], while DT outperformed NN and SVM in [24]. These differences in performance emphasize the impact of the problem context and provide a strong reason to test several techniques when addressing a problem before choosing one of them [9]. DSS and BI have been applied to banking in numerous domains, such as credit pricing [25]. However, the research is rather scarce in terms of the specific area of banking client targeting. For instance, [17] described the potential usefulness of DM techniques in marketing within Hong-Kong banking sector but no actual data-driven model was tested. The research of [19] identified clients for targeting at a major bank using pseudo-social networks based on relations (money transfers between stakeholders). Their approach offers an interesting alternative to traditional usage of business characteristics for modeling.
In previous work [23], we have explored data-driven models for modeling bank telemarketing success. Yet, we only achieved good models when using attributes that are only known on call execution, such as call duration. Thus, while providing interesting information for campaign managers, such models cannot be used for prediction. In what is more closely related with our approach, [15] analyzed how a mass media (e.g., radio and television) marketing campaign could affect the buying of a new bank product. The data was collected from an Iran bank, with a total of 22427 customers related with a six month period, from January to July of 2006, when the mass media campaign was conducted. It was assumed that all customers who bought the product (7%) were influenced by the marketing campaign. Historical data allowed the extraction of a total of 85 input attributes related with recency, frequency and monetary features and the age of the client. A binary classification task was modeled using a SVM algorithm that was fed with 26 attributes (after a feature selection step), using 2/3 randomly selected customers for training and 4 1/3 for testing. The classification accuracy achieved was 81% and through a Lift analysis [3], such model could select 79% of the positive responders with just 40% of the customers. While these results are interesting, a robust validation was not conducted. Only one holdout run (train/test split) was considered. Also, such random split does not reflect the temporal dimension that a real prediction system would have to follow, i.e., using past patterns to fit the model in order to issue predictions for future client contacts.
In this paper, we propose a personal and intelligent DSS that can automatically predict the result of a phone call to sell long term deposits by using a DM approach. Such DSS is valuable to assist managers in prioritizing and selecting the next customers to be contacted during bank marketing campaigns.
For instance, by using a Lift analysis that analyzes the probability of success and leaves to managers only the decision on how many customers to contact.
As a consequence, the time and costs of such campaigns would be reduced.
Also, by performing fewer and more effective phone calls, client stress and intrusiveness would be diminished. The main contributions of this work are: • We focus on feature engineering, which is a key aspect in DM [10], and propose generic social and economic indicators in addition to the more commonly used bank client and product attributes, in a total of 150 analyzed features. In the modeling phase, a semi-automated process (based on business knowledge and a forward method) allowed to reduce the original set to 22 relevant features that are used by the DM models.
• We analyze a recent and large dataset (52944 records) from a Portuguese bank. The data were collected from 2008 to 2013, thus including the effects of the global financial crisis that peaked in 2008.
• We compare four DM models (LR, DT, NN and SVM) using a realistic 5 rolling windows evaluation and two classification metrics. We also show how the best model (NN) could benefit the bank telemarketing business.
The paper is organized as follows: Section 2 presents the bank data and DM approach; Section 3 describes the experiments conducted and analyzes the obtained results; finally, conclusions are drawn in Section 4.

Bank telemarketing data
This research focus on targeting through telemarketing phone calls to sell longterm deposits. Within a campaign, the human agents execute phone calls to a list of clients to sell the deposit (outbound) or, if meanwhile the client calls the contact-center for any other reason, he is asked to subscribe the deposit (inbound). Thus, the result is a binary unsuccessful or successful contact.
This study considers real data collected from a Portuguese retail bank, from Each record included the output target, the contact outcome ({"failure", "suc-cess"}), and candidate input features. These include telemarketing attributes (e.g., call direction), product details (e.g., interest rate offered) and client information (e.g., age). These records were enriched with social and economic influence features (e.g., unemployment variation rate), by gathering external data from the central bank of the Portuguese Republic statistical web site 1 .
The merging of the two data sources led to a large set of potentially useful features, with a total of 150 attributes, which are scrutinized in Section 2.4.

Data mining models
In this work, we test four binary classification DM models, as implemented in the rminer package of the R tool [5]: logistic regression (LR), decision trees (DT), neural network (NN) and support vector machine (SVM).
The LR is a popular choice (e.g., in credit scoring) that operates a smooth nonlinear logistic transformation over a multiple regression model and allows the estimation of class probabilities [33]: , where p(c|x) denotes the probability of class c given the k-th input example x k = (x k,1 , ..., x k,M ) with M features and w i denotes a weight factor, adjusted by the learning algorithm. Due to the additive linear combination of its independent variables (x), the model is easy to interpret. Yet, the model is quite rigid and cannot model adequately complex nonlinear relationships.
The DT is a branching structure that represents a set of rules, distinguishing values in a hierarchical form [2]. This representation can translated into a set of IF-THEN rules, which are easy to understand by humans. The multilayer perceptron is the most popular NN architecture [14]. We adopt a multilayer perceptron with one hidden layer of H hidden nodes and one output node. The H hyperparameter sets the model learning complexity. A NN with a value of H = 0 is equivalent to the LR model, while a high H value allows the NN to learn complex nonlinear relationships. For a given input x k the state of the i-th neuron (s i ) is computed by: where P i represents the set of nodes reaching node i; f is the logistic function; w i,j denotes the weight of the connection between nodes j and i; and s 1 = x k,1 , . . ., s M = x k,M . Given that the logistic function is used, the output node automatically produces a probability estimate (∈ [0, 1]). The NN final solution is dependent of the choice of starting weights. As suggested in [13], to solve this issue, the rminer package uses an ensemble of N r different trained networks and outputs the average of the individual predictions [13].
The SVM classifier [4] transforms the input x ∈ M space into a high mdimensional feature space by using a nonlinear mapping that depends on a kernel. Then, the SVM finds the best linear separating hyperplane, related to a set of support vector points, in the feature space. The rminer package adopts the popular Gaussian kernel [13], which presents less parameters than other kernels (e.g., polynomial): K(x, x ) = exp(−γ||x − x || 2 ), γ > 0. The probabilistic SVM output is given by [35]: f (x i ) = m j=1 y j α j K(x j , x i ) + b and Before fitting the NN and SVM models, the input data is first standardized to a zero mean and one standard deviation [13]. For DT, rminer adopts the 8 default parameters of the rpart R package, which implements the popular CART algorithm [2] For the LR and NN learning, rminer uses the efficient BFGS algorithm [22], from the family of quasi-Newton methods, while SVM is trained using the sequential minimal optimization (SMO) [26]. The learning capabilities of NN and SVM are affected by the choice of their hyperparameters (H for NN; γ and C, a complex penalty parameter, for SVM). For setting these values, rminer uses grid search and heuristics [5].
Complex DM models, such as NN and SVM, often achieve accurate predictive performances. Yet, the increased complexity of NN and SVM makes the final data-driven model difficult to be understood by humans. To open these blackbox models, there are two interesting possibilities, rule extraction and sensitivity analysis. Rule extraction often involves the use of a white-box method (e.g., decision tree) to learn the black-box responses [29]. The sensitivity analysis procedure works by analyzing the responses of a model when a given input is varied through its domain [7]. By analyzing the sensitivity responses, it is possible to measure input relevance and average impact of a particular input in the model. The former can be shown visually using an input importance bar plot and the latter by plotting the Variable Effect Characteristic (VEC) curve. Opening the black-box allows to explaining how the model makes the decisions and improves the acceptance of prediction models by the domain experts, as shown in [20].

Evaluation
A class can be assigned from a probabilistic outcome by assigning a threshold D, such that event c is true if p(c|x k ) > D. The receiver operating charac-teristic (ROC) curve shows the performance of a two class classifier across the range of possible threshold (D) values, plotting one minus the specificity (x-axis) versus the sensitivity (y-axis) [11]. The overall accuracy is given by the area under the curve (AU C = 1 0 ROCdD), measuring the degree of discrimination that can be obtained from a given model. AUC is a popular classification metric [21] that presents advantages of being independent of the class frequency or specific false positive/negative costs. The ideal method should present an AUC of 1.0, while an AUC of 0.5 denotes a random classifier.
In the domain of marketing, the Lift analysis is popular for accessing the quality of targeting models [3]. Usually, the population is divided into deciles, Given that the training data includes a large number of contacts (51651), we adopt the popular and fast holdout method (with R distinct runs) for feature and model selection purposes. Under this holdout scheme, the training data is further divided into training and validation sets by using a random split with 2/3 and 1/3 of the contacts, respectively. The results are aggregated by the average of the R runs and a Mann-Whitney non-parametric test is used to check statistical significance at the 95% confidence level.
In real environment, the DSS should be regularly updated as new contact data becomes available. Moreover, client propensity to subscribe a bank product may evolve through time (e.g., changes in the economic environment). Hence, for achieving a robust predictive evaluation we adopt the more realistic fixedsize (of length W ) rolling windows evaluation scheme that performs several model updates and discards oldest data [18]. Under this scheme, a training window of W consecutive contacts is used to fit the model and then we perform predictions related with the next K contacts. Next, we update (i.e., slide) the training window by replacing the oldest K contacts with K newest contacts (related with the previously predicted contacts but now we assume that the outcome result is known), in order to perform new K predictions, an so on.
For a test set of length L, a total of number model updates (i.e., trainings) is Figure 1 exemplifies the rolling windows evaluation procedure.  Fig. 1. Schematic of the adopted rolling windows evaluation procedure.

Feature selection
The large number (150) of potential useful features demanded a stricter choice of relevant attributes. Feature selection is often a key DM step, since it is useful to discard irrelevant inputs, leading to simpler data-driven models that are easier to interpret and that tend to provide better predictive performances [12]. In [34], it is argued that while automatic methods can be useful, the best way is to perform a manual feature selection by using problem domain knowledge, i.e., by having a clear understanding of what the attributes actually mean. In this work, we use a semi-automatic approach for feature selection based on two steps that are described below.
In the first step, business intuitive knowledge was used to define a set of fourteen questions, which represent certain hypotheses that are tested. Each question (or factor of analysis) is defined in terms of a group of related attributes selected from the original set of 150 features by a bank campaign manager (domain expert). For instance, the question about the gender influence (male/female) includes the three features, related with the gender of the banking agent, client and client-agent difference (0 -if same sex; 1 -else). Table 1 exhibits the analyzed factors and the number of attributes related with each factor, covering a total of 69 features (reduction of 46%).
In the second step, an automated selection approach is adopted, based an adapted forward selection method [12]. Given that standard forward selection is dependent on the sequence of features used and that the features related with a factor of analysis are highly related, we first apply a simple wrapper selection method that works with a DM fed with combinations of inputs taken from a single factor. The goal is to identify the most interesting factors and features attached to such factors. Using only training set data, several DM models are fit, by using: each individual feature related to a particular question (i.e., one input) to predict the contact result; and all features related with the same question (e.g., 3 inputs for question #2 about gender influence). Let AU C q and AU C q,i denote the AUC values, as measured on the validation set, for the model fed with all inputs related with question q and only the i−th individual feature of question q. We assume that the business hypothesis is confirmed if at least one of the individually tested attributes achieves an AU C q,i greater than a threshold T 1 and if the model will all question related features returns an AU C q greater than another threshold T 2 . When an hypothesis is confirmed, where AU C q,m = max (AU C q,i ). Else, we rank the input relevance of the model with all question related features in order to select the most relevant ones, such that the sum of input importances is higher than a threshold T 4 .
Once a set of confirmed hypotheses and relevant features is achieved, a forward selection method is applied, working on a factor by factor step basis. A DM model that is fed with training set data using as inputs all relevant features of 13 the first confirmed factor and then AUC is computed over the validation set.
Then, another DM model is trained with all previous inputs plus the relevant features of the next confirmed factor. If there is an increase in the AUC, then the current factor features are included in the next step DM model, else they are discarded. This procedure ends when all confirmed factors have been tested if they improve the predictive performance in terms of the AUC value.

Modeling
All experiments were performed using the rminer package and R tool [5] and conducted in a Linux server, with an Intel Xeon 5500 2.27GHz processor. Each DM model related with this section was executed using a total of R = 20 runs.
For the feature selection, we adopted the NN model described in Section 2.2 as the base DM model, since preliminary experiments, using only training data, confirmed that NN provided the best AUC and ALIFT results when compared with other DM methods. Also, these preliminary experiments confirmed that SVM required much more computation when compared with NN, in an expected result since SMO algorithm memory and processing requirements grow much more heavily with the size of the dataset when compared with BFGS algorithm used by the NN. At this stage, we set the number of hidden nodes using the heuristic H = round(M/2) (M is the number of inputs), which is also adopted by the WEKA tool [34] and tends to provide good classification results [5]. The NN ensemble is composed of N r = 7 distinct networks, each trained with 100 epochs of the BFGS algorithm.
Before executing the feature selection, we fixed the initial phase thresholds to reasonable values: T 1 = 0.60 and T 2 = 0.65, two AUC values better than the random baseline of 0.5 and such that T 2 > T 1 ; T 3 = 0.01, the minimum difference of AUC values; and T 4 =60%, such that the sum of input importances accounts for at least 60% of the influence. Table 1  (which is less relevant) was fixed using the heuristic C = 3 proposed in for x standardized input data [5]. The rminer package applies this grid search by performing an internal holdout scheme over the training set, in order to select The obtained results for the modeling phase (using only training and validation set data) are shown on Table 3 Table 2). Nevertheless, given that a simpler model was selected (i.e., H = 6), we opt for such model in the remainder of this paper. Table 3 Comparison of DM models for the modeling phase (bold denotes best value)

Predictive knowledge and potential impact
The best model from previous section (NN fed with 22 features from  The left of Figure 2 plots the ROC curves for the four models tested. A good model should offer the best compromise between a desirable a high true positive rate (TPR) and low false positive rate (FPR). The former goal corresponds to a sensitive model, while the latter is related with a more specific model.    When comparing the best proposed model NN in terms of modeling versus rolling windows phases, there is a decrease in performance, with a reduction in AUC from 0.929 to 0.794 and ALIFT from 0.878 to 0.672. However, such reduction was expected since in the modeling phase the feature selection was tuned based on validation set errors, while the best model was then fixed (i.e., 22 inputs and H = 6) and tested on completely new unseen and more recent data. Moreover, the obtained AUC and ALIFT values are much better than the random baseline of 50%.

Explanatory knowledge
In this section, we show how explanatory knowledge can be extracted by using a sensitivity analysis and rule extraction techniques (Section 2. 2) to open the data-driven model. Using the Importance function of the rminer package, we applied the Data-based Sensitivity Analysis (DSA) algorithm, which is capable of measuring the global influence of an input, including its iterations with other attributes [7]. The DSA algorithm was executed on the selected NN model, fitted with all training data (51651 oldest contacts). Figure 4 exhibits the respective input importance bar plot (the attribute names are described in more detail on Table 2). A DT was also applied to the output responses of the NN model that was fitted with all training data. We set the DT complexity parameter to 0.001, which allowed to fit a DT a low error, obtaining a mean absolute error of 0.03 when predicting the NN responses. A large tree was obtained and to simplify the analysis, Figure 5 presents the obtained decision rules up to six decision levels. An example of an extracted rule is: if the number of employed is equal or higher than 5088 thousand and duration of previously scheduled calls is less than 13 minutes and the call is not made in March, April, October or December, and the call is inbound then the probability of success is 0.62. In Figure 5, decision rules that are aligned with the sensitivity analysis are shown in bold and are discussed in the next paragraphs. Figure 4 is that the three month Euribor rate (euribor3m), computed by the European Central Bank (ECB) and published by Thomson Reuters, i.e., a publicly available and widely used index, was considered the most relevant attribute, with a relative importance around 17%. Next comes the direction of the phone call (inbound versus outbound,  these two attributes are the ones from the top five which are not specifically related to call context, so they will be analyzed together further ahead. Last in the top five attributes comes the duration of previous calls that needed to be rescheduled to obtain a final answer by the client. It is also interesting to notice that the top ten attributes found by the sensitivity analysis ( Figure 4) are also used by the extracted decision tree, as shown in Figure 5.

An interesting result shown by
Concerning the sensitivity analysis input ranking, one may also take into con-   [27]. Still, the right of Figure 6 reveals the opposite, with a lower Euribor corresponding to a higher probability for deposits subscription, and the same probability decreasing along with the increase of the three month Euribor. A similar effect is visible in a decision node of the extracted DT (Figure 5), where the probability of success decreases by 10 pp when the Euribor rate is higher than 0.73. This behavior is explained by a more recent research [30], which revealed that while prior to 2008 a weak positive relation could be observed between offered rate for deposits and savings rate, after 2008, with the financial crisis, that relation reversed, turning clients more prone to savings while the Euribor constantly decreased. This apparent contradiction might be due to clients perception of a real economic recession and social depression.
Consumers might feel an increased need to consider saving for the future as opposed to immediate gratification coming from spending money in purchasing desired products or services. This observation emphasizes the inclusion of this kind of information on similar DM projects. Concerning the difference between best product rate offered and national average, Figure 6 confirms our expectation that an increase in this attribute does increase the probability for subscribing a deposit. Still, once the difference reaches 0.73%, the influence on the probability of subscription is highly reduced, which means that an interest rate slightly above the competition seems to be enough to make the difference on the result. It is also interesting to note that the extracted DT reveals a positive effect of the rate difference with a successful contact ( Figure 5).
The right of Figure 6 shows the influence of the second, third and fifth most relevant attributes. Regarding call direction, we validate that clients contacted through inbound are keener to subscribe the deposit. A similar effect is measured by the extracted DT, where an inbound call increases the probability of success by 25 pp (Figure 5). Inbound is associated with less intrusiveness given that the client has called the bank and thus he/she is more receptive for a sell. Another expected outcome is related with agent experience, where the knowledge extraction results show that it has a significant impact on a successful contact. Quite interestingly, a few days of experience are enough to produce a strong impact, given that under the VEC analysis with just six days the average probability of success is above 50% ( Figure 6) and the extracted DT increases the probability of successful sell by 9 pp when the experience is higher or equal than 3.3 days ( Figure 5). Regarding the duration of previously scheduled calls, it happens often that the client does not decide on the first call on whether to subscribe or not the deposit, asking to be called again, thus rescheduling another call. In those cases (63.8% for the whole dataset), a contact develops through more than one phone call. The sensitivity analysis ( Figure 6) shows that more time already spent on past calls within the same campaign increases probability of success. Similarly, the extracted DT confirms a positive effect of the duration of previous calls. For instance, when the duration is higher or equal than 13 minutes (left node at the second level of Figure 5), then the associated global probability of success is 0. In this study, we propose a personal and intelligent DSS that uses a data mining (DM) approach for the selection of bank telemarketing clients. We Two knowledge extraction techniques were also applied to the proposed model: a sensitivity analysis, which ranked the input attributes and showed the average effect of the most relevant features in the NN responses; and a decision tree, which learned the NN responses with a low error and allowed the extraction of decision rules that are easy to interpret. As an interesting outcome, the three month Euribor rate was considered the most relevant attribute by the sensitivity analysis, followed by the direction call (outbound or inbound), the bank agent experience, difference between the best possible rate for the product being offered and the national average rate, and the duration of previous calls that needed to be rescheduled to obtain a final answer by the client.
Several of the extracted decision rules were aligned with the sensitivity analysis results and make use of the top ten attributes ranked by the sensitivity analysis. The obtained results are credible for the banking domain and provide valuable knowledge for the telemarketing campaign manager. For instance, we confirm the result of [30], which claims that the financial crisis changed the way the Euribor affects savings rate, turning clients more likely to perform savings while Euribor decreased. Moreover, inbound calls and an increase in other highly relevant attributes (i.e., difference in best possible rate, agent experience or duration of previous calls), enhance the probability for a successful deposit sell.
In future work, we intend to address the prediction of other telemarketing relevant variables, such as the duration of the call (which highly affects the probability of a successful contact [23]) or the amount that is deposited in the bank. Additionally, the dataset may provide history telemarketing behavior for cases when clients have previously been contacted. Such information could be used to enrich the dataset (e.g., computing recency, frequency and monetary features) and possibly provide new valuable knowledge to improve model accuracy. Also it would be interesting to consider the possibility of splitting the sample according to two sub-periods of time within the range 2008-2012, which would allow to analyze impact of hard-hit recession versus slow recovery.