Data science strategies leading to the development of data scientists’ skills in organizations

The purpose of this paper is to compare the strategies of companies with data science practices and methodologies and the data specificities/variables that can influence the definition of a data science strategy in pharma companies. The current paper is an empirical study, and the research approach consists of verifying against a set of statistical tests the differences between companies with a data science strategy and companies without a data science strategy. We have designed a specific questionnaire and applied it to a sample of 280 pharma companies. The main findings are based on the analysis of these variables: overwhelming volume, managing unstructured data, data quality, availability of data, access rights to data, data ownership issues, cost of data, lack of pre-processing facilities, lack of technology, shortage of talent/skills, privacy concerns and regulatory risks, security, and difficulties of data portability regarding companies with a data science strategy and companies without a data science strategy. The paper offers an in-depth comparative analysis between companies with or without a data science strategy, and the key limitation is regarding the literature review as a consequence of the novelty of the theme; there is a lack of scientific studies regarding this specific aspect of data science. In terms of the practical business implications, an organization with a data science strategy will have better direction and management practices as the decision-making process is based on accurate and valuable data, but it needs data scientists skills to fulfil those goals.


Introduction
Data science in the pharma industry is a new issue, and it can bring significant advantages for early adopters as the benefit of making decisions based on accuracy and data quality is a competitive advantage for companies.
The access to high-quality and large datasets combined with data science techniques [2,32] will optimize business processes and the health sector. Data science can be a transformative driver of the health sector, namely the health tech industry (creating new data science applications to analyse and visualize in an optimal way the big data available for all the stakeholders of the health system),healthcare providers (as they are the main drivers for deploying better health services for citizens); pharmacists (adjusting themselves to the needs of healthcare providers and citizens in general, making decisions on new product research and development to improve the quality of life); and other stakeholders involved in all health processes, sharing big data [27].
Central concerns with any big data-relevant venture are the privacy of data and ethics, though the use of big data can effectively produce better outcomes and more tailored responses with improved quality of life. However, personal data are sensitive, and legal and ethical issues need to be considered when using data science to analyse this kind of data.
In the European Union, the definition of specific policies and technologies to enable data science as a base for big data analysis [21] will facilitate the creation of global value chains in the health sector (pharma, healthcare providers, citizens, and all other health sector stakeholders), and it will contribute to digital single-market strategy. The value created by data science [6][7][8] can help to transform the health sector to increase its quality, decrease costs, and improve accessibility for all citizens.
The novelty of this research from the perspective of neural computing and applications (NCAA), besides analysing the pharma industry regarding the data science strategy of pharma companies and taking it as a pilot example for other companies and policymakers, is based on the presupposition that data science methods are unquestionably fundamental to produce information for better decision-making processes. In this context, data scientists use neural network mechanisms to increase attention, learning, and decision-making, supported by applications that help them in those processes. This is reflected in the goals of this research, which identifies the need for data scientists to develop specific skills to support with more sustainability and accuracy the data science strategies and models implemented by organizations.
The paper is organized as follows. Section 2 provides the literature review of the relevant research based on data science methods. Section 3 presents the methodology. Section 4 presents the analysis and discussion, and Sect. 5 concludes the paper and presents the implications for future research.
2 The literature review on data science Data science, as the name implies, is the science of understanding, analysing, and transforming data through statistical and computational methodologies to solve reallife problems. As suggested by Hayashi [16], the main purpose of data science is to use knowledge of wellestablished theory and methods in supporting sciences like statistics, data mining, and machine learning to resolve and unmask the hidden structures or features in data and provide solutions for complex natural or social phenomena.
In recent times, the term 'big data' have often appeared in relevance to data science. However, it has been observed that data science problems may not always require the use of big data. Alternatively, the use of big data in conjunction with data science methodologies may prove useful in solving some specific problems. Similarly, as observed in the recent literature, deep-learning methodologies are often applied instead of traditional machine learning technologies when dealing with big data in a data science project as their use with big data has shown better results than most of the traditional machine learning approaches.
Data science has attracted intense and growing attention from significant health and life sciences organizations [13], including big pharmaceutical companies that maintain a traditional data-oriented scientific and clinical development approach for various parts of the business and management processes, where data are not shared across different departments like market access or marketing.
Progressing digital transformation stimulates a considerable growth of digital data. Data are an asset for any business organization, and having the capacity to understand all the connected trends and patterns and extract meaningful information and knowledge from data is referred to as data science [12].
The field of data science technology is enriched by knowledge and techniques from statistics as well as computing. On one hand, data science refers to traditional statistics [9] that are produced for argumentation analysis or specific, methodical problems, with additional capacity for exploratory analysis and the integration of data crunching and data mining. On the other hands, data science technologies are also borrowed from the field of software development, which has a strong basis in traditional platforms like data warehouses. This field thus can aggregate and maintain various large datasets for data management and storage on distributed development platforms through distributed computation or integration software [33].
The discovery of adaptive solutions through big data using data science technologies for long-standing pharmaceutical processes is often labelled Pharma 4.0. As discussed by Steinwandter et al. [24], cyberphysical systems and dark factories, which are well-known concepts in Pharma 4.0, often require data science techniques and tools as core components to resolve long-standing pharmaceutical issues and have been found to be useful.
Deep learning and big data analyses have recently gained much attention in leading medical sciences and pharma research as observed by Tariq et al. [29]. This is characterised by the initiative of many medical and pharmaceutical organizations to gather large amounts of relevant data of their domain. This recent research on the possible application of data science processes to solve long-standing pharmaceutical problems and the growing trend in the industry to utilize big data and deep-learning technologies to better understand various processes and make knowledgeable decisions are the motivations for our study. The adaptation of these technologies requires a better understanding of their advantages and disadvantages, and hence, this research is curtailed to identify and study the factors involved.
It is fundamental for the strategic decision-making process of a pharma organization to identify challenges, capitalize on opportunities, and predict future trends and behaviours of HCPs, KOLs, and other stakeholders (Grom, 2013). For this purpose, many data science techniques [32] can be applied, such as the ones grouped below: a. Descriptive statistical analysis, which is used to summarize data from a sample using parameters such as mean, median, standard deviation, variance, and others. b. Inferential statistical analysis, which is used to draw conclusions from data through hypotheses (null and alternative hypotheses). c. Predictive analytics, which uses predictive algorithms and machine learning techniques to define the probability of future results, behaviours, and patterns, based on existing data. d. Prescriptive analytics, which aims to find the optimal recommendations for a decision-making process. e. Causal analysis, which searches for data to understand the causes. f. Exploratory data analysis, which is an alternative to inferential statistics, and its emphasis is on detecting general trends and patterns in the data and tracking associations.
g. Mechanistic analysis, which is used in big data analysis in industries.
The understanding and knowledge of these techniques are essential for data scientists as their tasks involve complex data analysis and require a high level of understanding of learning processes to be used accurately in organizations, as discussed by [5].
Data science is gaining middle ground in all pharma companies for the efficient utilization of resources: storage, time, and efficient decision-making to exploit new methods and procedures. The critical challenges are the management of exponentially growing data, its meaningful analysis, deploying low-cost processing tools and practices while minimizing potential risks relating to safety, inconsistency, redundancy, and privacy. In this context, various variables need to be considered by organizations to define a data science strategy [6][7][8].
The volume of data generated by sources such as patients, hospitals, physicians, suppliers, and others nowadays is considerable, and to obtain valuable information from such enormous and heterogeneous data requires a multimodal learning methodology to make the insights from such combined information available to decision-makers and policymakers.
In the pharma industry, there is a high percentage of unstructured, internal data [1] from liaisons with stakeholders and from products. Furthermore, the use of external data such as lifestyle information for research, development, and marketing is vital to gain useful information and define future strategies. The identification of new trends and research and the development of new products could also be facilitated through this insight into data [17]. An optimal analytical approach should, as much as possible, generate recognizable patterns to allow for cross-checking results and enabling trust in the solutions.
Data quality is a significant issue as decisions are taken based on data that is sensitive, and there is a great responsibility and expectation regarding data accuracy and the quality of analytics tools [28].
Availability of and access to data [1] should also enable expert-driven self-service analytics to allow experts to control the analytics process. However, data are excessive and ever growing. There are several repositories, and new data are being generated daily by billions of connected devices or self-generated by people. It is necessary to find more appropriate and effective ways to leverage this data according to privacy and ethical principles, to access it, to understand the purposes for its use [11], and to improve and optimize the quality of the processes.
Data portability involves the transfer of data in a structured, commonly used, and machine-readable format from one organization (controller) to another organization [34]. The increase in data captured through the Internet and the Internet of Things needs to be dealt with, keeping in mind the regulations of data ownership standards [31]. There is a need to raise awareness and trigger debate for policymakers and develop data protection and privacy laws and legislation to protect patients and companies.
The literature also discusses the cost of data, not only the cost of online data storage (cloud computing) but also the cost of gathering data and analysing it and the cost of using data to create innovations [12] and advance society and the quality of life of citizens.
Another issue discussed in the literature is the lack of pre-processing [18] facilities in data mining processes. Data pre-processing may require techniques that involve transforming raw data into any required and machine-understandable format. Real-world data are often incomplete, inconsistent, and likely to contain errors. Data pre-processing prepares raw data for further processing and is used in database-driven applications such as customer relationship management and rule-based applications (like neural networks). This process is fundamental for the pharma industry to handle data gathered from diverse sources.
It is important to note here that when we work with big data, we are still missing specific technology, mainly applications that help to analyse the data [21] and make it readable for all stakeholders.
When it comes to medical data, privacy and security are of primary concern [23] as medical data are highly sensitive information; there are strong regulations at the national and European Union level. The preservation of privacy in the practical implementation of a data science strategy also requires specific analytical tools and cybersecurity systems.
Stakeholders are the actual beneficiaries of data science's potential and its facilitation in making the decision process more efficient. However, there is still a market gap in the skills needed to treat raw data and transform it into knowledge to support research, development [5], and other functions of a pharma company.
Finally, it is essential to discuss the importance of regulatory risks regarding the protection and security of data [23] to make sure individuals with dubious intentions do not access data.
From the above exploration of the literature, we have gathered the most relevant variables regarding pharmaceutical business data requirements; these will guide our empirical research and help us formulate the answer to the main research question, identified below: Research Question: What are the main differences between pharma companies with and without a data science strategy, and what is data science's relation to the shortage of skills?

Methodological approach
To answer the research question identified at the end of Sect. 2, we have applied an approach based on quantitative methods, supported by a questionnaire to identify differences in data science challenges and framework conditions among organizations with or without a data science strategy.
The information was collected via a structured questionnaire that was prepared after a review of the literature. A convenience sample was used (non-probabilistic sampling procedure). When it is difficult to obtain a complete sampling, convenience sampling is suitable [19,20]. The fieldwork was carried out between April and June of 2019 with a group of 280 individuals. To provide greater representativeness of the data, we selected individuals from companies around the world for a confidence level of 95% (and p = q = 0.5) and an increase in data error for the estimate of the proportion of 5.8%. The next table shows a summary of the information regarding the data collection and technical matters of the sample ( Table 1).
The sample includes pharma workers from all over the world, as shown in following Table 2:

Data analysis and discussion
To analyse the differences between the organizations that do not have a data science strategy, a covariance analysis (ANCOVA) was carried out. The Shapiro-Wilk test was carried out, and the null hypothesis of normality was rejected. Rheinheimer and Penfield [22] said that F-Snedecor was robust in the face of violations of normality. Tabachnick and Fidell [30] also say (p. 204) that the robustness of the variance analysis is assured with large samples, with approximately 20 degrees of liberation. Blanca et al. [4] also support this conclusion. The multivariate design with covariates is intended to reduce the damage caused by other covariant variables, such as the psychological variable, where data science influences the creation of innovations. In this way, variance due to individual differences was estimated from the regression between the dependent variable and the covariable. The scores in the dependent variable were statistically adjusted to the covariable. Finally, an ANOVA was performed on these adjusted scores [30]. Thus, the analysis controls the effect of the covariable so that it eliminates the variation due to the mismatch of the ANOVA error.
To investigate technological differences, the following hypotheses were analysed.
H 0 : X j Adjusted Data Science = X j Adjusted No Data Science .
The adjusted mean is obtained from the following expression: where: X j adjusted: mean of the dependent variable of the jth group, X j : mean of the dependent variable without the adjustment of the jth group, b: pending communal regression, X j : covariable mean in the jth group, X : total mean (of all groups) in the covariable.
In following Table 3, the adjusted means for each group are analysed.
In Table 3 is displayed the adjusted means, F-statistics, and p values. The analysis shows that there are statistically significant differences in all variables related to challenges (p value \ 0.01 in all cases), always showing a higher score in organizations with data science.  Sampling error 5.8%, assuming p = q = 0.5 and a confidence level of 95% Following this, an exploratory factorial analysis of the variables related to challenges was carried out to see which factors could be extracted from this analysis.
The extraction method is the main component analysis method, and varimax was used as the rotation procedure. The first factor obtained explained 63.965% of the total variance of the matrix of challenges, and this dimension has eight items and is classified as the data management dimension. The second factor extracted explained 10.908% of the total variance has five items and is called the data technostructure dimension. These two extracted factors explain 74.874% of the total variance ( Table 4). The Kaiser-Meyer-Olkin index is 0.762 higher than 0.7, and the Bartlett test of sphericity is fulfilled by presenting a chisquare of 4633.704 (p value \ 0.000).
It was also important, based on the factors obtained, data management and data technostructure, to validate the scale using confirmatory factor analysis ( Table 5).  The Cronbach's alpha (a) is higher than 0.7 [10], the composite reliability index (CRI) is higher than 0.7 [14], and the average variance extracted (AVE) is higher than 0.5 [14]. The measures of validity are also adequate, the coefficients of standardized loadings are higher than 0.5, and their means are higher than 0.7 (Hair et al. [15]). Moreover, the confidence interval of the correlations is less than 1 [3] (Table 6).
Therefore, we have validated the scale for data management with eight indicators and the scale for data technostructure with five indicators, the lack of technology and the shortage of talent/skills having the highest loads.
Next (Table 7), we analysed globally the differences that may exist between data management and data technostructure between organizations with a data science strategy.
In Table 7, it is possible to verify that the organizations with a data science strategy have the best performance in both management and technostructure. Therefore, in terms of practical business implications, an organization with a data science strategy will have better direction and management practices as the decision-making process is based on accurate and valuable data, based on the data science skills of the workers. Once the measurement model was validated, it was essential to analyse, in an exploratory way, if there is a relationship between data technostructure and data management regarding if the organization has a data science strategy. The next table shows the standardized coefficients used to analyse this structural relationship [14], 1982, Table 8). For the analysis of structural equation models, they were analysed through the maximum likelihood with the Satorra-Bentler correction, which is robust with nonnormal data [25,26].
Based on this analysis, it is possible to conclude that there is empirical evidence of the existence of a robust positive relationship between data technostructure and data management in organizations that have data science. However, the relationship is not statistically significant if the organization does not have a data science strategy (Fig. 1), which is justified by the fact that they are not focussed on that type of strategy and that they have not reached the maturity to understand the importance of having data science skills for data analysis.

Conclusions
This study presents an analysis of the application of data science to the pharma industry. Data science is a new interdisciplinary science that requires strong practical ability and an adaptive organizational culture to effectively implement the described techniques and models to support the pharma industry in daily activities. The review is conducted along two major dimensions, data technostructure and data management, as these are the two main components of a data science strategy. From the survey results, we can conclude that companies essentially need to empower their data technostructure for better results. It was also observed that most of the companies had a higher and increasing interest in data management. Many pharma  companies are in the process of realizing the importance of the skills needed for a data scientist and implementing data science in their analytics processes. According to the study, there is empirical evidence about the relationship between data technostructure and data management as they need to be defined and managed at nuclear dimensions for the competitiveness of the pharma industry. For future work, we intend to execute a survey with medical affairs practitioners and compare the collected data with our results in this study. It would be interesting to disaggregate the concepts of data management and data technostructure further and perform a deeper analysis.

Declaration
Conflict of interest The authors declare that there is no conflict of interest.