As an analytical approach, decision-making is the process of finding the best option from all feasible alternatives. The application of decision- making process in economics, management, psychology, mathematics, statistics and engineering is obvious and this process is an important part of all science-based professions. Proper management and utilization of valuable data could significantly increase knowledge and reduce cost by preventive actions, whereas erroneous and misinterpreted data could lead to poor inference and decision-making. This paper presents a class of practical methods to analyze high-dimensional event history data to reduce redundant information and facilitate practical interpretation through variable inefficiency recognition. In addition, numerical experiments and simulations are developed to investigate the performance and validation of the proposed methods.
Analytics data driven decision-making can substantially improve management decision-making process. In social science areas such as economics, business and management, decision-making is increasingly are based on the type and size of data, as well as analytic methods. It has been suggested that new methods to collect, use and interpret data should be developed to increase the performance of the decision makers (Lohr, S., 2012) (Brynjolfsson, E., 2012).
In the fields of economics, business and management, analyzing the collected data from different sources such as financial reports and consequently determining effective explanatory variables, specifically in complex and high-dimensional event history data provide an excellent opportunity to increase efficiency and reduce costs.
In economics, term event history analysis is used as an alternative to time-to-event analysis which has been used widely in the social sciences where interest is on analyzing time to events such as job changes, marriage, birth of children and so forth (Lee, E. T., and Wang, J. W., 2013). Some aspects make difficulty in analyzing this type of data using traditional statistical models. Dimensionality and non-linearity are among those (Allison, Paul D., 1984). Analysis of datasets with high number of explanatory variables requires different approaches and variable selection techniques could be used to determine a subset of variables that are significantly more valuable to (Yao, F., 2007) (Hellerstein, J., 2008) (Segaran, T., and Hammerbacher, J., 2009) (Feldman, D. et al., 2013) (Manyika, J. et al., 2011) (Moran, J., 2013) (Brown, B. et al., 2011).
The purpose of this study is to design a procedure including a class of methods for variable reduction via determining variable inefficiency in high-dimensional event history analysis where variable efficiency refers to the effect of a variable on event history data. As an outline, the concept of decision-making process, event history analysis, and relevant data analysis techniques are presented in Section 2. The logical model for the transformation of the explanatory variable dataset is proposed and three multidisciplinary variable selection methods and algorithms through variable efficiency are designed in Section 3. The results and comparison of results with well-known methods and simulation patterns are presented in Section 4. Finally, concluding remarks, including the advantages of the proposed methods are discussed in Section 5. The computer package that we use in this research is the MATLAB® R2011b programming environment.
In this section, applied introductions to decision-making process and event history analysis as well as data analysis techniques are presented.
Decision-making theories are classified based on two attributes: (a) Deterministic, which deals with a logical preference relation for any given action or Probabilistic, which postulate a probability function instead, and (b) Static, which assume the preference relation or probability function as time-independent or Dynamic which assume time-dependent events (Busemeyer, J. R., and Townsend, J. T., 1993). Historically, the Deterministic-Static decision-making is more popular decision-making process specifically under uncertainty. The assumption of decision-making in this study falls in this category as well.
As a process of making choices by setting objectives, gathering information, and assessing alternative choices in a decision-making process, broadly includes seven steps: (1) Defining the decision, (2) Collecting information, (3) Identifying alternatives, (4) Evaluating the alternatives, (5) Selecting best alternative(s), (6) Taking action, (7) Review decision and consequences (Busemeyer, J. R., and Townsend, J. T., 1993).
A major part of decision-making involves the analysis of a finite set of alternatives described in terms of evaluative criteria. The mathematical techniques of decision-making are among the most valuable factors of this process, which are generally referred to as realization in the quantitative methods of decision-making (Sadeghzadeh, K., and Salehi, M. B., 2010). With the increasing complexity and the variety of decision-making problems due to the huge size of data, the process of decision-making becomes more valuable (Brynjolfsson, E., 2012).
A brief review of event history analysis concept and definition of survival function is following.
Event history analysis consider the time until the occurrence of an event. The time can be measured in days, weeks, years, etc. Event history analysis is also known as time-to-event analysis which generally defined as a set of methods for analyzing such data where subjects are usually followed over a specified time period. Event history (time-to-event data) analysis has been used widely in the social sciences such as felons’ time to parole in criminology, duration of first marriage in sociology, length of newspaper or magazine subscription in marketing and worker’s compensation claims in insurance (Lee, E. T., and Wang, J. W., 2013) (Hosmer D. W. Jr., and Lemeshow, S., 1999) (Kalbfleisch, J. D., and Prentice, R. L., 2011).
Methods to analyze event history data can be categorized in parametric, semi-parametric and nonparametric methods. Parametric methods are based on survival function distributions such as exponential. Semi-parametric methods don’t assume knowledge of absolute risk and estimates relative rather than absolute risk and this assumption is called the proportional hazards assumption. For moderate- to high-dimensional covariates, it is difficult to apply semi-parametric methods (Huang, J., Ma, S., and Xie, H, 2006). In nonparametric methods which are useful when the underlying distribution of the problem is unknown, there are no math assumptions. Nonparametric methods are used to describe survivorship in a population or comparison of two or more populations. The Kaplan-Meier Product Limit estimate is a nonparametric method which is the most commonly used nonparametric estimator of the survival function and has clear advantages since it does not require an approximation that results the division of follow-up time assumption (Lee, E. T., and Wang, J. W., 2013) (Holford, T. R., 2002).
The probability of the event occurring at time t is
In event history analysis, information on an event status and follow up time is used to estimate a survival function S(t), which is defined as the probability that an object survives at least until time t:
From the definition of the cumulative distribution function:
Accordingly survival function is calculated by probability density function as:
In most applications, the survival function is shown as a step function rather than a smooth curve. Nonparametric estimate of $S(t)$ according to Kaplan-Meier (KM) estimator for distinct ordered event times $t_1$ to $t_n$ is:
Where at each event time $t_j$ there are $n_j$ subjects at risk and $d_j$ is the number of subjects which experienced the event.
A review of relevant used data analysis techniques in this study including discretization process as well as data reduction and variable selection methods is presented next.
Discretization Process
Variables in a dataset potentially are a combination format of different data types such as dichotomous (binary), nominal, ordinal, categorical, discrete, and continuous (Interval). There are many advantages of using discrete values over continuous as discrete variables are easy to understand and utilize, more compact and more accurate. Quantizing continuous variables is called discretization process.
In the splitting discretization methods, continuous ranges are divided into sub-ranges by the user specified width considering range of values or frequency of the observation values in each interval, respectively called equal-width and equal-frequency. A typical algorithm for splitting discretization process which quantifies one continuous feature at a time generally consists of four steps: (1) sort the feature values, (2) evaluate an appropriate cut-point, (3) split the range of continuous values according to the cut-point, and (4) stop when a stopping criterion satisfies.
In this study, discretization of explanatory variables of event history dataset assumed unsupervised, static, global and direct in order to reach a top-down splitting approach and transformation of all types of variables in dataset into a logical (binary) format. Briefly, static discretization is dependent of classification task, global discretization uses the entire observation space to discretize, and direct methods divide the range of k intervals simultaneously. For a comprehensive study of discretization process, see (Liu, Huan, et al., 2002).
Data Reduction and Variable Selection Methods
Data reduction techniques are categorized in three main strategies, including dimensionality reduction, numerosity reduction, and data compression (Han, J. et al, 2006) (Tan, P. et al., 2006). Dimensionality reduction as the most efficient strategy in the field of large-scale data deals with reducing the number of random variables or attributes in the special circumstances of the problem. All dimensionality reduction techniques are also classified as feature extraction and feature selection approaches. Feature Extraction is defined as transforming the original data into a new lower dimensional space through some functional mapping such as PCA and SVD (Motoda, H., and Huan, L., 2002) (Addison, D. et al., 2003). Feature selection is denoted as selecting a subset of the original data (features) without a transformation in order to filter out irrelevant or redundant features, such as filter methods, wrapper methods and embedded methods (Saeys, Y. et al., 2007) (Guyon, I., and Elisseeff, A., 2003).
Variable selection is a necessary step in a decision-making process dealing with a large-scale data. There is always uncertainty when researchers aim to collect most important variables specifically in the presence of big data. Variable selection for decision-making in many fields is mostly guided by expert opinion (Casotti, M., n.d.). The computational complexity of all the possible combinations of the p variables from size 1 to p, could be overwhelming, where the total number of combinations are 2𝑝𝑝−1. For example, for a dataset of 20 explanatory variables, the number all possible combinations is 220−1= 1048575.
Next section presents proposed methodology for multidisciplinary decision-making approach based on proposed analytical model, designed methods and heuristic algorithms for explanatory variable subset selection.
In this section, first proposed analytical model for transformation of the explanatory variable dataset to reach the logical representation as a sort of binary variables is presented. Next, in order to select most significant variables in terms of inefficiency, designed variable selection methods and heuristic clustering algorithms are introduced.
A multipurpose and flexible model for a type of event history data with a large number of variables when the correlation between variables is complicated or unknown is presented. The logical model is to simplify the original covariate dataset into a logical dataset by transformation lemma. Next, we show the validation of this designed logical model by correlation transformation (Sadeghzadeh, K., and Fard, N, in press) (Sadeghzadeh, K., and Fard, N, 2014).
The original event history dataset may include any type of explanatory. Many time-independent variables are even binary or interchangeable with a binary variable such as dichotomous variable. Also, interpretation of binary variable is simple, understandable and comprehensible. In addition, the model is appropriate for fast and low-cost calculation. The General schema of high-dimensional event history dataset includes n observations with p variables as shown in Table 1.
Variables | |||||
Obs. # | Time to Event | Var. 1 | Var. 2 | … | Var. p |
1 | t1 | u11 | u12 | … | u1p |
2 | t2 | u21 | u22 | … | u2p |
… | … | … | … | … | … |
n | tn | un1 | un2 | … | unp |
Each array of $p$ variables vectors will take only two possible values, canonically 0 and 1 . As discussed in Section 2, discretization method is applied to values by dividing the range of values for each variable into 2 equally sized parts. We define $w_{i j}$ as an initial splitting criterion equal to arithmetic mean of maximum and minimum value of $u_{i j}$ for $i=1 \ldots n, j=1 \ldots p$. The criteria $w_{i j}$ could be defined by expert using experimental or historical data as well. For any array $u_{i j}$ in the $n$-by- $p$ dataset matrix $U=$ $\left[u_{i j}\right]$, then allocate a substituting array $v_{i j}$ as 0 if $u_{i j}<w_{i j}$ and 1 if $u_{i j} \geq w_{i j}$. The proposed model assumes any array with a value of 1 as desired for expert and 0 otherwise. In other words, $v_{i j}=0$ represent the lack of the $j$ th variable in the $i$ th observation. The result of the transformation is an $n$-by- $p$ dataset matrix $V =\left[v_{n p}\right]$ which will be used in the following methods and algorithms. Also, we define timeto-event vector $T=\left[t_n\right]$ including all observed event times. The logical model initially could be satisfied by proper design of data collection process by based on Boolean logic to generate binary attributes.
To validate the robustness of this model we show that the change of correlation between variables before and after transformation is not significant and the logical dataset has followed the same pattern and behavior as the original; in terms of correlation of covariates. We define correlation matrix for each of original and transformed dataset based on Pearson product-moment correlation coefficient; $M=\left[m_{i j}\right]$ and $N=\left[n_{i j}\right]$ where $i=1 \ldots n, j=1 \ldots p$, where $n_{i j}$ and $m_{i j}$ denote covariance of variables $i$ and $j$ for original and transformed dataset respectively as follows:
where $\left(u_{i k}, v_{i k}\right)$ and $\left(\bar{u}_i, \bar{v}_i\right)$ represent value of variable $i$ in observation $k$ and mean of variable $i$ in each dataset respectively, and similarly the second parenthesis in equations(6) and (7) are defined for variable j.
The experimental fitted line for the scatter plot of $m_{i j}$ and $n_{i j}$ for any dataset is $y=a+b x$ where $b$ is positive small and $a$ is not significant. For instance, Figure 1 shows the primary biliary cirrhosis ( $PBC )$ dataset (Section 4) for an experimental result of an uncensored data with the fitted line of $y=0.6356 x+$ $0.0116 b$.
The proposed logical model validation and verification of the robustness were presented comprehensively in (Sadeghzadeh, K., and Fard, N, in press) and (Sadeghzadeh, K., and Fard, N, 2014).
In order to select the most significant variables in terms of inefficiency, methods and algorithms are presented next.
We design a class of methods applying on proposed logical model to select inefficient variables in a high-dimensional event history datasets. The major assumption to design appropriate methods for this purpose is that the variable which is completely inefficient solely can provide a significant performance improvement when engaged with others, and two variables that are inefficient by themselves can be efficient together (Guyon, I., and Elisseeff, A., 2003). Based on this assumption, we design three methods and heuristic algorithms to select inefficient variables in event history datasets with high-dimensional covariates. We use Kaplan-Meier estimator in this study to estimate survival probabilities as a function of time. The n-by-p matrix V is the prepared transformed logical dataset according to Section 3.1, where n is the number of observations, p is the number of variables, and k is the estimated subset size to select for calculation parts in the algorithms.
Recalling $V$ which is constructed by $k$ observation vectors corresponding to each of the variables, $D=$ [ $d_{k p}$ ] as a $k$-by- $p$ matrix is a selected subset of $V$ and $k$ is defined as the number of observations in any subset of $V$, where $k \leq n$. For any variable $i$, we define vector $O^i$ as a time-to-event vector which includes failure times of any observation $j$ the value of $v_{i j}$ is one. Similarly, we define vector $Z$ including failure times of any observation $j$ where the value of $v_{i j}$ is zero. The vectors $R$ and $S$ are defined as follow:
Vector R is constructed by all non-zero arrays r and similarly vector 𝑺𝑺 is constructed by all non-zero arrays s.
We propose three methods and algorithms to select inefficient variables as follows:
Singular Variable Effect Algorithm
The objective of Singular Variable Effect (SVE) method is to determine the efficiency of a variable by analyzing the effect of the presence of any variable singularly in comparison with its absence in a transformed logical dataset. For p variable, we aim to set vector Δ = [δi] where 𝑖 =1…𝑝𝑝 to rank the efficiency of the variables. The preliminary step for the highest efficiency in this method is to initially clustering the variables based on the correlation coefficient matrix of original dataset, M, and choose a representative variable from each highly correlated cluster and eliminate the other variables from the dataset. For instance, for any given dataset, if three variables are highly correlated, only one of them is selected randomly and the other two are eliminated from the dataset. The result of this process assures that the remaining variables for applying methods and heuristic algorithms are not highly correlated.
As an outcome of the SVE procedure, if one hopes to reduce the number of variables in the dataset for further analysis, could eliminate less efficient identified variables or if aims to concentrate on a reduced number of variables, could choose a category of more efficient identified variables as well. Heuristic algorithm for SVE method is:
for $i=1$ to $p$ do Calculate $O ^i$ and $Z ^i$ for variable $i$ observation vector in dataset $V$ Compare $T$ and $O ^i$ with Wilcoxon rank sum test Save the test score for variable $i$ as $\alpha_i$ Compare $T$ and $O ^i$ with Wilcoxon rank sum test Save the test score for variable $i$ as $\beta_i$ Calculate $\delta_i=\alpha_i-\beta_i$ end for Return $\Delta =\left\lceil\delta_p\right\rceil$ as the variable efficiency vector |
Splitting Semi-Greedy Clustering Algorithm
Splitting Semi-Greedy (SSG) method to select an inefficient variable subset is proposed. A clustering procedure through randomly splitting approach to select the best local subset according to a defined criterion incorporated. In this method we use block randomization which is designed to randomize subjects into equal sample sizes groups. A nonparametric test is used to test a null hypothesis that whether two samples are drawn from the same distribution, as compared to a given alternative hypothesis. Wilcoxon rank sum test is used in this method.
The concept of this method is inspired by the semi-greedy heuristic (Feo, T. A., and Resende, M. G., 1995) (Hart, J. P., and Shogan, A. W., 1987) and tabu search (Gendreau, M., and Potvin, J. Y., 2005). Criterion of this search is similar to The Nonparametric Test Score (NTS) method (Sadeghzadeh, K., and Fard, N, in press) which is to collect the most inefficient variable subset via Wilcoxon rank sum test score. At each of l trials, all p variables from the transformed logical dataset V are randomly clustered into subsets of size k variables, where one cluster possibly contains less than k variables and the number of clusters is equal to $\lceil p / k\rceil$. To calculated score summation for each variable over all trials, a randomization dataset matrix $\Xi=\left[\xi_{l k}\right]$ where each row is formed by $k$ variable identification numbers in any selected subsets for all $l$ trials. Comprehensive experimental results for validation of the proposed methods $\mid$ by comparison with similar methods are presented next. Heuristic algorithm for SSG method is:
for $i=1$ to $l$ do Split the data into equally sized subsets Compose the dataset $D$ for each subset Calculate $R$ over the $D$ for each subset Compare $T$ and $R$ with Wilcoxon rank sum test and save the test score for each subset one by one Select a subset with the highest test score Save the test score for variables in the selected subset as $\xi_{i(k+l)}$ end for Assume $\Theta =\left\lceil\theta_p\right\rceil$ as the reverse variable efficiency vector where initially each array as the cumulative contribution score corresponding to a variable is zero for $i=1$ to $l$ do for $j=1$ to $k$ do Add the value of $\xi_{i(k+l)}$ to the cumulative contribution score $\theta_p$ of the variable $i$ based on its identification number $=\xi_{i j}$ end for end for Return $\Theta =\left[\theta_p\right]$ as the variable inefficiency vector |
Weighted Time Score Algorithm
The Weighted Time Score (WTS) method is a variable clustering technique which selects set of size $k$ variables from the transformed logical dataset $V$ and calculates the score of each variable. The first step is to determine the observations in a selected subset which all $k$ variables are 1 for that observation and eliminate other observation from subset. Cumulative time score over the vector $T$ credit each of variables in the subset. Final score of all variables is reached by aggregation of those credits in $l$ trials. Randomization algorithm randomly chooses a defined $l$ subset of $k$ from the $V$, transformed logical dataset of $p$ variable. We define a randomization dataset matrix $\Psi=\left[\psi_{l k}\right]$ where each row is formed by $k$ variable identification numbers in any selected subsets for overall $l$ subsets. Heuristic algorithm for method is:
for $i=1$ to $l$ do Compose the dataset $D _i$ for variable set $i$ in $\Psi _{\text {including variables }} \psi_{i j}$ where $j=1$ to $k$ Calculate $S_i$ over the dataset $D_i$ Calculate $\sum t_i$ for $S _i$ as a time score Save the time score for variables in subset $i$ as $\psi_i(k+1)$ end for Assume $\Omega =\left[\omega_p\right]$ as the reverse variable efficiency vector where initially each array as the cumulative contribution score corresponding to a variable is zero for $i=1$ to $l$ do for $j=1$ to $k$ do Add the value of $\psi_{i(k+l)}$ to the cumulative contribution score $\omega_p$ of the variable $i$. based on its identification number $=\psi_{i i}$ end for end for Return $Q =\left[\omega_p\right]$ as the variable inefficiency vector |
The experiment results for these algorithms are followed in Section 4.
To evaluate the performance of the designed methods, first well-known and publicly available primary biliary cirrhosis (PBC) dataset (Fleming and Harrington 1991) is considered as the sample collected dataset. These dataset includes 111 uncensored complete observations and 17 explanatory variables in addition to event times for each observation. In order to obtain an approximate value of desired number of variables in any selected subset, we use principal component analysis (PCA) scree plot criterion (Sadeghzadeh, K., and Fard, N, in press) (Sadeghzadeh, K., and Fard, N, 2014). For the original uncensored PBC dataset, approximation of this number is 3.
To verify the performance of the proposed methods, the result of these methods and algorithms for the transformed logical uncensored PBC dataset is compared with the results of Nonparametric Test Score (NTS) method (Sadeghzadeh, K., and Fard, N, in press), Random Survival Forest (RSF) method (Ishwaran, H. et al., 2008) (Ishwaran, I., and Kogalur, U. B., 2007), Additive Risk Model (ADD) (Ma, S., Kosorok, M. R., and Fine, J. P., 2006), and Weighted Least Square (LS) method (Huang, J., Ma, S., and Xie, H, 2006) for similar dataset variable selection, given in Table 1. A comprehensive comparison of NTS, RSF, ADD and LS performance with other relevant methods in high-dimensional time-to-event data analysis such as Cox’s Proportional Hazard Model, LASSO and PCR has been presented in (Huang, J., Ma, S., and Xie, H, 2006) (Ishwaran, H. et al., 2008) (Ma, S., Kosorok, M. R., and Fine, J. P., 2006).
Each number in Tables 2 and 3 represents a specific variable in experiment dataset. For example, in Table 2, variable #1 is a selected as an inefficient variable by all methods.
Method | Selected Inefficient Variables |
SVE | 1, 3, 5, 10, 13, 17 |
SSG | 1, 3, 5, 10, 15, 17 |
WTS | 1, 3, 5, 10, 15, 17 |
NTS | 1, 3, 5, 6, 10, 15, 17 |
RSF | 1, 3, 5, 12, 13, 14, 15, 17 |
ADD | 1, 2, 5, 12, 14 |
LS | 1, 2, 3, 14, 15, 17 |
From the results shown on Tables 2, the SSG and WTS methods have a same performance. More than 80% of inefficient variables which has been detected by other methods (NTS, RSF, LS and ADD) are collected by proposed algorithms at significantly shorter calculation period, where the robustness of this class of methods has examined for several sample datasets.
To show variable inefficiency through three designed methods SVE, SSG, and WTS, graphical representation for the experiment results for uncensored PBC dataset is depicted in Figure 2. Each variable with larger radius and more distance from the center is less efficient and an ideal candidate to remove from dataset if it is desired.
As another validation of the proposed methods, a simulation is designed. We set n = 400 observations and p = 15 variables and simulated event times from a pseudorandom algorithm. We also set first five variables inefficient, where first two are absolutely inefficient. Some variable vectors are set as a linear function of event time data in addition to constant and periodic binary numbers as well as normal and exponential distributed pseudorandom numbers as independent values of explanatory variables. The results of methods and algorithms applying the simulated data are presented in Table 3. These results are compared with the results from NTS method. From the simulation defined pattern the comparison verifies the performance of all proposed methods.
Method | Selected Inefficient Variables (No.) |
SVE | 1, 2, 3, 10 |
SSG | 1, 2, 3 |
WTS | 1, 2, 3, 5, 12 |
NTS | 1, 2, 5 |
Definition | 1, 2, 3, 4, 5 |
Inefficiency analysis results for the simulation experiment shows that variables with identification number 1, 2 and 3 are detected as inefficient variables by all proposed methods. To reduce the number of variables in the dataset for further analysis, these explanatory variables are the best candidates to be eliminated from the dataset.
The proposed logical model, designed variable selection methods, and heuristic clustering algorithms in this paper are beneficial to explanatory variable reduction through an inefficient variable selection approach to obtain an appropriate variable subset in high-dimensional and large-scale event history data in order to avoid difficulties in decision-making.
By using such novel methods in the fields of economics, business and management, data analysis and decision-making processes will be faster, simpler and more accurate. For example, in business applications, many explanatory variables in a customer survey are defined based on cause and effect analysis process data or similar analytic process outcome. In most cases, correlations of these explanatory variables are complicated and unknown, and it is important to simply understand the efficiency of each variable in the survey. These procedures potentially applicable solutions for many problems in a vast area of science and technologies are presented.
Next step in this study is to considering event data and time-to-event models including new types of dependent variables through well-known models such as accelerated failure time and applying heuristic algorithms especially in the field of artificial intelligence.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.