Uses of principal component analysis
Often in healthcare there are multiple factors that influence an outcome of interest. For example, when trying to predict the length of stay of a patient this could be influenced by the severity of the patients condition, the complexity of their case, whether they require a package of social care before discharge, the day on which they are admitted, to name but a few.
Multi-variate regression is the most common way to produce a model to predict an outcome based on many different factors but you need a way to determine which factors to include in the regression model, this is where principal component analyses are useful.
Principal component analyses (PCA) are used to explore the ability of multiple factors to account for the variance in a data set. If you do not know which factors are most likely to be predictive of an outcome then PCA is a good starting point. There are different types of principal component analysis and the type of data that you are using determines which type of PCA you need to use. Three different types of PCA are described below and they provide a way to explore most healthcare data.
Types of PCA
The three type of PCA to be covered in this tutorial will help you to explore quantitative factors, qualitative factors and mixed quantitative and qualitative factors. They are:
- Principal component analysis (PCA) – for use with only continuous and discrete numerical data
- Multiple correspondence analysis (MCA) – for use with categorical data only
- Factor analysis of mixed data (FAMD) – for use with mixed numerical and categorical data
Principal component analysis
In this example we will need to load the factoextra library to visualise the results of the analysis.
We start by creating an example data set containing numerical data. There are 100 observations of 8 factors.
Input:
df_num <- data.frame(los=rexp(100, 5),
weight=rnorm(100, 78, 12),
height=rnorm(100, 165, 10),
procedures=rpois(100, 1.5),
age=rnorm(100, 45, 10),
complexity=rexp(100, 2),
severity=rexp(100, 3))
df$bmi <- df$weight/((df$height/100)^2)
head(df_num)
Output:
los weight height procedures age complexity severity
1 0.115322054 81.60335 182.9714 2 47.20549 0.17355499 0.08115066
2 0.265810974 78.89828 156.8967 1 36.65484 2.41406744 0.03199619
3 0.006315472 80.47647 184.0090 0 43.69646 0.19319890 0.52674936
4 0.011242195 72.13293 172.0895 2 50.67329 0.03718549 0.03969725
5 0.063300243 70.46458 172.3619 1 57.08467 0.05149866 0.33449668
6 0.062845458 77.43700 178.6578 3 46.27220 0.24534184 0.14639152
We then use the prcomp()
function to run a principal component analysis on our data. Note that centering and scaling have been set to TRUE
to standardise the variables.
df_pca <- prcomp(df_num,center=TRUE,scale.=TRUE)
A scree plot can be used to visualise the percentage of the variance in the data accounted for by different ways of combining the factors within the data. These are referred to as dimensions.
Input:
fviz_screeplot(df_pca,labels=TRUE)
Output:
To explore the contribution of each factor within the different dimensions created by the analysis we look at the variable level results. These are extracted using the get_pca_var()
function, displayed using $contrib
and visualised using the fviz_pca_var()
and fviz_contrib()
functions. fviz_pca_var()
shows the contribution of each factor across the first two dimensions of the analysis in a scatter graph and fviz_contrib()
produces a bar plot of the relative contribution of each factor within a dimension defined using the axes argument.
Note: the first two dimensions of a PCA are the combination of factors that account for the greatest variance with your data providing the combination of factors with the maximum likely explanatory power for a multi variate model.
Input:
vars <- get_pca_var(df_pca)
vars$contrib
Output:
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
los 5.3015078 13.231623 0.19497267 14.94394785 19.77971812 45.385871683
weight 30.3637358 1.065937 11.92369876 1.45409169 0.01760632 5.680743674
height 17.8111980 10.846066 9.44954267 11.56213368 1.65407486 3.761525479
procedures 0.1810879 21.093671 6.90714547 15.79046517 37.18931709 18.222235152
age 0.9674055 30.352961 1.97443213 13.32098274 7.29539076 20.258146115
complexity 0.0725160 19.353656 0.05117394 42.42371294 31.42683739 6.380088970
severity 1.3408652 2.998179 68.35320572 0.05327024 2.31412069 0.001815931
bmi 43.9616838 1.057907 1.14582864 0.45139571 0.32293475 0.309572995
Dim.7 Dim.8
los 1.08582388 7.653498e-02
weight 18.43870604 3.105548e+01
height 28.81399137 1.610147e+01
procedures 0.60253994 1.353790e-02
age 25.83047114 2.108957e-04
complexity 0.28775938 4.255188e-03
severity 24.91389322 2.464981e-02
bmi 0.02681503 5.272386e+01
Input:
fviz_pca_var(df_pca)
fviz_contrib(df_pca, choice = "var", axes = 1)
fviz_contrib(df_pca, choice = "var", axes = 2)
Output:
Multiple correspondence analysis
For this analysis we will require the FactoMineR and factoextra packages (Note: the packages are for use with R version 3.5.* and later).
The example data set created below is made up of 100 observations of 7 categorical variables (the columns).
Input:
df_cat <- data.frame(admitted=sample(0:1,100,replace=TRUE),
admitType=sample(1:4,100,replace=TRUE),
firstVisit=sample(0:1,100,replace=TRUE),
complex=sample(0:1,100,replace=TRUE),
ageGroup=sample(1:6,100,replace=TRUE),
diagCat=sample(1:10,100,replace=TRUE),
day=sample(1:7,100,replace=TRUE))
head(df_cat)
Output:
admitted admitType firstVisit complex ageGroup diagCat day
1 1 4 0 0 2 1 1
2 1 2 0 1 6 8 7
3 1 3 0 1 1 4 7
4 1 2 1 0 4 6 6
5 0 3 1 1 6 7 3
6 0 2 1 1 4 4 4
The easiest way to work with categorical data in R is to convert it to the factor data type. This works with both string/text based data and integer coded categories as used in this example. We can convert all of the columns to factors using the lapply()
function and converting the data back to a data frame using as.data.frame()
.
Input:
df_cat <- as.data.frame(lapply(df_cat, factor))
str(df_cat)
Output:
'data.frame': 100 obs. of 7 variables:
$ admitted : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 1 2 1 ...
$ admitType : Factor w/ 4 levels "1","2","3","4": 4 2 3 2 3 2 1 4 1 4 ...
$ firstVisit: Factor w/ 2 levels "0","1": 1 1 1 2 2 2 1 1 1 2 ...
$ complex : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 1 2 2 ...
$ ageGroup : Factor w/ 6 levels "1","2","3","4",..: 2 6 1 4 6 4 2 5 4 4 ...
$ diagCat : Factor w/ 10 levels "1","2","3","4",..: 1 8 4 6 7 4 5 5 3 7 ...
$ day : Factor w/ 7 levels "1","2","3","4",..: 1 7 7 6 3 4 6 7 3 5 ...
The function for MCA is MCA()
and we add the argument graph=FALSE
to suppress the automatically generated graphs as we will create our own.
Input:
df_mca <- MCA(df_cat,graph=FALSE)
The information that we want to retrieve from the analysis and the visualisations of that information that we want to produce are very similar to those in the PCA analysis above. We begin with a scree plot to see the amount of variance accounted for in the first 10 dimensions.
Input:
fviz_screeplot(df_mca,labels=TRUE)
Output:
To view the contribution of each variable to the amount of variance explained by each dimension you need to use get_mca_var()
and then $contrib
to view the output.
Input:
vars <- get_mca_var(df_mca)
vars$contrib
Output:
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
admitted_0 0.001750298 1.791328e-01 1.60802656 12.235398809 0.939527483
admitted_1 0.001432062 1.465632e-01 1.31565810 10.010780844 0.768704305
admitType_1 1.699524502 9.757696e+00 1.65083010 0.001412699 1.066248765
admitType_2 2.319885728 1.444341e-01 1.76623023 2.093369928 3.453491118
admitType_3 4.694504621 9.532601e-01 8.34974299 0.327415533 6.838234314
admitType_4 4.642117494 2.901158e+00 0.06588051 0.556584406 2.932488213
firstVisit_0 0.007981436 1.002564e+01 2.70743358 0.541482918 0.045503584
firstVisit_1 0.009755089 1.225356e+01 3.30908549 0.661812455 0.055615492
complex_0 3.155233215 7.088524e-01 1.29788661 2.983423727 0.021736097
complex_1 2.687791257 6.038373e-01 1.10560711 2.541435026 0.018515934
ageGroup_1 4.478460077 3.398178e+00 1.81439375 0.180051977 2.426826984
ageGroup_2 8.415530198 2.074877e+00 3.07705223 1.327278754 1.851695168
ageGroup_3 0.648089597 1.035906e+01 6.40819738 1.541699286 1.487802113
ageGroup_4 1.104617420 1.068189e+00 6.45107327 5.041606391 0.644188180
ageGroup_5 0.028643865 8.804525e-03 6.12096664 11.755043033 0.237669434
ageGroup_6 5.233839716 2.665486e-01 0.04954071 10.059937190 2.979330392
diagCat_1 1.211361369 3.931928e-01 3.29532046 1.732256475 2.722436831
diagCat_2 8.829331840 2.460947e-02 6.27367331 0.985144802 0.674332572
diagCat_3 0.005640283 1.456480e+00 1.10474769 0.264369196 9.477488027
diagCat_4 3.774542362 1.900483e+00 1.46455031 9.257833540 0.720998413
diagCat_5 4.337814007 2.742854e-01 0.38185908 2.956681822 0.578555398
diagCat_6 0.803925795 1.503228e+00 15.45814726 0.442884116 0.285319301
diagCat_7 0.014622074 3.613199e-04 0.32577112 13.824994397 1.087739160
diagCat_8 5.450960523 6.030231e+00 0.18030642 0.324617240 17.117290144
diagCat_9 2.278031713 9.420936e-01 1.08533540 1.502722595 8.874660902
diagCat_10 1.408464867 2.692705e+00 3.06131496 0.045741869 1.497714064
day_1 1.351042738 1.862585e+00 6.43745973 1.325621189 8.075718633
day_2 0.505902108 6.470470e-01 4.05761636 1.350966228 3.790374898
day_3 1.104627894 1.080610e+01 0.76402203 0.332277120 0.088915147
day_4 19.666270572 1.289532e+00 1.39131067 0.031542487 0.000029358
day_5 2.652214692 3.671076e-03 5.78282222 0.849825839 3.883336229
day_6 3.084466186 2.762171e-03 0.94764722 2.002494842 12.048080355
day_7 4.391624402 1.532083e+01 0.89049050 0.911293269 3.309432992
A graph of the contribution of each variable to the first two dimensions is produced using fviz_mca_var()
and as with the PCA analysis bar plots of the variable contributions to each dimension are produced using fviz_contrib()
and specifying the dimension in the axes=
argument.
Input:
fviz_mca_var(df_mca)
fviz_contrib(df_mca, choice = "var", axes = 1)
fviz_contrib(df_mca, choice = "var", axes = 2)
Output:
Factor analysis of mixed data
For this analysis we will require the FactoMineR and factoextra packages (Note: the packages are for use with R version 3.5.* and later).
Let's begin by creating a data set with mixed numerical and categorical data. This data set contains 100 observations of 4 categorical factors and 5 numerical factors.
Input:
df_mix <- data.frame(admitType=sample(1:4,100,replace=TRUE),
day=sample(1:7,100,replace=TRUE),
diagCat=sample(1:10,100,replace=TRUE),
complex=sample(0:1,100,replace=TRUE),
los=rexp(100, 5),
weight=rnorm(100, 78, 12),
height=rnorm(100, 165, 10),
procedures=rpois(100, 1.5),
age=rnorm(100, 45, 10))
head(df_mix)
Output:
admitType day diagCat complex los weight height procedures age
1 4 3 10 0 0.29656683 57.99030 165.3455 1 51.05371
2 2 4 7 1 0.05482085 86.83795 166.9023 1 39.93666
3 4 7 6 1 0.15006675 82.63232 166.7473 2 30.79434
4 4 4 5 1 0.21007950 74.81218 154.4498 0 46.27993
5 1 7 9 0 0.07231502 79.41773 169.7613 3 64.45851
6 3 7 4 0 0.19218883 79.60846 178.7857 2 53.00914
We now need to convert the columns that contain categorical variables from the numerical data type to factor type. This is done by identifying those columns to be converted to factor type, extracting the column names and passing them to the lapply
function to iterate over them and perform the conversion.
Input:
fac <- c(1:4)
facNames <- names(df_mix[fac])
df_mix[facNames] <- lapply(df_mix[facNames], factor)
str(df_mix)
Output:
'data.frame': 100 obs. of 9 variables:
$ admitType : Factor w/ 4 levels "1","2","3","4": 4 2 4 4 1 3 4 3 2 4 ...
$ day : Factor w/ 7 levels "1","2","3","4",..: 3 4 7 4 7 7 5 3 2 7 ...
$ diagCat : Factor w/ 10 levels "1","2","3","4",..: 10 7 6 5 9 4 3 2 2 5 ...
$ complex : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 1 1 ...
$ los : num 0.2966 0.0548 0.1501 0.2101 0.0723 ...
$ weight : num 58 86.8 82.6 74.8 79.4 ...
$ height : num 165 167 167 154 170 ...
$ procedures: int 1 1 2 0 3 2 1 0 2 1 ...
$ age : num 51.1 39.9 30.8 46.3 64.5 ...
The data is now ready for analysis. The function for the mixed data type analysis is FAMD()
and the graph=FALSE
argument is used to suppress the automatically generated plots.
df_famd <- FAMD(df_mix, graph=FALSE)
The important analysis outputs are similar to those for PCA and MCA. To create a customised scree plot showing as many or as few dimensions as you like, you need to extract the eigen values for the dimensions from the analysis and plot them as a bar graph.
Input:
eig.val <- df_famd$eig
x <- barplot(eig.val[, 2],
names.arg = 1:nrow(eig.val),
main = "Variances Explained by Dimensions (%)",
xlab = "Principal Dimensions",
ylab = "Percentage of variances",
col ="steelblue")
lines(x = x, eig.val[, 2],
type = "b", pch = 19, col = "red")
Output:
The contribution of the variables to explaining the variance in each dimension is extracted using the get_famd_var()
function and calling the $contrib
part of the output.
Input:
vars <- get_famd_var(df_famd)
vars$contrib
Output:
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
los 5.076082 1.140650 4.157746e+00 0.006575909 4.5807267
weight 5.347653 1.019035 7.389999e+00 3.372029522 8.6797507
height 9.250164 5.405068 4.447199e-03 0.013552674 5.2811866
procedures 1.195516 10.568965 1.738896e+01 0.440959702 0.5531763
age 1.766913 1.189035 3.022982e+01 2.237621735 3.7033632
admitType 14.115936 12.372150 1.023400e+01 5.697187858 18.4482633
day 19.230572 36.062218 1.218112e+01 38.763925601 26.2901018
diagCat 26.130542 31.065785 1.841379e+01 49.292828000 31.8876683
complex 17.886623 1.177094 1.226301e-04 0.175318999 0.5757631
The contributions of the variables to dimensions one and two are plotted using the fviz_famd_var()
function. The percentage contribution of each variable to each dimension's explanatory power is given as a bar plot using the fviz_contrib()
function and the dimension is changed using the axes=
argument.
Input:
fviz_famd_var(df_famd, repel = TRUE)
fviz_contrib(df_famd, "var", axes = 1)
fviz_contrib(df_famd, "var", axes = 2)
Output: