# Principal component analyses

## Uses of principal component analysis

Often in healthcare there are multiple factors that influence an outcome of interest. For example, when trying to predict the length of stay of a patient this could be influenced by the severity of the patients condition, the complexity of their case, whether they require a package of social care before discharge, the day on which they are admitted, to name but a few.

Multi-variate regression is the most common way to produce a model to predict an outcome based on many different factors but you need a way to determine which factors to include in the regression model, this is where principal component analyses are useful.

Principal component analyses (PCA) are used to explore the ability of multiple factors to account for the variance in a data set. If you do not know which factors are most likely to be predictive of an outcome then PCA is a good starting point. There are different types of principal component analysis and the type of data that you are using determines which type of PCA you need to use. Three different types of PCA are described below and they provide a way to explore most healthcare data.

## Types of PCA

The three type of PCA to be covered in this tutorial will help you to explore quantitative factors, qualitative factors and mixed quantitative and qualitative factors. They are:

• Principal component analysis (PCA) – for use with only continuous and discrete numerical data
• Multiple correspondence analysis (MCA) – for use with categorical data only
• Factor analysis of mixed data (FAMD) – for use with mixed numerical and categorical data

## Principal component analysis

In this example we will need to load the factoextra library to visualise the results of the analysis.

We start by creating an example data set containing numerical data. There are 100 observations of 8 factors.

``````
Input:
df_num <- data.frame(los=rexp(100, 5),
weight=rnorm(100, 78, 12),
height=rnorm(100, 165, 10),
procedures=rpois(100, 1.5),
age=rnorm(100, 45, 10),
complexity=rexp(100, 2),
severity=rexp(100, 3))
df\$bmi <- df\$weight/((df\$height/100)^2)

Output:
los   weight   height procedures      age complexity   severity
1 0.115322054 81.60335 182.9714          2 47.20549 0.17355499 0.08115066
2 0.265810974 78.89828 156.8967          1 36.65484 2.41406744 0.03199619
3 0.006315472 80.47647 184.0090          0 43.69646 0.19319890 0.52674936
4 0.011242195 72.13293 172.0895          2 50.67329 0.03718549 0.03969725
5 0.063300243 70.46458 172.3619          1 57.08467 0.05149866 0.33449668
6 0.062845458 77.43700 178.6578          3 46.27220 0.24534184 0.14639152
``````
``` ```

We then use the `prcomp()` function to run a principal component analysis on our data. Note that centering and scaling have been set to `TRUE` to standardise the variables.

``````
df_pca <- prcomp(df_num,center=TRUE,scale.=TRUE)
``````
``` ```

A scree plot can be used to visualise the percentage of the variance in the data accounted for by different ways of combining the factors within the data. These are referred to as dimensions.

``````
Input:
fviz_screeplot(df_pca,labels=TRUE)

Output:

``````
``` ```

To explore the contribution of each factor within the different dimensions created by the analysis we look at the variable level results. These are extracted using the `get_pca_var()` function, displayed using `\$contrib` and visualised using the `fviz_pca_var()` and `fviz_contrib()` functions. `fviz_pca_var()` shows the contribution of each factor across the first two dimensions of the analysis in a scatter graph and `fviz_contrib()` produces a bar plot of the relative contribution of each factor within a dimension defined using the axes argument.

Note: the first two dimensions of a PCA are the combination of factors that account for the greatest variance with your data providing the combination of factors with the maximum likely explanatory power for a multi variate model.

``````
Input:
vars <- get_pca_var(df_pca)
vars\$contrib

Output:
Dim.1     Dim.2       Dim.3       Dim.4       Dim.5        Dim.6
los         5.3015078 13.231623  0.19497267 14.94394785 19.77971812 45.385871683
weight     30.3637358  1.065937 11.92369876  1.45409169  0.01760632  5.680743674
height     17.8111980 10.846066  9.44954267 11.56213368  1.65407486  3.761525479
procedures  0.1810879 21.093671  6.90714547 15.79046517 37.18931709 18.222235152
age         0.9674055 30.352961  1.97443213 13.32098274  7.29539076 20.258146115
complexity  0.0725160 19.353656  0.05117394 42.42371294 31.42683739  6.380088970
severity    1.3408652  2.998179 68.35320572  0.05327024  2.31412069  0.001815931
bmi        43.9616838  1.057907  1.14582864  0.45139571  0.32293475  0.309572995
Dim.7        Dim.8
los         1.08582388 7.653498e-02
weight     18.43870604 3.105548e+01
height     28.81399137 1.610147e+01
procedures  0.60253994 1.353790e-02
age        25.83047114 2.108957e-04
complexity  0.28775938 4.255188e-03
severity   24.91389322 2.464981e-02
bmi         0.02681503 5.272386e+01
``````
``` ```

``````
Input:
fviz_pca_var(df_pca)
fviz_contrib(df_pca, choice = "var", axes = 1)
fviz_contrib(df_pca, choice = "var", axes = 2)

Output:

``````
``` ```

## Multiple correspondence analysis

For this analysis we will require the FactoMineR and factoextra packages (Note: the packages are for use with R version 3.5.* and later).

The example data set created below is made up of 100 observations of 7 categorical variables (the columns).

``````
Input:
firstVisit=sample(0:1,100,replace=TRUE),
complex=sample(0:1,100,replace=TRUE),
ageGroup=sample(1:6,100,replace=TRUE),
diagCat=sample(1:10,100,replace=TRUE),
day=sample(1:7,100,replace=TRUE))

Output:
1        1         4          0       0        2       1   1
2        1         2          0       1        6       8   7
3        1         3          0       1        1       4   7
4        1         2          1       0        4       6   6
5        0         3          1       1        6       7   3
6        0         2          1       1        4       4   4
``````
``` ```

The easiest way to work with categorical data in R is to convert it to the factor data type. This works with both string/text based data and integer coded categories as used in this example. We can convert all of the columns to factors using the `lapply()` function and converting the data back to a data frame using `as.data.frame()`.

``````
Input:
df_cat <- as.data.frame(lapply(df_cat, factor))
str(df_cat)

Output:
'data.frame':	100 obs. of  7 variables:
\$ admitted  : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 1 2 1 ...
\$ admitType : Factor w/ 4 levels "1","2","3","4": 4 2 3 2 3 2 1 4 1 4 ...
\$ firstVisit: Factor w/ 2 levels "0","1": 1 1 1 2 2 2 1 1 1 2 ...
\$ complex   : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 1 2 2 ...
\$ ageGroup  : Factor w/ 6 levels "1","2","3","4",..: 2 6 1 4 6 4 2 5 4 4 ...
\$ diagCat   : Factor w/ 10 levels "1","2","3","4",..: 1 8 4 6 7 4 5 5 3 7 ...
\$ day       : Factor w/ 7 levels "1","2","3","4",..: 1 7 7 6 3 4 6 7 3 5 ...
``````
``` ```

The function for MCA is `MCA()` and we add the argument `graph=FALSE` to suppress the automatically generated graphs as we will create our own.

``````
Input:
df_mca <- MCA(df_cat,graph=FALSE)
``````
``` ```

The information that we want to retrieve from the analysis and the visualisations of that information that we want to produce are very similar to those in the PCA analysis above. We begin with a scree plot to see the amount of variance accounted for in the first 10 dimensions.

``````
Input:
fviz_screeplot(df_mca,labels=TRUE)

Output:

``````
``` ```

To view the contribution of each variable to the amount of variance explained by each dimension you need to use `get_mca_var()` and then `\$contrib` to view the output.

``````
Input:
vars <- get_mca_var(df_mca)
vars\$contrib

Output:
Dim 1        Dim 2       Dim 3        Dim 4        Dim 5
admitted_0    0.001750298 1.791328e-01  1.60802656 12.235398809  0.939527483
admitted_1    0.001432062 1.465632e-01  1.31565810 10.010780844  0.768704305
admitType_1   1.699524502 9.757696e+00  1.65083010  0.001412699  1.066248765
admitType_2   2.319885728 1.444341e-01  1.76623023  2.093369928  3.453491118
admitType_3   4.694504621 9.532601e-01  8.34974299  0.327415533  6.838234314
admitType_4   4.642117494 2.901158e+00  0.06588051  0.556584406  2.932488213
firstVisit_0  0.007981436 1.002564e+01  2.70743358  0.541482918  0.045503584
firstVisit_1  0.009755089 1.225356e+01  3.30908549  0.661812455  0.055615492
complex_0     3.155233215 7.088524e-01  1.29788661  2.983423727  0.021736097
complex_1     2.687791257 6.038373e-01  1.10560711  2.541435026  0.018515934
ageGroup_1    4.478460077 3.398178e+00  1.81439375  0.180051977  2.426826984
ageGroup_2    8.415530198 2.074877e+00  3.07705223  1.327278754  1.851695168
ageGroup_3    0.648089597 1.035906e+01  6.40819738  1.541699286  1.487802113
ageGroup_4    1.104617420 1.068189e+00  6.45107327  5.041606391  0.644188180
ageGroup_5    0.028643865 8.804525e-03  6.12096664 11.755043033  0.237669434
ageGroup_6    5.233839716 2.665486e-01  0.04954071 10.059937190  2.979330392
diagCat_1     1.211361369 3.931928e-01  3.29532046  1.732256475  2.722436831
diagCat_2     8.829331840 2.460947e-02  6.27367331  0.985144802  0.674332572
diagCat_3     0.005640283 1.456480e+00  1.10474769  0.264369196  9.477488027
diagCat_4     3.774542362 1.900483e+00  1.46455031  9.257833540  0.720998413
diagCat_5     4.337814007 2.742854e-01  0.38185908  2.956681822  0.578555398
diagCat_6     0.803925795 1.503228e+00 15.45814726  0.442884116  0.285319301
diagCat_7     0.014622074 3.613199e-04  0.32577112 13.824994397  1.087739160
diagCat_8     5.450960523 6.030231e+00  0.18030642  0.324617240 17.117290144
diagCat_9     2.278031713 9.420936e-01  1.08533540  1.502722595  8.874660902
diagCat_10    1.408464867 2.692705e+00  3.06131496  0.045741869  1.497714064
day_1         1.351042738 1.862585e+00  6.43745973  1.325621189  8.075718633
day_2         0.505902108 6.470470e-01  4.05761636  1.350966228  3.790374898
day_3         1.104627894 1.080610e+01  0.76402203  0.332277120  0.088915147
day_4        19.666270572 1.289532e+00  1.39131067  0.031542487  0.000029358
day_5         2.652214692 3.671076e-03  5.78282222  0.849825839  3.883336229
day_6         3.084466186 2.762171e-03  0.94764722  2.002494842 12.048080355
day_7         4.391624402 1.532083e+01  0.89049050  0.911293269  3.309432992
``````
``` ```

A graph of the contribution of each variable to the first two dimensions is produced using `fviz_mca_var()` and as with the PCA analysis bar plots of the variable contributions to each dimension are produced using `fviz_contrib()` and specifying the dimension in the `axes=` argument.

``````
Input:
fviz_mca_var(df_mca)
fviz_contrib(df_mca, choice = "var", axes = 1)
fviz_contrib(df_mca, choice = "var", axes = 2)

Output:

``````
``` ```

## Factor analysis of mixed data

For this analysis we will require the FactoMineR and factoextra packages (Note: the packages are for use with R version 3.5.* and later).

Let's begin by creating a data set with mixed numerical and categorical data. This data set contains 100 observations of 4 categorical factors and 5 numerical factors.

``````
Input:
day=sample(1:7,100,replace=TRUE),
diagCat=sample(1:10,100,replace=TRUE),
complex=sample(0:1,100,replace=TRUE),
los=rexp(100, 5),
weight=rnorm(100, 78, 12),
height=rnorm(100, 165, 10),
procedures=rpois(100, 1.5),
age=rnorm(100, 45, 10))

Output:
admitType day diagCat complex        los   weight   height procedures      age
1         4   3      10       0 0.29656683 57.99030 165.3455          1 51.05371
2         2   4       7       1 0.05482085 86.83795 166.9023          1 39.93666
3         4   7       6       1 0.15006675 82.63232 166.7473          2 30.79434
4         4   4       5       1 0.21007950 74.81218 154.4498          0 46.27993
5         1   7       9       0 0.07231502 79.41773 169.7613          3 64.45851
6         3   7       4       0 0.19218883 79.60846 178.7857          2 53.00914
``````
``` ```

We now need to convert the columns that contain categorical variables from the numerical data type to factor type. This is done by identifying those columns to be converted to factor type, extracting the column names and passing them to the `lapply` function to iterate over them and perform the conversion.

``````
Input:
fac <- c(1:4)
facNames <- names(df_mix[fac])
df_mix[facNames] <- lapply(df_mix[facNames], factor)
str(df_mix)

Output:
'data.frame':	100 obs. of  9 variables:
\$ admitType : Factor w/ 4 levels "1","2","3","4": 4 2 4 4 1 3 4 3 2 4 ...
\$ day       : Factor w/ 7 levels "1","2","3","4",..: 3 4 7 4 7 7 5 3 2 7 ...
\$ diagCat   : Factor w/ 10 levels "1","2","3","4",..: 10 7 6 5 9 4 3 2 2 5 ...
\$ complex   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 1 1 ...
\$ los       : num  0.2966 0.0548 0.1501 0.2101 0.0723 ...
\$ weight    : num  58 86.8 82.6 74.8 79.4 ...
\$ height    : num  165 167 167 154 170 ...
\$ procedures: int  1 1 2 0 3 2 1 0 2 1 ...
\$ age       : num  51.1 39.9 30.8 46.3 64.5 ...
``````
``` ```

The data is now ready for analysis. The function for the mixed data type analysis is `FAMD()` and the `graph=FALSE` argument is used to suppress the automatically generated plots.

``````
df_famd <- FAMD(df_mix, graph=FALSE)
``````
``` ```

The important analysis outputs are similar to those for PCA and MCA. To create a customised scree plot showing as many or as few dimensions as you like, you need to extract the eigen values for the dimensions from the analysis and plot them as a bar graph.

``````
Input:
eig.val <- df_famd\$eig
x <- barplot(eig.val[, 2],
names.arg = 1:nrow(eig.val),
main = "Variances Explained by Dimensions (%)",
xlab = "Principal Dimensions",
ylab = "Percentage of variances",
col ="steelblue")

lines(x = x, eig.val[, 2],
type = "b", pch = 19, col = "red")

Output:

``````
``` ```

The contribution of the variables to explaining the variance in each dimension is extracted using the `get_famd_var()` function and calling the `\$contrib` part of the output.

``````
Input:
vars <- get_famd_var(df_famd)
vars\$contrib

Output:
Dim.1     Dim.2        Dim.3        Dim.4      Dim.5
los         5.076082  1.140650 4.157746e+00  0.006575909  4.5807267
weight      5.347653  1.019035 7.389999e+00  3.372029522  8.6797507
height      9.250164  5.405068 4.447199e-03  0.013552674  5.2811866
procedures  1.195516 10.568965 1.738896e+01  0.440959702  0.5531763
age         1.766913  1.189035 3.022982e+01  2.237621735  3.7033632
admitType  14.115936 12.372150 1.023400e+01  5.697187858 18.4482633
day        19.230572 36.062218 1.218112e+01 38.763925601 26.2901018
diagCat    26.130542 31.065785 1.841379e+01 49.292828000 31.8876683
complex    17.886623  1.177094 1.226301e-04  0.175318999  0.5757631
``````
``` ```

The contributions of the variables to dimensions one and two are plotted using the `fviz_famd_var()` function. The percentage contribution of each variable to each dimension's explanatory power is given as a bar plot using the `fviz_contrib()` function and the dimension is changed using the `axes=` argument.

``````
Input:
fviz_famd_var(df_famd, repel = TRUE)
fviz_contrib(df_famd, "var", axes = 1)
fviz_contrib(df_famd, "var", axes = 2)

Output:

``````
``` ```