This notebook is best used in conjunction with the recorded delivery of the training session which is available on https://youtu.be/5klSpGC2puU and the Advanced R presentation available in the https://gitlab.com/SManzi/r-for-healthcare-training.
Plot your data
Make a simple plot of your data and look at the shape of it
Import the fitdistrplus library
library(fitdistrplus)
#Import data and plot
data("groundbeef", package = "fitdistrplus")
my_data <- groundbeef $ serving
plot(my data, pch=20)
A histogram is a more useful way to view your data Here we use ggplot2 to plot the data or you could use the base R hist() function
ggplot(data=groundbeef) +
geom_histogram(mapping=aes(x=serving),bins=10,
col="black",
fill="grey")
Empirical density and cumulative distribution
¶
Empirical density – equivalent to histogram giving density of observations Cumulative distribution – Adds up density of observations
plotdist(my data, histo = TRUE, demp = TRUE)
Cullen and Frey graph
Assess the data in terms of skewness (+ve or -ve skew) and kurtosis (sharpness of the peak of the curve)
descdist(my data, discrete=FALSE, boot=500)
Fitting your data
# Individually fit specific distributions
fit_w <- fitdist(my_data, "weibull")
fit_g <- fitdist(my_data, "gamma")
fit_ln <- fitdist(my_data, "lnorm")
summary(fit_ln)
# Fit a list of named distributions and print the summary
dists <- c("gamma","lnorm","weibull")
fit <- list()
for (i in 1:length(dists)){
fit[[i]] <- fitdist(my_data, dists[i])
}
for (i in 1:length(dists)){
print(summary(fit[[i]]))
}
# Plot the results from the individually
#fitted distributions and compare
par(mfrow=c(2,2))
plot.legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
cdfcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
qqcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
ppcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
# Plot the results from the list of
#fitted distributions and compare
par(mfrow=c(2,2))
plot.legend <- dists
denscomp(fit, legendtext = plot.legend)
cdfcomp (fit, legendtext = plot.legend)
qqcomp (fit, legendtext = plot.legend)
ppcomp (fit, legendtext = plot.legend)
denscomp: probability density function
cdfcomp: cumulative density function
qqcomp: qq plot – compares the empirical cumulative distribution function of a data set with a specified theoretical cumulative distribution function
ppcomp: pp plot – compares the quantiles of a data distribution with the quantiles of a standardized theoretical distribution from a specified family of distributions
Credit: pp and qq plot descriptions https://v8doc.sas.com/sashtml/qc/chap8/sect9.htm#:~:text=A%20P%2DP%20plot%20compares%20the,a%20specified%20family%20of%20distributions.
Fitting distributions using the actuar package
The package fitdistrplus only contains a limited number of named distributions. The actuar package contains more named distributions to try extending fitdistrplus.
# Import the actuar library
library(actuar)
# For example
fit_ll <- fitdist(my_data, "llogis", start = list(shape = 1, scale = 500))
fit_P <- fitdist(my_data, "pareto", start = list(shape = 1, scale = 500))
fit_B <- fitdist(my_data, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1))
cdfcomp(list(fit_ln, fit_ll, fit_P, fit_B), xlogscale = TRUE, ylogscale = TRUE,
legendtext = c("lognormal", "loglogistic", "Pareto", "Burr"), lwd=2)
Goodness of fit
f <- gofstat(fit, fitnames=c("gamma","lnorm","weibull"))
g <- gofstat(list(fit_ln, fit_ll, fit_P, fit_B), fitnames = c("lnorm", "llogis", "Pareto", "Burr"))
Estimating uncertainty in the parameters
ests <- bootdist(fit_B, niter = 1e3)
summary(ests)
#Plot
plot(ests)
#95% bootstrap confidence interval
quantile(ests, probs=.05)