Distribution fitting with fitdistrplus

This notebook is best used in conjunction with the recorded delivery of the training session which is available on https://youtu.be/5klSpGC2puU and the Advanced R presentation available in the https://gitlab.com/SManzi/r-for-healthcare-training.

Plot your data

Make a simple plot of your data and look at the shape of it

Import the fitdistrplus library

library(fitdistrplus)
#Import data and plot
data("groundbeef", package = "fitdistrplus")
my_data <- groundbeef $ serving
plot(my data, pch=20)

A histogram is a more useful way to view your data Here we use ggplot2 to plot the data or you could use the base R hist() function

ggplot(data=groundbeef) +
  geom_histogram(mapping=aes(x=serving),bins=10,
                 col="black",
                 fill="grey")

Empirical density and cumulative distribution

Empirical density – equivalent to histogram giving density of observations Cumulative distribution – Adds up density of observations

plotdist(my data, histo = TRUE, demp = TRUE)

Cullen and Frey graph

Assess the data in terms of skewness (+ve or -ve skew) and kurtosis (sharpness of the peak of the curve)

descdist(my data, discrete=FALSE, boot=500)

Fitting your data

# Individually fit specific distributions
fit_w  <- fitdist(my_data, "weibull")
fit_g  <- fitdist(my_data, "gamma")
fit_ln <- fitdist(my_data, "lnorm")
summary(fit_ln)
# Fit a list of named distributions and print the summary
dists <- c("gamma","lnorm","weibull")
fit <- list()
for (i in 1:length(dists)){
  fit[[i]]  <- fitdist(my_data, dists[i])
}

for (i in 1:length(dists)){
  print(summary(fit[[i]]))
}
# Plot the results from the individually
#fitted distributions and compare
par(mfrow=c(2,2))
plot.legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
cdfcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
qqcomp  (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
ppcomp  (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
# Plot the results from the list of
#fitted distributions and compare
par(mfrow=c(2,2))
plot.legend <- dists
denscomp(fit, legendtext = plot.legend)
cdfcomp (fit, legendtext = plot.legend)
qqcomp  (fit, legendtext = plot.legend)
ppcomp  (fit, legendtext = plot.legend)

denscomp: probability density function
cdfcomp: cumulative density function
qqcomp: qq plot – compares the empirical cumulative distribution function of a data set with a specified theoretical cumulative distribution function
ppcomp: pp plot – compares the quantiles of a data distribution with the quantiles of a standardized theoretical distribution from a specified family of distributions
Credit: pp and qq plot descriptions https://v8doc.sas.com/sashtml/qc/chap8/sect9.htm#:~:text=A%20P%2DP%20plot%20compares%20the,a%20specified%20family%20of%20distributions.

Fitting distributions using the actuar package

The package fitdistrplus only contains a limited number of named distributions. The actuar package contains more named distributions to try extending fitdistrplus.

# Import the actuar library
library(actuar)
# For example
fit_ll <- fitdist(my_data, "llogis", start = list(shape = 1, scale = 500))
fit_P  <- fitdist(my_data, "pareto", start = list(shape = 1, scale = 500))
fit_B  <- fitdist(my_data, "burr",   start = list(shape1 = 0.3, shape2 = 1, rate = 1))
cdfcomp(list(fit_ln, fit_ll, fit_P, fit_B), xlogscale = TRUE, ylogscale = TRUE,
        legendtext = c("lognormal", "loglogistic", "Pareto", "Burr"), lwd=2)

Goodness of fit

f <- gofstat(fit, fitnames=c("gamma","lnorm","weibull"))
g <- gofstat(list(fit_ln, fit_ll, fit_P, fit_B), fitnames = c("lnorm", "llogis", "Pareto", "Burr"))

Estimating uncertainty in the parameters

ests <- bootdist(fit_B, niter = 1e3)
summary(ests)
#Plot
plot(ests)
#95% bootstrap confidence interval
quantile(ests, probs=.05)