Introduction

The ggcorr function is a visualization function to plot correlation matrixes as ggplot2 objects. It was inspired by a Stack Overflow question.

Rationale

Correlation matrixes show the correlation coefficients between a relatively large number of continuous variables. However, while R offers a simple way to create such matrixes through the cor function, it does not offer a plotting method for the matrixes created by that function.

The ggcorr function offers such a plotting method, using the “grammar of graphics” implemented in the ggplot2 package to render the plot. In practice, its results are graphically close to those of the corrplot function, which is part of the excellent arm package.

Installation

ggcorr is available through the GGally package:

install.packages("GGally")

It can also be used as a standalone function:

source("https://raw.githubusercontent.com/briatte/ggcorr/master/ggcorr.R")

Dependencies

The main package dependency of ggcorr is the ggplot2 package for plot construction.

library(ggplot2)

The ggplot2 package can be installed from CRAN through install.packages. Doing so will also install the reshape2 package, which is used internally by ggcorr for data manipulation.

Example: Basketball statistics

The examples shown in this vignette use NBA statistics shared by Nathan Yau at his excellent blog “Flowing Data”.

nba = read.csv("http://datasets.flowingdata.com/ppg2008.csv")

Let’s pass the entire dataset to ggcorr without any further work:

ggcorr(nba)
## Warning in ggcorr(nba): data in column(s) 'Name' are not numeric and were
## ignored
## Warning: Non Lab interpolation is deprecated

This example shows the default output of ggcorr. It also produced a warning to indicate that one column of the dataset did not contain numeric data and was therefore dropped from the correlation matrix. The warning can be avoided by dropping the column from the dataset passed to ggcorr:

ggcorr(nba[, -1])

Note: when used with a continuous color scale, ggcorr also currently produces a warning related to color interpolation. This is an innocuous warning that should disappear in future updates of the ggplot2 and scales packages. This warning is hidden in the rest of this vignette.

Correlation dataset

The first argument of ggcorr is called data. It accepts either a data frame, as shown above, or a matrix of observations, which will be converted to a data frame before plotting:

ggcorr(matrix(runif(5), 2, 5))

ggcorr can also accept a correlation matrix through the cor_matrix argument, in which case its first argument must be set to NULL to indicate that ggcorr should use the correlation matrix instead:

ggcorr(data = NULL, cor_matrix = cor(nba[, -1], use = "everything"))

Correlation methods

ggcorr supports all correlation methods offered by the cor function. The method is controlled by the method argument, which takes two character strings:

  1. The first setting that needs to be taken into account in a correlation matrix is the selection of observations to be used. This setting might take any of the following values: "everything", "all.obs", "complete.obs", "na.or.complete" or "pairwise.complete.obs" (the default used by ggcorr). These settings control how covariances are computed in the presence of missing values. The difference between each of them is explained in the documentation of the cor function.
  2. The second setting that ggcorr requires is the type of correlation coefficient to be computed. There are three possible values for it: "pearson" (the default used both by ggcorr and by cor), "kendall" or "spearman". Again, the difference between each setting is explained in the documentation of the cor function. Generally speaking, unless the data are ordinal, the default choice should be "pearson", which produces correlation coefficients based on Pearson’s method.

Here are some examples showing how to pass different correlation methods to ggcorr:

# Pearson correlation coefficients, using pairwise observations (default method)
ggcorr(nba[, -1], method = c("pairwise", "pearson"))
# Pearson correlation coefficients, using all observations
ggcorr(nba[, -1], method = c("everything", "pearson"))
# Kendall correlation coefficients, using complete observations
ggcorr(nba[, -1], method = c("complete", "kendall"))
# Spearman correlation coefficients, using strictly complete observations
ggcorr(nba[, -1], method = c("all.obs", "spearman"))

If no second argument is provided, ggcorr will default to "pearson".

Plotting parameters

The rest of this vignettes focuses on how to tweak the aspect of the correlation matrix plotted by ggcorr.

Controlling the color scale

By default, ggcorr uses a continuous color scale that extends from \(-1\) to \(+1\) to show the strength of each correlation represented in the matrix. To switch to categorical colors, all the user has to do is to add the nbreaks argument, which specifies how many breaks should be contained in the color scale:

ggcorr(nba[, 2:15], nbreaks = 5)

When the nbreaks argument is used, the number of digits shown in the color scale is controlled through the digits argument. The digits argument defaults to two digits, but as shown in the example above, it will default to a single digit if the breaks do not require more precision.

Further control over the color scale includes the name argument, which sets its title, the legend.size argument, which sets the size of the legend text, and the legend.position argument, which controls where the legend is displayed. The latter two are just shortcuts to the same arguments in ggplot2’s theme, and since the plot is a ggplot2 object, all other relevant theme and guides methods also apply:

ggcorr(nba[, 2:15], name = expression(rho), legend.position = "bottom", legend.size = 12) +
  guides(fill = guide_colorbar(barwidth = 18, title.vjust = 0.75)) +
  theme(legend.title = element_text(size = 14))

Controlling the color palette

ggcorr uses a default color gradient that goes from bright red to light grey to bright blue. This gradient can be modified through the low, mid and high arguments, which are similar to those of the scale_gradient2 controller in ggplot2:

ggcorr(nba[, 2:15], low = "steelblue", mid = "white", high = "darkred")
## Warning: Non Lab interpolation is deprecated