The ggcorr
function is a visualization function to plot correlation matrixes as ggplot2
objects. It was inspired by a Stack Overflow question.
Correlation matrixes show the correlation coefficients between a relatively large number of continuous variables. However, while R offers a simple way to create such matrixes through the cor
function, it does not offer a plotting method for the matrixes created by that function.
The ggcorr
function offers such a plotting method, using the “grammar of graphics” implemented in the ggplot2
package to render the plot. In practice, its results are graphically close to those of the corrplot
function, which is part of the excellent arm
package.
ggcorr
is available through the GGally
package:
install.packages("GGally")
It can also be used as a standalone function:
source("https://raw.githubusercontent.com/briatte/ggcorr/master/ggcorr.R")
The main package dependency of ggcorr
is the ggplot2
package for plot construction.
library(ggplot2)
The ggplot2
package can be installed from CRAN through install.packages
. Doing so will also install the reshape2
package, which is used internally by ggcorr
for data manipulation.
The examples shown in this vignette use NBA statistics shared by Nathan Yau at his excellent blog “Flowing Data”.
nba = read.csv("http://datasets.flowingdata.com/ppg2008.csv")
Let’s pass the entire dataset to ggcorr
without any further work:
ggcorr(nba)
## Warning in ggcorr(nba): data in column(s) 'Name' are not numeric and were
## ignored
## Warning: Non Lab interpolation is deprecated
This example shows the default output of ggcorr
. It also produced a warning to indicate that one column of the dataset did not contain numeric data and was therefore dropped from the correlation matrix. The warning can be avoided by dropping the column from the dataset passed to ggcorr
:
ggcorr(nba[, -1])
Note: when used with a continuous color scale, ggcorr
also currently produces a warning related to color interpolation. This is an innocuous warning that should disappear in future updates of the ggplot2
and scales
packages. This warning is hidden in the rest of this vignette.
The first argument of ggcorr
is called data
. It accepts either a data frame, as shown above, or a matrix of observations, which will be converted to a data frame before plotting:
ggcorr(matrix(runif(5), 2, 5))
ggcorr
can also accept a correlation matrix through the cor_matrix
argument, in which case its first argument must be set to NULL
to indicate that ggcorr
should use the correlation matrix instead:
ggcorr(data = NULL, cor_matrix = cor(nba[, -1], use = "everything"))
ggcorr
supports all correlation methods offered by the cor
function. The method is controlled by the method
argument, which takes two character strings:
"everything"
, "all.obs"
, "complete.obs"
, "na.or.complete"
or "pairwise.complete.obs"
(the default used by ggcorr
). These settings control how covariances are computed in the presence of missing values. The difference between each of them is explained in the documentation of the cor
function.ggcorr
requires is the type of correlation coefficient to be computed. There are three possible values for it: "pearson"
(the default used both by ggcorr
and by cor
), "kendall"
or "spearman"
. Again, the difference between each setting is explained in the documentation of the cor
function. Generally speaking, unless the data are ordinal, the default choice should be "pearson"
, which produces correlation coefficients based on Pearson’s method.Here are some examples showing how to pass different correlation methods to ggcorr
:
# Pearson correlation coefficients, using pairwise observations (default method)
ggcorr(nba[, -1], method = c("pairwise", "pearson"))
# Pearson correlation coefficients, using all observations
ggcorr(nba[, -1], method = c("everything", "pearson"))
# Kendall correlation coefficients, using complete observations
ggcorr(nba[, -1], method = c("complete", "kendall"))
# Spearman correlation coefficients, using strictly complete observations
ggcorr(nba[, -1], method = c("all.obs", "spearman"))
If no second argument is provided, ggcorr
will default to "pearson"
.
The rest of this vignettes focuses on how to tweak the aspect of the correlation matrix plotted by ggcorr
.
By default, ggcorr
uses a continuous color scale that extends from \(-1\) to \(+1\) to show the strength of each correlation represented in the matrix. To switch to categorical colors, all the user has to do is to add the nbreaks
argument, which specifies how many breaks should be contained in the color scale:
ggcorr(nba[, 2:15], nbreaks = 5)
When the nbreaks
argument is used, the number of digits shown in the color scale is controlled through the digits
argument. The digits
argument defaults to two digits, but as shown in the example above, it will default to a single digit if the breaks do not require more precision.
Further control over the color scale includes the name
argument, which sets its title, the legend.size
argument, which sets the size of the legend text, and the legend.position
argument, which controls where the legend is displayed. The latter two are just shortcuts to the same arguments in ggplot2
’s theme
, and since the plot is a ggplot2
object, all other relevant theme
and guides
methods also apply:
ggcorr(nba[, 2:15], name = expression(rho), legend.position = "bottom", legend.size = 12) +
guides(fill = guide_colorbar(barwidth = 18, title.vjust = 0.75)) +
theme(legend.title = element_text(size = 14))
ggcorr
uses a default color gradient that goes from bright red to light grey to bright blue. This gradient can be modified through the low
, mid
and high
arguments, which are similar to those of the scale_gradient2
controller in ggplot2
:
ggcorr(nba[, 2:15], low = "steelblue", mid = "white", high = "darkred")
## Warning: Non Lab interpolation is deprecated
By default, the midpoint of the gradient is set at \(0\), which indicates a null correlation. The midpoint
argument can be used to modify this setting. In particular, setting midpoint
to NULL
will automatically select the median correlation coefficient as the midpoint, and will show that value to the user:
ggcorr(nba[, 2:15], midpoint = NULL)
## Color gradient midpoint set at median correlation to 0.08
A final option for controlling the colors of the color scale is to use a ColorBrewer palette through the palette
argument. This argument should be used only when the color scale is categorical, i.e. when the nbreaks
argument is in use:
ggcorr(nba[, 2:15], nbreaks = 4, palette = "RdGy")
Note: trying to use a ColorBrewer palette on a color scale that contains more breaks than there are colors in the palette will return a warning (two identical warnings, actually) to the user.
By default, ggcorr
uses color tiles to represent the strength of the correlation coefficients, in similar fashion to how heatmaps represent counts of observations (the data used in this vignette were initially used to illustrate such heatmaps).
ggcorr
can also represent the correlations as proportionally sized circles. All it takes is to set its geom
argument to "circle"
:
ggcorr(nba[, 2:15], geom = "circle", nbreaks = 5)
Additionally, the user might set the minimum and maximum size of the circles through the min_size
and max_size
arguments:
ggcorr(nba[, 2:15], geom = "circle", nbreaks = 5, min_size = 0, max_size = 6)
Additional controls over the geometry of ggcorr
are illustrated towards the end of this vignette.
ggcorr
can show the correlation coefficients on top of the correlation matrix by seeting the label
argument to TRUE
:
ggcorr(nba[, 2:15], label = TRUE)
The label_color
and label_size
arguments allow to style the coefficient labels:
ggcorr(nba[, 2:15], nbreaks = 4, palette = "RdGy", label = TRUE, label_size = 3, label_color = "white")
The label_round
argument further controls the number of digits shown in the coefficient labels, which defaults to a single digit, the label_alpha
argument controls the level of transparency of the labels. If label_alpha
is set to TRUE
, the level of transparency will vary like the correlation coefficient, increasing as it moves further away from \(0\):
ggcorr(nba[, 2:15], label = TRUE, label_size = 3, label_round = 2, label_alpha = TRUE)
In several of the examples above, the rendering of the variable labels (which are shown on the diagonal of the correlation matrix) is not necessarily optimal. To modify the aspect of these labels, all the user has to do is to pass any argument supported by geom_text
directly to ggcorr
. The example below shows how to reduce the size of the labels while moving them to the left and changing their color:
ggcorr(nba[, 2:15], hjust = 0.75, size = 5, color = "grey50")
One issue that is likely to arise with variable labels in a correlation matrix is that they will be too long to be displayed in full at the bottom-left of the plot. This issue is illustrated below by starting the correlation matrix with the MIN
variable, which appears to be slightly clipped at the very bottom-left of the plot:
ggcorr(nba[, 3:16], hjust = 0.75, size = 5, color = "grey50")
To solve this issue, ggcorr
can add some whitespace to the horizontal axis of the plot through the layout.exp
argument. Passing any numeric value to this argument will add one or more ‘invisible tile(s)’ to the left of the plot, which can help displaying variables with long names:
ggcorr(nba[, 3:16], hjust = 0.75, size = 5, color = "grey50", layout.exp = 1)
It might be useful, in some circumstances, to show the empirical range of correlation coefficients instead of the full \((-1, +1)\) range in the color scale. When the color scale is a continuous color gradient, this can be achieved by setting the limits
argument to FALSE
:
ggcorr(nba[, 2:15], limits = FALSE)
When the color scale is categorical, setting the limits
argument to FALSE
or, equivalently, setting the drop
argument to TRUE
will drop the breaks that do not correspond to any of the correlation coefficients:
ggcorr(nba[, 2:15], nbreaks = 9, limits = FALSE)
ggcorr(nba[, 2:15], nbreaks = 9, drop = TRUE)
If the geom
argument of ggcorr
is set to "text"
, it will represent the correlation coefficients as their (colored) values:
ggcorr(nba[, 2:15], geom = "text", nbreaks = 5, palette = "RdYlBu", hjust = 1)
The size of these values will be set to that of label_size
, which allows to overimpose coefficient labels:
ggcorr(nba[, 2:15], geom = "text", nbreaks = 5, palette = "RdYlBu", hjust = 1, label = TRUE, label_alpha = 0.5)
Last, if the geom
argument of ggcorr
is set to "blank"
, it will plot nothing, which is useful when hacking into the internal values of the plot, as illustrated below.
Since ggcorr
produces ggplot2
objects, it can be useful to understand how the object is constructed in order to obtain more specific plots from it. Every ggcorr
object contains the following data
object:
head(ggcorr(nba[, 2:15])$data, 5)
## x y coefficient label
## 2 MIN G 0.18686608 0.2
## 3 PTS G 0.06309908 0.1
## 4 FGM G 0.03992195 0.0
## 5 FGA G -0.05958051 -0.1
## 6 FGP G 0.18087541 0.2
This allows for a fair amount of “hacking” into the internal values of ggcorr
, as in the following example, which highlights all correlation coefficients superior to \(0.5\) or inferior to \(-0.5\), using different colors for negative and positive coefficients:
ggcorr(nba[, 2:15], geom = "blank", label = TRUE, hjust = 0.75) +
geom_point(size = 10, aes(color = coefficient > 0, alpha = abs(coefficient) > 0.5)) +
scale_alpha_manual(values = c("TRUE" = 0.25, "FALSE" = 0)) +
guides(color = FALSE, alpha = FALSE)
ggcorr
is strictly limited to correlation matrixes: it cannot plot heatmaps or cluster heatmaps (the latter can be plotted with ggplot2
through the gapmap
package). For scatterplot matrixes, see the ggpairs
function, which is also part of the GGally
package.
If you find other limitations to ggcorr
, please submit an issue about them, thanks!
Last printed Sep 11, 2015.