Tutorial
This tutorial shows how to create data visualizations using the StatsMakie grouping and styling APIs as well as the StatsMakie statistical recipes.
Grouping data by discrete variables
The first feature that StatsMakie adds to Makie is the ability to group data by some discrete variables and use those variables to style the result. Let's first create some vectors to play with:
N = 1000
a = rand(1:2, N) # a discrete variable
b = rand(1:2, N) # a discrete variable
x = randn(N) # a continuous variable
y = @. x * a + 0.8*randn() # a continuous variable
z = x .+ y # a continuous variable
To see how x
and y
relate to each other, we could simply try (be warned: the first plot is quite slow, the following ones will be much faster):
scatter(x, y, markersize = 0.2)
It looks like there are two components in the data, and we can ask whether they come from different values of the a
variable:
scatter(Group(a), x, y, markersize = 0.2)
Group
will split the data by the discrete variable we provided and color according to that variable. Colors will cycle across a range of default values, but we can easily customize those:
scatter(Group(a), x, y, color = [:black, :red], markersize = 0.2)
and of course we are not limited to grouping with colors: we can use the shape of the marker instead. Group(a)
defaults to Group(color = a)
, whereas Group(marker = a)
with encode the information about variable a
in the marker:
scatter(Group(marker = a), x, y, markersize = 0.2)
Grouping by many variables is also supported:
scatter(Group(marker = a, color = b), x, y, markersize = 0.2)
Styling data with continuous variables
One of the advantage of using an inherently discrete quantity (like the shape of the marker) to encode a discrete variable is that we can use continuous attributes (e.g. color within a colorscale) for continuous variable. In this case, if we want to see how a, x, y, z
interact, we could choose the marker according to a
and style the color according to z
:
scatter(Group(marker = a), Style(color = z), x, y)
Just like with Group
, we can Style
any number of attributes in the same plot. color
is probably the most common, markersize
is another sensible option (especially if we are using color
already for the grouping):
scatter(Group(color = a), x, y, Style(markersize = z ./ 10))
Split-apply-combine strategy with a plot
StatsMakie also has the concept of a "visualization" function (which is somewhat different but inspired on Grammar of Graphics statistics). The idea is that any function whose return type is understood by StatsMakie (meaning, there is an appropriate visualization for it) can be passed as first argument and it will be applied to the following arguments as well.
A simple example is probably linear and non-linear regression.
Linear regression
StatsMakie knows how to compute both a linear and non-linear fit of y
as a function of x
, via the "analysis functions" linear
(linear regression) and smooth
(local polynomial regression) respectively:
using StatsMakie: linear, smooth
plot(linear, x, y)
That was anti-climatic! It is the linear prediction of y
given x
, but it's a bit of a sad plot! We can make it more colorful by splitting our data by a
, and everything will work as above:
plot(linear, Group(a), x, y)
And then we can plot it on top of the previous scatter plot, to make sure we got a good fit:
scatter(Group(a), x, y, markersize = 0.2)
plot!(linear, Group(a), x, y)
Here of course it makes sense to group both things by color, but for line plots we have other options like linestyle
:
plot(linear, Group(linestyle = a), x, y)
A non-linear example
Using non-linear techniques here is not very interesting as linear techniques work quite well already, so let's change variables:
N = 200
x = 10 .* rand(N)
a = rand(1:2, N)
y = sin.(x) .+ 0.5 .* rand(N) .+ cos.(x) .* a
and then:
scatter(Group(a), x, y)
plot!(smooth, Group(a), x, y)
Different analyses
linear
and smooth
are two examples of possible analysis, but many more are possibles and it's easy to add new ones. If we were interested to the distributions of x
and y
for example we could do:
plot(histogram, y)
The default plot type is determined by the dimensionality of the input and the analysis: with two variables one would get a heatmap:
plot(histogram, x, y)
This plots is reasonably customizable in that one can pass keywords arguments to the histogram
analysis:
plot(histogram(nbins = 30), x, y)
and change the default plot type to something else:
wireframe(histogram(nbins = 30), x, y)
Of course heatmap is the saner choice, but why not abuse Makie 3D capabilities?
Other available analysis are density
(to use kernel density estimation rather than binning) and frequency
(to count occurrences of discrete variables).
What if I have data instead?
If one has data instead, it is possible to signal StatsMakie that we are working from a DataFrame (or any table actually) and it will interpret symbols as columns:
using DataFrames, RDatasets
iris = RDatasets.dataset("datasets", "iris")
scatter(Data(iris), Group(:Species), :SepalLength, :SepalWidth)
And everything else works as usual:
# use Position.stack to signal that you want bars stacked vertically rather than superimposed
plot(Position.stack, histogram, Data(iris), Group(:Species), :SepalLength)
wireframe(density(trim=true), Data(iris), Group(:Species), :SepalLength, :SepalWidth)
Wide data
Other than comparing the same column split by a categorical variable, one may also compare different columns put side by side (here in a Tuple
, (:PetalLength, :PetalWidth)
). The attribute that styles them has to be set to bycolumn
. Here color will distinguish :PetalLength
versus :PetalWidth
whereas the marker will distinguish the species.
scatter(
Data(iris),
Group(marker = :Species, color = bycolumn),
:SepalLength, (:PetalLength, :PetalWidth)
)