Blog - advanced ggplot

df <- data.frame(
  x = c(3, 1, 5), 
  y = c(2, 4, 6), 
  label = c("a","b","c")
)

df

A data.frame: 3 × 3
x	y	label
<dbl>	<dbl>	<chr>
3	2	a
1	4	b
5	6	c

p <- ggplot(df, aes(x, y, label = label)) + 
  labs(x = NULL, y = NULL) + # Hide axis label
  theme(plot.title = element_text(size = 12)) # Shrink plot title

p + geom_point() + ggtitle("point")

p + geom_text() + ggtitle("text")

p + geom_bar(stat = "identity") + ggtitle("bar")

p + geom_tile() + ggtitle("raster")

p + geom_line() + ggtitle("line")

p + geom_area()+ggtitle("area")

p + geom_path() + ggtitle("path")

collective geoms

Geoms can be roughly divided into individual and collective geoms.

different groups on different layers

we want to plot summaries that use different levels of aggregation.

data(Oxboys, package = "nlme")
head(Oxboys)

A nfnGroupedData: 6 × 4
	Subject	age	height	Occasion
	<ord>	<dbl>	<dbl>	<ord>
1	1	-1.0000	140.5	1
2	1	-0.7479	143.4	2
3	1	-0.4630	144.8	3
4	1	-0.1643	147.1	4
5	1	-0.0027	147.7	5
6	1	0.2466	150.2	6

ggplot(Oxboys, aes(age, height, group = Subject)) + 
  geom_line() + 
  geom_smooth(method = "lm", se = FALSE)
#> `geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

ggplot(Oxboys, aes(age, height)) + 
  geom_line(aes(group = Subject)) + 
  geom_smooth(method = "lm", size = 2, se = FALSE)

Warning message:
“Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.”
`geom_smooth()` using formula = 'y ~ x'

head(Oxboys)

A nfnGroupedData: 6 × 4
	Subject	age	height	Occasion
	<ord>	<dbl>	<dbl>	<ord>
1	1	-1.0000	140.5	1
2	1	-0.7479	143.4	2
3	1	-0.4630	144.8	3
4	1	-0.1643	147.1	4
5	1	-0.0027	147.7	5
6	1	0.2466	150.2	6

ggplot(Oxboys,aes(Occasion,height))+
    geom_boxplot()

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot() +
  geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)

matching aesthetics to graphic objects

lines and paths operate on the first value principle: each segment is defined by two observations
ggplot2 applies the aesthetic value associated with the first observation when drawing the segment.

df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))

ggplot(df, aes(x, y, colour = factor(colour))) + 
  geom_line( aes(group = 1),size = 2) +
  geom_point(size = 5)

ggplot(df, aes(x, y, colour = colour)) + 
  geom_line(aes(group = 1),size = 2) +
  geom_point(size = 5)

在左手边的颜色是离散的，右手边是连续的，即使颜色变量是连续的，ggplot不会平滑

library(ggplot2)
ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

Warning message:
“The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.”

group="whatever" 是一个”虚拟”分组来覆盖默认行为，(这里)是按 cut 分组，通常是按x 变量.geom_bar的默认值是按 x 变量分组，以便分别计算 x 变量的每个级别中的行数.例如，在这里，geom_bar默认返回cut 等于"Fair"、"Good"等的行数.

但是，如果我们想要比例，那么我们需要将所有级别的cut一起考虑.在第二个图中，数据首先按cut分组，因此分别考虑 cut的每个级别.Fair in Fair 的比例是 100%，Good in Good 等的比例也是如此.group=1(或 group="x" 等)阻止了这一点，因此每个级别的削减比例将相对于所有削减水平.

xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(
  x = xgrid,
  y = approx(df$x, df$y, xout = xgrid)$y,
  colour = approx(df$x, df$colour, xout = xgrid)$y  
)
ggplot(interp, aes(x, y, colour = colour)) + 
  geom_line(size = 2) +
  geom_point(data = df, size = 5)

若我们要进行混合渐变形式，

ggplot(mpg, aes(class)) + 
  geom_bar()
ggplot(mpg, aes(class, fill = drv)) + 
  geom_bar()

显示多种颜色，需要多种的bars对于每一个class

statistical summaries

A layer combines data, aesthetic mapping, a geom (geometric object), a stat (statistical transformation), and a position adjustment. Typically, you will create layers using a geom_ function, overriding the default position and stat if needed.

revealing uncertainty

having the infomation about the uncertainty present in your idea

discrete x,range:geom_errorbar(),geom)linerange()
discrete x,range&center:geom_crossbar(),geom_pointrange()
continuous x,range:geom_ribbon()
continuous x,range&center:geom_smooth(stat="identity")

y <- c(18,11,16)
df <- data.frame(x=1:3,y=y,se=c(1.2,0.5,1.0))

base <- ggplot(df,aes(x,y,ymin=y-se,ymax=y+se))

箱线图

base+geom_crossbar()

base+geom_pointrange()

base+geom_smooth(stat="identity")

base+geom_errorbar()
base+geom_linerange()
base+geom_ribbon()

weighted data

dealing with overplotting

constructing a bi-gauss distribution

df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
norm + geom_point()
norm + geom_point(shape = 1) # Hollow circles "1"is circle,"2"is rectangle
norm + geom_point(shape = ".") # Pixel sized

adjust the opacity

norm + geom_point(alpha = 1 / 3)
norm + geom_point(alpha = 1 / 5)
norm + geom_point(alpha = 1 / 10)

norm

norm +geom_bin2d()
norm + geom_bin2d(bins=10)

Estimate the 2d density with stat_density2d()

Statistical summaries

bar

ggplot(diamonds, aes(color)) + 
  geom_bar()

ggplot(diamonds, aes(color, price)) + 
  geom_bar(stat = "summary_bin", fun = mean)

surfaces

we are considered two classes of geoms: - simple geoms where there’s a one-on-one correspondence between rows in the data - statistical geoms where introduce a layer of statistical summaries in between the raw data and the fault - we will consider cases where a visualization of a three dimensional surface

data(faithfuld)

ggplot(faithfuld,aes(eruptions,waiting))+
    geom_contour(aes(z=density,color=..level..))

..level.. 变量

..意味着一个内部计算的变量

演示相同的分布做一个热力图

ggplot(faithfuld, aes(eruptions, waiting)) + 
  geom_raster(aes(fill = density))

generated variables

a stat takes a data frame as input and returns a data frame as output, and so a stat can add new variables to the original dataset

Geoms

geometric objects or geoms for short,perform the actual rendering of the layer, controlling the type of plot that you create.

graphical primitives:
- geom_blank()：啥也没有
- geom_point()points
- geom_path()
- geom_rect() rectangles.
- geom_ploygon() filled polygons.
- geom_text()
One variable:
- discrete
- continuous
two variables:
- both continuous:
  - geom_point()
  - geom_smooth()
three variables:
- geom_contour()
- geom_tile()
- geom_raster(): fast version of geom_tile() for equal sized tiles

Stats

统计变换，或统计转换数据，通常是用某种方式来总结它

stat_bin()
stat_bin2d()
stat_bindot()
stat_binplot()

other stats can’t be created with a geom_ function

ggplot(mpg,aes(trans,cty))+
    geom_point()+
    stat_summary(geom="point",fun="mean",color="red",size=4)

ggplot(mpg,aes(trans,cty))+
    geom_point()+
    geom_point(stat="summary",fun="mean",color="red",size=4)

the way to use these functions. you can either add a stat_() function and override the default geom or add a geom_() function and override the default stat:

generated variables

a stat takes a data frame as input and returns a data frame as output, and so a stat can add new variables to the original dataset. it is possible to map aesthetics to these new variables. example: stat_bin 用于构建histogram 产生一系列的变量： - count,the number of observation in each bin - density the density of observation in each bin - x the centre of the bin

ggplot(diamonds,aes(price))+
    geom_histogram(binwidth = 500)

the after_stat() must wrap the name, preventing the confusion in case the original dataset includes a variables with the same name as a generated variable

ggplot(diamonds,aes(price))+
    geom_histogram(aes(y = after_stat(density)),binwidth=500)

scale and guides

ggplot(mpg, aes(displ, hwy)) + 
  geom_point(aes(colour = class))