FigDraw 4. SCI 文章绘图之散点图（Scatter）

关注公众号，桓峰基因

桓峰基因

生物信息分析，SCI文章撰写及生物信息基础知识学习：R语言学习，perl基础编程，linux系统命令，Python遇见更好的你

72篇原创内容 -->

公众号

前两期简单介绍了 R 语言基础，比较简单粗略，然后有介绍了 R 语言中表格的转换，因为现在绘图基本以及舍弃了基本绘图的方式，都会选择 ggplot2 来作图，那么这期SCI绘图开始，就先从散点图的绘制开始吧！

前言

散点图是描绘两个连续型变量之间关系的图形，特别是在观察两个变量之间的相关关系时特别好使。

基础参数介绍

绘制简单的散点图包括一些参数的调整，包括点的大小，颜色，形状，还有就是分组绘图等参数，下面大概介绍几个常用参数。

基础参数

基础参数就是包括大小，颜色，形状，如下：

shape：点的形状

size：点的大小

color：点的颜色

绘制分组的散点图

1 首先将数值型变量转换为因子型变量

2 分组变量赋值给颜色属性(变量赋值必须在aes里面)

3 分组变量赋值给性状属性

将连续变量映射给颜色、大小和形状等属性

1 将连续性变量映射给颜色属性

2 将连续性变量映射给大小属性

重叠点的处理

当数据量非常大时，会导致数据重叠点非常严重，可通过使用半透明的点避免。alpha参数控制点的透明度。

多图布局

将不同分类的变量绘制到不同的图上，实现多图布局，只需要用“+”连接facet_wrap()函数，其中首个参数为用于分类的变量前加“~”，nrow 参数表示每行布局的图像数。

实例解析

1. 数据读取

我们选择经典的数据 iris 数据，说明如下：

DescriptionThis famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

library(ggplot2)
data(iris)
str(iris)
## 'data.frame':	150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

2. geom_point{ggplot2}

a. 简单散点图

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() + theme_bw()

b. shape参数修改图形的形状

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(shape = 8) +
    theme_bw()

c. size参数修改点的大小

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(shape = 8,
    size = 6) + theme_bw()

d. color参数修改点的颜色

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(shape = 8,
    size = 6, color = "red") + theme_bw()

e. geom_text:添加文本

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(shape = 8,
    size = 6, color = "red") + geom_text(aes(label = Petal.Width), position = position_dodge(width = 0.5),
    size = 3) + theme_bw()

避免文字的重叠，我们可以使用 ggrepel 软件包，如下：

library(ggrepel)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(shape = 8,
    size = 6, color = "red") + geom_text_repel(aes(label = Petal.Width), size = 3) +
    theme_bw()

f. 分组变量赋值给颜色属性(变量赋值必须在aes里面)

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point(shape = 8,
    size = 6) + theme_bw()

g. 分组变量赋值给性状属性

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, shape = Species)) + geom_point(size = 3) +
    theme_bw()

h.分组变量同时赋值给颜色和形状属性

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, shape = Species)) + geom_point(size = 3) +
    scale_color_brewer(palette = "Accent") + scale_shape_manual(values = c(2, 9,
    16)) + theme_bw()

i. 连续性变量映射给颜色属性

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Petal.Width)) +
    geom_point(size = 3) + scale_color_gradient(low = "lightblue", high = "darkblue") +
    theme_bw()

j. 连续性变量映射给大小属性（典型的气泡图）

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(size = Petal.Width,
    fill = Petal.Length), shape = 21, colour = "black") + scale_fill_gradient(low = "#377EB8",
    high = "#E41A1C") + theme_bw()

k.重叠点的处理当数据量非常大时，会导致数据重叠点非常严重，可通过使用半透明的点避免。alpha参数控制点的透明度。

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(size = Petal.Width,
    fill = Petal.Length), shape = 21, colour = "black", alpha = 0.5) + scale_fill_gradient(low = "#377EB8",
    high = "#E41A1C") + theme_bw()

l. 多图布局连接facet_wrap()函数，其中首个参数为用于分类的变量前加“~”，nrow参数表示每行布局的图像数。

ggplot(data = iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width, color = Species,
    shape = Species, size = Petal.Length), alpha = 0.5) + facet_wrap(~Species, nrow = 1) +
    theme_bw()

如果需要用两个变量实现多图布局，可使用facet_grid()函数指定行列对应的变量，用“~”分隔,如下：

iris$Group = ifelse(iris$Sepal.Length > mean(iris$Sepal.Length), "High", "Low")
ggplot(data = iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
    facet_grid(Group ~ Species) + theme_bw()

散点图常常会出现点重叠的情况，尤其是数据四舍五入后作图。通过调整参数position = "jitter"，可以避免这种网格化，为每个点添加少量随机噪声。因为没有两个点可能会接收到相同数量的随机噪声，所以这就使避免了散点堆积的情况。

ggplot(data = iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width, color = Species,
    size = Petal.Length, alpha = 0.5, shape = Species), position = "jitter", ) +
    theme_bw() + facet_grid(Group ~ Species)

3. geom_jitter{ggplot2}

ggplot2 对应的函数为 geom_jitter(),他可以让重叠的点随机分布在原始位置的周围，width参数（argument）控制的应该是点距离原始位置的距离，通过两幅图片可以非常直观的看出差别,如下：

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_jitter(width = 0.5,
    size = 1) + theme_bw()

4. geom_count{ggplot2}

当你的坐标值都是离散的数据且有很多重叠的点时，使用 geom_count 绘制时，将每个位置的点的数目映射到点的大小,如下：

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_count(aes(size = Petal.Length,
    group = Species)) + theme_bw()

最好配合 scale_size_area 一起使用，以确保观测值数目为 0 的点映射的大小为 0。最小的观测数目的点的大小已经很接近 0 了，看下效果，如下：

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_count() + scale_size_area() +
    theme_bw()

5. geom_dotplot{ggplot2}

当使用geom_dotplot绘图时，point的形状是dot，不能改变点的形状，因此，geom_dotplot 叫做散点图，通过绘制点来呈现数据的分布，对点分箱的方法有两种：点密度（dot-density ）和直方点（histodot）。当使用点密度分箱（bin）方式时，分箱的位置是由数据和binwidth决定的，会根据数据进行变化，但不会大于binwidth指定的宽度；当使用直方点分箱方式时，分箱有固定的位置和固定的宽度，就像由点构成的直方图（histogram）。

bin是分箱的意思，在统计学中，数据分箱是一种把多个连续值分割成多个区间的方法，每一个小区间叫做一个bin（bucket），这就意味着每个bin定义一个数值区间，连续值会落到相应的区间中。对点进行分箱时，点的位置（Position adjustment）有多种调整方式：

identity：不调整
dodge：垂直方向不调整，只调整水平位置
nudge：在一定的范围内调整水平和垂直位置
jitter：抖动，当具有离散位置和相对较少的点数时，抖动很有用
jitterdodge：同时jitter和 dodge
stack：堆叠，
fill：填充，用于条形图

每个位置调整都对应一个函数position_xxx()。

当沿着x轴进行分箱，并沿着y轴堆叠时，y轴上的数字没有意义。

当沿x轴进行分箱并沿y轴堆叠时，由于ggplot2的技术限制，y轴上的数字没有意义。您可以隐藏y轴（如其中一个示例中所示），也可以手动缩放y轴以匹配点数。

使用geom_dotplot()函数来绘制点图常用的参数注释如下：

mapping：使用aes()来设置点图美学特征，参数x是因子，参数y是数值；

data：数据框对象；

position：位置调整（Position adjustment），默认值是identity，表示不调整位置；

method：默认值是dotdensity（点密度分箱），或者histodot（直方点，固定的分箱宽度）；

binwidth：该参数用于调整分箱的宽度，该参数受到method参数的影响，如果method是dotdensity，那么binwidth指定分箱的最大宽度；如果method是histodot，那么binwidth指定分箱的固定宽度，默认值是数据范围（range of the data）的1/30；

binaxis：沿着那个轴进行分箱，默认值是x；

stackdir：设置堆叠的方向，默认值是up，有效值是down、center、centerwhole和up；

stackratio：点堆叠的密集程度，默认值是1，值越小，堆集越密集；

dotsize：点的大小，相对于binwidth的直径，默认值是1。

a. 绘制基本的点图

ggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_dotplot(binaxis = "y", stackdir = "center",
    stackratio = 1.5, dotsize = 1) + theme_bw()

b. 点图中增加均值和中位数

ggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_dotplot(binaxis = "y", stackdir = "center",
    stackratio = 1.5, dotsize = 1) + theme_bw() + stat_summary(fun.y = median, geom = "point",
    shape = 18, size = 3, color = "red")

c. 按照分组改变点图的颜色

ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) + geom_dotplot(binaxis = "y",
    stackdir = "center", stackratio = 1.5, dotsize = 1) + theme_bw() + stat_summary(fun.y = median,
    geom = "point", shape = 18, size = 3, color = "red") + theme(legend.position = "none")

References:

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. (has iris3 as iris.)

本文使用文章同步助手同步

FigDraw 4. SCI 文章绘图之散点图 （Scatter）