R语言基本操作

R语言基本操作

为什么选择R?

丰富的资源

涵盖了多种行业数据分析中几乎所有的方法;

良好的扩展性

十分方便的编写函数和程序包,跨平台,可以胜任复杂的数据分析、绘制精美的图形;

完备的帮助系统

每个函数都有统一格式的帮助,运行实例;

GNU软件

免费、软件本身及程序包的源代码公开;

R的特点:

多领域的统计资源

目前在R网站上约有 4000 个程序包,涵盖了基础统计学、社会学、经济学、生态学、空间分

析、系统发育分析、生物信息学等诸多方面;

跨平台

R可在多种操作系统下运行,如Windows、MacOS、多种Linux和UNIX等;
命令行驱动
R即时解释,输入命令,即可获得相应的结果;

参考资料:

第一部分:R的数据结构

参考配套知识点的第一章,想了解更全面的知识点,可以看这 【R语言知识点详细总结】 中的 第

一章 R的数据结构 ;

1. 定义值为 4 的一个向量

In [3]:
x <- 4 #也可用=赋值
print(x)

2. 查看向量x的数据类型

In [4]:
typeof(x)

3. 判断x是否是一个向量

In [5]:
is.vector(x)

4. 定义多个元素向量:包含88,5,12,

[1] 4

'double'

TRUE

In [6]:
y <- c( 88 , 5 , 12 , 13 )
print(y)
print(typeof(y))
print(is.vector(y))

5. 创建一个包含从 1 到 5 的向量

In [7]:

方法一:c()函数

x1 <- c( 1 , 2 , 3 , 4 , 5 )
print(x1)

方法二:运算符创建向量

x2 <- 1 : 5
print(x2)

6. 创建一个从 12 到 30 步长为 3 的向量

In [8]:
seq(from = 12 , to = 30 , by = 3 )

7. 创建一个从1.1到 2 长度为 10 的向量

In [9]:
seq(from=1.1, to= 2 , length= 10 )

8. 创建包含 4 个 8 的向量

[1] 88 5 12 13

[1] "double"
[1] TRUE

[1] 1 2 3 4 5

[1] 1 2 3 4 5

12 15 18 21 24 27 30

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

In [10]:

方法一:reo()函数

rep( 8 , 4 )

方法二:c()函数

c( 8 , 8 , 8 , 8 )

9. 在索引为 4 的位置上对y向量添加元素 168

In [11]:
y <- c(y[ 1 : 3 ], 168 , y[ 4 ])
print(y)

10. 从索引为 4 的位置上对y向量添加多个元素(56,24,35,10,5,7)

In [12]:
y <- c(y[ 1 : 3 ], c( 56 , 24 , 35 , 10 , 5 , 7 ), y[ 4 ])
print(y)

11. 获取y向量的长度

In [13]:
length(y)

12. 计算c(1,2,4)和c(5,0,-1)的加减乘除后的结果

8 8 8 8

8 8 8 8

[1] 88 5 12 168 13

[1] 88 5 12 56 24 35 10 5 7 168

10

In [14]:
c( 1 , 2 , 4 ) + c( 5 , 0 ,-1)
c( 1 , 2 , 4 ) - c( 5 , 0 ,-1)
c( 1 , 2 , 4 ) * c( 5 , 0 ,-1)
c( 1 , 2 , 4 ) / c( 5 , 0 ,-1)

13. 访问y向量的第 2 个元素

In [15]:
y[ 2 ]

14. 访问y向量的第 2 个到第 4 个元素

In [16]:
y[ 2 : 4 ]

15. 将y向量的第 2 个到第 4 个元素修改为(8,14,67)

In [17]:
print(y)
y[ 2 : 4 ] = c( 8 , 14 , 67 )
print(y)

16. 访问y向量除了前 3 个元素外的其他元素

6 2 3

-4 2 5

5 0 -

0.2 Inf -

5

5 12 56

[1] 88 5 12 56 24 35 10 5 7 168

[1] 88 8 14 67 24 35 10 5 7 168

In [18]:
print(y)
print(y[-c( 1 : 3 )]) # 或者b=(1:3) y[-b]

17. 给出如下列,创建一个矩阵

  • X = c(1,1,1)

  • Y = c(2,2,2)

  • temp = c(14.7,18.5,25.9)

  • RH = c(66,73,41)

  • wind = c(2.7,8.5,3.6)

  • rain = c(0,0,0)

  • area = c(0,0,0)

  • rank = c(1,2,3)

In [19]:
X = c( 1 , 1 , 1 )
Y = c( 2 , 2 , 2 )
temp = c(14.7,18.5,25.9)
RH = c( 66 , 73 , 41 )
wind = c(2.7,8.5,3.6)
rain = c( 0 , 0 , 0 )
area = c( 0 , 0 , 0 )
rank = c( 1 , 2 , 3 )
ForeData = cbind(X,Y,temp,RH,wind,rain,area,rank)
print(ForeData)
print(is.matrix(ForeData)) # 判断是否为矩阵

18. 给出向量c(1,2,3,11,12,13),创建 2 行 3 列的矩阵,行命名为(row1, row2),

列命名为(C.1, C.2, C.3)

[1] 88 8 14 67 24 35 10 5 7 168

[1] 67 24 35 10 5 7 168

X Y temp RH wind rain area rank
[1,] 1 2 14.7 66 2.7 0 0 1
[2,] 1 2 18.5 73 8.5 0 0 2
[3,] 1 2 25.9 41 3.6 0 0 3
[1] TRUE

In [20]:
mdat <- matrix(c( 1 , 2 , 3 , 11 , 12 , 13 ), nrow = 2 , ncol = 3 , byrow = TRUE, dimnames =
print(mdat)

19. 先创建 2 行 2 列的空矩阵,然后按照列的方式依次给每个位置赋值

1,2,3,

In [21]:
x = matrix(nrow = 2 , ncol = 2 ) # 注意:不能写成matrix(2,3)
x[ 1 , 1 ] = 1
x[ 2 , 1 ] = 2
x[ 1 , 2 ] = 3
x[ 2 , 2 ] = 4
print(x)

20. 对上述创建的x矩阵的行列进行重命名,行命名为(‘1’, ‘2’),列命名为(‘a’,

‘b’)

In [22]:
colnames(x) = c(‘a’,‘b’)
rownames(x) = c(‘1’,‘2’)
print(x)

21. 访问ForeData矩阵的第 2 行第 3 列的元素

In [23]:
print(ForeData[ 2 , 3 ])

22. 访问ForeData矩阵的第 1 到 2 行,第 1 到 3 列的元素

C.1 C.2 C.

row1 1 2 3
row2 11 12 13

[,1] [,2]

[1,] 1 3

[2,] 2 4

a b
1 1 3
2 2 4
temp
18.

In [24]:
print(ForeData[ 1 : 2 , 1 : 3 ])

23. 访问ForeData矩阵的第 1 到 2 行,第 1 列和第 3 列的元素(注意与 22 题的区

别)

In [25]:
print(ForeData[ 1 : 2 , c( 1 , 3 )])

24. 定义一个 4 行 5 列的三维数组,数值为1:60, 行命名为

c(‘R1’,‘R2’,‘R3’,‘R4’),列命名为c(‘C1’,‘C2’,‘C3’,‘C4’,‘C5’),维度命名为

c(‘T1’,‘T2’,‘T3’)

X Y temp
[1,] 1 2 14.
[2,] 1 2 18.
X temp
[1,] 1 14.
[2,] 1 18.

In [26]:
a = c( 1 : 60 )
dim1 = c(‘R1’,‘R2’,‘R3’,‘R4’)
dim2 = c(‘C1’,‘C2’,‘C3’,‘C4’,‘C5’)
dim3 = c(‘T1’,‘T2’,‘T3’)
f = array(a,c( 4 , 5 , 3 ),dimnames = list(dim1,dim2,dim3))
print(f)

25. 根据下面给定的列,创建一个数据框

  • X = c(1,1,1)

  • Y = c(2,2,2)

  • temp = c(14.7,18.5,25.9)

  • RH = c(66,73,41)

  • wind = c(2.7,8.5,3.6)

  • rain = c(0,0,0)

  • area = c(0,0,0)

  • month = c(‘aug’,‘aug’,‘aug’)

  • day = c(‘fri’,‘fri’,‘fri’)

, , T

C1 C2 C3 C4 C

R1 1 5 9 13 17

R2 2 6 10 14 18

R3 3 7 11 15 19

R4 4 8 12 16 20

, , T

C1 C2 C3 C4 C

R1 21 25 29 33 37

R2 22 26 30 34 38

R3 23 27 31 35 39

R4 24 28 32 36 40

, , T

C1 C2 C3 C4 C

R1 41 45 49 53 57

R2 42 46 50 54 58

R3 43 47 51 55 59

R4 44 48 52 56 60

In [27]:
X = c( 1 , 1 , 1 )
Y = c( 2 , 2 , 2 )
temp = c(14.7,18.5,25.9)
RH = c( 66 , 73 , 41 )
wind = c(2.7,8.5,3.6)
rain = c( 0 , 0 , 0 )
area = c( 0 , 0 , 0 )
month = c(‘aug’,‘aug’,‘aug’)
day = c(‘fri’,‘fri’,‘fri’)
ForeDataFrm = data.frame(FX = X,FY = Y, Fmonth = month,Fday = day, Ftemp = temp
print(ForeDataFrm)

26. 查看ForeDataFrm数据框的列名

In [28]:
names(ForeDataFrm)

27. 判断ForeDataFrm是否是数据框类型

In [29]:
is.data.frame(ForeDataFrm)

28. 访问ForeDataFrm数据框的第 1 列和第 3 列

FX FY Fmonth Fday Ftemp FRH Fwind Frain Farea
1 1 2 aug fri 14.7 66 2.7 0 0
2 1 2 aug fri 18.5 73 8.5 0 0
3 1 2 aug fri 25.9 41 3.6 0 0
'FX' 'FY' 'Fmonth' 'Fday' 'Ftemp' 'FRH' 'Fwind' 'Frain' 'Farea'

TRUE

In [30]:

方法一:

print(ForeDataFrm[,c( 1 , 3 )])

方法二:

print(ForeDataFrm[,c(‘FX’,‘Fmonth’)])

29. 访问ForeDataFrm中Fwind这一列

In [31]:

方法一:

ForeDataFrm$Fwind

方法二:

ForeDataFrm[[‘Fwind’]]

方法三:

ForeDataFrm[[ 7 ]]

30. 判断a=123.4和b='123.4’是否为数值型,整数型,字符串型,布尔型

FX Fmonth
1 1 aug
2 1 aug
3 1 aug
FX Fmonth
1 1 aug
2 1 aug
3 1 aug

2.7 8.5 3.

2.7 8.5 3.

2.7 8.5 3.

In [32]:
a <- 123.
is.numeric(a)
is.integer(a)
is.character(a)
is.logical(a)

In [33]:
b <- “123.4”
is.numeric(b)
is.integer(b)
is.character(b)
is.logical(b)

31. 查看向量a和b的数据类型

In [34]:
typeof(a)
typeof(b)

32. 将a转换为字符串类型,将b转换为浮点型

TRUE

FALSE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

'double'
'character'

In [35]:
a <- as.character(a)
b <- as.double(b)
typeof(a)
typeof(b)

33. 将e = c(1:10)向量转换为矩阵

In [36]:
e <- c( 1 : 10 )
f <- as.matrix(e)
print(f)

第二部分:数据的导入

参考配套知识点的第二章,想了解更全面的知识点,可以看这 【R语言知识点详细总结】 中的 第

二章 数据的导入 ;这里只是以txt的导入为例,如果想看更多的文件导入方式,可以看下配套知识点
的第二章,里面有更多格式的文件导入方式;

34. 读取ReportCard1.txt和ReportCard2.txt文件到数据框

'character'
'double'

[,1]

[1,] 1

[2,] 2

[3,] 3

[4,] 4

[5,] 5

[6,] 6

[7,] 7

[8,] 8

[9,] 9

[10,] 10

In [37]:
ReportCard1 = read.table(file=‘/home/mw/input/wlong6309/ReportCard1.txt’, heade
ReportCard2 = read.table(file=‘/home/mw/input/wlong6309/ReportCard2.txt’, heade
names(ReportCard1)
names(ReportCard2)

第三部分:R的数据管理

参考配套知识点的第三章,想了解更全面的知识点,可以看这 【R语言知识点详细总结】 中的 第

三章 R的数据管理 ;

35. 按照学号xh字段合并ReportCard1和ReportCard

In [38]:
ReportCard = merge(ReportCard1, ReportCard2, by = ‘xh’)
print(head(ReportCard))

36. ReportCard按照math字段进行降序排列

In [39]:
Ord = order(ReportCard$math, na.last = TRUE, decreasing = TRUE)
print(Ord) # Ord为位置向量, 1 号学生的数学成绩最高, 3 号学生的数学成绩最低或者为缺失值

'xh' 'sex' 'poli' 'chi' 'math'
'xh' 'fore' 'phy' 'che' 'geo' 'his'
xh sex poli chi math fore phy che geo his
1 92101 2 96 96 87.5 72 93 65 76.0 92
2 92102 1 94 97 86.5 61 93 64 79.5 95
3 92103 2 NA NA NA 66 98 79 89.0 81
4 92104 2 89 97 69.5 86 83 62 83.0 94
5 92105 1 82 85 79.5 60 88 66 72.5 98
6 92106 2 88 88 78.0 60 90 70 81.5 77

[1] 1 33 2 32 34 31 14 5 6 35 10 45 9 12 8 36 38 46 4 7 44 39 13 50 11

[26] 49 41 16 37 43 42 40 17 47 27 19 58 15 18 52 20 57 22 23 24 48 54 21 30 51

[51] 53 55 60 26 25 56 28 59 29 3

In [40]:

如果想在数据框中按照这种顺序排列

a = ReportCard[Ord,]
print(head(a))

37. 查询ReportCard中math字段存在缺失值的行

In [41]:
a = is.na(ReportCard$math)
print(ReportCard[a,])

38. 查询ReportCard中存在缺失值的行

In [42]:
a = complete.cases(ReportCard)
print(ReportCard[!a,])

39. 对于ReportCard数据框生成缺失数据报告

xh sex poli chi math fore phy che geo his
1 92101 2 96 96 87.5 72 93 65 76.0 92
33 92204 2 88 81 87.5 60 84 63 79.0 92
2 92102 1 94 97 86.5 61 93 64 79.5 95
32 92203 2 74 93 84.5 50 89 72 82.5 92
34 92205 2 81 79 84.0 60 91 64 81.0 92
31 92202 1 78 89 83.5 81 91 77 81.0 93
xh sex poli chi math fore phy che geo his
3 92103 2 NA NA NA 66 98 79 89 81
xh sex poli chi math fore phy che geo his
3 92103 2 NA NA NA 66 98 79 89 81
27 92142 2 NaN 70 59 22 68 26 26 63

In [43]:
install.packages(“mice”)
library(mice)
Updating HTML index of packages in ‘.Library’
Making ‘packages.html’ … done
Warning message:
“As of rlang 0.4.0, dplyr must be at least version 0.8.0.

  • dplyr 0.7.8 is too old for rlang 0.4.11.
  • Please update dplyr with install.packages("dplyr") and restart R.”
    Attaching package: ‘mice’
    The following object is masked from ‘package:stats’:
    filter
    The following objects are masked from ‘package:base’:
    cbind, rbind

In [44]:
print(md.pattern(ReportCard))

40. 计算以 2 为底 10 的对数的平方根,并保留 3 位小数位数

注:至于想了解更多的统计函数,可以参考【R语言配套知识点详细总结】中 第三章 R的数据管理 里的 变量计

算 ,里面包含 数学函数、统计函数、概率函数、字符串函数 等;

xh sex fore phy che geo his chi math poli
58 1 1 1 1 1 1 1 1 1 1 0
1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 0 0 0 3
0 0 0 0 0 0 0 1 1 2 4

In [45]:
round(sqrt(log( 10 , 2 )),digits= 3 )

41. 计算向量y的平均值,中位数,标准差,方差,最大最小值

注:更多函数,可以移步【R语言配套知识点详细总结】中 第三章 里的 统计函数 ;

In [46]:
mean(y) # 中位数
median(y)
sd(y)
var(y)
max(y)
min(y)

42. 根据ReportCard中学生的各门成绩计算每个学生的总分和平均分

In [47]:
attach(ReportCard)#访问数据框中域访问
SumScore = poli + chi + math + fore + phy + che + geo + his
detach(ReportCard)
AvScore = SumScore/ 8 #计算平均值
ReportCardsumScore=SumScoreReportCardsumScore = SumScore ReportCardsumScore=SumScoreReportCardavScore = AvScore
sum(is.na(ReportCard$sumScore))#计算总分为缺失值的观测值的观测样本数
mean(complete.cases(ReportCard))#计算完整观测样本的比率

1.

42.

19

52.

2727.

168

5

2

0.

43. 计算向量y的和,累计和,连乘积

In [48]:
sum(y)
cumsum(y)
prod(y)

44. 以ReportCard中的math字段值、均值、标准差生成标准正态分布

注:更多函数,可以移步【R语言配套知识点详细总结】中 第三章 里的 概率函数 ;

In [49]:
a = is.na(ReportCardmath)math=ReportCardmath) math = ReportCardmath)math=ReportCardmath
math = math[!a]
dnorm(math,mean(math),sd(math))

45. “You like R. So do I”:去掉So do I,空格换成_,所有字母大写

注:更多函数,可以移步【R语言配套知识点详细总结】中 第三章 里的 字符串函数 ;

426

88 96 110 177 201 236 246 251 258 426

32616105984000

0.00575809401592225 0.00645108245058991 0.0227150564107415 0.

0.0141904617776079 0.0227150564107415 0.0213938476169057 0.

0.01577914953468 0.0251373901050085 0.019951012746438 0.

0.0116313829477201 0.0243359978745361 0.0261470803551543 0.

0.0232497215340507 0.024950643098031 1 0.0219931442165863 0.

0.0201 10646396652 0.0175429182127999 0.0170139044800109 0.

0.0112979964667856 0.0261887891036519 0.00521672261070003 0.

0.0148879840945352 0.00883547768683561 0.007991 13070263313

0.00575809401592225 0.00840732757664685 0.0147175850040436 0.

0.0262670193933022 0.0218495019576 0.0231224289399321 0.

0.0256209012015816 0.0264212263047745 0.0263585015720458 0.

0.01577914953468 0.0222903309124429 0.0262996025645777 0.

0.0253919543357831 0.0248580221981635 0.0143598431926238 0.

0.0143598431926238 0.0154185816426542 0.0143598431926238 0.008543917573531 1

0.0215421231399613 0.0246549590986037 0.00521672261070003 0.

In [50]:
str = “You like R. So do I”
str_1 = strsplit(str,‘S’)[[ 1 ]] # 注:列表名$域名 或者 列表名[ [‘域名’] ] 或者 列表框[
str_2 = sub(’ ', ‘‘, sub(’ ', '’, str_1[ 1 ])) # 为什么嵌套:sub好像只能替换第一个
str_3 = toupper(str_2)
print(str_3)

注: 46 到 50 是有关矩阵的运算的知识点,想了解更多的相关知识点,可以看这【【R语言配套知

识点详细总结】】中的 第三章 R的数据湖管理 里的 矩阵的运算

46. 生成一个 4 行 4 列的单位矩阵

In [51]:
print(diag( 4 ))

47. 生成 2 个 2 行 2 列的矩阵m和n,一个值全是 1 ,一个值全是 2

In [52]:
m = matrix( 1 , nrow= 2 , ncol= 2 )
n = matrix( 2 , nrow= 2 , ncol= 2 )
print(m)
print(n)

48. 计算矩阵m和矩阵n相乘后的结果mn,并输出正对角元素值

[1] "YOU_LIKE_R. "

[,1] [,2] [,3] [,4]

[1,] 1 0 0 0

[2,] 0 1 0 0

[3,] 0 0 1 0

[4,] 0 0 0 1

[,1] [,2]

[1,] 1 1

[2,] 1 1

[,1] [,2]

[1,] 2 2

[2,] 2 2

In [53]:
mn = m %*% n
print(mn)

In [54]:
print(diag(mn)) # 输出正对角元素值

49. 以1:9按照列优先方式生成 3 行 3 列的矩阵mm,并求该矩阵的转置矩阵

In [55]:
mm = matrix( 1 : 9 , nrow= 3 , ncol= 3 , byrow=TRUE)
print(mm)
print(‘转置后的矩阵:’)
print(t(mm))

50. 求上一步生成的mm矩阵的特征值和特征向量

In [56]:
eigen(mm)

[,1] [,2]

[1,] 4 4

[2,] 4 4

[1] 4 4

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

[3,] 7 8 9

[1] “转置后的矩阵:”

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

eigen() decomposition
$values
[1] 1.611684e+01 -1.116844e+00 -1.303678e-15
$vectors
[,1] [,2] [,3]
[1,] -0.2319707 -0.78583024 0.4082483
[2,] -0.5253221 -0.08675134 -0.8164966
[3,] -0.8186735 0.61232756 0.4082483

51. 将ReportCard中学生成绩的平均成绩分为A,B,C,D,E 5个等级

A :大于等于 90 分;

B:大于等于 80 分,小于 90 分;

C:大于等于 70 分,小于 80 分;

D:大于等于 60 分,小于 70 分;

E:小于 60 分;

In [57]:
attach(ReportCard)#访问数据框中域访问
SumScore = poli + chi + math + fore + phy + che + geo + his
detach(ReportCard)
AvScore = SumScore/ 8 #计算平均值
ReportCardsumScore=SumScoreReportCardsumScore = SumScore ReportCardsumScore=SumScoreReportCardavScore = AvScore

通过使用within和逻辑运算符将学生平均成绩分为 5 个级别

ReportCard = within(ReportCard,{
avScore[avScore>= 90 ] = ‘A’
avScore[avScore>= 80 & avScore < 90 ] = ‘B’
avScore[avScore>= 70 & avScore < 80 ] = ‘C’
avScore[avScore>= 60 & avScore < 70 ] = ‘D’
avScore[avScore < 60 ] = ‘E’
})

下面通过%in%包含函数,找出非正常项

flag = ReportCard$avScore %in% c(‘A’,“B”,“C”,“D”,“E”)

通过使用flag标记,将非正常的标记为缺失值

ReportCard$avScore[!flag] = NA

输出平均成绩等级

print(ReportCard$avScore)

52. 将ReportCard中sex字段值 1 和 2 ,替换成’M’和’F’

'M’表示男性;

'F’表示女性;

The following object is masked _by_ .GlobalEnv:
math
Warning message in poli + chi + math:
“longer object length is not a multiple of shorter object length”
[1] "B" "B" NA "B" "C" "C" "C" "C" "C" "C" "C" "C" "C" "D" "C" "D" "D" "C" "D"
[20] "D" "D" "D" "D" "D" "E" "E" NA "E" "E" "E" "B" "B" "C" "C" "C" "C" "C" "C"
[39] "D" "C" "C" "C" "C" "D" "D" "D" "D" "C" "D" "D" "D" "D" "D" "D" "D" "D" "D"
[58] "E" "E" "E"

In [58]:
ReportCardsex=factor(ReportCardsex = factor(ReportCardsex=factor(ReportCardsex, levels = c( 1 , 2 ), labels = c(“M”,“F”))
str(ReportCard$sex)

In [59]:
print(head(ReportCard))

52. 对于ReportCard,筛选出性别为M,平均等级为E的样本

In [60]:
MaleScore = subset(ReportCard, ReportCardKaTeX parse error: Expected 'EOF', got '&' at position 12: sex == 'M' &̲ ReportCardavScore == 'E
print(MaleScore)

53. 从ReportCard中随机抽取 10 位同学的数据

Factor w/ 2 levels "M","F": 2 1 2 2 1 2 2 1 1 2 ...
xh sex poli chi math fore phy che geo his sumScore avScore
1 92101 F 96 96 87.5 72 93 65 76.0 92 677.5 B
2 92102 M 94 97 86.5 61 93 64 79.5 95 670.0 B
3 92103 F NA NA NA 66 98 79 89.0 81 NA 
4 92104 F 89 97 69.5 86 83 62 83.0 94 673.5 B
5 92105 M 82 85 79.5 60 88 66 72.5 98 629.5 C
6 92106 F 88 88 78.0 60 90 70 81.5 77 624.0 C
xh sex poli chi math fore phy che geo his sumScore avScore
28 92144 M 59 79.0 34.0 34 57 37 37 76 409.5 E
29 92145 M 74 84.5 30.5 33 64 34 34 71 439.5 E
30 92146 M 61 69.0 45.0 20 49 32 32 51 397.5 E
58 92234 M 66 79.0 55.5 57 52 57 41 65 451.0 E
59 92236 M 79 76.0 34.0 28 63 36 36 52 414.0 E

In [61]:
xh = sample(ReportCardxh,size=10,replace=FALSE)samples=ReportCard[ReportCardxh, size = 10 , replace = FALSE) sample_s = ReportCard[ReportCardxh,size=10,replace=FALSE)samples=ReportCard[ReportCardxh %in% xh,]
print(sample_s)

54. 分别使用repeat和for循环依次打印 50 以内的 6 的倍数

In [62]:

repeat循环

i = 6
repeat{ if(i > 50 ) break else {print(i); i = i + 6 }}

In [63]:

for循环

for(i in seq(from = 6 , to = 50 , by = 6 ))
print(i)

第四部分:R的基本数据分析

xh sex poli chi math fore phy che geo his sumScore avScore
1 92101 F 96 96 87.5 72 93 65 76.0 92 677.5 B
5 92105 M 82 85 79.5 60 88 66 72.5 98 629.5 C
7 92108 F 84 90 69.5 50 80 60 86.5 94 615.5 C
27 92142 F NaN 70 59.0 22 68 26 26.0 63 NaN 
30 92146 M 61 69 45.0 20 49 32 32.0 51 397.5 E
39 92211 F 71 73 69.0 42 95 61 76.5 76 556.0 D
41 92213 M 82 76 65.0 60 75 60 78.0 76 569.0 C
46 92218 M 87 72 70.0 65 72 49 62.0 68 534.5 D
56 92231 F 83 84 38.5 60 76 46 65.5 49 515.0 D
58 92234 M 66 79 55.5 57 52 57 41.0 65 451.0 E

[1] 6

[1] 12

[1] 18

[1] 24

[1] 30

[1] 36

[1] 42

[1] 48

[1] 6

[1] 12

[1] 18

[1] 24

[1] 30

[1] 36

[1] 42

[1] 48

参考配套知识点的第四章,想了解更全面的知识点,可以看这 【R语言知识点详细总结】 中的 第

四章 R的基本数据分析 ;

55. 计算ReportCard中所有字段的基本描述统计量

In [64]:
summary(ReportCard)

56. 计算ReportCard中学生所有课程考试成绩的均值和标准差

In [65]:
Av.Course = sapply(ReportCard[, 3 : 10 ], FUN = mean, na.rm = TRUE) # 均值
Sd.Course = sapply(ReportCard[, 3 : 10 ],FUN = sd, na.rm = TRUE) # 方差
print(Av.Course)
print(Sd.Course)

57. 计算ReportCard中各科的平均分和总分

xh sex poli chi math
Min. :92101 M:30 Min. :40.00 Min. :63.00 Min. :30.50
1st Qu.:92122 F:30 1st Qu.:74.50 1st Qu.:77.00 1st Qu.:47.25
Median :92174 Median :82.50 Median :84.00 Median :62.50
Mean :92170 Mean :79.64 Mean :83.28 Mean :61.17
3rd Qu.:92217 3rd Qu.:87.00 3rd Qu.:90.00 3rd Qu.:70.75
Max. :92239 Max. :96.00 Max. :97.00 Max. :87.50
NA's :2 NA's :1 NA's :1
fore phy che geo
Min. :20.00 Min. :49.00 Min. :26.00 Min. :26.00
1st Qu.:40.75 1st Qu.:67.75 1st Qu.:45.50 1st Qu.:57.75
Median :50.00 Median :76.50 Median :55.00 Median :66.00
Mean :49.92 Mean :75.20 Mean :54.08 Mean :65.24
3rd Qu.:60.00 3rd Qu.:83.25 3rd Qu.:62.25 3rd Qu.:78.00
Max. :86.00 Max. :98.00 Max. :83.00 Max. :89.00
his sumScore avScore
Min. :49.00 Min. :372.5 Length:60
1st Qu.:71.75 1st Qu.:510.0 Class :character
Median :79.50 Median :554.0 Mode :character
Mean :78.68 Mean :548.7
3rd Qu.:91.00 3rd Qu.:589.2
Max. :98.00 Max. :677.5
NA's :2
poli chi math fore phy che geo his
79.63793 83.27966 61.16949 49.91667 75.20000 54.08333 65.24167 78.68333
poli chi math fore phy che geo his
10.575872 8.127365 15.076417 14.018501 12.351902 12.315474 15.394389 12.735233

In [66]:
Av.Course = colMeans(ReportCard[, 3 : 10 ],na.rm = TRUE) # 各科平均分
Sums.Course = colSums(ReportCard[, 3 : 10 ],na.rm = TRUE) # 各科总分
print(Av.Course)
print(Sums.Course)

58. 计算ReportCard中每名学生所有科目的平均分和总分

In [67]:
Av.Person = rowMeans(ReportCard[, 3 : 10 ],na.rm = TRUE)
Sum.Person = rowSums(ReportCard[, 3 : 10 ],na.rm = TRUE)
print(Av.Person)
print(Sum.Person)

59. 计算ReportCard中女生各科成绩的平均值

In [68]:
#抽取女生的数据
FeMaleCard = subset(ReportCard,ReportCard$sex == “F”)
#求女生各科成绩的平均值
Des.FeMale = sapply(FeMaleCard[ 3 : 10 ],FUN = mean,na.rm = TRUE)
print(Des.FeMale)

60. 分性别计算学生政治课考试成绩的基本描述

poli chi math fore phy che geo his
79.63793 83.27966 61.16949 49.91667 75.20000 54.08333 65.24167 78.68333
poli chi math fore phy che geo his
4619.0 4913.5 3609.0 2995.0 4512.0 3245.0 3914.5 4721.0

[1] 84.68750 83.75000 82.60000 82.93750 78.87500 79.06250 76.75000 79.00000

[9] 71.75000 75.75000 72.87500 72.31250 72.81250 70.06250 71.18750 68.56250

[17] 67.25000 71.68750 67.68750 70.00000 68.06250 67.12500 63.00000 62.87500

[25] 56.25000 56.68750 47.71429 51.62500 53.12500 44.87500 84.18750 79.62500

[33] 79.31250 79.00000 71.81250 74.50000 73.75000 72.75000 70.43750 71.81250

[41] 71.50000 70.81250 69.43750 68.50000 67.43750 68.12500 67.62500 69.06250

[49] 64.25000 63.93750 66.62500 64.68750 63.81250 62.75000 62.81250 62.75000

[57] 60.12500 59.06250 50.50000 41.12500

[1] 677.5 670.0 413.0 663.5 631.0 632.5 614.0 632.0 574.0 606.0 583.0 578.5

[13] 582.5 560.5 569.5 548.5 538.0 573.5 541.5 560.0 544.5 537.0 504.0 503.0

[25] 450.0 453.5 334.0 413.0 425.0 359.0 673.5 637.0 634.5 632.0 574.5 596.0

[37] 590.0 582.0 563.5 574.5 572.0 566.5 555.5 548.0 539.5 545.0 541.0 552.5

[49] 514.0 511.5 533.0 517.5 510.5 502.0 502.5 502.0 481.0 472.5 404.0 329.0

poli chi math fore phy che geo his
80.46429 83.05172 62.34483 48.63333 77.66667 55.80000 67.95000 78.43333

In [69]:
Des.Gender = tapply(ReportCardpoli,INDEX=ReportCardpoli,INDEX = ReportCardpoli,INDEX=ReportCardsex,FUN = summary,na.rm
print(Des.Gender)

61. 计算学生数学、物理、化学的简单相关矩阵

In [70]:
Tmp = ReportCard[complete.cases(ReportCard),]
CorMatrix = cor(Tmp[,c( 5 , 7 , 8 )],use = “everything”,method = “pearson”)
print(CorMatrix)

62. 基于学生的数学、物理成绩的简单相关系数进行相关系数检验

In [71]:
Tmp = ReportCard[complete.cases(ReportCard),]
cor.test(Tmp[, 5 ],Tmp[, 7 ],alternative = “two.side”,method = “pearson”)

63. 在性别和平均成绩等级列联表的基础上,分析学生性别和平均成绩等级两

个变量之间是否独立

$M

Min. 1st Qu. Median Mean 3rd Qu. Max.
56.00 73.25 82.00 78.87 86.75 94.00
$F
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
40.00 76.00 83.00 80.46 88.00 96.00 2
math phy che
math 1.0000000 0.7535317 0.7171637
phy 0.7535317 1.0000000 0.6207730
che 0.7171637 0.6207730 1.0000000
Pearson's product-moment correlation
data: Tmp[, 5] and Tmp[, 7]
t = 8.5775, df = 56, p-value = 8.753e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6149204 0.8469769
sample estimates:
cor
0.7535317

In [72]:
CrossTable = table(ReportCard[,c( 2 , 12 )])
chisq.test(CrossTable)

第五部分:R的数据可视化

参考配套知识点的第五章,想了解更全面的知识点,可以看这 【R语言知识点详细总结】 中的 第

五章 R的数据可视化 ;

64. 读取ForestData.txt文件到Forest数据框

In [4]:
Forest = read.table(file=‘/home/mw/input/wlong6309/ForestData.txt’, header = TR
print(head(Forest))

65. 对Forest中的temp字段值绘制茎叶图

Warning message in chisq.test(CrossTable):
“Chi-squared approximation may be incorrect”
Pearson's Chi-squared test
data: CrossTable
X-squared = 0.67532, df = 3, p-value = 0.879
X Y month day temp RH wind rain area
1 1 2 aug fri 14.7 66 2.7 0 0
2 1 2 aug fri 18.5 73 8.5 0 0
3 1 2 aug fri 25.9 41 3.6 0 0
4 1 2 aug sat 25.9 32 3.1 0 0
5 1 2 aug sun 19.5 39 6.3 0 0
6 1 2 aug sun 17.9 44 2.2 0 0

In [74]:
stem(Forest$temp)

66. 对Forest中的temp和month字段,绘制各月温度的箱线图

The decimal point is at the |
2 | 2
4 | 26666668111112333588
6 | 755
8 | 022337889038
10 | 1112334566690002223345556667888
12 | 223444677899123344777888899
14 | 012222334456677778911222222444444455667788999999
16 | 011222234446666677888888900001111222333444444446666677777888888999
18 | 00001222222334444556666777888999999000111111222233333344444556666666
20 | 11111122233334444445566666667777778888889900011112222333344445555666+3
22 | 11112223344566778888899990001223333344444455677778889999
24 | 0111112222233333566668889901333445679999
26 | 122344444788899234556788899
28 | 002336779236
30 | 2226880
32 | 344613

In [75]:
Forestmonth=factor(Forestmonth = factor(Forestmonth=factor(Forestmonth,levels = c(“jan”,“feb”,“mar”,“apr”,“may”,"ju
boxplot(temp~month,data = Forest,main = “森林地区各月温度箱线图”)

67. 对Forest中的temp字段,绘制森林地区温度直方图

In [76]:
hist(Forest$temp,xlab = “森林地区温度”,ylab = “频率”,main = “森林地区温度直方图”,cex.

68. 对ReportCard中平均成绩等级(ABCDE)和人数绘制条形图

In [77]:
NumGrade = tapply(ReportCardavScore,INDEX=ReportCardavScore,INDEX = ReportCardavScore,INDEX=ReportCardavScore,FUN = length)
barplot(NumGrade,xlab = “平均分等级”,ylab = “人数”,ylim = c( 0 , 25 ))

69. 对ReportCard中平均成绩等级(ABCDE)和人数绘制饼图

In [78]:
Pct = round(NumGrade/length(ReportCard$avScore)* 100 , 2 )
GLabs = paste(c(“B”,“C”,“D”,“E”),Pct,“%”,sep = “”)
pie(NumGrade,labels = GLabs,cex = 0.8,main = “平均分等级饼图”,cex.main = 0.8)

70. 对Forest中温度temp和湿度RH两个字段,以tremp为x轴以RH为y轴绘制

散点图

In [79]:
plot(Foresttemp,Foresttemp,Foresttemp,ForestRH,main = “森林地区温度和相对湿度的散点图”,xlab = “温度”,ylab

71. 对于Forest森林数据,温度和相对湿度的简单散点图,以及添加方法求解

回归线的散点图

In [80]:
plot(Foresttemp,Foresttemp,Foresttemp,ForestRH,main = “森林地区温度和相对湿度的散点图”,xlab = “温度”,ylab
M0 = lm(RH~temp,data = Forest)
abline(M0coefficients)M.Loess=loess(RH temp,data=Forest)Ord=order(Forestcoefficients) M.Loess = loess(RH~temp,data = Forest) Ord = order(Forestcoefficients)M.Loess=loess(RH temp,data=Forest)Ord=order(Foresttemp)
lines(Foresttemp[Ord],M.Loesstemp[Ord],M.Loesstemp[Ord],M.Loessfitted[Ord],lwd = 1 ,lty = 1 ,col = 2 )

72. 对Forest森林数据,绘制温度和相对湿度以及风力的三维散点图

In [81]:
install.packages(“scatterplot3d”)
library(“scatterplot3d”)
with(Forest,scatterplot3d(temp,RH,wind,main=“森林地区温度、相对湿度和风力的三维散点图”

73. 对ReportCard数据,绘制学生各门课程考试成绩的相关系数图

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done

In [82]:
install.packages(“corrgram”)
library(“corrgram”)
corrgram(ReportCard[, 3 : 10 ],lower.panel=panel.shade,upper.panel=panel.pie,text.p

第六部分:R的统计分析

参考配套知识点的第六章,想了解更全面的知识点,可以看这 【R语言知识点详细总结】 中的 第

六章 R的统计分析 ;

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done

74. 下表给出了 28 位学生某门课程的成绩数据,问 80 分是否可以作为学生成绩

的3/4分位数?显著性水平=0.01

In [83]:

采用非参数检验中的符号检验

x <- c( 95 , 89 , 68 , 90 , 88 , 60 , 81 , 67 , 60 , 60 , 60 , 63 , 60 , 92 , 60 , 88 , 88 , 87 , 60 , 73 , 60 , 97 , 91 , 60
binom.test(min(sum(x> 80 ),sum(x< 80 )),sum(x!= 80 ), 0.75)

75. 某网站收集了 19 家大型公司CEO邮箱里每天收到的垃圾邮件件数,得到如

下数据(单位:封),问:垃圾邮件数量的中心位置是否超过了 320 封?

310 350 370 377 389 400 415 425 440 295 325 296 250 340 298 365 375 360 385

In [84]:

采用Wilcoxon符号秩检验

spamail <- c( 310 , 350 , 370 , 377 , 380 , 400 , 415 , 425 , 440 , 295 , 325 , 296 , 250 , 340 , 298 , 365 , 37
wilcox.test(spamail, 320 ,alt=‘great’,conf.int=TRUE)

Exact binomial test
data: min(sum(x > 80), sum(x < 80)) and sum(x != 80)
number of successes = 13, number of trials = 28, p-value = 0.001436
alternative hypothesis: true probability of success is not equal to 0.75
95 percent confidence interval:
0.2751086 0.6613009
sample estimates:
probability of success
0.4642857
Wilcoxon rank sum test
data: spamail and 320
W = 14, p-value = 0.3
alternative hypothesis: true location shift is greater than 0
95 percent confidence interval:
-70 Inf
sample estimates:
difference in location
45

76. 今测得 10 名非铅作业工人和 7 名铅作业工人的血铅值如下表所示,试用

Wilcoxon秩和检验分析两组工人血铅值有无差异。

In [85]:

Wilcoxon秩和检验

x <- c( 24 , 26 , 29 , 34 , 43 , 58 , 63 , 72 , 87 , 101 )
y <- c( 82 , 87 , 97 , 121 , 164 , 208 , 213 )

不采用连续性修正

wilcox.test(x,y,alternative=“less”,exact=FALSE,correct=FALSE)

77. 为研究血型与肝病之间的关系,调查 295 名肝病患者及 638 名非肝病患者

(对照组)不同血型的得病情况,如下表所示,问血型与肝病之间是否存在着

关联?

Wilcoxon rank sum test
data: x and y
W = 4.5, p-value = 0.001449
alternative hypothesis: true location shift is less than 0

In [86]:

卡方独立性检验

x <- c( 98 , 67 , 13 , 18 , 38 , 41 , 8 , 12 , 289 , 262 , 57 , 30 )
dim(x)<- c( 4 , 3 )
chisq.test(x)

78. 为了解某种药物的治疗效果,采集药物A与B的疗效数据整理成二维列联表

如下,检验药物与疗效的独立性。

In [87]:

Fisher精确性检验

medicine<-matrix(c( 8 , 7 , 2 , 23 ), 2 , 2 )
fisher.test(medicine)

79. 为研究 4 种不同药物对儿童咳嗽的治疗效果,将 25 个体质相似的病人随机

分为 4 组,分别采用A、B、C、D 4种药物进行治疗, 5 天后测量每个病人每天

的咳嗽次数如下表所示,试比较这 4 种药物的治疗效果是否相同?

Pearson's Chi-squared test
data: x
X-squared = 15.073, df = 6, p-value = 0.01969
Fisher's Exact Test for Count Data
data: medicine
p-value = 0.002429
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.856547 143.340082
sample estimates:
odds ratio
12.12648

In [88]:

多组数据位置推断—Kruskal-wallis检验

drug <- c( 80 , 203 , 236 , 252 , 284 , 368 , 457 , 393 , 133 , 180 , 100 , 160 , 156 , 295 , 320 , 448 , 465 , 48
gr.drug<-c( 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 , 3 , 3 , 3 , 4 , 4 , 4 , 4 , 4 , 4 )
kruskal.test(drug,gr.drug)

80. 设有来自A,B,C,D4个地区的四名厨师制作京城水煮鱼,为了比较他们的品

质是否相同,经四位美食评委评分结果如下表所示,试测试 4 个地区制作的水

煮鱼这道菜的品质有无区别。

Kruskal-Wallis rank sum test
data: drug and gr.drug
Kruskal-Wallis chi-squared = 8.0721, df = 3, p-value = 0.04455

In [89]:

多组数据位置推断—Friedman检验

beijingfish <- c( 85 , 82 , 82 , 79 , 87 , 75 , 86 , 82 , 90 , 81 , 80 , 76 , 80 , 75 , 81 , 75 )
treat.BF <- c( 1 , 2 , 3 , 4 , 1 , 2 , 3 , 4 , 1 , 2 , 3 , 4 , 1 , 2 , 3 , 4 )
block.BF <- c( 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 , 4 , 4 , 4 , 4 )
friedman.test(beijingfish,treat.BF,block.BF)

81. 现收集了某大学部分学生一年级英语期末成绩,与其高考英语成绩进行比

较,调查 12 位学生的结果如下表,试检验学生中学的学习成绩与大学学习成绩

有相关关系。

Friedman rank sum test
data: beijingfish, treat.BF and block.BF
Friedman chi-squared = 8.1316, df = 3, p-value = 0.04337

In [90]:
x <- c( 65 , 79 , 67 , 66 , 89 , 85 , 84 , 73 , 88 , 80 , 86 , 75 )
y <- c( 62 , 66 , 50 , 68 , 88 , 86 , 64 , 62 , 92 , 64 , 81 , 80 )
cor.test(x,y) #pearson相关性检验
cor.test(x,y,meth=‘spearman’) # spearman相关系数
cor.test(x,y,meth=‘kendall’) # kendall相关系数

82. 给出两个向量x和y如下,做一元线性回归分析

x:318,910,200,409,425,502,314,1210,1022,1225
y:524,1019,638,815,913,928,605,1516,1219,1624
Pearson's product-moment correlation
data: x and y
t = 3.4403, df = 10, p-value = 0.006328
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2811026 0.9209916
sample estimates:
cor
0.7362315
Warning message in cor.test.default(x, y, meth = "spearman"):
“Cannot compute exact p-value with ties”
Spearman's rank correlation rho
data: x and y
S = 65.227, p-value = 0.003265
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7719346
Warning message in cor.test.default(x, y, meth = "kendall"):
“Cannot compute exact p-value with ties”
Kendall's rank correlation tau
data: x and y
z = 2.6181, p-value = 0.008842
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.5846846

In [91]:
x<-c( 318 , 910 , 200 , 409 , 425 , 502 , 314 , 1210 , 1022 , 1225 )
y<-c( 524 , 1019 , 638 , 815 , 913 , 928 , 605 , 1516 , 1219 , 1624 )
plot(x,y)
lm.reg<-lm(y~ 1 +x)
summary(lm.reg)
op=par(mfrow=c( 2 , 2 ))
plot(lm.reg)#产生四个图,分别是:1 residual vs fitted;2 Normal QQ-plot;3 scale-loc
par(op)

Call:
lm(formula = y ~ 1 + x)
Residuals:
Min 1Q Median 3Q Max
-191.52 -86.63 45.26 79.32 138.17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 393.0431 79.6510 4.935 0.00114 **
x 0.8983 0.1057 8.498 2.82e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 125.4 on 8 degrees of freedom
Multiple R-squared: 0.9003, Adjusted R-squared: 0.8878
F-statistic: 72.21 on 1 and 8 DF, p-value: 2.821e-05

In [92]:

求预测值和预测区间

point <- data.frame(x= 425 )
lm.pred <- predict(lm.reg,point,interval=‘prediction’,level=0.95)
print(lm.pred)

第七部分:随机考察题

83. 计算从 1 加到 100 ,至少采用 3 种不同的方法

fit lwr upr
1 774.8322 466.5557 1083.109

In [93]:
#从 1 加到 100
#方法1:for循环
sum1= 0
for(i in seq (from= 1 , to= 100 ,by= 1 ) ) sum1=sum1+i
print(sum1)
#方法2:repeat循环
i= 0
sum2= 0
repeat{if(i> 100 ) break else {sum2=sum2+i; i=i+1}}
print(sum2)
#方法3:while循环
sum3= 0
i= 0
while(i<= 100 ){ sum3=sum3+i; i=i+1}
print(sum3)
#方法4 : sum函数
print(sum(c( 1 : 100 )))

84. 计算从 1 的平方加到 100 的平方,至少采用 3 种方法

In [94]:
#从 1 的平方加到 100 的平方#方法1:for函数
sum4= 0
for(i in seq (from= 1 , to= 100 , by= 1 ) ) sum4=sum4+i^ 2
print(sum4)
#方法2:repeat循环
i= 0
sum5= 0
repeat{if(i> 100 )break else { sum5=sum5+i^ 2 ; i=i+1}}
print(sum5)
#方法3:while循环
sum6= 0
i= 0
while (i<= 100 ){ sum6=sum6+i^ 2 ; i=i+1}
print(sum6)
#方法4: sum函数
print(sum(c(( 1 : 100 )^ 2 )))

[1] 5050

[1] 5050

[1] 5050

[1] 5050

[1] 338350

[1] 338350

[1] 338350

[1] 338350

85. 创建[1,100]之间所有奇数组成的向量

In [95]:
t = seq(from= 1 , to= 100 , by= 2 ) #从 1 到100,间隔为 2 ,输出数
print(t)

86. 将长度为 200 的数值型向量t中第 5 个元素删除,并在此位置中添加元素 11

和 21 ;

In [96]:
t = c( 1 : 200 )
t = t[-5] #删除第 5 个元素
t = c(t[ 1 : 4 ], 11 , 21 , t[ 5 : 199 ]) #在第五个元素的位置上,添加11,21两个数
print(t)

87. 将 1 到 24 构成的自然数序列构建为行数为 3 、列数为 4 、组数为 2 的数组并访

问第二组数据;

[1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

[26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

[1] 1 2 3 4 11 21 6 7 8 9 10 11 12 13 14 15 16 17

[19] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

[37] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

[55] 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

[73] 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89

[91] 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107

[109] 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125

[127] 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143

[145] 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161

[163] 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179

[181] 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197

[199] 198 199 200

In [97]:
y = c( 1 : 24 )
t = array(y, c( 3 , 4 , 2 )) #访问第二组数据
print(t)

88. 使用以下向量作为数据框的列,创建一个数据框。并访问temp这一列;

X = c(1,1,1)
Y = c(2,2,2)
temp = c(14.7,18.5,25.9)
RH = c(66,73,41)

In [98]:
X = c( 1 , 1 , 1 )
Y = c( 2 , 2 , 2 )
temp = c(14.7,18.5,25.9)
RH = c( 66 , 73 , 41 )
data = data.frame(X,Y,temp,RH) #定义数据框
print(data)

In [99]:
print(data[,‘temp’] ) #访问temp列,或者也可以写成data[,3]

89. 设置随机种子,随机生成服从标准正态分布函数的 100 个数字;

, , 1

[,1] [,2] [,3] [,4]

[1,] 1 4 7 10

[2,] 2 5 8 11

[3,] 3 6 9 12

, , 2

[,1] [,2] [,3] [,4]

[1,] 13 16 19 22

[2,] 14 17 20 23

[3,] 15 18 21 24

X Y temp RH
1 1 2 14.7 66
2 1 2 18.5 73
3 1 2 25.9 41

[1] 14.7 18.5 25.9

In [100]:
set.seed( 100 )
y = rnorm( 100 , 0 , 1 )#生成 100 个标准正态分布的 100 个数
print(y)

90. 将y从小到大排序,然后用排好的值,做出其正态分布密度图;

In [101]:
y = sort(y)#将y中的数值进行排序
print(y)

[1] -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127 0.31863009

[7] -0.58179068 0.71453271 -0.82525943 -0.35986213 0.08988614 0.09627446

[13] -0.20163395 0.73984050 0.12337950 -0.02931671 -0.38885425 0.51085626

[19] -0.91381419 2.31029682 -0.43808998 0.76406062 0.26196129 0.77340460

[25] -0.81437912 -0.43845057 -0.72022155 0.23094453 -1.15772946 0.24707599

[31] -0.09111356 1.75737562 -0.13792961 -0.11119350 -0.69001432 -0.22179423

[37] 0.18290768 0.41732329 1.06540233 0.97020202 -0.10162924 1.40320349

[43] -1.77677563 0.62286739 -0.52228335 1.32223096 -0.36344033 1.31906574

[49] 0.04377907 -1.87865588 -0.44706218 -1.73859795 0.17886485 1.89746570

[55] -2.27192549 0.98046414 -1.39882562 1.82487242 1.38129873 -0.83885188

[61] -0.26199577 -0.06884403 -0.37888356 2.58195893 0.12983414 -0.71302498

[67] 0.63799424 0.20169159 -0.06991695 -0.09248988 0.44890327 -1.06435567

[73] -1.16241932 1.64852175 -2.06209602 0.01274972 -1.08752835 0.27053949

[79] 1.00845187 -2.07440475 0.89682227 -0.04999577 -1.34534931 -1.93121153

[85] 0.70958158 -0.15790503 0.21636787 0.81736208 1.72717575 -0.10377029

[91] -0.55712229 1.42830143 -0.89295740 -1.15757124 -0.53029645 2.44568276

[97] -0.83249580 0.41351985 -1.17868314 -1.17403476

[1] -2.27192549 -2.07440475 -2.06209602 -1.93121153 -1.87865588 -1.77677563

[7] -1.73859795 -1.39882562 -1.34534931 -1.17868314 -1.17403476 -1.16241932

[13] -1.15772946 -1.15757124 -1.08752835 -1.06435567 -0.91381419 -0.89295740

[19] -0.83885188 -0.83249580 -0.82525943 -0.81437912 -0.72022155 -0.71302498

[25] -0.69001432 -0.58179068 -0.55712229 -0.53029645 -0.52228335 -0.50219235

[31] -0.44706218 -0.43845057 -0.43808998 -0.38885425 -0.37888356 -0.36344033

[37] -0.35986213 -0.26199577 -0.22179423 -0.20163395 -0.15790503 -0.13792961

[43] -0.11119350 -0.10377029 -0.10162924 -0.09248988 -0.09111356 -0.07891709

[49] -0.06991695 -0.06884403 -0.04999577 -0.02931671 0.01274972 0.04377907

[55] 0.08988614 0.09627446 0.11697127 0.12337950 0.12983414 0.13153117

[61] 0.17886485 0.18290768 0.20169159 0.21636787 0.23094453 0.24707599

[67] 0.26196129 0.27053949 0.31863009 0.41351985 0.41732329 0.44890327

[73] 0.51085626 0.62286739 0.63799424 0.70958158 0.71453271 0.73984050

[79] 0.76406062 0.77340460 0.81736208 0.88678481 0.89682227 0.97020202

[85] 0.98046414 1.00845187 1.06540233 1.31906574 1.32223096 1.38129873

[91] 1.40320349 1.42830143 1.64852175 1.72717575 1.75737562 1.82487242

[97] 1.89746570 2.31029682 2.44568276 2.58195893

In [102]:
plot(y, dnorm(y, 0 , 1 ), type=“l”, main=“正态分布密度图”) #生成正态分布密度图

91. 请利用R代码编写一函数程序如下,并计算f(5);

In [103]:
#首先,定义一个f函数
f = function(n){
sum = 0 #定义一个sum,存储总和
for(i in 1 :n) sum = sum + i^ 3 #定义一个for循环,依次将n个数的立方求和
return(sum) #返回求和后的数
}
f( 5 ) #当n为 5 时,调用f函数,结果为 225

92. 用R来计算以下公式,并保留两位小数;

In [104]:
round(abs(exp( 1 )-exp( 2 ))^( 1 / 3 ), 2 )

93. 求向量x=c(3: 95 )的均值、中位数、标准差、方差、最大值、最小值、

向量长度以及向量的各个元素的和;

225

1.67

In [105]:
x=c( 3 : 95 )
mean(x)
median(x)
sd(x)
var(x)
max(x)
min(x)
length(x)
sum(x)

94. 将ReportCard1.txt和ReportCard2.txt这两个文件读取并保存在

Reportcard1和Reportcard2两个变量中;使用merge函数以“xh”为关键字,

将两个文件合并,并保存到Reportcard变量中。

In [106]:
Reportcard1 = read.table(“/home/mw/input/wlong6309/ReportCard1.txt”,header=T)
Reportcard2 = read.table(“/home/mw/input/wlong6309/ReportCard2.txt”,header=T)
Reportcard = merge(Reportcard1,Reportcard2,by=‘xh’)
print(head(Reportcard))

95. 将Reportcard中的缺失值删除

49

49

26.9907391525316

728.5

95

3

93

4557

xh sex poli chi math fore phy che geo his
1 92101 2 96 96 87.5 72 93 65 76.0 92
2 92102 1 94 97 86.5 61 93 64 79.5 95
3 92103 2 NA NA NA 66 98 79 89.0 81
4 92104 2 89 97 69.5 86 83 62 83.0 94
5 92105 1 82 85 79.5 60 88 66 72.5 98
6 92106 2 88 88 78.0 60 90 70 81.5 77

In [107]:
Reportcard = na.omit(Reportcard)
print(head(Reportcard))

96. 将Reportcard中的性别重新编码,将其中的 1 用M替代, 2 由F替代

In [108]:
Reportcardsex=factor(Reportcardsex = factor(Reportcardsex=factor(Reportcardsex, levels=c( 1 , 2 ),labels=c(“M”,“F”))
Reportcard$sex

97. 在数据框Reportcard中计算每条学生的总成绩和平均成绩。其中总成绩用

变量(Sumscore),平均成绩用(Avscore)

In [109]:
SumScore = rowSums(Reportcard[, 3 : 10 ], na.rm=TRUE)
ReportcardSumScore=SumScoreAvScore=rowMeans(Reportcard[,3:10],na.rm=TRUE)ReportcardSumScore = SumScore AvScore = rowMeans(Reportcard[, 3 : 10 ], na.rm=TRUE) ReportcardSumScore=SumScoreAvScore=rowMeans(Reportcard[,3:10],na.rm=TRUE)ReportcardAvScore = AvScore
print(head(Reportcard))

98. 以Avscore为标准将成绩划分称等级(A,B,C,D,E),并将重新编码

的变量存储avscore变量中,然后用柱状图查看该班学生成绩等级的分布情

况;

xh sex poli chi math fore phy che geo his
1 92101 2 96 96 87.5 72 93 65 76.0 92
2 92102 1 94 97 86.5 61 93 64 79.5 95
4 92104 2 89 97 69.5 86 83 62 83.0 94
5 92105 1 82 85 79.5 60 88 66 72.5 98
6 92106 2 88 88 78.0 60 90 70 81.5 77
7 92108 2 84 90 69.5 50 80 60 86.5 94

F M F M F F M M F M M F M M M M F M F F F F M F F

M M M M F F F F F M F F M M M M M F M F F M F M M

M M F F F M M F

Levels :
xh sex poli chi math fore phy che geo his SumScore AvScore
1 92101 F 96 96 87.5 72 93 65 76.0 92 677.5 84.6875
2 92102 M 94 97 86.5 61 93 64 79.5 95 670.0 83.7500
4 92104 F 89 97 69.5 86 83 62 83.0 94 663.5 82.9375
5 92105 M 82 85 79.5 60 88 66 72.5 98 631.0 78.8750
6 92106 F 88 88 78.0 60 90 70 81.5 77 632.5 79.0625
7 92108 F 84 90 69.5 50 80 60 86.5 94 614.0 76.7500

其中(大于等于 90 分为A,大于等于 80 并小于 90 分为B,大于等于 70 分并小于 80 分为C,大于等于 60

分并小于 70 分为D,小于 60 分为E)

In [110]:
Reportcard = within(Reportcard,{
AvScore[AvScore>= 90 ] = ‘A’
AvScore[AvScore>= 80 & AvScore< 90 ] = ‘B’
AvScore[AvScore>= 70 & AvScore< 80 ] = ‘C’
AvScore[AvScore>= 60 & AvScore< 70 ] = ‘D’
AvScore[AvScore< 60 ] = ‘E’
})
avScore = Reportcard[, 12 ]#将重新编码的数据保存到avScore中
print(avScore)
[1] “B” “B” “B” “C” “C” “C” “C” “C” “C” “C” “C” “C” “C” “C” “D” “D” “C” “D” “C”
[20] “D” “D” “D” “D” “E” “E” “E” “E” “E” “B” “C” “C” “C” “C” “C” “C” “C” “C” “C”
[39] “C” “C” “D” “D” “D” “D” “D” “D” “D” “D” “D” “D” “D” “D” “D” “D” “D” “E” “E”
[58] “E”

In [111]:
n=table(Reportcard$AvScore)
barplot(n,ylim=c( 0 , 25 )) #生成柱状图

99. 以1:6按照列的顺序生成 2 行 3 列的数据,并计算每行每列的最大最小值

In [112]:
data = matrix(c( 1 , 2 , 3 , 4 , 5 , 6 ), nrow= 2 )
row_max = c()
row_min = c()
col_max = c()
col_min = c()
for(i in 1 :nrow(data))
{
row_max = c(row_max, max(data[i,]))
row_min = c(row_min, min(data[i,]))
}
data = cbind(data, row_max, row_min)
for(j in 1 :ncol(data))
{
col_max = c(col_max, max(data[,j]))
col_min = c(col_min, min(data[,j]))
}
data = rbind(data, col_max, col_min)
print(data)

100. 用R爬取 58 同城石家庄在售新房首页

row_max row_min
1 3 5 5 1
2 4 6 6 2
col_max 2 4 6 6 2
col_min 1 3 5 5 1

In [3]:
install.packages(“rvest”)
library(rvest)#包含爬虫函数的包

读取网页,获取石家庄在售新房

page_text <- read_html(“https://sjz.58.com/xinfang/”)#加载第一页的数据
#获取小区名称
estate_name <- page_text %>% html_nodes(“span.items-name”) %>% html_text()
#获取小区所在位置
estate_detail_address <- page_text %>% html_nodes(“span.list-map”) %>% html_tex
estate_brief_address <- substr(estate_detail_address, 3 , 4 )#所在县区
#均价
estate_price <- page_text %>% html_nodes(“p.price”) %>% html_nodes(“span”)%>% h
#处理数据:翰林观天下售价显示的是周边均价(保留)
estate_price <- c(estate_price[ 1 : 16 ], “15990”, estate_price[ 17 : 59 ])
#将爬取到的数据存入数据框中
estate <- data.frame(name=estate_name,address=estate_brief_address,price=estate

只输出前几行

print(head(estate))

至于用R爬取 58 同城新房代码见后续完整的项目哈,马上安排更新,欢迎 点赞、Fork 哈!!
【 R语言配套知识点详细总结】

In [ ]:

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
name address price
1 紫晶悦和中心 长安 14800
2 天润福庭 藁城 10500
3 美好时光 裕华 12500
4 玖筑翰府 开发 11000
5 绿城诚园 新华 12800
6 东华国樾府 裕华 15500

你可能感兴趣的:(R语言基础实践,r语言,开发语言)