R语言中的数据类型包括逻辑型(logical)、数值型(numeric)、整数型(integer)、字符型(character)、复数型(complex)和原始类型(raw)。
R的结构数据类型包括向量、列表、二维矩阵、三维矩阵、因子和数据框,其创建方式和元素访问代码如下表:
类别 | 创建方式 | 元素访问 |
---|---|---|
向量(vector) | c( ) | v[index] |
列表(list) | list( ) | L[[index]] |
二维矩阵(matrix) | matrix( ) | M[index1,index2] |
多维矩阵(array) | array( ) | A[index1,…indexn] |
因子(factor) | factor( ) | F[index] |
数据框(data.frame) | data.frame( ) | DF[index1,index2] |
在R语言中,运算符号主要包括算数运算符号、关系运算符号、逻辑运算符号和其他符号,如下表:
符号类型 | 运算符号 | 作用描述 |
---|---|---|
算数运算符号 | + | 加运算 |
算数运算符号 | - | 减运算 |
算数运算符号 | * | 乘运算 |
算数运算符号 | / | 除运算 |
算数运算符号 | %% | 取余运算 |
算数运算符号 | %/% | 整除运算 |
算数运算符号 | ^ | 幂运算 |
关系运算符号 | > | 大于 |
关系运算符号 | < | 小于 |
关系运算符号 | > = | 大于等于 |
关系运算符号 | < = | 小于等于 |
关系运算符号 | = | 等于 |
关系运算符号 | ! = | 不等于 |
逻辑运算 | &,&& | 与运算 |
逻辑运算 | 丨,丨 丨 | 或运算 |
逻辑运算 | ! | 非运算 |
其他符号 | : | 创建序列值 |
其他符号 | %in% | 包含关系运算 |
其他符号 | %*% | 矩阵乘法运算 |
利用R可以建立自己需要的函数,建立函数的方法也非常简单,下面来创建第一个函数,代码如下:
myfunction <- function(a,b){
result_ab <- a*b
print(result_ab) #输出结果
}
myfunction(2,6)
输出结果如下
> myfunction <- function(a,b){
+ result_ab <- a*b
+ print(result_ab)
+ }
> myfunction(2,6)
[1] 12
利用for循环函数便可解决,代码如下:
jiecheng_function <- function(a){
ss=1
for (i in 1:a) {
ss <- ss*i
}
print(ss)
}
下面来验证函数准确性:
> jiecheng_function(1)
[1] 1
> jiecheng_function(5)
[1] 120
> jiecheng_function(10)
[1] 3628800
将数值代入函数后结果与实际一致,证明了结构的准确性
我们的电脑中有各种各样的数据,只有将数据导入R中才能进一步运算,下面来介绍csv文件和excel文件的导入,需要用到“readr”函数包(The goal of readr is to provide a fast and friendly way to readrectangular data (like csv, tsv, and fwf). It is designed to flexiblyparse many types of data found in the wild, while still cleanly failingwhen data unexpectedly changes. If you are new to readr, the best placeto start is the data importchapter in R for data science.)和readxl函数包(The readxl package makes it easy to get data out of Excel and into R. Compared to many of the existing packages (e.g. gdata, xlsx, xlsReadWrite) readxl has no external dependencies, so it’s easy to install and use on all operating systems. It is designed to work with tabular data.)。
代码如下:
install.packages("readr","readxl") #安装包
> library(readr)
> library(readxl)
> setwd("E:\\R_course\\Chapter2\\Data") #建立工作空间
> getwd() #显示当前工作空间
[1] "E:/R_course/Chapter2/Data"
> cp.csv <- read_csv("comp.csv") #读取csv文件
Parsed with column specification:
cols(
Berri1 = col_double(),
Boyer = col_double(),
CSC = col_double(),
Dame = col_double(),
Parc = col_double(),
PierDup = col_double(),
Ren = col_double(),
Urbain = col_double(),
University = col_double(),
Viger = col_double()
)
> cp.xls <- read_excel("comp.xlsx")
New names:
* `` -> ...2
> summary(cp.xls) #描述cp.xls数据
Date ...2 Berri1
Length:366 Min. :1899-12-31 Min. : 32.0
Class :character 1st Qu.:1899-12-31 1st Qu.: 456.2
Mode :character Median :1899-12-31 Median :2381.5
Mean :1899-12-31 Mean :2701.1
3rd Qu.:1899-12-31 3rd Qu.:4764.0
Max. :1899-12-31 Max. :7544.0
处理后的数据利用write_csv()函数导出即可(This is about twice as fast as write.csv(), and never writes row names. output_column() is a generic method used to coerce columns to suitable output.)。
write_csv(cp.csv,"kk.csv") #将cp.csv数据导出,并将之命名为kk.csv
数据的处理过程主要包括数据提取、数据整理以及管道操作
“dplyr”包提供了非常简便的数据提取方式A fast, consistent tool for working with data frame like objects, both in memory and out of memory.用法包括:(dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.)
setwd("E:\\R_course\\Chapter2\\Data")
cp <- read_csv("comp.csv")
#提取第二行第三列元素
cp[2,3]
#"Dame"列第二个元素
cp[2,'Dame']
#提取'Dame'列
select(cp,'Dame')
#提取以P为首字母的列
select(cp,starts_with("P"))
#提取‘Dame’列等等于0的元素
filter(cp,Dame==0)
对于非关系型数据来说,由于元组的字段数量并不一致,数据结构也不固定,需要额外调用rlist函数包Provides a set of functions for data manipulation with list objects, including mapping, filtering, grouping, sorting, updating, searching, and other useful functions. Most functions are designed to be pipeline friendly so that data processing with lists can be chained.。
person <- list(
p1=list(name="Ken",age=24,interest=c("reading","music","movies"),
lang=list(r=2,csharp=4,python=3)),
p2=list(name="James",age=25,interest=c("sports","music"),
lang=list(r=3,java=2,cpp=5)),
p3=list(name="Penny",age=24,
interest=c("movies","reading"),
lang=list(r=1,cpp=4,python=2)))
str(person)
List of 3
$ p1:List of 4
..$ name : chr "Ken"
..$ age : num 24
..$ interest: chr [1:3] "reading" "music" "movies"
..$ lang :List of 3
.. ..$ r : num 2
.. ..$ csharp: num 4
.. ..$ python: num 3
$ p2:List of 4
..$ name : chr "James"
..$ age : num 25
..$ interest: chr [1:2] "sports" "music"
..$ lang :List of 3
.. ..$ r : num 3
.. ..$ java: num 2
.. ..$ cpp : num 5
$ p3:List of 4
..$ name : chr "Penny"
..$ age : num 24
..$ interest: chr [1:2] "movies" "reading"
..$ lang :List of 3
.. ..$ r : num 1
.. ..$ cpp : num 4
.. ..$ python: num 2
library(rlist)
list.map(person,age)
list.map(person,names(lang)) #映射结果
p.age25 <- list.filter(person,age>=25)
str(p.age25)
p.py3 <- list.filter(person,lang$python>=3) #过滤结果
str(p.py3)
library(dplyr)
arrange(cp,Dame) #升序
arrange(cp,desc(Dame)) #降序
library(rlist)
str(list.sort(person,age)) #升序
str(list.sort(person,desc(lang$r))) #降序
> widedata <- data.frame(
+ person=c('Alex','Bob','Cathy'),
+ grade=c(2,3,4),
+ score=c(78,89,88),
+ age=c(18,19,18)
+ )
> library(tidyr)
> widedata
person grade score age
1 Alex 2 78 18
2 Bob 3 89 19
3 Cathy 4 88 18
> longdata <- gather(widedata,variable,value,-person)
> longdata
person variable value
1 Alex grade 2
2 Bob grade 3
3 Cathy grade 4
4 Alex score 78
5 Bob score 89
6 Cathy score 88
7 Alex age 18
8 Bob age 19
9 Cathy age 18
> widedata2 <- spread(longdata,variable,value)
> > widedata2
person age grade score
1 Alex 18 2 78
2 Bob 19 3 89
3 Cathy 18 4 88
另外,函数包tidyr中的unite函数可以实现数据的合并和拆分,代码如下:
> wideunite <- unite(widedata,information,person,grade,
+ score,age,sep = "_")
> wideunite
information
1 Alex_2_78_18
2 Bob_3_89_19
3 Cathy_4_88_18
> widesep <- separate(wideunite,information,c("person",
+ "grade","score","age"),sep = "_")
> widesep
person grade score age
1 Alex 2 78 18
2 Bob 3 89 19
3 Cathy 4 88 18
管道操作能够减少代码冗余,提高写代码的效率,使得代码更加直观,magrittr函数包中的“%>%”是最常用的管道操作符。(The magrittr package offers a set of operators which promote semantics that will improve your code by structuring sequences of data operations left-to-right (as opposed to from the inside and out,avoiding nested function calls,minimizing the need for local variables and function definitions, andmaking it easy to add steps anywhere in the sequence of operations.)
示例如下:
cp %>% select(starts_with("D")) %>% "*"(2) %>%
unlist() %>% matrix(nrow = 2) %>% colMeans() %>%
plot() #选择cp中首字母带D的列,乘以2,转化为向量,将向量转化为
2行的矩阵,对矩阵每一列求平均值,制图。