For more detail, you can download the html file here free
This course is following that
Raw Data
Processed Data:
Data processing actually is part of the data analysis. In fact a huge component of data scientist’s job is performing those sorts of processing operations. The raw data may only need to be processed once, but regardless of how often you processed it, you need to keep a record of all the different things you did. Because it can have a major impact on the data stream analysis.
Raw Data:
Tidy Data:
Other important tips about tidy data:
Four things you should have from raw data to tidy data:
Code book:
other important tips about code book:
Get/set your working directory:
getwd()
and setwd()
.setwd("./data")
or setwd("../")
setwd("/Users/jtleek/data/")
setwd("C:/Users/Andraw/Downloads")
or setwd("C:\\Users\\Andra\\Downloads")
Checking for and creating directories:
file.exists("directoryName")
will check to see if the directory exists.dir.create("directoryName")
will create a directory if if doesnot exist.Example:
if(!file.exists("data")){
dir.create("data")
}
Getting data from the internet–downloand.file()
Example:
# data from https://data.baltimorecity.gov/Transportation/Baltimore-Fixed-Speed-Cameras/dz54-2aru
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/camera.csv")
list.files("./data")
Linux should be a little difference with the second line:
download.file(fileUrl, destfile = "./data/camera.csv" , method = "curl" )
An important component of downloading files from the internet is that those files might change. So for example they change the cameras, there might be a new set of cameras and the data we are analysis might be different.
dateDownloaded <- date()
dateDownloaded
Some notes about download.file()
:
download.file()
.method = "curl"
.Loading flat files–read.table()
:
read.csv()
and read.csv2()
Example:
If we use
cameraData <- read.table("./data/camera.csv")
,
we will get an error with
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 13 elements
.
The reason why is because that there’s commas separating camera.csv
. But the default for read.table()
is to look for a tab delimited file. There are two ways you can use to read this data:
cameraData <- read.table("./data/camera.csv", sep = ",", header = TRUE)
head(cameraData,3)
or (read.csv()
automatically set sep = ","
and header = TRUE
)
cameraData <- read.csv("./data/camera.csv")
head(cameraData, 3)
Some more important parameters:
quote = ""
means no quotes.nrows = 10
reads 10 lines).In my experience, the biggest trouble with reading flat files are quotation marks ’ or ” placed in data values, setting quote = ""
often resolves these.
Excel files are still probablu the most widely used format for sharing data.
Example:
# Download the file to load
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/camera.xlsx", mode="wb")
dateDownloaded <- date()
# install xlsx package first
## install.packages("xlsx")
# load package and rad excel data
library(xlsx)
cameraData <- read.xlsx("./data/camera.xlsx", sheetIndex = 1, header = TRUE)
head(cameraData)
Linux should add method = "curl"
in the second line and delete mode = "wb"
in the third line.
mode = "wb"
is very important in Windows because without it there will be an error: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.util.zip.ZipException: invalid distance too far back
.Reading specific rows and columns:
colIndex <- 2:3
rowIndex <- 1:4
cameraDataSubset <- read.xlsx("./data/camera.xlsx", sheetIndex = 1, colIndex = colIndex, rowIndex = rowIndex, mode = "wb")
cameraDataSubset
Further notes:
write.xlsx()
will write out an excel file with similar arguments.read.xlsx2()
is muc faster than read.xlsx()
but for reading subsets of rows may be unstable.XLConnet
package has more options for writting and manipulation excel files.XML:
Tags, elements and attributes:
Hello world M

Conect A to B.
Read XML file into R
# install XML package first
## install.packages("XML")
library(XML)
fileUrl <- "http://www.w3schools.com/xml/simple.xml"
doc <- xmlTreeParse(fileUrl, useInternal = TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
names(rootNode)
xmlTreeParse()
parses out the xml file: It loads the document into a R memory in a way. and then parse it and get access to different parts of it. Within R, it’s still a structurd object, so we have to be able to use different functions to access different parts of that object.xmlRoot()
is excuted, you will have access to that particular element to that xml file.xmlName()
gives the name of xml file.names()
gives what all the nested elements with that root node are.The next thing that you could use is to directly access parts of the XML document. You can do it in a little bit in the same way you access a list in R
# first element
rootNode[[1]]
# first element of the first element
rootNode[[1]][[1]]
# extract different parts of the file programmatically
xmlSApply(rootNode, xmlValue)
xmlSApply()
what you do is you pass that a parsed XML object and then you tell it what function you’d like to apply. So what that’s is going to do is going to loop through all of the elements of the XML root node and get the XML value. xmlValue
Some types of XML nodes have no children nodes, but are leaf nodes and simply contain textXpath
:new language
/node
Top level node//node
Node at any levelnode[@attr-name]
Node with an attribute namenode[@attr-name="bob"]
Node with attribute name attr-name==”bob”Get the items on the menu and prices
# extract content by elements
xpathSApply(rootNode, "//name", xmlValue)
xpathSApply(rootNode, "//price", xmlValue)
Extract content by attributes
# Extract content by attributes
fileUrl <- "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens"
doc <- htmlTreeParse(fileUrl, useInternal = TRUE)
# find li elements with `class = "team-name"` and return their value
teams <- xpathSApply(doc, "//li[@class='team-name']", xmlValue)
teams
JSON:
Reading data from JSOn {jsonlite
package}
# install package first
## install.packages("jsonlite")
library(jsonlite)
# what you get from fromJSON function is a structured data frame
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")
# all names of this data frame
names(jsonData)
# look at the names of that particular variable
names(jsonData$owner)
jsonData$owner$login
How convert data frame to JSON
# writing data frame to JSON
myjson <- toJSON(iris, pretty = TRUE)
# print it out: too long, you can view it yourself
#cat(myjson, nrow = 2)
# fromJSON return data frame again
iris2 <- fromJSON(myjson)
head(iris2, 3)
data.table:
Create data tables just like data frames
# install package first
## install.packages("data.table")
library(data.table)
# data frame
DF <- data.frame(x = rnorm(9), y = rep(letters[1:3], each = 3), z = rnorm(9))
head(DF, 3)
# data table
DT <- data.table(x = rnorm(9), y = rep(letters[1:3], each = 3), z = rnorm(9))
head(DT, 3)
See all the data tables in memry
tables()
Subsetting rows
DT[2,]
DT[DT$y=="a",]
DT[c(2,3)]
# or
DT[c(2,3),]
Subsettng columns:
What happens when you try to subset columns, if you just try to subset columns the way you used to in data frame, this is where they really diverge data table and data frame. It’s not actually trying to subset the columns using the same subsetting functions that happens with data frame. it does something a little bit different. And so what it’s using is expressions to be able to summarize the data in variour, different ways.
# expression in R
k <- {print(10); 5}
print(k)
Calculating values for variables with expressions
DT[, list(mean(x), mean(z))]
DT[,table(y)]
Another thing that it does very fast and memory efficiently is to add a new column. The nice thing is usually when you are adding a new variable to a data frame, R will copy over the entire data frame and add a new variable to it. So you get two copies of the data frame in the memory. So when dealing with big data sets, this is obviously going to cause lots of memory problems which you don’t have with data table because a new copy isn’t being created. So you have to be able to, if you’re trying to create a copy you have to explicitly do that with the copy function.
DT[, w:=z^2]
DT2 <- DT
DT[, y:=2]
head(DT, 2)
head(DT2, 2)
Multiple operations
DT[, m:={tmp <- {x+z}; log2(tmp+5)}]
plyr like operations
DT[,a:=x>0]
DT[,b:=mean(x+w),by=a]
Special variable
.N
An integer, length 1, containing the number r
set.seed(123)
DT <- data.table(x = sample(letters[1:3], 1E5, TRUE))
# number of a, b, c apperance
DT[, .N,by=x]
Keys:
A unique aspect of data tables is that that have keys, and so if you set the key, it’s possible to subset and sort a data tbale much more rapidly than you would be able to do with a data frame.
DT <- data.table(x = rep(letters[1:3], each = 100), y = rnorm(100))
# set keys
setkey(DT, x)
# subsetting rows with x == "a"
head(DT["a"])
Joins or merge data table using keys
DT1 <- data.table(x = c("a", "a", "b", "dt1"), y = 1:4)
DT2 <- data.table(x = c("a", "b", "dt2"), z = 5:7)
# set keys
setkey(DT1, x)
setkey(DT2, x)
# joint
merge(DT1, DT2)
Fast reading
big_df <- data.frame(x = rnorm(1E6), y = rnorm(1E6))
file <- tempfile()
write.table(big_df, file = file, row.names = FALSE, col.names = TRUE, sep = "\t", quote = FALSE)
# fread command chould be applied to read data tables
system.time(fread(file))
system.time(read.table(file, head = TRUE, sep = "\t")) # about 10 times slower