0%

Exploratory Data Analysis

来自coursera上约翰霍普金斯大学Data Science系列课程Course4:Exploratory Data Analysis.


Princlples of Analytic Graphics

  1. Show comparisons compared to what(PM25 in house with aircleaner compared to without aircleaner)
  2. Show casuality, mechanism, explanation show how you believe the world works(show you believe child living in house with lower pm25 is more likely to be healthy)
  3. Show multivariate Data more than 2 variables
  4. Integrate multiple models of evidence don't let the tools drive the analysis(plot depend on your own idea, not the tools)
  5. Describe and document the evidence
  6. Content is king

Take a Look at the Data

1
2
library(datasets)
data(airquality)

One dimension

  1. Summary
    1
    summary(airquality$Ozone) # 臭氧
  2. Boxplots
    1
    2
    boxplot(airquality$Ozone, col = "blue")
    abline(h = 100)
  3. Historgrams
    1
    2
    3
    4
    hist(airquality$Ozone, col = "green", breaks = 100)
    abline(v = 100, lwd = 2)
    abline(v = median(airquality$Ozone), col = "magenta", lwd = 4)
    rug(airquality$Ozone)
  4. Barplot
    1
    barplot(table(airquality$Month),col = "wheat")

Two dimensions

  1. Multiple Boxplots
    1
    boxplot(Ozone ~ Month, data = airquality, col = "red")
  2. Multiple Historgrams
    1
    2
    3
    par(mfrow = c(2,1), mar = c(4, 4, 2, 1))
    hist(subset(airquality, Month == 5)$Ozone, col = "green")
    hist(subset(airquality, Month == 8)$Ozone, col = "green")
  3. Scatterplot
    1
    2
    3
    4
    par(mfrow = c(1,1))
    with(airquality, plot(Solar.R, Ozone, col = Month))
    legend("topright", pch = 1, col = c(5, 6, 7, 8, 9), legend = c("5月", "6月", "7月", "8月", "9月"))
    abline(h = 100, lwd = 2, lty = 2)

Plotting Systems in R

Base Plotting System

  • Base:artist's palette model, and usually needs two steps to create a plot
  • representative:plot()

Two packages

  • graphics(including plot, hist, boxplot, etc)
  • grDevices(including X11, PDF, PostScript, PNG, etc)

Two steps to create a base plot

  • Initializing a new plot
    1
    2
    library(datasets)
    with(airquality, plot(Wind, Ozone)) # Scatterplot
  • Annotation an existing plot
    1
    2
    model <- lm(Ozone ~ Wind, airquality)
    abline(model, lwd = 2)

Base Plotting Functions

Initialize

  • plot:initialize a new plot
  • hist:initialize a new hist
  • boxplot:initialize a new boxplot

Add

  • lines:add lines to a plot
  • abline:add lines to a plot
  • points:add points to a plot
  • text:add text labels to a plot using specified x, y coordinates
  • title:add annotations to x, y axis labels, title, subtitle, outer margin
  • mtext:add arbitrary text to the margins
  • axis:add axis labels

Some Important Base Graphics Parameters

  • pch:the plotting symbol
  • lty:the line type
  • lwd:the line width
  • col:color
  • xlab:string for the xlab
  • ylab:string for the ylab
  • las:the orientation of the axis
  • bg:thebackground color
  • mar:the margin size
  • oma:the outer margin size
  • mfrow:number of plots per row, column
  • mfcol:number of plots per row, column(differ in order)

Default parameters:

1
2
par("bg") # "transparent"
par("mar") # 4 4 2 1

Examples

Example:Legend

1
2
3
4
with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York", type = "n"))
with(subset(airquality, Month == 5), points(Wind, Ozone, col = "blue"))
with(subset(airquality, Month != 5), points(Wind, Ozone, col = "red"))
legend("topright", pch = 1, col = c("blue", "red"), legend = c("May", "Other Months"))

Example:Multiple Base Plots

1
2
3
4
5
6
7
par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(airquality, {
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation")
plot(Temp, Ozone, main = "Ozone and Temperature")
mtext("Ozone and Weather in New York", outer = TRUE)
})

The Lattice System

  • Lattice:Entire plot specified by one function
  • useful for plotting high dimensional data(conditioning plots)
  • different from base plot driectly to the graphics device, lattice plot returns an object of class trellis (and will be auto-printed)
  • representative:xyplot()

Two packages

  • lattice(including xyplot bwplot, levelplot, etc)
  • grid(usually indirectedly called through lattice or ggplot2)

Lattice Functions

  • xplot:create scatterplots
  • bwplot:box-and-whiskers plots
  • histogram:histograms
  • stripplot:like a boxplot but with actual points
  • dotplot:plot dots on "violin strings"
  • splom:scatterplot matrix(like pairs in base plotting)
  • levelplot, contourplot:for plotting "image" data

Examples

Example:xyplot

1
2
3
4
library(datasets)
library(lattice)
state <- data.frame(state.x77, region = state.region)
xyplot(Life.Exp ~ Income | region, data = state, layout = c(4, 1))

Example:plane functiuon

1
2
3
4
5
6
7
8
9
10
library(lattice)
set.seed(10)
x <- rnorm(100)
f <- rep(0:1, each = 50)
y <- x + f- f * x + rnorm(100, sd = 0.5)
f <- factor(f, labels = c("group1", "group2"))
xyplot(y ~ x | f, panel = function(x, y, ...){
panel.xyplot(x, y, ...)
panel.lmline(x, y, col = 2)
})

The ggplot2 System

  • ggplot2:Mixed elements of Base and Lattice
  • book: In brief, thegrammar tells us that a statistical graphic is a mapping from data to aesthetic(美学) attributes(color, shape, size) of geometric objects (points, lines, bars).The plot may also contain statistical transformations of the data and is drawn on a specific corrdinate system.
  • representative:qplot(), ggplot()

Basic Components of a ggplot2 Plot

  • a data frame
  • aesthetic mapping:how data are mapped to color, size
  • geoms:points, lines, shapes
  • facets:for conditional plots
  • stats:binning(柱形分析), quantiles, smoothing
  • scales:for example:sex
  • corrdinate system
1
2
3
4
5
library(ggplot2)
data(mpg)
g <- ggplot(mpg, aes(displ, hwy))
summary(g)
g + geom_point()

Annotation

  1. labs and theme
  • xlab(), ylab(), ggtitle(), labs()
  • theme(legend.position = "none")
  • theme_gray()
  • theme_bw()

1
2
3
4
5
library(ggplot2)
data(mpg)
g <- ggplot(mpg, aes(displ, hwy))
g + geom_point(color = "steelblue", size = 3.14, alpha = 0.5) + labs(x = expression(PM[25]))# alpha表示透明度
g + geom_point(aes(color = drv), size = 3.14, alpha = 0.5) + ggtitle("title") + theme(plot.title = element_text(hjust = 0.5))
2. ylim and coord_cartesian - ylim(-3, 3)# 把y限定在(-3, 3) - coord_cartesian(ylim = c(-3, 3))# 显示(-3, 3)的范围

1
2
3
4
5
6
testdata <- data.frame(x = 1:100, y = rnorm(100))
testdata[50,2] <- 100
g <- ggplot(testdata, aes(x = x,y = y))
g + geom_line()
g + geom_line() + ylim(-3, 3) # 把y限定在(-3, 3)
g + geom_line() + coord_cartesian(ylim = c(-3, 3)) # 显示(-3, 3)的范围
3. cut
1
2
3
4
testdata <- data.frame(x = 1:100, y = rnorm(100))
testdata[50,2] <- 100
cutpoints <- quantile(testdata$y, seq(0, 1, length = 4), na.rm = T)
testdata$y_new <- cut(testdata$y, cutpoints)

Examples

Example:geom

1
2
3
library(ggplot2)
data(mpg)
qplot(displ, hwy, data = mpg, geom = c("point", "smooth"))

Example:fill

1
2
3
library(ggplot2)
data(mpg)
qplot(hwy, data = mpg, fill = drv)

Example:facets

1
2
3
library(ggplot2)
data(mpg)
qplot(displ, hwy, data = mpg, facets = .~drv)

1
2
3
library(ggplot2)
data(mpg)
qplot(displ, hwy, data = mpg, facets = drv~.)

Example:boxplot

1
2
3
library(ggplot2)
data(mpg)
qplot(drv, hwy, data = mpg, geom = "boxplot", color = manufacturer)

Example:ggplot

1
2
3
library(ggplot2)
data(mpg)
g + geom_point() + facet_grid(drv~cyl, margins=TRUE) + geom_smooth(method="lm", size=1, se=FALSE, color="pink")


Color

1
2
p1 <- colorRampPalette(c("red","yellow"))
showMe(p1(100))

1
2
p2 <- colorRampPalette(c("orange","yellow","green"))
showMe(p2(100))
1
2
3
4
cols <- brewer.pal(3, "BuGn")
showMe(cols)
pal <- colorRampPalette(cols)
showMe(pal(20))