【编程学习】大数据平台基础课程要点总结——R语言实现多线程

在此附上老师教学课件地址:

由于学的时候用的英文,懒得翻译,就直接英文输出了~~~

Let’s check how many cores can you use by R language?

1
2
3
library(parallel)
nCores <- detectCores()
nCores

PARALLELIZE USING parallel

multicore, snow, foreach

The parallel package was introduced in 2011 to unify two popular parallisation packages: snow and multicore. The multicore package was designed to parallelise using the fork mechanism, on Linux machines. The snow package was designed to parallelise other mechanisms. R processes started with snow are not forked, so they will not see the parent’s data. Data will have to be copied to child processes. The good news: snow can start R processes on Windows machines, or remotely machines in the cluster.

  • The parallel library can be used to send tasks (encoded as function calls) to each of the processing cores on your machine in parallel.
  • The most popular mclapply() function essentially parallelizes calls to lapply().
  • mclapply gathers up the responses from each of these function calls, and returns a list of responses that is the same length as the list or vector of input data (one return per input item).

REMARK ON mclapply

  • The mclapply() function (and related mc* functions) works via the fork mechanism on Unix-style operating systems.
  • Briefly, your R session is the main process and when you call a function like mclapply(), you fork a series of sub-processes that operate independently from the main process (although they share a few low-level features).
  • These sub-processes then execute your function on their subsets of the data, presumably on separate cores of your CPU. Once the computation is complete, each sub-process returns its results and then the sub-process is killed. The parallel package manages the logistics of forking the sub-processes and handling them once they’ve finished.

Example

  • Let’s start up with a simple example. We select Sepal.Length and Species from the iris dataset, subset it to 100 observations, and then iterate across 10,000 trials, each time resampling the observations with replacement.
  • Run a logistic regression fitting species as a function of length, and record the coefficients for each trial to be returned.
1
2
3
4
5
6
7
8
9
# built a function first
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- seq(1, 10000)
boot_fx <- function(trial) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
r <- coefficients(result1)
res <- rbind(data.frame(), r)
}
1
2
3
4
5
6
7
8
# benchmark
system.time({
results <- lapply(trials, boot_fx)
})

# output:
# user system elapsed
# 18.808 0.000 18.934
1
2
3
4
5
6
# mclapply
system.time({
results <- mclapply(trials, boot_fx, mc.cores = detectCores())
})
# user system elapsed
# 28.256 1.611 1.756

Elapsed is the time this function takes to run. As you can see from this case study, mclapply has resulted in a huge improvement

PARALLELIZE WITH parLapply

  • Using the forking mechanism on your computer is one way to execute parallel computation but it’s not the only way that the parallel package offers.
  • Another way to build a “cluster” using the multiple cores on your computer is via sockets.
  • A socket is simply a mechanism with which multiple processes or applications running on your computer (or different computers, for that matter) can communicate with each other.
  • With parallel computation, data and results need to be passed back and forth between the parent and child processes and sockets can be used for that purpose.

Example

  • Building a socket cluster is simple to do in R with the makeCluster() function.
1
2
3
4
5
6
7
8
library(snow)
cl <- makeCluster(nCores, type = 'SOCK')
clusterExport(cl, "x")
system.time(results <- parLapply(cl, trials, boot_fx))
stopCluster(cl)
# output:
# user system elapsed
# 0.025 0.003 1.670

OTHER PACKAGES FOR PARALLEL COMPUTING IN R

Many R packages come up with parallel features (or arguments). You may refer to this link for more details. Examples are:

  • future provides a lightweight and unified Future API for sequential and parallel processing of R expression via futures.
  • Data.table is a venerable and powerful package written primarily by Matt Dowle. It is a high-performance implementation of R’s data frame construct, with an enhanced syntax. There have been innumerable benchmarks showcasing the power of the data.table package. It provides a highly optimized tabular data structure for most common analytical operations.
  • The caret package (Classification And REgression Training) is a set of functions that streamline the process for creating predictive models. The package contains tools for data splitting, preprocessing, feature selection, model tuning using resampling, variable importance estimation, and other functionality.
  • Multidplyr is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with partition(), and then the data stays on each node until you explicitly retrieve it with collect(). This minimizes time spent moving data around, and maximizes parallel performance.