Sunday, April 6, 2014

SparkR - Spark + R


# My test SparkR program - mySparkR.R

require(SparkR)

# this does not work since I don' have a cluster setup
# sc < - sparkR.init(master="spark://david-centos6:7077", sparkEnvir=list(spark.executor.memory="1g"))
sc < - sparkR.init(master="local[2]", sparkEnvir=list(spark.executor.memory="1g"))

lines < - textFile(sc, "hdfs://david-centos6:8020/user/david/data/result.txt")
       
words < - flatMap(lines,
   function(line) {
      strsplit(line, " ")[[1]]
})

wordCount < - lapply(words, function(word) { list(word, 1L) })

counts < - reduceByKey(wordCount, "+", 2L)
output < - collect(counts)

for (wordcount in output) {
  cat(wordcount[[1]], ": ", wordcount[[2]], "\n")
}
                                
# pur the input file into HDFS
> hadoop fs -put result.txt data

# Run SparkR to test it
> ./sparkR examples/mySparkR.R
- or To increase the memory used by the driver you can  -
> SPARK_MEM=1g ./sparkR examples/mySparkR.R


NOTE: SparkUI is at http://david-centos6:4040
reference: http://stackoverflow.com/questions/21677142/running-a-job-on-spark-0-9-0-throws-error

ERROR sometimes you will be experiencing:
"Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory"

A simple test you can run to test if you have memory or other problems using Spark Shell
Try to run MASTER="local[2]" spark-shell on the same machine you're trying to run the code. And the same code in spark console: sc.parallelize(1 to 100).count

If the sufficient memory problem persist then you might want to try to add SPARK_WORKER_MEMORY=2g to the file tools/spark-0.9.0-incubating-bin-hadoop2/conf/spark-env.sh (Not sure if this help yet ???)



No comments: