Scala版Sparkセットアップ


忘備録

【OS】
今回はCentOS6.6_x86_64版を使用。詳細は以下を参照。
http://centos.server-manual.com/
事前準備
セットアップに必要なパッケージを事前に設定しておく必要がある。以下を全て設定する。
システム変更が発生するので管理者権限が必須。rootにsuしておく事。

【YUMパッケージ管理】
yum -y install yum-plugin-fastestmirror
yum -y update
yum -y groupinstall "Base" "Development tools" "Japanese Support"
[RPMforgeリポジトリ追加]
rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt
rpm -ivh http://apt.sw.be/redhat/el6/en/x86_64/rpmforge/RPMS/rpmforge-release-0.5.3-1.el6.rf.x86_64.rpm
[EPELリポジトリ追加]
rpm --import http://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-6
rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
[ELRepoリポジトリ追加]
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-6-6.el6.elrepo.noarch.rpm
[Remiリポジトリ追加]
rpm --import http://rpms.famillecollet.com/RPM-GPG-KEY-remi
rpm -ivh http://rpms.famillecollet.com/enterprise/remi-release-6.rpm

【SELinux無効化】
getenforce
Enforcing ←SELinux有効
setenforce 0
getenforce
Permissive ←SELinux無効
vi /etc/sysconfig/selinux
SELINUX=enforcing
SELINUX=disabled ←変更(起動時に無効にする)

【iptablesでHTTPを許可】
vi /etc/sysconfig/iptables
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT ←追加
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT
Iptables再起動
service iptables restart

【JAVA】
CentOS構築時にデフォルトでインストールされたバージョンをアンインストール。
yum erase java*
最新版をネットより入手(rpm版)しインストール
rpm –ivh jdk-8u45-linux-x64.rpm
バージョン確認
java –version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
■JAVA_HOME設定
vi /etc/profile


export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar

【前提条件】
Standalone modeで稼働させるため今回はビルドは行わないものとします

【Scala】
cd /usr/local/src
wget http://www.scala-lang.org/files/archive/scala-2.11.7.tgz
tar -zxvf scala-2.11.7.tgz
chown -R root:root scala-2.11.7
mv scala-2.11.7 ../scala

【Spark】
wget http://ftp.riken.jp/net/apache/spark/spark-1.4.0/spark-1.4.0-bin-cdh4.tgz
tar -zxvf spark-1.4.0-bin-cdh4.tgz
chown -R root:root spark-1.4.0-bin-cdh4
mv spark-1.4.0-bin-cdh4 ../spark

環境変数を追記
vi /etc/profile


export SCALA_HOME=/usr/local/scala
export SPARK_HOME=/usr/local/spark
export PATH=$SCALA_HOME/bin:$PATH

source /etc/profile

確認


cd $SPARK_HOME
./bin/spark-shell
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    14/10/01 05:53:08 INFO SecurityManager: Changing view acls to: hdspark,
    14/10/01 05:53:08 INFO SecurityManager: Changing modify acls to: hdspark,
    14/10/01 05:53:08 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdspark, ); users with modify permissions: Set(hdspark, )
    14/10/01 05:53:08 INFO HttpServer: Starting HTTP Server
    14/10/01 05:53:09 INFO Utils: Successfully started service 'HTTP class server' on port 33066.
    Welcome to
          ____              __
         / /  ___ ____/ /_
        \ \/ _ \/ _ `/ _/  '/
       // ./_,// //_\   version 1.4.0
          //

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
14/10/01 05:53:22 INFO SecurityManager: Changing view acls to: hdspark,
14/10/01 05:53:22 INFO SecurityManager: Changing modify acls to: hdspark,
14/10/01 05:53:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdspark, ); users with modify permissions: Set(hdspark, )
14/10/01 05:53:24 INFO Slf4jLogger: Slf4jLogger started
14/10/01 05:53:24 INFO Remoting: Starting remoting
14/10/01 05:53:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@localhost:36288]
14/10/01 05:53:25 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@localhost:36288]
14/10/01 05:53:25 INFO Utils: Successfully started service 'sparkDriver' on port 36288.
14/10/01 05:53:25 INFO SparkEnv: Registering MapOutputTracker
14/10/01 05:53:25 INFO SparkEnv: Registering BlockManagerMaster
14/10/01 05:53:25 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20141001055325-22ac
14/10/01 05:53:26 INFO Utils: Successfully started service 'Connection manager for block manager' on port 56196.
14/10/01 05:53:26 INFO ConnectionManager: Bound socket to port 56196 with id = ConnectionManagerId(localhost,56196)
14/10/01 05:53:26 INFO MemoryStore: MemoryStore started with capacity 267.3 MB
14/10/01 05:53:26 INFO BlockManagerMaster: Trying to register BlockManager
14/10/01 05:53:26 INFO BlockManagerMasterActor: Registering block manager localhost:56196 with 267.3 MB RAM
14/10/01 05:53:26 INFO BlockManagerMaster: Registered BlockManager
14/10/01 05:53:26 INFO HttpFileServer: HTTP File server directory is /tmp/spark-a33f43d9-37da-4c9e-a0b8-71b117b37012
14/10/01 05:53:26 INFO HttpServer: Starting HTTP Server
14/10/01 05:53:26 INFO Utils: Successfully started service 'HTTP file server' on port 54714.
14/10/01 05:53:27 INFO Utils: Successfully started service 'SparkUI' on port 4040.
14/10/01 05:53:27 INFO SparkUI: Started SparkUI at http://localhost:4040
14/10/01 05:53:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/10/01 05:53:29 INFO Executor: Using REPL class URI: http://localhost:33066
14/10/01 05:53:29 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@localhost:36288/user/HeartbeatReceiver
14/10/01 05:53:30 INFO SparkILoop: Created spark context..
Spark context available as sc.


scala>



//簡単な行数カウントを実行してみます

scala> val txtFile = sc.textFile("README.md")
    14/10/01 05:56:17 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
    14/10/01 05:56:17 INFO MemoryStore: ensureFreeSpace(156973) called with curMem=0, maxMem=280248975
    14/10/01 05:56:17 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.3 KB, free 267.1 MB)
    txtFile: org.apache.spark.rdd.RDD[String] = ../README.md MappedRDD[1] at textFile at <console>:12




scala> txtFile.count()
    14/10/01 05:56:29 INFO FileInputFormat: Total input paths to process : 1
    14/10/01 05:56:29 INFO SparkContext: Starting job: count at <console>:15
    14/10/01 05:56:29 INFO DAGScheduler: Got job 0 (count at <console>:15) with 1 output partitions (allowLocal=false)
    14/10/01 05:56:29 INFO DAGScheduler: Final stage: Stage 0(count at <console>:15)
    14/10/01 05:56:29 INFO DAGScheduler: Parents of final stage: List()
    14/10/01 05:56:29 INFO DAGScheduler: Missing parents: List()
    14/10/01 05:56:29 INFO DAGScheduler: Submitting Stage 0 (../README.md MappedRDD[1] at textFile at <console>:12), which has no missing parents
    14/10/01 05:56:29 INFO MemoryStore: ensureFreeSpace(2384) called with curMem=156973, maxMem=280248975
    14/10/01 05:56:29 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 267.1 MB)
    14/10/01 05:56:29 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (../README.md MappedRDD[1] at textFile at <console>:12)
    14/10/01 05:56:29 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
    14/10/01 05:56:29 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1207 bytes)
    14/10/01 05:56:29 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
    14/10/01 05:56:29 INFO HadoopRDD: Input split: file:/usr/local/spark/README.md:0+4811
    14/10/01 05:56:29 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
    14/10/01 05:56:29 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
    14/10/01 05:56:29 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
    14/10/01 05:56:29 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
    14/10/01 05:56:29 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
    14/10/01 05:56:30 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1731 bytes result sent to driver
    14/10/01 05:56:30 INFO DAGScheduler: Stage 0 (count at <console>:15) finished in 0.462 s
    14/10/01 05:56:30 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 423 ms on localhost (1/1)
    14/10/01 05:56:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
    14/10/01 05:56:30 INFO SparkContext: Job finished: count at <console>:15, took 0.828128221 s
    res0: Long = 141

//成功!