6 node Raspberry Pi 3 Spark Lego Cluster

6 node Raspberry Pi 3 Spark Lego Cluster

Tested on:
MacOS X Sierra 10.12.4
Raspbian Jessie lite 2017-04-10
Spark 2.1.0

–==[Initial setup]==–
1. Download:
https://downloads.raspberrypi.org/raspbian_lite_latest.torrent
2. Use Etcher at https://etcher.io/ to burn images.
3. Create empty file ‘ssh’ in /boot on each SD card to enable ssh:
$ touch ssh
4. On mac enable internet sharing for ethernet LAN.
5. Check the mac’s IP (interface bridge100):
$ ifconfig
6. Scan IPs:
$ sudo nmap -sn -PE 192.168.2.0/24
(install nmap via brew, IP address in the commnad is mac’s IP with the last 0,
see )
7. ssh pi@192.168.2.2 with password: raspberry
If getting “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” (from previuos
ssh keys) run:
$ ssh-keygen -R 192.168.2.2
and ssh again.
8. configure RPi’s by
$ sudo raspi-config
a. boot to console
b. change hostname
c. fill partition
d. ram split for GPU to minimum
9. side note:
$ cat /proc/cpuinfo
can tell which RPi version it is (three last lines: higher revision – RPi3, i.e.
a02082)
10. To reduce power consumption on RPi3 turn off wifi and bluetooth by
blacklisting corresponding kernel modules and reboot:
add to /etc/modprobe.d/raspi-blacklist.conf
#wifi
blacklist brcmfmac
blacklist brcmutil
#bt
blacklist btbcm
blacklist hci_uart
11. Check the module list:
$ lsmod
–==[Networking]==–
I. Simplistic approach with mac as a router.
NOTE: The problem with this approach is it based on the assumption of the mac’s
IP which acts as a router. That is why it will be problematic to connect the
cluster to another machine. For that we need DHCP server in the cluster with
two ethernet ports (see II.)

1. Network configuration (RPis can access internet through laptop):
assign static IPs
add the following lines to /etc/dhcpcd.conf

interface eth0
static ip_address=192.168.2.3/24 # IP of the host RPi
static routers=192.168.2.1 # IP of the laptop
static domain_name_servers=192.168.2.1 # IP of the laptop

2. Without real DNS populate /etc/hosts (on the laptop also) by
#IP hostname
192.168.2.1 Borislavs-MacBook-Pro
192.168.2.2 rpi2
192.168.2.3 rpi3

II. Configuration with one node in the cluster acting as a DHCP server/gate.
see
http://makezine.com/projects/build-a-compact-4-node-raspberry-pi-cluster/
–==[Spark standalone cluster configuration]==–
0. Install Java 8.
Oracle JDK is almost twice faster than Open JDK:
$ sudo apt-get -y install oracle-java8-jdk

1. creating user ‘spark’ and group ‘spark’ on each node:
$ sudo addgroup spark
$ sudo adduser –ingroup spark spark
$ sudo adduser spark sudo

Note: delete a user
$ sudo deluser –remove-home tecmint

2. Switch to user spark:
$ su spark
$ cd ~
NOTE: now we always log in as spark user!

3. setting up passwordless ssh
on the master node run:
$ ssh-keygen
with empty password.
Copy ssh key to every other machine:
$ ssh-copy-id spark@rpi2
Note: copy the key to the master itself if planning to have workers on the
master node.

4. copy spark distribution to RPis (in parallel, install pssh):

$ parallel-scp -h workers -r -p 100 -t 0 spark-2.1.0-bin-hadoop2.7.tgz /home/spark/

where ‘workers’ is a file containing lines ‘ []’.

or

$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz

i.e. link on the website. Make sure it is binary (compiled). Compiling takes
a looot of time.

5. unpack spark distribution on each node:

$ tar -xzvf spark-2.1.0-bin-hadoop2.7.tgz
$ mv spark-2.1.0-bin-hadoop2.7 spark-2.1.0

6. Spark configuration on the master node. create file ‘slaves’ in spark-2.1.0/conf/
where we list all workers, i.e.

rpi2
rpi3

7. On all nodes add to spark-2.1.0/conf/spark-env.sh:
SPARK_MASTER_HOST=192.168.2.3
where 192.168.2.3 is the IP of the master node (here we assume rpi3 is the master
node with IP 192.168.2.3).

8. spark memory config options. add to spark/conf/spark-env.sh:

SPARK_EXECUTOR_MEMORY=500M
SPARK_DRIVER_MEMORY=500M
SPARK_WORKER_MEMORY=500M
SPARK_DAEMON_MEMORY=500M

9. On the master node configure pyspark.
Add enviroment variables (temporary with export, or permemantly by adding
to ~/.bashrc and then . .bashrc to run it) so that we can do ‘import pyspark’
in python (Note: if it is a separate bash file, then use export and quotes
for “/home/spark-2.1.0/spark”):

export SPARK_HOME=/home/spark/spark-2.1.0
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

10. Check that pyspark module is available:
$ python
>>> import pyspark

–==[Installing and configuring jupyter]==–
1. Install and update pip
$ sudo apt-get install -y python-pip python-dev
$ sudo -H pip install –upgrade pip

2. Install jupyter on the master node:
$ sudo -H pip install jupyter

3. remote access to jupyter notebook:

option A: ssh tuneling, see
http://kawahara.ca/how-to-run-an-ipythonjupyter-notebook-on-a-remote-machine/

option B: configure jupyter for remote access:
see http://jupyter-notebook.readthedocs.io/en/latest/public_server.html

Option B:
a. Generate ~/.jupyter/jupyter_notebook_config.py by:
$ jupyter notebook –generate-config

generate and copy hashed password by the following python snippet:

In [1]: from notebook.auth import passwd
In [2]: passwd()
sha1:9fccb3141886:81e622d22fc82d4b781143deb46bfe5f1c1f083b

b. Add to ~/.jupyter/jupyter_notebook_config.py:

c.NotebookApp.ip = ‘*’
c.NotebookApp.password = u’sha1:bcd259ccf…’
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888

c. Run the server by

$ jupyter notebook

and access it from the web browser 192.168.2.3:8888 or rpi3:8888.

NOTE:
a. it is better to run jupyter through tmux
b. Maybe it is worth restricting incoming IPs.
–==[Launching spark cluster and jupyter notebook]==–
1. On the master node run the cluster by

$ spark-2.1.0/sbin/start-all.sh

2. Launch jupyter:
$ jupyter notebook

3. In the notebook create SparkContext by:
from pyspark import SparkConf, SparkContext

conf = SparkConf()
conf.setAppName(‘Test’)
conf.setMaster(‘spark://192.168.2.2:7077’)
conf.set(“spark.executor.memory”, “500M”)
sc = SparkContext(conf=conf)

NOTE: spark-env.sh defines SPARK_WORKER_MEMORY. Worker contains executors and
spark-defaults.conf

3. Monitor spark at http://192.168.2.3:4040/

4. After finished working stop current SparkContext by:
sc.stop()

and stop the cluster by

$ spark/sbin/stop-all.sh

IMPORTANT:
1. WITHOUT HDFS WE NEED TO KEEP COPY OF DATA FILES WE WANT OT READ ON EACH WORKER.
THEN WE CAN READ THEM BY SPECIFYING FILENAME AS:
file:///
2. CHECK IF YOU NEED PYSPARK AVAILABLE ON WORKERS AND SAME LIBRARIES?
3. CHECK WHERE DOES IT WRITE FILES?

–==[tmux]==–
In order to keep jobs active even if the network is not available/ssh fails use
tmux.
1. Install tmux by:
sudo apt-get install tmux
2. Launching the job:
$ tmux
$
ctrl+b d
3. can safely unplug ethernet/terminate ssh
4. list available sessions:
$ tmux ls
5. Attach to a session:
$ tmux a -t <# of the session from tmux ls command>, i.e. in ‘0: 1 windows…’ 0
is what we need>

Upcoming:

  1. running commands via ssh in parallel
  2. some Spark benchmarks
  3. Lego case details

IMG_5244 2