GEOFBOT/ Setting up a Flink cluster.md

## Setting up a Flink cluster.md

      
    Raw
  

               Setting up a Flink cluster.md
            
          
    NOTE: HDFS is required for Flink's DistributedCache which distributes Python plans to worker nodes. We use BlueData Hadoop CDH nodes.
Remember to make sure you aren't using env.execute(local=True) in your Python plans!
On the master node:


Install git and other useful things that we like
sudo yum install git bzip2 -y


Install JDK 8 and Anaconda Python
cd ~
wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-8u92-linux-x64.rpm"
sudo yum localinstall jdk-8u92-linux-x64.rpm -y
rm ~/jdk-8u92-linux-x64.rpm
sudo alternatives --set java /usr/java/jdk1.8.0_92/jre/bin/java

cd ~
wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh
sudo bash Anaconda3-4.1.1-Linux-x86_64.sh -b -p /opt/anaconda3
echo 'export PATH=/opt/anaconda3/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
rm Anaconda3-4.1.1-Linux-x86_64.sh


Setup Apache Maven
cd ~
wget http://apache.osuosl.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.zip
unzip apache-maven-3.3.9-bin.zip
rm ~/apache-maven-3.3.9-bin.zip
sudo mv ~/apache-maven-3.3.9/ /opt/maven
sudo ln -s /opt/maven/bin/mvn /usr/bin/mvn


Build Apache Flink
# git clone https://github.com/GEOFBOT/flink flink-src
# cd flink-src
# # BlueData uses CDH 5.4.3 which has Hadoop 2.3.0+. We don't need to specify Hadoop version
# # because Flink uses Hadoop 2.3.0+ by default anyways.
# mvn clean install -DskipTests
# ln -s ~/flink-src/build-target ~/flink
wget https://github.com/GEOFBOT/flink/releases/download/iteration/flink-bulkiterations.tgz
tar xzvf flink-bulkiterations.tgz
rm flink-bulkiterations.tgz


Set up Flink config files
cd ~/flink/conf/
# Modified configuration file
wget -O flink-conf.yaml https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/flink-conf.yaml
echo >> flink-conf.yaml # Github Gist strips ending newline?
# List of worker node IPs
wget -O masters https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/masters
echo >> masters
wget -O slaves https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/slaves
echo >> slaves


Repeat process for each worker


Start Flink on the master node
~/flink/bin/start-cluster.sh


Set up private keys so that the master node can access the worker nodes
I used the following ~/.ssh/config:
Host 172.17.77.*
    User bluedata
    IdentityFile <key location>


## autosetup.fish
# run setup script on each worker node
cd ~
wget https://gist.github.com/GEOFBOT/3ffc9b21214174ae750cc3fdb2625b71/raw/slaves
for ip in (cat slaves)
    ssh bluedata@$ip "curl -L https://gist.github.com/GEOFBOT/3ffc9b21214174ae750cc3fdb2625b71/raw/quicksetup.sh | sh" &
end
rm slaves

## AWS Notes.md

      
    Raw
  

              AWS Notes.md
            
          
    Make sure to allow all traffic between nodes in the cluster security group so data can be sent between nodes. Set up SSH key on the master node. Use only AWS internal IPs when setting up list of worker nodes (slaves).

  
## convert_to_CentOS.sh
#!/bin/bash
# Convert CDH BlueData node to CentOS so we can have nice packages

yum clean all
mkdir ~/centos; cd ~/centos
wget http://mirror.centos.org/centos/6.8/os/x86_64/RPM-GPG-KEY-CentOS-6
wget http://mirror.centos.org/centos/6.8/os/x86_64/Packages/centos-release-6-8.el6.centos.12.3.x86_64.rpm
wget http://mirror.centos.org/centos/6.8/os/x86_64/Packages/yum-3.2.29-73.el6.centos.noarch.rpm
wget http://mirror.centos.org/centos/6.8/os/x86_64/Packages/yum-utils-1.1.30-37.el6.noarch.rpm
wget http://mirror.centos.org/centos/6.8/os/x86_64/Packages/yum-plugin-fastestmirror-1.1.30-37.el6.noarch.rpm
rpm --import RPM-GPG-KEY-CentOS-6
rpm -e --nodeps redhat-release-server
rpm -Uhv --force --nodeps *.rpm
yum upgrade -y
yum remove subscription-manager

cd ~
rm -r centos

## flink-conf.yaml
################################################################################
#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
################################################################################


#==============================================================================
# Common
#==============================================================================

# JAVA_HOME
env.java.home: /usr/java/jdk1.8.0_92

# The host on which the JobManager runs. Only used in non-high-availability mode.
# The JobManager process will use this hostname to bind the listening servers to.
# The TaskManagers will try to connect to the JobManager on that host.

jobmanager.rpc.address: 172.17.77.20


# The port where the JobManager's main actor system listens for messages.

jobmanager.rpc.port: 6123


# The heap size for the JobManager JVM

jobmanager.heap.mb: 256


# The heap size for the TaskManager JVM

taskmanager.heap.mb: 512


# The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline.

taskmanager.numberOfTaskSlots: 1

# Specify whether TaskManager memory should be allocated when starting up (true) or when
# memory is required in the memory manager (false)

taskmanager.memory.preallocate: false

# The parallelism used for programs that did not specify and other parallelism.

parallelism.default: 1


#==============================================================================
# Web Frontend
#==============================================================================

# The port under which the web-based runtime monitor listens.
# A value of -1 deactivates the web server.

jobmanager.web.port: 8081

# Flag to specify whether job submission is enabled from the web-based
# runtime monitor. Uncomment to disable.

#jobmanager.web.submit.enable: false


#==============================================================================
# Streaming state checkpointing
#==============================================================================

# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends: jobmanager, filesystem, <class-name-of-factory>
#
#state.backend: filesystem


# Directory for storing checkpoints in a Flink-supported filesystem
# Note: State backend must be accessible from the JobManager and all TaskManagers.
# Use "hdfs://" for HDFS setups, "file://" for UNIX/POSIX-compliant file systems,
# (or any local file system under Windows), or "S3://" for S3 file system.
#
# state.backend.fs.checkpointdir: hdfs://namenode-host:port/flink-checkpoints


#==============================================================================
# Advanced
#==============================================================================

# The number of buffers for the network stack.
#
# taskmanager.network.numberOfBuffers: 2048


# Directories for temporary files.
#
# Add a delimited list for multiple directories, using the system directory
# delimiter (colon ':' on unix) or a comma, e.g.:
#     /data1/tmp:/data2/tmp:/data3/tmp
#
# Note: Each directory entry is read from and written to by a different I/O
# thread. You can include the same directory multiple times in order to create
# multiple I/O threads against that directory. This is for example relevant for
# high-throughput RAIDs.
#
# If not specified, the system-specific Java temporary directory (java.io.tmpdir
# property) is taken.
#
# taskmanager.tmp.dirs: /tmp


# Path to the Hadoop configuration directory.
#
# This configuration is used when writing into HDFS. Unless specified otherwise,
# HDFS file creation will use HDFS default settings with respect to block-size,
# replication factor, etc.
#
# You can also directly specify the paths to hdfs-default.xml and hdfs-site.xml
# via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'.
#
fs.hdfs.hadoopconf: /etc/hadoop/conf/


#==============================================================================
# Master High Availability (required configuration)
#==============================================================================

# The list of ZooKepper quorum peers that coordinate the high-availability
# setup. This must be a list of the form:
# "host1:clientPort,host2[:clientPort],..." (default clientPort: 2181)
#
# recovery.mode: zookeeper
#
# recovery.zookeeper.quorum: localhost:2181,...
#
# Note: You need to set the state backend to 'filesystem' and the checkpoint
# directory (see above) before configuring the storageDir.
#
# recovery.zookeeper.storageDir: hdfs:///recovery

## masters
172.17.77.20:8081

## quicksetup.sh
#!/bin/sh

### Setup script that automates commands listed in the Markdown file

### ON EACH MASTER AND WORKER NODE:

JDK_VER_MAJ=8
JDK_VER_MIN=92

## Enable CentOS repositories if this node is still RedHat
if rpm -q redhat-release-server; then
  curl -L https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/convert_to_CentOS.sh | sudo bash
fi

## Install git and bzip2
sudo yum install git bzip2 -y

## Install JDK 8 and Anaconda Python
if [ ! -d "/usr/java/jdk1.${JDK_VER_MAJ}.0_${JDK_VER_MIN}" ]; then
  cd ~
  wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-${JDK_VER_MAJ}u${JDK_VER_MIN}-linux-x64.rpm"
  sudo yum localinstall jdk-${JDK_VER_MAJ}u${JDK_VER_MIN}-linux-x64.rpm -y
  rm ~/jdk-${JDK_VER_MAJ}u${JDK_VER_MIN}-linux-x64.rpm
  sudo alternatives --set java /usr/java/jdk1.${JDK_VER_MAJ}.0_${JDK_VER_MIN}/jre/bin/java
fi

if ! which python3; then
  wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh
  sudo bash Anaconda3-4.1.1-Linux-x86_64.sh -b -p /opt/anaconda3
  echo 'export PATH=/opt/anaconda3/bin:$PATH' >> ~/.bashrc
  source ~/.bashrc
  rm Anaconda3-4.1.1-Linux-x86_64.sh
fi

## Setup Apache Maven

if [ ! -d "/opt/maven" ]; then
  cd ~
  wget http://apache.osuosl.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.zip
  unzip apache-maven-3.3.9-bin.zip
  rm ~/apache-maven-3.3.9-bin.zip
  sudo mv ~/apache-maven-3.3.9/ /opt/maven
  sudo ln -s /opt/maven/bin/mvn /usr/bin/mvn
fi
## Build Apache Flink

# git clone https://github.com/GEOFBOT/flink flink-src
# cd flink-src
# # BlueData uses CDH 5.4.3 which has Hadoop 2.3.0+. We don't need to specify Hadoop version
# # because Flink uses Hadoop 2.3.0+ by default anyways.
# mvn clean install -DskipTests
# ln -s ~/flink-src/build-target ~/flink
cd ~
if [ ! -d "flink" ]; then
  wget https://github.com/GEOFBOT/flink/releases/download/iteration/flink-bulkiterations.tgz
  tar xzvf flink-bulkiterations.tgz
  rm flink-bulkiterations.tgz
fi

## Set up Flink config files

mkdir -p ~/flink/tmp
cd ~/flink/conf/
# Modified configuration file
wget -O flink-conf.yaml https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/flink-conf.yaml
echo >> flink-conf.yaml # No trailing newline from wget?
# List of worker node IPs
wget -O masters https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/masters
echo >> masters
wget -O slaves https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/slaves
echo >> slaves

## Repeat process for each worker
echo Repeat this process on the master node and on each worker node.

## Set up private keys
echo Remember to set up .ssh/config on the master node so that it can control the workers.

## Start Flink on the master node

# ~/flink/bin/start-cluster.sh

## slaves
172.17.77.21
	# run setup script on each worker node
	cd ~
	wget https://gist.github.com/GEOFBOT/3ffc9b21214174ae750cc3fdb2625b71/raw/slaves
	for ip in (cat slaves)
	ssh bluedata@$ip "curl -L https://gist.github.com/GEOFBOT/3ffc9b21214174ae750cc3fdb2625b71/raw/quicksetup.sh \| sh" &
	end
	rm slaves
	#!/bin/bash
	# Convert CDH BlueData node to CentOS so we can have nice packages

	yum clean all
	mkdir ~/centos; cd ~/centos
	wget http://mirror.centos.org/centos/6.8/os/x86_64/RPM-GPG-KEY-CentOS-6
	wget http://mirror.centos.org/centos/6.8/os/x86_64/Packages/centos-release-6-8.el6.centos.12.3.x86_64.rpm
	wget http://mirror.centos.org/centos/6.8/os/x86_64/Packages/yum-3.2.29-73.el6.centos.noarch.rpm
	wget http://mirror.centos.org/centos/6.8/os/x86_64/Packages/yum-utils-1.1.30-37.el6.noarch.rpm
	wget http://mirror.centos.org/centos/6.8/os/x86_64/Packages/yum-plugin-fastestmirror-1.1.30-37.el6.noarch.rpm
	rpm --import RPM-GPG-KEY-CentOS-6
	rpm -e --nodeps redhat-release-server
	rpm -Uhv --force --nodeps *.rpm
	yum upgrade -y
	yum remove subscription-manager

	cd ~
	rm -r centos
	################################################################################
	# Licensed to the Apache Software Foundation (ASF) under one
	# or more contributor license agreements. See the NOTICE file
	# distributed with this work for additional information
	# regarding copyright ownership. The ASF licenses this file
	# to you under the Apache License, Version 2.0 (the
	# "License"); you may not use this file except in compliance
	# with the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.
	################################################################################


	#==============================================================================
	# Common
	#==============================================================================

	# JAVA_HOME
	env.java.home: /usr/java/jdk1.8.0_92

	# The host on which the JobManager runs. Only used in non-high-availability mode.
	# The JobManager process will use this hostname to bind the listening servers to.
	# The TaskManagers will try to connect to the JobManager on that host.

	jobmanager.rpc.address: 172.17.77.20


	# The port where the JobManager's main actor system listens for messages.

	jobmanager.rpc.port: 6123


	# The heap size for the JobManager JVM

	jobmanager.heap.mb: 256


	# The heap size for the TaskManager JVM

	taskmanager.heap.mb: 512


	# The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline.

	taskmanager.numberOfTaskSlots: 1

	# Specify whether TaskManager memory should be allocated when starting up (true) or when
	# memory is required in the memory manager (false)

	taskmanager.memory.preallocate: false

	# The parallelism used for programs that did not specify and other parallelism.

	parallelism.default: 1


	#==============================================================================
	# Web Frontend
	#==============================================================================

	# The port under which the web-based runtime monitor listens.
	# A value of -1 deactivates the web server.

	jobmanager.web.port: 8081

	# Flag to specify whether job submission is enabled from the web-based
	# runtime monitor. Uncomment to disable.

	#jobmanager.web.submit.enable: false


	#==============================================================================
	# Streaming state checkpointing
	#==============================================================================

	# The backend that will be used to store operator state checkpoints if
	# checkpointing is enabled.
	#
	# Supported backends: jobmanager, filesystem, <class-name-of-factory>
	#
	#state.backend: filesystem


	# Directory for storing checkpoints in a Flink-supported filesystem
	# Note: State backend must be accessible from the JobManager and all TaskManagers.
	# Use "hdfs://" for HDFS setups, "file://" for UNIX/POSIX-compliant file systems,
	# (or any local file system under Windows), or "S3://" for S3 file system.
	#
	# state.backend.fs.checkpointdir: hdfs://namenode-host:port/flink-checkpoints


	#==============================================================================
	# Advanced
	#==============================================================================

	# The number of buffers for the network stack.
	#
	# taskmanager.network.numberOfBuffers: 2048


	# Directories for temporary files.
	#
	# Add a delimited list for multiple directories, using the system directory
	# delimiter (colon ':' on unix) or a comma, e.g.:
	# /data1/tmp:/data2/tmp:/data3/tmp
	#
	# Note: Each directory entry is read from and written to by a different I/O
	# thread. You can include the same directory multiple times in order to create
	# multiple I/O threads against that directory. This is for example relevant for
	# high-throughput RAIDs.
	#
	# If not specified, the system-specific Java temporary directory (java.io.tmpdir
	# property) is taken.
	#
	# taskmanager.tmp.dirs: /tmp


	# Path to the Hadoop configuration directory.
	#
	# This configuration is used when writing into HDFS. Unless specified otherwise,
	# HDFS file creation will use HDFS default settings with respect to block-size,
	# replication factor, etc.
	#
	# You can also directly specify the paths to hdfs-default.xml and hdfs-site.xml
	# via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'.
	#
	fs.hdfs.hadoopconf: /etc/hadoop/conf/


	#==============================================================================
	# Master High Availability (required configuration)
	#==============================================================================

	# The list of ZooKepper quorum peers that coordinate the high-availability
	# setup. This must be a list of the form:
	# "host1:clientPort,host2[:clientPort],..." (default clientPort: 2181)
	#
	# recovery.mode: zookeeper
	#
	# recovery.zookeeper.quorum: localhost:2181,...
	#
	# Note: You need to set the state backend to 'filesystem' and the checkpoint
	# directory (see above) before configuring the storageDir.
	#
	# recovery.zookeeper.storageDir: hdfs:///recovery
	#!/bin/sh

	### Setup script that automates commands listed in the Markdown file

	### ON EACH MASTER AND WORKER NODE:

	JDK_VER_MAJ=8
	JDK_VER_MIN=92

	## Enable CentOS repositories if this node is still RedHat
	if rpm -q redhat-release-server; then
	curl -L https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/convert_to_CentOS.sh \| sudo bash
	fi

	## Install git and bzip2
	sudo yum install git bzip2 -y

	## Install JDK 8 and Anaconda Python
	if [ ! -d "/usr/java/jdk1.${JDK_VER_MAJ}.0_${JDK_VER_MIN}" ]; then
	cd ~
	wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-${JDK_VER_MAJ}u${JDK_VER_MIN}-linux-x64.rpm"
	sudo yum localinstall jdk-${JDK_VER_MAJ}u${JDK_VER_MIN}-linux-x64.rpm -y
	rm ~/jdk-${JDK_VER_MAJ}u${JDK_VER_MIN}-linux-x64.rpm
	sudo alternatives --set java /usr/java/jdk1.${JDK_VER_MAJ}.0_${JDK_VER_MIN}/jre/bin/java
	fi

	if ! which python3; then
	wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh
	sudo bash Anaconda3-4.1.1-Linux-x86_64.sh -b -p /opt/anaconda3
	echo 'export PATH=/opt/anaconda3/bin:$PATH' >> ~/.bashrc
	source ~/.bashrc
	rm Anaconda3-4.1.1-Linux-x86_64.sh
	fi

	## Setup Apache Maven

	if [ ! -d "/opt/maven" ]; then
	cd ~
	wget http://apache.osuosl.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.zip
	unzip apache-maven-3.3.9-bin.zip
	rm ~/apache-maven-3.3.9-bin.zip
	sudo mv ~/apache-maven-3.3.9/ /opt/maven
	sudo ln -s /opt/maven/bin/mvn /usr/bin/mvn
	fi
	## Build Apache Flink

	# git clone https://github.com/GEOFBOT/flink flink-src
	# cd flink-src
	# # BlueData uses CDH 5.4.3 which has Hadoop 2.3.0+. We don't need to specify Hadoop version
	# # because Flink uses Hadoop 2.3.0+ by default anyways.
	# mvn clean install -DskipTests
	# ln -s ~/flink-src/build-target ~/flink
	cd ~
	if [ ! -d "flink" ]; then
	wget https://github.com/GEOFBOT/flink/releases/download/iteration/flink-bulkiterations.tgz
	tar xzvf flink-bulkiterations.tgz
	rm flink-bulkiterations.tgz
	fi

	## Set up Flink config files

	mkdir -p ~/flink/tmp
	cd ~/flink/conf/
	# Modified configuration file
	wget -O flink-conf.yaml https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/flink-conf.yaml
	echo >> flink-conf.yaml # No trailing newline from wget?
	# List of worker node IPs
	wget -O masters https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/masters
	echo >> masters
	wget -O slaves https://gist.github.com/geofbot/3ffc9b21214174ae750cc3fdb2625b71/raw/slaves
	echo >> slaves

	## Repeat process for each worker
	echo Repeat this process on the master node and on each worker node.

	## Set up private keys
	echo Remember to set up .ssh/config on the master node so that it can control the workers.

	## Start Flink on the master node

	# ~/flink/bin/start-cluster.sh