Skip to content

Instantly share code, notes, and snippets.

$ cat mail.py
import os
import sys
missing_list = sys.argv[1]
f = open('/home/kkadigar/krish/missingfiles.html','ab')
header="""
<!DOCTYPE html>
<html>
Step 1: download avro tool jar
wget http://mirrors.sonic.net/apache/avro/avro-1.7.7/java/avro-tools-1.7.7.jar
Step 2: Generate schema
java -jar avro-tools-1.7.7.jar getschema /home/hdfs/genre1/part-m-00000.avro
Step 3:
sqoop import --connect jdbc:mysql://172.16.2.164/movielens --username hive -P --table genre --as-avrodatafile
this imports genre data from mysql to hdfs as .avro files and generates .avsc schema in local filesystem
Hive runs on your workstation and converts your SQL query into series of MapReduce jobs for execution on Hadoop cluster. Hive organizes data into tables, which provides a means for attaching structure to data stored in HDFS.
Metadata is stored in database called metasore.
Hive Installation: Hive require hadoop and sqoop installed on the machine before installing hive.
1. install three services hive, hive-metastore, hive-server2
yum install hive
yum install hive-metastore
yum install hive-server2
2. Install MySQL server and start MySQL
Pre-Requisites
1. Disable the IP Tables, so that the daemons on remote machines can talk to ports on the system
sudo service iptables stop
sudo chkconfig iptables off
2. Disable SELinux
sudo /usr/sbin/setenforce 0
sudo sed -i.old s/SELINUX=enforcing/SELINUX=disabled/ /etc/sysconfig/selinux
3. Set the Hostname (replace [name_of_host] with your systems hostname)
#Typed parted in console
#it will take you into parted shell
parted
#select device to be parted using the following command
select devicename
#Ex select /dev/vda
#vda represents virtual device
#sda represents physical storage
PRE REQs:
1) setup password less SSH on all instances
2) add all hosts with FQDN in /etc/hosts file in all instances
3) Disable SELinux on all instances
sudo /usr/sbin/setenforce 0
sudo sed -i.old s/SELINUX=enforcing/SELINUX=disabled/ /etc/sysconfig/selinux
4) Diable Iptables on all instances so that daemons on other machines can interact