Skip to content

Instantly share code, notes, and snippets.

View anjijava16's full-sized avatar
💭
Awesome

Anjaiah Methuku anjijava16

💭
Awesome
View GitHub Profile
Scala Envi:
1. Install scala
1.1 Download scala (latest version) form http://www.scala-lang.org/
1.2 Uncompress it
1.3 Add the scala bin folder to path variable
2. Eclipse mars or luna
@anjijava16
anjijava16 / SkipMapper.java
Created December 26, 2016 10:17 — forked from amalgjose/SkipMapper.java
Mapreduce program for removing stop words from the given text files. Hadoop Distributed cache and counters are used in this program
package com.hadoop.skipper;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
http://www.geoinsyssoft.com/rdds-vs-dataframes-apache-spark/
[-]GIT
https://www.codeschool.com/learn/git
https://cursos.alura.com.br/course/git
https://www.codecademy.com/learn/learn-git
Um projeto no GIT é composto de 3 partes:
- Working Directory: Onde acontece a edição/deleção
- Staging Area: Onde adicionamos os arquivos a serem comitados
- Repository: Onde acontece o commit e é armazenado a última versão.
• DISTINCT and GROUP BY - Use only if it is necessary. Try to avoid it as it will degrade the performance.
• PARTITION - Try to partition the table. Using the partition column in Filter will Improve the performance.
• Rewrite - Do not use the same query as used in RDBMS. Rewrite the query completely to improve the performance.
• Map Split Size - Try to reduce the Map Split Size. This will reduce the time taken by the query.
• Map Join - Try to Map Join small tables so that joining it with large table will take less time.
• Memory - Change the memory based on queries used.
• Format ORC - Try to keep all the tables in ORC format which will improve the queries on that table.
• Hive Execute parallel - For executing jobs in parallel.
• CTAS - Try creating Managed tables instead of External tables.
• Data Explosion - Try to fetch the filtered data set and join. Make sure that there is no cross join between large data set.
Procedure to install Google Chrome 52 on a RHEL/CentOS/Fedora Linux:
Here is how to install and use the Google Chrome 45 in five easy steps:
Open the Terminal application. Grab 64bit Google Chrome.
Type the following command to download 64 bit version of Google Chrome:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
Install Google Chrome and its dependencies on a CentOS/RHEL, type:
sudo yum install ./google-chrome-stable_current_*.rpm
Start Google Chrome from the CLI:
i) hadoop fs -Ddfs.block.size=67108864 -Ddfs.replication=4 -copyFromLocal pom.xml /app/data
ii) hdfs fsck -blocks -files -locations /app/data/pom.xml
Output : Connecting to namenode via http://localhost:50070
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path /app/data/pom.xml at Mon Aug 21 23:36:58 IST 2017
/app/data/pom.xml 2617 bytes, 1 block(s): Under replicated BP-806356112-127.0.0.1-1489344343967:blk_1073742850_2027. Target Replicas is 4 but found 1 replica(s).
0. BP-806356112-127.0.0.1-1489344343967:blk_1073742850_2027 len=2617 repl=1 [127.0.0.1:50010]
http://www3.cs.stonybrook.edu/~youngkwon/cse535/Lecture13_Hadoop_HDFS.pdf
http://www3.cs.stonybrook.edu/~youngkwon/cse535/
http://www3.cs.stonybrook.edu/~youngkwon/cse535/Lecture12_Hadoop_MapReduce.pdf
0) hadoop jar mapReduceUtils-0.1.jar com.iwinner.m_techlearn.hadoop.mapreduce.custom1.TempuratureJob /data/OutputCust/
i) hadoop fs -Ddfs.block.size=67108864 -Ddfs.replication=4 -copyFromLocal pom.xml /app/data
ii) hdfs fsck -blocks -files -locations /app/data/pom.xml
iii)yarn application -list
iv)yarn application -kill <<Application_ID>>
Wednesday, March 11, 2015
HDFS Tutorial
Posted by pramod narayana at 6:27 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest