Scala Envi:
1. Install scala
1.1 Download scala (latest version) form
1.2 Uncompress it
1.3 Add the scala bin folder to path variable
2. Eclipse mars or luna
Mapreduce program for removing stop words from the given text files. Hadoop Distributed cache and counters are used in this program
package com.hadoop.skipper;
import java.util.HashSet;
import java.util.Set;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
Um projeto no GIT é composto de 3 partes:
- Working Directory: Onde acontece a edição/deleção
- Staging Area: Onde adicionamos os arquivos a serem comitados
- Repository: Onde acontece o commit e é armazenado a última versão.
• DISTINCT and GROUP BY - Use only if it is necessary. Try to avoid it as it will degrade the performance.
• PARTITION - Try to partition the table. Using the partition column in Filter will Improve the performance.
• Rewrite - Do not use the same query as used in RDBMS. Rewrite the query completely to improve the performance.
• Map Split Size - Try to reduce the Map Split Size. This will reduce the time taken by the query.
• Map Join - Try to Map Join small tables so that joining it with large table will take less time.
• Memory - Change the memory based on queries used.
• Format ORC - Try to keep all the tables in ORC format which will improve the queries on that table.
• Hive Execute parallel - For executing jobs in parallel.
• CTAS - Try creating Managed tables instead of External tables.
• Data Explosion - Try to fetch the filtered data set and join. Make sure that there is no cross join between large data set.
Procedure to install Google Chrome 52 on a RHEL/CentOS/Fedora Linux:
Here is how to install and use the Google Chrome 45 in five easy steps:
Open the Terminal application. Grab 64bit Google Chrome.
Type the following command to download 64 bit version of Google Chrome:
Install Google Chrome and its dependencies on a CentOS/RHEL, type:
sudo yum install ./google-chrome-stable_current_*.rpm
Start Google Chrome from the CLI:
i) hadoop fs -Ddfs.block.size=67108864 -Ddfs.replication=4 -copyFromLocal pom.xml /app/data
ii) hdfs fsck -blocks -files -locations /app/data/pom.xml
Output : Connecting to namenode via http://localhost:50070
FSCK started by hadoop (auth:SIMPLE) from / for path /app/data/pom.xml at Mon Aug 21 23:36:58 IST 2017
/app/data/pom.xml 2617 bytes, 1 block(s): Under replicated BP-806356112- Target Replicas is 4 but found 1 replica(s).
0. BP-806356112- len=2617 repl=1 []
Wednesday, March 11, 2015
HDFS Tutorial
java -cp "hiveJdbcQueryUtils-0.1.jar:lib/*"