SaiVijaya/gist:731986c320e8c6702dee

## gistfile1.txt
HDFS Questions
1)What is Difference between Secondary namenode, Checkpoint namenode & backupnod Secondary Namenode, a poorly named component of hadoop.

(2)What are the Side Data Distribution Techniques.

(3)What is shuffleing in mapreduce?

(4)What is partitioning?

(5)Can we change the file cached by Distributed Cache

(6)What if job tracker machine is down?

(7)Can we deploy job tracker other than name node?

(8)What are the four modules that make up the Apache Hadoop framework?

(9)Which modes can Hadoop be run in? List a few features for each mode.

(10)Where are Hadoop’s configuration files located?

(11)List Hadoop’s three configuration files.


(12)What are “slaves” and “masters” in Hadoop?


(13)How many datanodes can run on a single Hadoop cluster?


(14)What is job tracker in Hadoop?


(15)How many job tracker processes can run on a single Hadoop cluster?


(16)What sorts of actions does the job tracker process perform?


(17)How does job tracker schedule a job for the task tracker?


(18)What does the mapred.job.tracker command do?

19)What is “PID”?


(20)What is “jps”?

(21)   How would you restart Namenode?


(22)Is there another way to check whether Namenode is working?


(23)How would you restart Namenode?


(24)What is “fsck”?


(25)What is a “map” in Hadoop?


(26)What is a “reducer” in Hadoop?


(27)What are the parameters of mappers and reducers?


(28)Is it possible to rename the output file, and if so, how?

29  	What is a rack?

30	What is Big Data?
31	 What do the four V's of Big Data denote?
32	How big data analysis helps businesses increase their revenue? Give example.
33	Differentiate between Structured and Unstructured data.
34	On what concept the Hadoop framework works?
35	Why do we need Hadoop?
36	What are the main components of a Hadoop Application?
37	What is Hadoop streaming?
38	What is the best hardware configuration to run Hadoop?
39.	What are the most commonly defined input formats in Hadoop
40.	 What is the basic difference between traditional RDBMS and Hadoop?
41 What is Fault Tolerance?
42  Replication causes data redundancy, then why is it pursued in HDFS?
43  Which port does SSH work on?


44What is streaming in Hadoop?


45What is the difference between Input Split and an HDFS Block?


46What does the file hadoop-metrics.properties do?


47Name the most common Input Formats defined in Hadoop? Which one  is default?


48What is the difference between TextInputFormat and KeyValueInputFormat class?

49What is InputSplit in Hadoop?

50How is the splitting of file invoked in Hadoop framework

51Consider case scenario: In M/R system,
- HDFS block size is 64 MB
- Input format is FileInputFormat
 - We have 3 files of size 64K, 65Mb and 127Mb

52) Explain what is Speculative Execution?

(53)Can you give some examples of Big Data?


(54) Can you give a detailed overview about the Big Data being generated by Facebook?


(55)According to IBM, what are the three characteristics of Big Data?


(56)How Big is ‘Big Data’?


(57)How analysis of Big Data is useful for organizations?


(58)Who are ‘Data Scientists’?

(59 Give a brief overview of Hadoop history.


(60)Give examples of some companies that are using Hadoop structure?

61 ) What are the key features of HDFS?


(62)Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?


(63)What is throughput? How does HDFS get a good throughput?


(64)What is streaming access?


(65)What is a commodity hardware? Does commodity hardware include RAM?


(66)What is a metadata?


(67)Why do we use HDFS for applications having large data sets and not when there are lot of small files?


(68)What is a daemon?


(69)Is Namenode machine same as datanode machine as in terms of hardware?

(70)What is a heartbeat in HDFS?

(71)Are Namenode and job tracker on the same host?

(72)What are the benefits of block transfer?


(73)If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?

74  How indexing is done in HDFS?


75  If a data Node is full how it’s identified?
76  If datanodes increase, then do we need to upgrade Namenode?

77 Explain how can you check whether Namenode is working beside using the jps command?

78  What mechanism does Hadoop framework provide to synchronise changes made in Distribution Cache during runtime of the application?
79 Have you ever used Counters in Hadoop. Give us an example scenario?
80  Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?
81  Is it possible to have Hadoop job output in multiple directories? If yes, how?
82   Did you ever built a production process in Hadoop? If yes, what was the process when your Hadoop job fails due to any reason?
83 When we send a data to a node, do we allow settling in time, before sending another data to that node?
84 Does hadoop always require digital data to process?
85 On what basis Namenode will decide which datanode to write on?
86 Doesn’t Google have its very own version of DFS?
87 On what basis data will be stored on a rack?
88 Do we need to place 2nd and 3rd data in rack 2 only?
89 What if rack 2 and datanode fails?
90 What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
91 What is the difference between MapReduce engine and HDFS cluster?
92  Is map like a pointer?
93 Why are the number of splits equal to the number of maps?
94 Is a job split into maps?
95 Which are the two types of ‘writes’ in HDFS?
96 Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
97 Can Hadoop be compared to NOSQL database like Cassandra?
98 How can I install Cloudera VM in my system?
99 What is a Task Tracker in Hadoop? How many instances of Task Tracker run on a hadoop cluster
100 What are the four basic parameters of a mapper?
111 What is the input type/format in MapReduce by default?
112 Can we do online transactions(OLTP) using Hadoop?
113 Explain how HDFS communicates with Linux native file system
114  What is a IdentityMapper and IdentityReducer in MapReduce ?
115 How JobTracker schedules a task?
116 When is the reducers are started in a MapReduce job?
117 What other technologies have you used in hadoop stack?
118 How NameNode Handles data node failures?
119 How many Daemon processes run on a Hadoop system?
120 What is configuration of a typical slave node on Hadoop cluster?
121  How many JVMs run on a slave node?
122 How will you make changes to the default configuration files?
123 Can I set the number of reducers to zero?
124  Can I set the number of reducers to zero?
125  Whats the default port that jobtrackers listens ?
(126)unable to read options file while i tried to import data from mysql to hdfs.
127 What problems have you faced when you are working on Hadoop code?
128 how would you modify that solution to only count the number of unique words in all the documents?
129 What is the difference between a Hadoop  and Relational Database and Nosql?
130 How the HDFS Blocks are replicated?
131 What is a Task instance in Hadoop? Where does it run?
132 what is meaning Replication factor?
133 If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
134 146)How the Client communicates with HDFS?

(135)Which object can be used to get the progress of a particular job

(136)What is next step after Mapper or MapTask?

(137)What are the default configuration files that are used in Hadoop?

(138)Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?

(139)What is HDFS Block size? How is it different from traditional file system block size?

(140)what is SPF?

(141)Where do you specify the Mapper Implementation?
142 Explain the core methods of the Reducer?
143 How can you add the arbitrary key-value pairs in your mapper?
144 What are combiners? When should I use a combiner in my MapReduce Job?
145 How Mapper is instantiated in a running job?
146 Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
147 What happens if you don?t override the Mapper methods and keep them as it is?

148 How does an Hadoop application look like or their basic components?

149 What is the meaning of speculative execution in Hadoop? Why is it important?
150 What are the restriction to the key and value class ?
151  What is Writable & WritableComparable interface?
152 What is the use of Context object?
153 How can we control particular key should go in a specific reducer?
154 What do you understand about Object Oriented Programming (OOP)? Use Java examples.
155 Describe what happens to a MapReduce job from submission to output?
156 What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application
157 What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it
158 What is the difference between HDFS and NAS ?
159 Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
160 Why would NoSQL be better than using a SQL Database? And how much better is it?
161 What do you understand by Standalone (or local) mode?
162 What is Pseudo-distributed mode?
163 What does /var/hadoop/pids do?
164 What infrastructure do we need to process 100 TB data using Hadoop?
165 how to enable recycle bin or trash in Hadoop
166 what is difference between int and intwritable
167 In Map Reduce why map write output to Local Disk instead of HDFS?
168 How to write a Custom Key Class?
169 Why we use IntWritable instead of Int? Why we use LongWritable instead of Long?
170 If data is present in HDFS and RF is defined, then how can we change Replication Factor?
171 How we can change Replication factor when Data is on the fly?3
172  mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/hadoop/inpdata. Name node is in safemode.
173 What Hadoop Does in Safe Mode
174 What should be the ideal replication factor in Hadoop Cluster?
175 What will be the consideration while we do Hardware Planning for Master in Hadoop architecture?
176 When should be hadoop archive create
177 what factors the block size takes before creation?
178 In which location Name Node sores its Metadata and why?
179 Should we use RAID in Hadoop or not?
180 How blocks are distributed among all data nodes for a particular chunk of data?
181 Data node block size in HDFS, why 64MB?
182 What is the Non DFS Used
183 What is Rack awareness?
184 What is Output Format in hadoop?
185 What is difference between split and block in hadoop?
186 How can one write custom record reader?
187 What do you understand from Node redundancy and is it exist in hadoop cluster
188 How to resolve IOException: Cannot create directory, while formatting namenode in hadoop.
189 What is AMI
190 How can one set space quota in Hadoop (HDFS) directory
191 What is identity mapper and reducer? In which cases can we use them?
192 What is Reduce only jobs?
193 What is crontab? Explain with suitable example.
194 Safe mode execeptions
195 What is the meaning of the term "non-DFS used" in Hadoop web-console?

PIG and HIVE, HBASE, flume

1  Pig for Hadoop - Give some points?
2 Hive for Hadoop - Give some points?
3 File permissions in HDFS?
4 what is ODBC and JDBC connectivity in Hive?
5 What is Derby database?
6 What is Schema on Read and Schema on Write?
7 What is Internal and External table in Hive?
8 What is bucketing in Hive?
9 What is Clustring in Hive?
10 How to write data in Hbase using flume?
11 What is difference between memory channel and file channel in flume?
12 How to resolve the following error while running a query in hive: Error in metadata: Cannot validate serde

Scenario Based question
1 how would you modify that solution to only count the number of unique words in all the documents?
2 How would you tackle calculating the number of unique visitors for each hour by mining a huge Apache log? You can use post processing on the output of the MapReduce job.
3 How would you tackle counting words in several text documents?
4 Have you ever used Counters in Hadoop. Give us an example scenario?
5 can we write map reduce program in other than java programming language. how.
6 Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason
7 Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it
8 how to proceed to write your first mapreducer program.
	HDFS Questions
	1)What is Difference between Secondary namenode, Checkpoint namenode & backupnod Secondary Namenode, a poorly named component of hadoop.

	(2)What are the Side Data Distribution Techniques.

	(3)What is shuffleing in mapreduce?

	(4)What is partitioning?

	(5)Can we change the file cached by Distributed Cache

	(6)What if job tracker machine is down?

	(7)Can we deploy job tracker other than name node?

	(8)What are the four modules that make up the Apache Hadoop framework?

	(9)Which modes can Hadoop be run in? List a few features for each mode.

	(10)Where are Hadoop’s configuration files located?

	(11)List Hadoop’s three configuration files.


	(12)What are “slaves” and “masters” in Hadoop?


	(13)How many datanodes can run on a single Hadoop cluster?


	(14)What is job tracker in Hadoop?


	(15)How many job tracker processes can run on a single Hadoop cluster?


	(16)What sorts of actions does the job tracker process perform?


	(17)How does job tracker schedule a job for the task tracker?


	(18)What does the mapred.job.tracker command do?

	19)What is “PID”?


	(20)What is “jps”?

	(21) How would you restart Namenode?





	(22)Is there another way to check whether Namenode is working?


	(23)How would you restart Namenode?


	(24)What is “fsck”?


	(25)What is a “map” in Hadoop?


	(26)What is a “reducer” in Hadoop?


	(27)What are the parameters of mappers and reducers?


	(28)Is it possible to rename the output file, and if so, how?

	29 What is a rack?

	30 What is Big Data?
	31 What do the four V's of Big Data denote?
	32 How big data analysis helps businesses increase their revenue? Give example.
	33 Differentiate between Structured and Unstructured data.
	34 On what concept the Hadoop framework works?
	35 Why do we need Hadoop?
	36 What are the main components of a Hadoop Application?
	37 What is Hadoop streaming?
	38 What is the best hardware configuration to run Hadoop?
	39. What are the most commonly defined input formats in Hadoop
	40. What is the basic difference between traditional RDBMS and Hadoop?
	41 What is Fault Tolerance?
	42 Replication causes data redundancy, then why is it pursued in HDFS?
	43 Which port does SSH work on?


	44What is streaming in Hadoop?


	45What is the difference between Input Split and an HDFS Block?


	46What does the file hadoop-metrics.properties do?


	47Name the most common Input Formats defined in Hadoop? Which one is default?


	48What is the difference between TextInputFormat and KeyValueInputFormat class?

	49What is InputSplit in Hadoop?

	50How is the splitting of file invoked in Hadoop framework

	51Consider case scenario: In M/R system,
	- HDFS block size is 64 MB
	- Input format is FileInputFormat
	- We have 3 files of size 64K, 65Mb and 127Mb

	52) Explain what is Speculative Execution?

	(53)Can you give some examples of Big Data?


	(54) Can you give a detailed overview about the Big Data being generated by Facebook?


	(55)According to IBM, what are the three characteristics of Big Data?


	(56)How Big is ‘Big Data’?


	(57)How analysis of Big Data is useful for organizations?


	(58)Who are ‘Data Scientists’?

	(59 Give a brief overview of Hadoop history.


	(60)Give examples of some companies that are using Hadoop structure?

	61 ) What are the key features of HDFS?


	(62)Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?


	(63)What is throughput? How does HDFS get a good throughput?


	(64)What is streaming access?


	(65)What is a commodity hardware? Does commodity hardware include RAM?


	(66)What is a metadata?


	(67)Why do we use HDFS for applications having large data sets and not when there are lot of small files?


	(68)What is a daemon?


	(69)Is Namenode machine same as datanode machine as in terms of hardware?

	(70)What is a heartbeat in HDFS?

	(71)Are Namenode and job tracker on the same host?

	(72)What are the benefits of block transfer?


	(73)If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?

	74 How indexing is done in HDFS?


	75 If a data Node is full how it’s identified?
	76 If datanodes increase, then do we need to upgrade Namenode?

	77 Explain how can you check whether Namenode is working beside using the jps command?

	78 What mechanism does Hadoop framework provide to synchronise changes made in Distribution Cache during runtime of the application?
	79 Have you ever used Counters in Hadoop. Give us an example scenario?
	80 Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?
	81 Is it possible to have Hadoop job output in multiple directories? If yes, how?
	82 Did you ever built a production process in Hadoop? If yes, what was the process when your Hadoop job fails due to any reason?
	83 When we send a data to a node, do we allow settling in time, before sending another data to that node?
	84 Does hadoop always require digital data to process?
	85 On what basis Namenode will decide which datanode to write on?
	86 Doesn’t Google have its very own version of DFS?
	87 On what basis data will be stored on a rack?
	88 Do we need to place 2nd and 3rd data in rack 2 only?
	89 What if rack 2 and datanode fails?
	90 What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
	91 What is the difference between MapReduce engine and HDFS cluster?
	92 Is map like a pointer?
	93 Why are the number of splits equal to the number of maps?
	94 Is a job split into maps?
	95 Which are the two types of ‘writes’ in HDFS?
	96 Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
	97 Can Hadoop be compared to NOSQL database like Cassandra?
	98 How can I install Cloudera VM in my system?
	99 What is a Task Tracker in Hadoop? How many instances of Task Tracker run on a hadoop cluster
	100 What are the four basic parameters of a mapper?
	111 What is the input type/format in MapReduce by default?
	112 Can we do online transactions(OLTP) using Hadoop?
	113 Explain how HDFS communicates with Linux native file system
	114 What is a IdentityMapper and IdentityReducer in MapReduce ?
	115 How JobTracker schedules a task?
	116 When is the reducers are started in a MapReduce job?
	117 What other technologies have you used in hadoop stack?
	118 How NameNode Handles data node failures?
	119 How many Daemon processes run on a Hadoop system?
	120 What is configuration of a typical slave node on Hadoop cluster?
	121 How many JVMs run on a slave node?
	122 How will you make changes to the default configuration files?
	123 Can I set the number of reducers to zero?
	124 Can I set the number of reducers to zero?
	125 Whats the default port that jobtrackers listens ?
	(126)unable to read options file while i tried to import data from mysql to hdfs.
	127 What problems have you faced when you are working on Hadoop code?
	128 how would you modify that solution to only count the number of unique words in all the documents?
	129 What is the difference between a Hadoop and Relational Database and Nosql?
	130 How the HDFS Blocks are replicated?
	131 What is a Task instance in Hadoop? Where does it run?
	132 what is meaning Replication factor?
	133 If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
	134 146)How the Client communicates with HDFS?

	(135)Which object can be used to get the progress of a particular job

	(136)What is next step after Mapper or MapTask?

	(137)What are the default configuration files that are used in Hadoop?

	(138)Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?

	(139)What is HDFS Block size? How is it different from traditional file system block size?

	(140)what is SPF?

	(141)Where do you specify the Mapper Implementation?
	142 Explain the core methods of the Reducer?
	143 How can you add the arbitrary key-value pairs in your mapper?
	144 What are combiners? When should I use a combiner in my MapReduce Job?
	145 How Mapper is instantiated in a running job?
	146 Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
	147 What happens if you don?t override the Mapper methods and keep them as it is?

	148 How does an Hadoop application look like or their basic components?

	149 What is the meaning of speculative execution in Hadoop? Why is it important?
	150 What are the restriction to the key and value class ?
	151 What is Writable & WritableComparable interface?
	152 What is the use of Context object?
	153 How can we control particular key should go in a specific reducer?
	154 What do you understand about Object Oriented Programming (OOP)? Use Java examples.
	155 Describe what happens to a MapReduce job from submission to output?
	156 What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application
	157 What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it
	158 What is the difference between HDFS and NAS ?
	159 Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
	160 Why would NoSQL be better than using a SQL Database? And how much better is it?
	161 What do you understand by Standalone (or local) mode?
	162 What is Pseudo-distributed mode?
	163 What does /var/hadoop/pids do?
	164 What infrastructure do we need to process 100 TB data using Hadoop?
	165 how to enable recycle bin or trash in Hadoop
	166 what is difference between int and intwritable
	167 In Map Reduce why map write output to Local Disk instead of HDFS?
	168 How to write a Custom Key Class?
	169 Why we use IntWritable instead of Int? Why we use LongWritable instead of Long?
	170 If data is present in HDFS and RF is defined, then how can we change Replication Factor?
	171 How we can change Replication factor when Data is on the fly?3
	172 mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/hadoop/inpdata. Name node is in safemode.
	173 What Hadoop Does in Safe Mode
	174 What should be the ideal replication factor in Hadoop Cluster?
	175 What will be the consideration while we do Hardware Planning for Master in Hadoop architecture?
	176 When should be hadoop archive create
	177 what factors the block size takes before creation?
	178 In which location Name Node sores its Metadata and why?
	179 Should we use RAID in Hadoop or not?
	180 How blocks are distributed among all data nodes for a particular chunk of data?
	181 Data node block size in HDFS, why 64MB?
	182 What is the Non DFS Used
	183 What is Rack awareness?
	184 What is Output Format in hadoop?
	185 What is difference between split and block in hadoop?
	186 How can one write custom record reader?
	187 What do you understand from Node redundancy and is it exist in hadoop cluster
	188 How to resolve IOException: Cannot create directory, while formatting namenode in hadoop.
	189 What is AMI
	190 How can one set space quota in Hadoop (HDFS) directory
	191 What is identity mapper and reducer? In which cases can we use them?
	192 What is Reduce only jobs?
	193 What is crontab? Explain with suitable example.
	194 Safe mode execeptions
	195 What is the meaning of the term "non-DFS used" in Hadoop web-console?

	PIG and HIVE, HBASE, flume

	1 Pig for Hadoop - Give some points?
	2 Hive for Hadoop - Give some points?
	3 File permissions in HDFS?
	4 what is ODBC and JDBC connectivity in Hive?
	5 What is Derby database?
	6 What is Schema on Read and Schema on Write?
	7 What is Internal and External table in Hive?
	8 What is bucketing in Hive?
	9 What is Clustring in Hive?
	10 How to write data in Hbase using flume?
	11 What is difference between memory channel and file channel in flume?
	12 How to resolve the following error while running a query in hive: Error in metadata: Cannot validate serde

	Scenario Based question
	1 how would you modify that solution to only count the number of unique words in all the documents?
	2 How would you tackle calculating the number of unique visitors for each hour by mining a huge Apache log? You can use post processing on the output of the MapReduce job.
	3 How would you tackle counting words in several text documents?
	4 Have you ever used Counters in Hadoop. Give us an example scenario?
	5 can we write map reduce program in other than java programming language. how.
	6 Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason
	7 Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it
	8 how to proceed to write your first mapreducer program.