In this article, CyberITHub offers you top 150 Big data Hadoop interview questions and answers. With the onset of this Digital World, one of the biggest challenges we face today is about storing and handling huge volume of complex data. This resulted in the rise of Big Data Platform in which Hadoop is the most common and popular one. Organization dealing with huge chunks of data always looking for Hadoop Skillset Professionals. Hence CyberITHub is going to provide you the best interview questions that will help you crack Big Data and Hadoop based Job Interviews.
Top 150 Hadoop Interview Questions and Answers
1. What is Hadoop ?
Ans. Apache Hadoop is an open source framework which are used to store, process and analyze huge volume of data.
2. What is Big Data ?
Ans. Big Data is a separate field which deals with storage and processing of large volume of data sets in complex Data Clusters.
3. What are the two different types of nodes in HDFS ?
Ans. Two different types of nodes are:-
- Multiple Datanodes
4. What is the main role of NameNode ?
Ans. NameNode manages all the metadata needed to store and retrieve the actual data from the DataNodes.
5. What is the purpose of SecondaryNameNode ?
Ans. The purpose of the SecondaryNameNode is to perform periodic checkpoints that evaluate the status of the NameNode.
6. What is the usual default size of HDFS block ?
Ans. 64 MB or 128 MB
7. Which command can be used to bring HDFS into safe mode for maintenance ?
Ans. hdfs dfsadmin-safemode
8. What is fsimage ?
Ans. It is a kind of file in which the NameNode stores the metadata of the HDFS File System.
9. What are the different features offered by the HDFS Snapshots ?
Ans. Following are the features offered by the HDFS Snapshots:-
- Snapshot creation is Instantaneous. More on Hadoop 2.
- Snapshots can be used for data backup, protection against user errors, and disaster recovery.
- Snapshots do not adversely affect regular HDFS operations.
- Snapshots can be taken of a sub-tree of the file system or the entire file system.
10. Does HDFS NFS Gateway supports NFS version 3 ?
11. Which command can be used to get an HDFS Status report ?
Ans. hdfs dfsadmin -report
12. Which command can be used to take HDFS Snapshots ?
Ans. hdfs dfs snapshot
13. What is TestDFSIO ?
Ans. TestDFSIO is an HDFS benchmark application. It is basically a read and write test for HDFS.
14. What is Speculative Execution ?
Ans. If there is any slow running tasks, then Hadoop never tries to fix them. Instead it will diagnose and runs a backup task for them. This is Called Speculative Execution.
15. From which file Speculative Execution can be switched on or off ?
16. What is MapReduce ?
Ans. MapReduce is a Hadoop framework on which applications are written to process huge volume of data in large data clusters.
17. What is Pipes ?
Ans. Pipes is a library that allows C++ source code to be used for mapper and reducer code.
18. What is Pig in Hadoop ?
Ans. Pig is a high level platform which provides an abstraction over MapReduce.
19. How to start Pig so that it can use Hadoop MapReduce ?
Ans. pig -x mapreduce
20. What is Apache Hive ?
Ans. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, ad hoc queries, and the analysis of large data sets using a SQL-like language called HiveQL.
21. What are the different features offered by Hive ?
Ans. Below features are currently offered by the Hive :-
- Tools to enable easy data extraction, transformation, and loading (ETL)
- A mechanism to impose structure on a variety of data formats
- Access to files stored either directly in HDFS or in other data storage systems such as HBase
- Query execution via MapReduce and Tez (optimized MapReduce)
22. Can a programmer add their custom mappers and reducers to Hive queries ?
23. What is Apache Flume ?
Ans. Apache Flume is an independent agent designed to collect, transport, and store data into HDFS.
24. What are the different components Flume Agent composed of ?
Ans. Flume Agent composed of three components:-
- Source - The source component receives data and send it to the channel.
- Channel - A channel queues the source data and forward it to sink destination.
- Sink - The sink delivers the data to destination i.e. to HDFS, a local file or another flume agent.
25. Can a source of a flume agent write to multiple channels ?
26. Can a sink of a flume agent take data from multiple channels ?
Ans. No, it can take data from only one channel.
27. Which of the data transfer format is usually used by the Flume ?
Ans. Apache Avro
28. What are the advantages of using Apache Avro data transfer format ?
Ans. Few advantages of using Apache Avro data transfer format are:-
- Avro is a data serialization/deserialization system that uses a compact binary format.
- Avro also uses remote procedure calls (RPCs) to send data. That is, an Avro sink will contact an Avro source to send data.
29. What is Apache Oozie ?
Ans. Apache Oozie is a workflow director system designed to run and manage multiple related Apache Hadoop jobs.
30. How many types of Oozie jobs are permitted in Hadoop ?
Ans. Three types of Oozie jobs are permitted in Hadoop :-
- Workflow - It is a specified sequence of Hadoop jobs with outcome-based decision points and control dependency.
- Coordinator - It is a scheduled workflow job that can run at various time intervals or when data becomes available.
- Bundle - It is a higher level Oozie abstraction that will batch a set of coordinator jobs.
31. In which language Oozie workflow definitions are written ?
Ans. hPDL(An XML Process definition language)
32. What are the different features provided by Apache HBase ?
Ans. Below are the features provided by the Apache HBase:-
- Linear and modular scalability
- Strictly consistent reads and writes
- Automatic and configurable sharding of tables
- Automatic failover support between Region Servers
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables
- Easy-to-use Java API for client access
33. What is Apache HBase ?
Ans. Apache HBase is an open-source non-relational distributed database modeled after Google's BigTable and written in Java.
34. Which utility can be used to load data in tab-separated values(tsv) format into HBase ?
35. Which application can be used to import and export RDBMS data into HDFS ?
Ans. Apache Sqoop
36. Which application can be used to capture and transport weblog data ?
Ans. Apache Flume
37. What is the command to run the balancer ?
Ans. hdfs balancer
38. When does HDFS goes into safe mode ?
Ans. When a major issues arises within the file system then HDFS goes into safe mode..
39. How to Check if the HDFS is in safe mode ?
Ans. hdfs dfsadmin -safemode get
40. Which property provides the time in seconds between the SecondaryNameNode checkpoints ?
41. Which command can be used to force the checkpoint if the SecondaryNameNode is not running ?
Ans. hdfs secondarynamenode -checkpoint force
42. How many simultaneous snapshots can be accommodated in a snapshottable directory ?
43. In which directory snapshots can be taken ?
Ans. Snapshots can be taken on any directory once the directory has been set as snapshottable.
44. Which is the default Scheduler for YARN ?
Ans. Capacity Scheduler
45. What are the different capabilities NFSv3 gateway supports ?
Ans. Below are the capabilities NFSv3 gateway currently supports:-
- Users can browse the HDFS file system through their local file system using an NFSv3 client-compatible operating system.
- Image Users can download files from the HDFS file system to their local file system.
- Image Users can upload files from their local file system directly to the HDFS file system.
- Image Users can stream data directly to HDFS through the mount point. File append is supported, but random write is not supported.
46. What is DistributedCache ?
Ans. DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.
47. What are main configuration files in Hadoop ?
Ans. Below are the main configuration files in Hadoop:-
- core-default.xml: System-wide properties
- hdfs-default.xml: Hadoop Distributed File System properties
- mapred-default.xml: Properties for the YARN MapReduce framework
- yarn-default.xml: YARN properties
48. How to run YARN WebProxy in standalone mode ?
Ans. By adding the configuration property yarn.web-proxy.address to yarn-site.xml
49. What is the role of JobHistoryServer ?
Ans. The JobHistoryServer provides all YARN MapReduce applications with a central location in which to aggregate completed jobs for historical reference and debugging.
50. In which configuration file the settings for the JobHistoryServer can be found ?
Ans. mapred-site.xml file
51. How YARN controls the container's memory ?
Ans. YARN controls the container's memory through three important values in the yarn-site.xml file:-
52. Which command can be used to check the health of HDFS ?
Ans. hdfs fsck <path>
53. How to enable Application Master restart when an error occurs in MapReduce Job ?
Ans. To enable Application Master restart, you need to set the following properties:-
- Inside yarn-site.xml, you can tune the property yarn.resourcemanager.am.max-retries. The default is 2.
- Inside mapred-site.xml, you can more directly tune how many times a MapReduce ApplicationMaster should restart with the property mapreduce.am.max-attempts. The default is 2.
54. What are blocks in HDFS ?
Ans. Blocks define the minimum amount of data that HDFS can read and write at a time.
55. What are Cache Pools ?
Ans. Cache pools are an administrative grouping for managing cache permissions and resource usage.
56. What are the different types of communication protocol in HDFS ?
Ans. There are three different communication protocols in HDFS :-
- Client Protocol - This is a communication protocol that's defined for communication between the HDFS Client and the Namenode server.
- Data Transfer Protocol - The HDFS Client, after receiving metadata information from Namenode, establishes communication with Datanode to read and write data. This communication between the client and the Datanode is defined by the Data Transfer Protocol.
- Data Node Protocol - This protocol defines communication between Namenode and DataNode.
57. What are the different operations performed by the QJM when it writes to the JournalNode ?
Ans. Below operations are performed by the QJM when it writes to the JournalNode:-
- The writer makes sure that no other writers are writing to the edit logs. This is to guarantee that even if the two NameNodes are active at a same time, only one will be allowed to make namespace changes to the edit logs.
- It is possible that the writer has not logged namespace modifications to all the JournalNodes or that some JournalNodes have not completed the logging. The QJM makes sure that all the JournalNodes are in sync based on file length.
- When one of the preceding two things are verified, the OJM can start a new log segment to write to edit logs.
- The writer sends current batch edits to all the JournalNodes in the cluster and waits for an acknowledgement based on the quorum of all the JournalNodes before considering the write a success. Those JournalNodes who failed to respond to the write request will be marked as OutOfSync and will not be used for the current batch of the edit segment.
- A QJM sends a RPC request to JournalNodes to finalize log segmentation. After receiving confirmation from quorum of JournalNodes, QJM can begin the next log segment. More on Mastering Hadoop.
58. What is Quorum Journal Manager(QJM) ?
Ans. The QJM is a dedicated HDFS implementation, designed for the sole purpose of providing a highly available edit log, and is the recommended choice for most HDFS installations.
59. Which command can be used to fetch the latest fsimage from NameNode ?
Ans. hdfs dfsadmin -fetchImage /home/packt
60. Which tool is used to convert fsimage file content into a human-readable format ?
Ans. Offline Image Viewer tool
61. What is HDFS Federation ?
Ans. HDFS Federation is a feature introduced in Hadoop 2, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace.
62. What is Checkpoint ?
Ans. Checkpoint is the process of merging an fsimage with edit logs by applying all the actions of the edit log on the fsimage.
63. Does QJM uses Zookeeper ?
64. What is failover controller ?
Ans. It is an entity in the system which does the transition from the active namenode to the standby.
65. What is graceful failover ?
Ans. When failover manually initiated by an administrator in case of a routine maintenance then it is called graceful failover.
66. Which fencing technique is being used as a last resort to fence the previously active namenode ?
Ans. STONITH(shoot the other node in the head)