Mphasis Hadoop Interview Questions
1. What are the three modes in which Hadoop can run?
The three modes in which Hadoop can run are :
1) Standalone mode: This is the default mode. It uses the local FileSystem and a single Java process to run the Hadoop services.
2) Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services.
3) Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services.
2. What are the differences between regular FileSystem and HDFS?
1) Regular FileSystem: In regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.
2) HDFS: Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.
3. What happens when two clients try to access the same file in the HDFS?
HDFS supports exclusive write only. When the first client contacts the “NameNode” to open the file for writing, the “NameNode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.
4. What is a checkpoint?
In brief, “Checkpointing” is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.
5. How is HDFS fault tolerant?
When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.
6. Can NameNode and DataNode be a commodity hardware?
The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience, you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.
7. What does ‘jps’ command do?
The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.
8. How do you define “Rack Awareness” in Hadoop?
Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.
9. What is the purpose of “RecordReader” in Hadoop?
The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.
10. What is a “Combiner”?
A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.
11. What do you know about “SequenceFileInputFormat”?
“SequenceFileInputFormat” is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.
Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.
You might like this:
For all other Interview Questions - Click here
Labels:
Mphasis Hadoop Interview Questions, Hadoop interview questions for 5 years experience, Hadoop scenario based interview questions, Hadoop interview questions for 7 years experience, Hadoop technical Interview questions
No comments:
Post a Comment