HADOOP

Hadoop can be run in 3 different modes.

1. Standalone Mode
2. Pseudo Distributed Mode(Single Node Cluster)
3. Fully distributed mode (or multiple node cluster)

Standalone Mode

  1. Default mode of Hadoop
  2. HDFS is not utilized in this mode.
  3. Local file system is used for input and output
  4. Used for debugging purpose
  5. No Custom Configuration is required in 3 hadoop(mapred-site.xml,core-site.xmlhdfs-site.xml) files.
  6. Standalone mode is much faster than Pseudo-distributed mode.

Pseudo Distributed Mode(Single Node Cluster)

  1.  Configuration is required in given 3 files for this mode
  2. Replication factory is one for HDFS.
  3. Here one node will be used as Master Node / Data Node / Job Tracker / Task Tracker
  4. Used for Real Code to test in HDFS.
  5. Pseudo distributed cluster is a cluster where all daemons are
    running on one node itself.

Fully distributed mode (or multiple node cluster)

  1. This is a Production Phase
  2. Data are used and distributed across many nodes.
  3. Different Nodes will be used as Master Node / Data Node / Job Tracker / Task Tracker
Hadoop Architecture
Now a days required framework like which  handle huge amount of data in an application like Facebook,  Twitter, LinledIn, Google, Yahoo, etc these have lots of data. These companies required some process to that huge data like 1. Data Analysis, 2. Proper Handling of Data and 3. Understandable data to custom format.

Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.

In 2003-2004 Google Introduced some new technique in search engine 1. File System GFS (Google File System) and another framework for data analyzing technique called 2. MapReduce to make fast searching and fast analyzing data. Google just submitted theses white paper to search engine.

In 2005-2006 Yahoo take these technique for Google and Implement in single framework given Name Hadoop. Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project. No one knows that better than Doug Cutting, chief architect of Cloudera and one of the curious story behind Hadoop. When he was creating the open source software that supports the processing of large data sets, Cutting knew the project would need a good name. Cutting's son, then 2, was just beginning to talk and called his beloved stuffed yellow elephant "Hadoop" (with the stress on the first syllable). Fortunately, he had one up his sleeve—thanks to his son. The son (who's now 12) frustrated with this. He's always saying 'Why don't you say my name, and why don't I get royalties? I deserve to be famous for this :)


What is Big Data?
  • Lots of Data (Terabytes or Petabytes)
  • Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • Systems or Enterprises generate huge amount of data from Terabytes to and even Petabytes of information 












  • Facebook generate 500+ TBs data per day for analysis.
  • NYSE generate about 1 TB data of new trade data per day to perform stock trading analytic to determine trends for optimal trades.
  • One Airoplane per day generate  almost like Facebook data quantity.
  • Lighting or Electricity board has also lots of data per day to analytics for calculating power consumption for a particular state.
















|--------------------------------------------------------------|
                                 DATA
|--------------------------|--------------------------------------|
               70 % data       | 30% Structured
 Unstructured(NoSQL) |  Data (RDBMS)
                                       |
Vertical Scalability       | No Vertical Scalability
|-------------------------|-------------------------------------|













Big Data Technologies

1. Operational Big Data-
System that provide operational capabilities for real time interactive workload where data is primarily captured and stored.(MongoDB)

2. Analytical Big Data-
System that provide analytical capabilities for retrospective, complex analysis that may touch most of all the data. (Hadoop)















What is Hadoop? first of all we are understanding what is DFS(Distributed File System), Why DFS?

 DFS(Distributed File Systems)-

A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server.












In above pics there are different physical machines in different location but in one logical machine have a common file system for all physical machine. 


  • System that permanently store data
  • Divided into logical units (files, shards, chunks, blocks etc)
  • A file path joins file and directory names into a relative or absolute relative address to identify a file
  • Support access to files and remote servers
  • Support concurrency 
  • Support Distribution
  • Support Replication
  • NFS, GPFS, Hadoop DFS, GlusterFS, MogileFS...










WHY DFS?














What is Hadoop?
Apache Hadoop is a framework that allow for the distributed processing for large data sets across clusters of commodity computers using simple programing model.

It is design to scale up from a single server to thousands of machines each offering local computation and storage.

Apache Hadoop is simply a framework, it is library which build using java with objective of providing capability of managing huge amount of data.

Hadoop is a java framework providing by Apache hence to manage huge amount of data by providing certain components which have capability of understanding data providing the right storage capability and providing right algorithm to do analysis to it.

Open Source Software + Commodity Hardware = IT Costs reduction

What is Hadoop used for?
  • Searching
  • Log Processing
  • Recommendation systems
  • Analytics
  • Video and Image Analysis
  • Data Retention

Company Using Hadoop:
  • Yahoo
  • Google
  • Facebook
  • Amazon
  • AOL
  • IBM
  • other mores
http://wiki.apache.org/hadoop/PoweredBy

Hadoop Architecture

Here we will describe about Hadoop Architecture. In high level of hadoop architecture there are two main modules HDFS and MapReduce.

               Means HDFS + MapReduce = Hadoop Framework

Following pic have high level architecture of hadoop version 1 and version 2-

















Hadoop provides a distributed filesystem(HDFS) and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. While the interface to HDFS is patterned after the Unix filesystem, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand.

The Apache Hadoop framework is composed of the following modules :
1] Hadoop Common - contains libraries and utilities needed by other Hadoop modules

2] Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.

3] Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.

4] Hadoop MapReduce - a programming model for large scale data processing.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.

Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop "platform" is now commonly considered to consist of a number of related projects as well – Apache Pig, Apache Hive, Apache HBase, and others

















For the end-users, though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Apache Pig, Apache Hive among other related projects expose higher level user interfaces like Pig latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.

Core Components of Hadoop 1.x(HDFS & MapReduce) :
There are two primary components at the core of Apache Hadoop 1.x : the Hadoop Distributed File System (HDFS) and the MapReduce parallel processing framework. These open source projects, inspired by technologies created inside Google.
 Hadoop Distributed File System (HDFS)-Storage
  • Distributed across "nodes"
  • Natively redundant
  • NameNode track location

 MapReduce-Processing
  • Split a task across processors
  • near data and assembles results
  • self healing and high bandwidth
  • clustered storage
  • JobTracker manages the TaskTracker

























NameNode is admin node, is associated with Job Tracker, is master slave architecture.

JobTracker is associated with NameNode with multiple task tracker for processing of data sets.

What is HDFS?


What is HDFS?

HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters oncommodity hardware
  • Highly fault-tolerant 
    "Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS."
  • High throughput
    "HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future."
  • Suitable for application with large data sets
    "Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance."
  • Streaming access to file system data
    "Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access."
  • can be built out of commodity hardware

Areas where HDFS is Not a Good Fit Today-
  • Low-latency data access
  • Lots of small files
  • Multiple writers, arbitrary file modifications

HDFS Components-
  • NameNodes
    • -associated with JobTracker
    • -master of the system
    • maintain and manage the blocks which are present on the DataNodes

    The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file). The NameNode maintains the namespace tree and the mapping of blocks to DataNodes. The current design has a single NameNode for each cluster. The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently.
  • DataNodes
    • - associated with TaskTracker
    • -salves which are deployed on each machine and provide the actual storage
    • -responsible for serving read and write request f
    • or the clients

    Each block replica on a DataNode is represented by two files in the local native filesystem. The first file contains the data itself and the second file records the block's metadata including checksums for the data and the generation stamp. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional filesystems. Thus, if a block is half full it needs only half of the space of the full block on the local drive.




Hi in this hadoop tutorial we will describing now HDFS Architecture. There are following are two main components of HDFS.

Main Components of HDFS-
  • NameNodes
    • master of the system
    • maintain and manage the blocks which are present on the datanodes
    • Namenode is like above pic "lamborghini car" strong with body, is single point failure point

  • DataNodes
    • slaves which are deployed on each machine and provide the actual storage
    • responsible for serving read and write request for the clients
    • Datanodes are like above pic where like "ambassador cars" some less strong compare to "lamborghini" but have actual point of service providers, is commodity hardware.
HDFS Architecture-
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
 
Rack-storage area where we store multiple datanodes
client-is application which you used to intract with NameNode and DataNode

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

NameNode Metadata-

Meta data in memory 
  • the entire meta-data in main memory
  • no demand paging for FS meta-data
Types of Meta-data
  • List of files
  • List of blocks of each file
  • List of data nodes of each block
  • File attributes e.g. access time, replication factors
A Transaction Log
  • Recode file creation and deletion

Secondary Name Node-storing backup of NameNode it will not work as an alternate of namenode, it just stored namenode metadata.
















HDFS Client Create a New File-

When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that host replicas of the blocks of the file. The list is sorted by the network topology distance from the client. The client contacts a DataNode directly and requests the transfer of the desired block. When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Choice of DataNodes for each block is likely to be different. The interactions among the client, the NameNode and the DataNodes are illustrated in following figure.





















Unlike conventional filesystems, HDFS provides an API that exposes the locations of a file blocks. This allows applications like the MapReduce framework to schedule a task to where the data are located, thus improving the read performance. It also allows an application to set the replication factor of a file. By default a file's replication factor is three. For critical files or files which are accessed very often, having a higher replication factor improves tolerance against faults and increases read bandwidth.

Anatomy of File Write by HDFS Client-





















1. Create input splits by HDFS client.
2. After that it goes to namenode and namenode give the information back to which datanode to be selected. 
3. Then client to be write the data pack to Datanode no one else, Namenode does not write any thing datanodes.
4. Data written to datanodes in a pipeline by HDFS client.
5. and every datanodes return ack packet i.e. acknowledgement back to HDFS client (Non-posted write here all writes are asynchronous). 
6. close connection with datanodes.
7. confirm about completion to namenode.

Anatomy of File Read by HDFS Client-




















1. User ask to HDFS client to read a file and Client move request to NameNode.
2. NameNode give block information which data node has the file.
3. and then client goes to read data from datanodes.
4. Client reading data from all datanodes in parallel(for fast accessing data in case of any failure of any datanode that is why  hadoop read data in parallel way) way not in pipeline.
5.  Reading data from every datanodes where same file exists.
6. After reading is complete then close the connection with datanode cluster.

JobTracker and TaskTracker are coming into picture when we required processing to data set. In hadoop system there are five services always running in background (called hadoop daemon services).
Daemon Services of Hadoop- 
  1. Namenodes
  2. Secondary Namenodes
  3. Jobtracker
  4. Datanodes
  5. Tasktracker

Above three services 1, 2, 3 can talk to each other and other two services 4,5 can also talk to each other. Namenode and datanodes are also talking to each other as well as Jobtracker and Tasktracker are also.























Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.

If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the file system. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.

JobTracker and TaskTrackers Work Flow-



















1. User copy all input files to distributed file system using namenode meta data.
2. Submit jobs to client which applied to input files fetched stored in datanodes.
3. Client get information about input files from namenodes to be process.
4. Client create splits of all files for the jobs
5. After splitting files client stored meta data about this job to DFS.
6. Now client submit this job to job tracker.




















7. Now jobtracker come into picture and initialize job with job queue.
8. Jobtracker read job files from DFS submitted by client.
9. Now jobtracker create maps and reduces for jobs and input splits applied to mappers. Same number of mapper are there as many input splits are there. Every map work on individual split and create output.
















10. Now tasktrackers come into picture and jobs submitted to every tasktrackers by jobtracker and receiving heartbeat from every TaskTracker for confirming tasktracker working properly or not. This heartbeat frequently sent to JobTracker in 3 second by every TaskTrackers. If suppose any task tracker is not sending heartbeat to jobtracker in 3 second then JobTracker wait for 30 second more after that jobtracker consider those tasktracker as a dead state and upate metadata about those task trackers.
11. Picks tasks from splits.
12. Assign to TaskTracker.



















Finally all tasktrackers create outputs and number of reduces generate as number of outputs created by task trackers. After all reducer give us final output.

Introduction to MapReduce

In this hadoop tutorial we will introduce map reduce, what is map reduce. Before map reduce how to analyze the bigdata. Please look into following picture.



 










Here big data split into equal size and grep it using linux command and matches with some specific characters like high temperature of any large data set of weather department. But this way have some problems as follows.

Problems in the Traditional way analysis-

1. Critical path problem (Its amount of time to take to finish the job without delaying the next milestone or actual completion date).
2. Reliability problem
3. Equal split issues
4. Single split may failure
5. Sorting problem

For overcome these all problems Hadoop introduce mapreduce in picture for analyzing such amount of data in fast.






















What is MapReduce
  • MapReduce is a programming model for processing large data sets.
  • MapReduce is typically used to do distributed computing on clusters of computers.
  • The model is inspired by the map and reduce functions commonly used in functional programming.
  • Function output is dependent purely on the input data and not on any internal state. So for a given input the output is always guaranteed.
  • Stateless nature of the functions guarantees scalability.  
Key Features of MapReduce Systems
  • Provides Framework for MapReduce Execution
  • Abstract Developer from the complexity of Distributed Programming
  • Partial failure of the processing cluster is expected and tolerable.
  • Redundancy and fault-tolerance is built in, so the programmer doesn't have to worry
  • MapReduce Programming Model is Language Independent
  • Automatic Parallelization and distribution
  • Fault Tolerance
  • Enable Data Local Processing
  • Shared Nothing Architecture Model
  • Manages inter-process communication
MapReduce Explained
  • MapReduce consist of 2 Phases or Steps
    • Map
    • Reduce
The "map" step takes a key/value pair and produces an intermediate key/value pair.

The "reduce" step takes a key and a list of the key's values and outputs the final key/value pair.
















  • MapReduce Simple Steps
    • Execute map function on each input received
    • Map Function Emits Key, Value pair
    • Shuffle, Sort and Group the outputs
    • Executes Reduce function on the group
    • Emits the output per group
Map Reduce WAY-














1. Very big data convert in to splits
2. Splits are processed by mapper
3. Some partitioning functionality operated on the output of mapper
4. After that data move to Reducer and produce desire output

Anatomy of a MapReduce Job Run-
  • Classic MapReduce (MapReduce 1) 
    A job run in classic MapReduce is illustrated in following Figure. At the highest level, there
    are four independent entities:
    • The client, which submits the MapReduce job.
    • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
    • The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.

































  • YARN (MapReduce 2) 
    MapReduce on YARN involves more entities than classic MapReduce. They are:
    • The client, which submits the MapReduce job.
    • The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
    • The YARN node managers, which launch and monitor the compute containers on machines in the cluster.
    • The MapReduce application master, which coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.
    The process of running a job is shown in following Figure and described in the following sections.





MapReduce Flow Chart Sample Example


In this mapreduce tutorial we will explain mapreduce sample example with its flow chart. How to work mapreduce for a job.

A SIMPLE EXAMPLE FOR WORD COUNT
  • We have a large collection of text documents in a folder. (Just to give a feel size.. we have 1000 documents each with average of 1 Millions words)
  • What we need to calculate:-
    • Count the frequency of each distinct word in the documents?
  • How would you solve this using simple Java program?
  • How many lines of codes will u write?
  • How much will be the program execution time?
To overcome listed above problems into some line using mapreduce program. Now we look into below mapreduce function for understanding how to its work on large dataset.

MAP FUNCTION
  • Map Functions operate on every key, value pair of data and transformation logic provided in the map function.
  • Map Function always emits a Key, Value Pair as output
       Map(Key1, Valiue1) --> List(Key2, Value2)
  • Map Function transformation is similar to Row Level Function in Standard SQL
  • For Each File
    • Map Function is
      • Read each line from the input file
        • Tokenize and get each word
          • Emit the word, 1 for every word found
The emitted word, 1 will from the List that is output from the mapper

So who take ensuring the file is distributed and each line of the file is passed to each of the map function?-Hadoop Framework take care about this, no need to worry about the distributed system.

REDUCE FUNCTION
  • Reduce Functions takes list of value for every key and transforms the data based on the (aggregation) logic provided in the reduce function.
  • Reduce Function
        Reduce(Key2, List(Value2)) --> List(Key3, Value3)
  • Reduce Functions is similar to Aggregate Functions in Standard SQL
Reduce(Key2, List(Value2)) --> List(Key3, Value3)

For the List(key, value) output from the mapper Shuffle and Sort the data by key
Group by Key and create the list of values for a key
  • Reduce Function is
    • Read each key (word) and list of values (1,1,1..) associated with it.
      • For each key add the list of values to calculate sum
        • Emit the word, sum for every word found
So who is ensuring the shuffle, sort, group by etc?

MAP FUNCTION FOR WORD COUNT
  1. private final static IntWritable one = new IntWritable(1);  
  2. private Text word = new Text();  
  3.   
  4. public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {  
  5. String line = value.toString();  
  6. StringTokenizer tokenizer = new StringTokenizer(line);  
  7. While(tokenizer.hasMoreTokens()){  
  8.   
  9. word.set(tokenizer.nextToken());  
  10. context.write(word, one);  
  11. }  
  12. }  

REDUCE FUNCTION FOR WORD COUNT
  1. public void reduce(Text key, Iterable <IntWritable> values, Context context) throws IOException, InterruptedException{  
  2.   
  3. int sum = 0;  
  4. for(IntWritable val : values){  
  5. sum += val.get();  
  6. }  
  7. context.write(key, new IntWritable(sum));  
  8. }  

ANATOMY OF A MAPREDUCE PROGRAM












FLOW CHART OF A MAPREDUCE PROGRAM
Suppose we have a file with size about 200 MB, suppose content as follows

-----------file.txt------------
_______File(200 MB)____________
hi how are you
how is your job (64 MB) 1-Split
________________________________
-------------------------------
________________________________
how is your family
how is your brother (64 MB) 2-Split
________________________________
-------------------------------
________________________________
how is your sister
what is the time now (64 MB) 3-Split
________________________________
-------------------------------
_______________________________
what is the strength of hadoop (8 MB) 4-Split
________________________________
-------------------------------

In above file we have divided this file into 4 splits with sizes three splits with size 64 MB and last fourth split with size 8 MB.

Input File Formats:
----------------------------
1. TextInputFormat
2. KeyValueTextInputFormat
3. SequenceFileInputFormat
4. SequenceFileAsTextInputFormat 
------------------------------



Lets see in another following figure to understand the process of MAPREDUCE.

MapReduce Programming Hello World Job


In the Hadoop and MapReduce tutorial we will see how to create hello world job and what are the steps to creating a mapreduce program. There are following steps to creating mapreduce program.

Step1- Creating a file
J$ cat>file.txt
hi how are you
how is your job
how is your family
how is your brother 
how is your sister
what is the time now 
what is the strength of hadoop

Step2- loading file.txt from local file system to HDFS
J$ hadoop fs -put file.txt  file

Step3- Writing programs
  1.  DriverCode.java
  2.  MapperCode.java
  3.  ReducerCode.java

Step4- Compiling all above .java files

J$ javac -classpath $HADOOP_HOME/hadoop-core.jar *.java

Step5- Creating jar file

J$ jar cvf job.jar *.class

Step6- Running above job.jar on file (which there in HDFS)

J$ hadoop jar job.jar DriverCode file TestOutput

Lets start with actual code for these steps above.

Hello World Job -> WordCountJob

1. DriverCode (WordCount.java)
  1. package com.doj.hadoop.driver;  
  2.   
  3. import org.apache.hadoop.fs.Path;  
  4. import org.apache.hadoop.io.IntWritable;  
  5. import org.apache.hadoop.io.Text;  
  6. import org.apache.hadoop.mapred.FileInputFormat;  
  7. import org.apache.hadoop.mapred.FileOutputFormat;  
  8. import org.apache.hadoop.mapred.JobClient;  
  9. import org.apache.hadoop.mapred.JobConf;  
  10. import org.apache.hadoop.mapred.Mapper;  
  11. import org.apache.hadoop.mapred.Reducer;  
  12. import org.apache.hadoop.mapred.TextInputFormat;  
  13. import org.apache.hadoop.mapred.TextOutputFormat;  
  14.   
  15. /** 
  16.  * @author Dinesh Rajput 
  17.  * 
  18.  */  
  19. public class WordCount extends Configured implements Tool {  
  20.  @Override  
  21.  public int run(String[] args) throws Exception {  
  22.    if (args.length != 2) {  
  23.        System.err.printf("Usage: %s [generic options] <input> <output>\n",  
  24.        getClass().getSimpleName());  
  25.        ToolRunner.printGenericCommandUsage(System.err);  
  26.        return -1;  
  27.     }  
  28.    JobConf conf = new JobConf(WordCount.class);  
  29.    conf.setJobName("Word Count");  
  30.    FileInputFormat.addInputPath(conf, new Path(args[0]));  
  31.    FileOutputFormat.setOutputPath(conf, new Path(args[1]));  
  32.    conf.setMapperClass(WordMapper.class);  
  33.    conf.setCombinerClass(WordReducer.class);  
  34.    conf.setReducerClass(WordReducer.class);  
  35.    conf.setOutputKeyClass(Text.class);  
  36.    conf.setOutputValueClass(IntWritable.class);  
  37.    return conf.waitForCompletion(true) ? 0 : 1;  
  38.  }  
  39.  public static void main(String[] args) throws Exception {  
  40.    int exitCode = ToolRunner.run(new WordCount(), args);  
  41.    System.exit(exitCode);  
  42.  }  
  43. }  
2. MapperCode (WordMapper.java)
  1. package com.doj.hadoop.driver;  
  2.   
  3. import java.io.IOException;  
  4. import java.util.StringTokenizer;  
  5.    
  6. import org.apache.hadoop.io.IntWritable;  
  7. import org.apache.hadoop.io.LongWritable;  
  8. import org.apache.hadoop.io.Text;  
  9. import org.apache.hadoop.mapred.MapReduceBase;  
  10. import org.apache.hadoop.mapred.Mapper;  
  11. import org.apache.hadoop.mapred.OutputCollector;  
  12. import org.apache.hadoop.mapred.Reporter;  
  13.   
  14. /** 
  15.  * @author Dinesh Rajput 
  16.  * 
  17.  */  
  18. public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>  
  19. {  
  20.  private final static IntWritable one = new IntWritable(1);  
  21.    
  22.  private Text word = new Text();  
  23.    
  24.  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)  
  25.   throws IOException  
  26.   {  
  27.   String line = value.toString();  
  28.   StringTokenizer tokenizer = new StringTokenizer(line);  
  29.   while (tokenizer.hasMoreTokens())  
  30.   {  
  31.    word.set(tokenizer.nextToken());  
  32.    output.collect(word, one);  
  33.    }  
  34.   }  
  35. }  

3. ReducedCode (WordReducer.java)
  1. package com.doj.hadoop.driver;  
  2.   
  3. /** 
  4.  * @author Dinesh Rajput 
  5.  * 
  6.  */  
  7. import java.io.IOException;  
  8. import java.util.Iterator;  
  9.    
  10. import org.apache.hadoop.io.IntWritable;  
  11. import org.apache.hadoop.io.Text;  
  12. import org.apache.hadoop.mapred.MapReduceBase;  
  13. import org.apache.hadoop.mapred.OutputCollector;  
  14. import org.apache.hadoop.mapred.Reducer;  
  15. import org.apache.hadoop.mapred.Reporter;  
  16.    
  17. public class WordReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>  
  18. {  
  19.  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,  
  20.    Reporter reporter) throws IOException  
  21.   {  
  22.   int sum = 0;  
  23.   while (values.hasNext())  
  24.   {  
  25.    sum += values.next().get();  
  26.   }  
  27.   output.collect(key, new IntWritable(sum));  
  28.   }  

Hadoop Confiuration

Hadoop Configuration
 I have to do in the following layers.


  • HDFS Layer
    • NameNode-Master
    • DataNode-Store Data(Actual Storage)
  • MapReduce Layer
    • JobTracker
    • TaskTracker
  • Secondary Namenode- storing backup of NameNode it will not work as an alternate namenode, it just stored namenode metadata

Types of Hadoop Configurations
  • Standalone Mode
    • All processes runs as single process
    • Preferred in development
  • Pseudo Cluster Mode
    • All processes run in different process but on a single machine
    • Simulate cluster
  • Fully Cluster Mode
    • All processes running on different boxes
    • Preferred in production Mode

What are important files to be configure
  • hadoop-env.sh (set java environment and logging file)
  • core-site.xml (configure namenode)
  • hdfs-site.xml (configure datanode)
  • mapred-site.xml (map reduce here taking responsibility of configuring jobTracker and taskTracker)
  • yarn-site.xml
  • master (file configured on each datanodes telling about its namenode)
  • slave (file configured on namenode telling what all slave of datanode it has to manage)




1 comment: