Understanding Hadoop Clusters and the Network

Filed in Big Data, Fabrics, Featured, Hadoop by on September 10, 2011 100 Comments

This article is Part 1 in series that will take a closer look at the architecture and methods of a Hadoop cluster, and how it relates to the network and server infrastructure.  The content presented here is largely based on academic work and conversations I’ve had with customers running real production clusters.  If you run production Hadoop clusters in your data center, I’m hoping you’ll provide your valuable insight in the comments below.  Subsequent articles to this will cover the server and network architecture options in closer detail.  Before we do that though, lets start by learning some of the basics about how a Hadoop cluster works.  (In a hurry? Download this article here.) OK, let’s get started!

Server Roles in Hadoop

The three major categories of machine roles in a Hadoop deployment are Client machines, Masters nodes, and Slave nodes.  The Master nodes oversee the two key functional pieces that make up Hadoop: storing lots of data (HDFS), and running parallel computations on all that data (Map Reduce).  The Name Node oversees and coordinates the data storage function (HDFS), while the Job Tracker oversees and coordinates the parallel processing of data using Map Reduce.  Slave Nodes make up the vast majority of machines and do all the dirty work of storing the data and running the computations.  Each slave runs both a Data Node and Task Tracker daemon that communicate with and receive instructions from their master nodes.  The Task Tracker daemon is a slave to the Job Tracker, the Data Node daemon a slave to the Name Node.

Client machines have Hadoop installed with all the cluster settings, but are neither a Master or a Slave.  Instead, the role of the Client machine is to load data into the cluster, submit Map Reduce jobs describing how that data should be processed, and then retrieve or view the results of the job when its finished.  In smaller clusters (~40 nodes) you may have a single physical server playing multiple roles, such as both Job Tracker and Name Node.  With medium to large clusters you will often have each role operating on a single server machine.

In real production clusters there is no server virtualization, no hypervisor layer.  That would only amount to unnecessary overhead impeding performance.  Hadoop runs best on Linux machines, working directly with the underlying hardware.  That said, Hadoop does work in a virtual machine.  That’s a great way to learn and get Hadoop up and running fast and cheap.  I have a 6-node cluster up and running in VMware Workstation on my Windows 7 laptop.


Hadoop Cluster

Hadoop Cluster

This is the typical architecture of a Hadoop cluster.  You will have rack servers (not blades) populated in racks connected to a top of rack switch usually with 1 or 2 GE boned links.  10GE nodes are uncommon but gaining interest as machines continue to get more dense with CPU cores and disk drives.  The rack switch has uplinks connected to another tier of switches connecting all the other racks with uniform bandwidth, forming the cluster.  The majority of the servers will be Slave nodes with lots of local disk storage and moderate amounts of CPU and DRAM.  Some of the machines will be Master nodes that might have a slightly different configuration favoring more DRAM and CPU, less local storage.

In this post, we are not going to discuss various detailed network design options.  Let’s save that for another discussion (stay tuned).  First, lets understand how this application works…


Hadoop Work Flow

Hadoop Workflow

Why did Hadoop come to exist? What problem does it solve?  Simply put, businesses and governments have a tremendous amount of data that needs to be analyzed and processed very quickly.  If I can chop that huge chunk of data into small chunks and spread it out over many machines, and have all those machines processes their portion of the data in parallel — I can get answers extremely fast — and that, in a nutshell, is what Hadoop does.

In our simple example, we’ll have a huge data file containing emails sent to the customer service department.  I want a quick snapshot to see how many times the word “Refund” was typed by my customers.  This might help me to anticipate the demand on our returns and exchanges department, and staff it appropriately.  It’s a simple word count exercise.  The Client will load the data into the cluster (File.txt), submit a job describing how to analyze that data (word count), the cluster will store the results in a new file (Results.txt), and the Client will read the results file.


Writing Files to HDFS

Writing Files to HDFS

Your Hadoop cluster is useless until it has data, so we’ll begin by loading our huge File.txt into the cluster for processing.  The goal here is fast parallel processing of lots of data.  To accomplish that I need as many machines as possible working on this data all at once.  To that end, the Client is going to break the data file into smaller “Blocks”, and place those blocks on different machines throughout the cluster.  The more blocks I have, the more machines that will be able to work on this data in parallel.  At the same time, these machines may be prone to failure, so I want to insure that every block of data is on multiple machines at once to avoid data loss.  So each block will be replicated in the cluster as its loaded.  The standard setting for Hadoop is to have (3) copies of each block in the cluster.  This can be configured with the dfs.replication parameter in the file hdfs-site.xml.

The Client breaks File.txt into (3) Blocks.  For each block, the Client consults the Name Node (usually TCP 9000) and receives a list of (3) Data Nodes that should have a copy of this block.  The Client then writes the block directly to the Data Node (usually TCP 50010).  The receiving Data Node replicates the block to other Data Nodes, and the cycle repeats for the remaining blocks.  The Name Node is not in the data path.  The Name Node only provides the map of where data is and where data should go in the cluster (file system metadata).


Hadoop Rack Awareness

Hadoop Rack Awareness

Hadoop has the concept of “Rack Awareness”.  As the Hadoop administrator you can manually define the rack number of each slave Data Node in your cluster.  Why would you go through the trouble of doing this?  There are two key reasons for this: Data loss prevention, and network performance.  Remember that each block of data will be replicated to multiple machines to prevent the failure of one machine from losing all copies of data.  Wouldn’t it be unfortunate if all copies of data happened to be located on machines in the same rack, and that rack experiences a failure? Such as a switch failure or power failure.  That would be a mess.  So to avoid this, somebody needs to know where Data Nodes are located in the network topology and use that information to make an intelligent decision about where data replicas should exist in the cluster.  That “somebody” is the Name Node.

There is also an assumption that two machines in the same rack have more bandwidth and lower latency between each other than two machines in two different racks.  This is true most of the time.  The rack switch uplink bandwidth is usually (but not always) less than its downlink bandwidth.  Furthermore, in-rack latency is usually lower than cross-rack latency (but not always).  If at least one of those two basic assumptions are true, wouldn’t it be cool if Hadoop can use the same Rack Awareness that protects data to also optimally place work streams in the cluster, improving network performance?  Well, it does! Cool, right?

What is NOT cool about Rack Awareness at this point is the manual work required to define it the first time, continually update it, and keep the information accurate.  If the rack switch could auto-magically provide the Name Node with the list of Data Nodes it has, that would be cool. Or vice versa, if the Data Nodes could auto-magically tell the Name Node what switch they’re connected to, that would be cool too.

Even more interesting would be a OpenFlow network, where the Name Node could query the OpenFlow controller about a Node’s location in the topology.


Preparing HDFS Writes

Preparing HDFS Writes

The Client is ready to load File.txt into the cluster and breaks it up into blocks, starting with Block A.  The Client consults the Name Node that it wants to write File.txt, gets permission from the Name Node, and receives a list of (3) Data Nodes for each block, a unique list for each block.  The Name Node used its Rack Awareness data to influence the decision of which Data Nodes to provide in these lists.  The key rule is that for every block of data, two copies will exist in one rack, another copy in a different rack.  So the list provided to the Client will follow this rule.

Before the Client writes “Block A” of File.txt to the cluster it wants to know that all Data Nodes which are expected to have a copy of this block are ready to receive it.  It picks the first Data Node in the list for Block A (Data Node 1), opens a TCP 50010 connection and says, “Hey, get ready to receive a block, and here’s a list of (2) Data Nodes, Data Node 5 and Data Node 6.  Go make sure they’re ready to receive this block too.”  Data Node 1 then opens a TCP connection to Data Node 5 and says, “Hey, get ready to receive a block, and go make sure Data Node 6 is ready is receive this block too.”  Data Node 5 will then ask Data Node 6, “Hey, are you ready to receive a block?”

The acknowledgments of readiness come back on the same TCP pipeline, until the initial Data Node 1 sends a “Ready” message back to the Client.  At this point the Client is ready to begin writing block data into the cluster.


HDFS Write Pipeline

HDFS Write Pipeline

As data for each block is written into the cluster a replication pipeline is created between the (3) Data Nodes (or however many you have configured in dfs.replication).  This means that as a Data Node is receiving block data it will at the same time push a copy of that data to the next Node in the pipeline.

Here too is a primary example of leveraging the Rack Awareness data in the Name Node to improve cluster performance.  Notice that the second and third Data Nodes in the pipeline are in the same rack, and therefore the final leg of the pipeline does not need to traverse between racks and instead benefits from in-rack bandwidth and low latency.  The next block will not be begin until this block is successfully written to all three nodes.


HDFS Pipeline Write Success

When all three Nodes have successfully received the block they will send a “Block Received” report to the Name Node.  They will also send “Success” messages back up the pipeline and close down the TCP sessions.  The Client receives a success message and tells the Name Node the block was successfully written.  The Name Node updates it metadata info with the Node locations of Block A in File.txt.

The Client is ready to start the pipeline process again for the next block of data.


HDFS Multi-block Replication Pipeline

As the subsequent blocks of File.txt are written, the initial node in the pipeline will vary for each block, spreading around the hot spots of in-rack and cross-rack traffic for replication.

Hadoop uses a lot of network bandwidth and storage.  We are typically dealing with very big files, Terabytes in size.  And each file will be replicated onto the network and disk (3) times.  If you have a 1TB file it will consume 3TB of network traffic to successfully load the file, and 3TB disk space to hold the file.


Client Writes Span Cluster

After the replication pipeline of each block is complete the file is successfully written to the cluster.  As intended the file is spread in blocks across the cluster of machines, each machine having a relatively small part of the data.  The more blocks that make up a file, the more machines the data can potentially spread.  The more CPU cores and disk drives that have a piece of my data mean more parallel processing power and faster results.  This is the motivation behind building large, wide clusters.  To process more data, faster.  When the machine count goes up and the cluster goes wide, our network needs to scale appropriately.

Another approach to scaling the cluster is to go deep. This is where you scale up the machines with more disk drives and more CPU cores.  Instead of increasing the number of machines you begin to look at increasing the density of each machine.  In scaling deep, you put yourself on a trajectory where more network I/O requirements may be demanded of fewer machines.  In this model, how your Hadoop cluster makes the transition to 10GE nodes becomes an important consideration.


Name Node

The Name Node holds all the file system metadata for the cluster and oversees the health of Data Nodes and coordinates access to data.  The Name Node is the central controller of HDFS.  It does not hold any cluster data itself.  The Name Node only knows what blocks make up a file and where those blocks are located in the cluster.  The Name Node points Clients to the Data Nodes they need to talk to and keeps track of the cluster’s storage capacity, the health of each Data Node, and making sure each block of data is meeting the minimum defined replica policy.

Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP handshake, using the same port number defined for the Name Node daemon, usually TCP 9000.  Every tenth heartbeat is a Block Report, where the Data Node tells the Name Node about all the blocks it has.  The block reports allow the Name Node build its metadata and insure (3) copies of the block exist on different nodes, in different racks.

The Name Node is a critical component of the Hadoop Distributed File System (HDFS).  Without it, Clients would not be able to write or read files from HDFS, and it would be impossible to schedule and execute Map Reduce jobs.  Because of this, it’s a good idea to equip the Name Node with a highly redundant enterprise class server configuration; dual power supplies, hot swappable fans, redundant NIC connections, etc.


Re-replicating Missing Replicas

If the Name Node stops receiving heartbeats from a Data Node it presumes it to be dead and any data it had to be gone as well.  Based on the block reports it had been receiving from the dead node, the Name Node knows which copies of blocks died along with the node and can make the decision to re-replicate those blocks to other Data Nodes.  It will also consult the Rack Awareness data in order to maintain the two copies in one rack, one copy in another rack replica rule when deciding which Data Node should receive a new copy of the blocks.

Consider the scenario where an entire rack of servers falls off the network, perhaps because of a rack switch failure, or power failure.  The Name Node would begin instructing the remaining nodes in the cluster to re-replicate all of the data blocks lost in that rack.  If each server in that rack had a modest 12TB of data, this could be hundreds of terabytes of data that needs to begin traversing the network.


Secondary Name Node

Hadoop has server role called the Secondary Name Node.  A common misconception is that this role provides a high availability backup for the Name Node.  This is not the case.

The Secondary Name Node occasionally connects to the Name Node (by default, ever hour) and grabs a copy of the Name Node’s in-memory metadata and files used to store metadata (both of which may be out of sync).  The Secondary Name Node combines this information in a fresh set of files and delivers them back to the Name Node, while keeping a copy for itself.

Should the Name Node die, the files retained by the Secondary Name Node can be used to recover the Name Node.  In a busy cluster, the administrator may configure the Secondary Name Node to provide this housekeeping service much more frequently than the default setting of one hour.  Maybe every minute.


Client Read from HDFS

When a Client wants to retrieve a file from HDFS, perhaps the output of a job, it again consults the Name Node and asks for the block locations of the file.  The Name Node returns a list of each Data Node holding a block, for each block.

The Client picks a Data Node from each block list and reads one block at a time with TCP on port 50010, the default port number for the Data Node daemon.  It does not progress to the next block until the previous block completes.


Data Node reads from HDFS

There are some cases in which a Data Node daemon itself will need to read a block of data from HDFS.  One such case is where the Data Node has been asked to process data that it does not have locally, and therefore it must retrieve the data from another Data Node over the network before it can begin processing.

This is another key example of the Name Node’s Rack Awareness knowledge providing optimal network behavior.  When the Data Node asks the Name Node for location of block data, the Name Node will check if another Data Node in the same rack has the data.  If so, the Name Node provides the in-rack location from which to retrieve the data.  The flow does not need to traverse two more switches and congested links find the data in another rack.  With the data retrieved quicker in-rack, the data processing can begin sooner, and the job completes that much faster.


Map Task

Now that File.txt is spread in small blocks across my cluster of machines I have the opportunity to provide extremely fast and efficient parallel processing of that data.  The parallel processing framework included with Hadoop is called Map Reduce, named after two important steps in the model; Map, and Reduce.

The first step is the Map process.  This is where we simultaneously ask our machines to run a computation on their local block of data.  In this case we are asking our machines to count the number of occurrences of the word “Refund” in the data blocks of File.txt.

To start this process the Client machine submits the Map Reduce job to the Job Tracker, asking “How many times does Refund occur in File.txt” (paraphrasing Java code).  The Job Tracker consults the Name Node to learn which Data Nodes have blocks of File.txt.  The Job Tracker then provides the Task Tracker running on those nodes with the Java code required to execute the Map computation on their local data.  The Task Tracker starts a Map task and monitors the tasks progress.  The Task Tracker provides heartbeats and task status back to the Job Tracker.

As each Map task completes, each node stores the result of its local computation in temporary local storage.  This is called the “intermediate data”.  The next step will be to send this intermediate data over the network to a Node running a Reduce task for final computation.


What if Map Task data isn’t local?

While the Job Tracker will always try to pick nodes with local data for a Map task, it may not always be able to do so.  One reason for this might be that all of the nodes with local data already have too many other tasks running and cannot accept anymore.

In this case, the Job Tracker will consult the Name Node whose Rack Awareness knowledge can suggest other nodes in the same rack.  The Job Tracker will assign the task to a node in the same rack, and when that node goes to find the data it needs the Name Node will instruct it to grab the data from another node in its rack, leveraging the presumed single hop and high bandwidth of in-rack switching.


Reduce Task computes data received from Map Tasks

The second phase of the Map Reduce framework is called, you guess it, Reduce.  The Map task on the machines have completed and generated their intermediate data.  Now we need to gather all of this intermediate data to combine and distill it for further processing such that we have one final result.

The Job Tracker starts a Reduce task on any one of the nodes in the cluster and instructs the Reduce task to go grab the intermediate data from all of the completed Map tasks.  The Map tasks may respond to the Reducer almost simultaneously, resulting in a situation where you have a number of nodes sending TCP data to a single node, all at once.  This traffic condition is often referred to as “Incast” or “fan-in”.  For networks handling lots of incast conditions, its important the network switches have well-engineered internal traffic management capabilities, and adequate buffers (not too big, not too small).  Throwing gobs of buffers at a switch may end up causing unwanted collateral damage to other traffic.  But that’s a topic for another day.

The Reducer task has now collected all of the intermediate data from the Map tasks and can begin the final computation phase.  In this case, we are simply adding up the sum total occurrences of the word “Refund” and writing the result to a file called Results.txt

The output from the job is a file called Results.txt that is written to HDFS following all of the processes we have covered already; splitting the file up into blocks, pipeline replication of those blocks, etc.  When complete, the Client machine can read the Results.txt file from HDFS, and the job is considered complete.

Our simple word count job did not result in a lot of intermediate data to transfer over the network.  Other jobs however may produce a lot of intermediate data — such as sorting a terabyte of data.  Where the output of the Map Reduce job is a new set of data equal to the size of data you started with.  How much traffic you see on the network in the Map Reduce process is entirely dependent on the type job you are running at that given time.

If you’re a studious network administrator, you would learn more about Map Reduce and the types of jobs your cluster will be running, and how the type of job affects the traffic flows on your network.  If you’re a Hadoop networking rock star, you might even be able to suggest ways to better code the Map Reduce jobs so as to optimize the performance of the network, resulting in faster job completion times.


Unbalanced Hadoop Cluster

Hadoop may start to be a real success in your organization, providing a lot of previously untapped business value from all that data sitting around.  When business folks find out about this you can bet that you’ll quickly have more money to buy more racks of servers and network for your Hadoop cluster.

When you add new racks full of servers and network to an existing Hadoop cluster you can end up in a situation where your cluster is unbalanced.  In this case, Racks 1 & 2 were my existing racks containing File.txt and running my Map Reduce jobs on that data.  When I added two new racks to the cluster, my File.txt data doesn’t auto-magically start spreading over to the new racks.  All the data stays where it is.

The new servers are sitting idle with no data, until I start loading new data into the cluster.  Furthermore, if the servers in Racks 1 & 2 are really busy, the Job Tracker may have no other choice but to assign Map tasks on File.txt to the new servers which have no local data.  The new servers need to go grab the data over the network.  As as result you may see more network traffic and slower job completion times.


Hadoop Cluster Balancer

To fix the unbalanced cluster situation, Hadoop includes a nifty utility called, you guessed it, balancer.

Balancer looks at the difference in available storage between nodes and attempts to provide balance to a certain threshold.  New nodes with lots of free disk space will be detected and balancer can begin copying block data off nodes with less available space to the new nodes.  Balancer isn’t running until someone types the command at a terminal, and it stops when the terminal is canceled or closed.

The amount of network traffic balancer can use is very low, with a default setting of 1MB/s.  This setting can be changed with the dfs.balance.bandwidthPerSec parameter in the file hdfs-site.xml

The Balancer is good housekeeping for your cluster.  It should definitely be used any time new machines are added, and perhaps even run once a week for good measure.  Given the balancers low default bandwidth setting it can take a long time to finish its work, perhaps days or weeks.

Wouldn’t it be cool if cluster balancing was a core part of Hadoop, and not just a utility?  I think so.


This material is based on studies, training from Cloudera, and observations from my own virtual Hadoop lab of six nodes.  Everything discussed here is based on the latest stable release of Cloudera’s CDH3 distribution of Hadoop. There are new and interesting technologies coming to Hadoop such as Hadoop on Demand (HOD) and HDFS Federations, not discussed here, but worth investigating on your own if so inclined.

If  you notice any errors or inaccuracies here please do not hesitate to let me know in the comments.  Furthermore, I hope that the folks who are running and managing real production clusters are so inclined to contribute their opinions and expertise below.


Download:

Slides – PDF

Slides and Text – PDF

Cheers,
Brad

About the Author ()

Brad Hedlund is an Engineering Architect with the CTO office of VMware’s Networking and Security Business Unit (NSBU), focused on network & security virtualization (NSX) and the software-defined data center. Brad’s background in data center networking begins in the mid-1990s with a variety of experience in roles such as IT customer, systems integrator, architecture and technical strategy roles at Cisco and Dell, and speaker at industry conferences. CCIE Emeritus #5530.

Comments (100)

Trackback URL | Comments RSS Feed

  1. Steve Loughran says:

    This isn’t too bad an article, even if it creates excess worries about network setup.

    1. rack topology scripts are trivial to write if you get your racks set up with decent masks; you take the IP address and filter it slightly.

    2. those comments about 3x copies are only valid if you use the default replication factor of 3. That’s not mandatory. Some clusters run at 2x as they can replicate elsewhere, when you submit a job to the cluster that JAR can be replicated many more times, to stop JAR reads becoming a bottleneck.

    3. The other thing to consider is that a lot of that bandwidth you discuss is intra-rack; it’s backbone traffic that matters the most. That and IO bandwidth. If a 12 TB server goes down then the data on the backplane will be limited by how fast it can come off disks and go onto other ones, which dependent on where the data is scattered.

    4. When you add new nodes to a cluster, you should crank up the rebalancer bandwidth to make it rebalance faster.

    5. Cluster ingress/egress can often be done with clients that connect at 10Gbps to the backplane, this increases upload speed without needing to move the (many) servers up from 1Gbe.

    Key point: because of its topology awareness, bandwidth isn’t that much of a problem in everyday use. If it is, something is wrong with your code, or you forgot about the shuffle stage. The one thing that ops teams are scared of is ToR switch failure (generating a rack’s worth of traffic), or a full network partitioning, which Hadoop doesn’t yet recognise as an emergency “take the cluster offline” event. What is worth mentioning is that off-cluster traffic can be high, so caching DNS servers and/or static hosts tables are often used there to help worker nodes locate the master nodes. The other issue is that you do have to plan to isolate the cluster from the rest of your network -no VLANs here, plan for rack growth, and during your cluster commissioning tests verify that the switch manufacturer -whoever they are- have made switches that really can handle every node lighting up their ports simultaneously. Classic “enterprise” switches may make assumptions about network usage that aren’t valid in this world.

  2. Srikanth says:

    Awesome! One of the best. Thank you.

  3. SaraB says:

    Steve, Great comments!

    I am a bit surprised at your suggestion to not use VLANs. Say you have branchOffice1 accessing rack1, branchOffice2 access racks 2-3 and so on. Wouldn’t you isolate their traffic and L2 domains via VLANs?

    i.e. Rack1 (nServers) —VLAN x—-ToR Switch—-VLAN x— Core Switch

    Also, any thoughts on VPN, Load balancers and firewall in the network design for hadoop?
    I would imagine one would require VPN one per rack, and a firewall to protect management ports of servers?

    Thank you.

    • Brad Hedlund says:

      Sara,
      I think Steve’s comment about “no VLANs here” was about how you should connect the cluster to the rest of your data center network. That it should be a clean Layer 3 hand-off between the cluster spine switches and the upstream network, providing isolation from any Layer 2 problems outside of the cluster. I agree with that 100%.

      Cheers,
      Brad

    • Steve Loughran says:

      @Sara – I meant you do need to have dedicated networking for your Hadoop cluster, instead of thinking you can share existing infrastructure just by setting up some VLANs. If you look how Yahoo! work, they even have caching DNS servers on their nodes because otherwise DNS traffic would overload the system -but that was for Web Indexing, which has to resolve the DNS address of every web site in the planet.

      As far as management ports go, I’d just have a separate 100 MBit/s Ether hooked up to the ILO ports. Low cost, isolated and better for things like BIOS upgrades. Speaking of which, you will also need some cluster management tooling that works at scale. That’s for the computer management; the Hadoop ports (50070 etc) are web pages. You could try and isolate them, but any code that runs in the cluster (i.e. MR jobs) could issue HTTP requests against the nodes and you’d be hard-pressed to stop them. Maybe the security releases help with that; I haven’t looked.

      -steve

      • Allen Wittenauer says:

        I (still) recommend local DNS caches regardless of whether you are doing web work or not. Hadoop is extremely DNS heavy vs. many other applications due to its distributed nature. Something such as not setting to give preference to IPv4 on nodes with IPv6 enabled but not used can lead to unintended DoS attacks on the upper infrastructure. CPU/mem requirements are tiny and benefits massive. The only real downside to a local DNS cache is that it is a) more content for your configuration management and b) one needs to make sure that the port usage doesn’t clash.

        (Regarding nscd as an alternative: if the grid is small enough, nscd might suffice, but even with tuning, my experiences lead to me to believe that it gets less efficient the bigger the cache. nscd is going to be busy caching user and group information, so might as well pass the host work off to something that specializes in it.)

      • dj says:

        Sorry to bring up an old posting, but I am trying to understand Hadoop and the replication process.

        Brad states “two copies in one rack, one copy in another rack replica rule when deciding which Data Node should receive a new copy of the blocks.”

        On Apache.org documents they state “replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack.”

        So is it two local and one remote or one local and two remote (same rack)?

        • Brad Hedlund says:

          dj,
          In my lab observations, the last two nodes in the replication pipeline were in the same rack. This is what is depicted in my “HDFS Write Pipeline” diagram above, and matches with the description you found in the Apache.org documentation.

  4. While Steve suggests that you don’t need to have your replication set to 3, its always a good idea to start out with this, especially if you’re new to Hadoop. (Local machine, local rack, remote rack.)

    Steve is right, the topology scripts are trivial. I even think there’s a python example floating around. I would however recommend that you give your machines names in your setup because its a lot easier to remember than just the IP addresses. Also note that the topology script defines the topology of the rack and its really a virtual topology. While its a good idea that the topology script follows your physical rack layout, it doesn’t necessarily have to do this….

    To the point about Layer 2 VLANs, good planning should remove the need for this… All of your machines within the rack should be in both a contiguous IP address block and on the same ToR switch.
    If you’re building a large cluster, then you should hopefully plan out your machine room such that the racks are close to one another on the back plane. (Of course nothing always goes to plan and most machine rooms are built out on a ‘first come, first server’ layout.) :-)

    In this event, you may want to consider VLANS. I don’t know for sure because I wasn’t allowed to set one up and to test it out. Note… not all switches can do layer 2. For those that can, I have to ask if layer 2 takes precedence over layer 3. That is to say that if I set up my layer 2 VLAN, would its traffic between nodes defined within the VLAN take higher precedence than other traffic moving across the back plane? If the answer is no, then there’s no real benefit in setting up the VLAN. (This also assumes that you’ve got a lot of network traffic through your network…)

    With respect to port bonding and 10GBe. The price of 10GBe has dropped along with the fact that some hardware manufacturers are putting 10GBe on the mother board, that you shouldn’t do port bonding. Remember that 1+1 doesn’t always equal 2. ;-) You also tie up 1/2 your ToR switch limiting the number of nodes you can physically put in a rack, or you have to double up on you switches. In addition, you also have to be care in how you handle your uplinks between switches. Even if you’re running 1GBe ports on a switch, if your uplink (?trunk?) is only 1 GBe, then you could be creating a bottleneck.
    Note that while the author works for Cisco, in the ToR space, Arista and BladeNetworks are pushing hard on the 10GBe switch space. (And they also claim compatibility w Cisco hardware.)

    When you design your cluster, you have to be cognizant of both your cluster node configuration, and your network. As Steve points out, putting 12+TB of disk on each node would mean that if you lose a node, you can swamp your network as your cluster heals itself.

  5. Chakri says:

    Hi Brad,

    Thanks for the writeup. I forwarded the link to couple of my colleagues. Very well composed for beginners. Especially the pictorial depiction , simply superb .

    Thank you
    Chakri

  6. Eric Ericson says:

    Brad, etc all..

    Just a quick clarification question regarding client reads from HDFS:

    “The Client picks a Data Node from each block list and reads one block at a time with TCP on port 50010, the default port number for the Data Node daemon. It does not progress to the next block until the previous block completes.”

    So, since HDFS addresses each drive independently in a filesystem, and distributes the blocks across each for redundancy, then wouldn’t the above stated behavior indicate that any non-datanode reads are limited to one spindle at a time? Given that the current Hadoop dogma is to purchase low cost commodity hardware (with SATA drives), that would mean that any transfers out of the cluster will be capped at ~ 140Mb/s? (sustained read for a 7200rpm SATA drive)

    • Brad Hedlund says:

      Eric,
      Interesting point, though the value of a Hadoop cluster lies in how fast it can process the data inside cluster. How fast you can pull the results out of the cluster is not the key problem being solved.

      • Mj says:

        Brad,

        If we are able to pull out data fast then won’t it help us to process our data fast by leveraging better methods of parallel processing ?

  7. Sudheer says:

    Brad:
    Excellent post with the helpful slides! Great explanation indeed.

    I was wondering if anyone here (or @Brad) has tested deployment of a Hadoop cluster that
    spans multiple data centers. I understand from the rack set-up documentation that HDFS treats
    it as a tree with a distance 1 between each layer of the tree (see https://issues.apache.org/jira/browse/HADOOP-692, especially its attachment: https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf). So if I configured a Hadoop cluster in 2 data-centers, with 2/3rd capacity in one DC and 1/3rd capacity in
    another DC – would that make up a good DR (Disaster Recovery) strategy and make it an Active-Active cluster? Or would MapReduce and HDFS reads degrade significantly to not warrant this kind of set up?

    Appreciate any insight into this data-center awareness/DR configuration.

    Thanks,
    Sudheer

  8. Joseph E. says:

    Brad. Great article! As a complete beginner to Hadoop, I have some questions to help me get my head around it all. Please forgive the lengthy questions and set up below.

    I wanted to pose some questions around the relationship between what I’ll term “Standard DC infrastructure” and Hadoop. Let’s define “Standard DC infrastructure” as the typical DC setup for applications on servers accessing storage on large arrays like EMC, Netapp.

    So, continuing on Brad’s example of “count the number of times ‘refund’ shows up in emails”. The source of that data must come from the corporate email servers, with mboxs on a storage array. When we want to engage our Hadoop cluster for an MR job, the client node has to somehow gather up all those emails into “file.txt”.

    Q1: Doesn’t this put a tremendous burden on the hadoop client in terms of processing, converting, gathering?

    Q2: Once the file.txt is built (assuming on the storage array), does the copying of such a file depend on the hadoop client actually existing in the data path for the copy? Put another way, does the Hadoop client machine have to first get data from the storage array, then initiate a copy process of its own to move it to the hadoop cluster?

    Q3: Once the file.txt is copied, replicated, and the MR job run, results returned, what happens to file.txt? Does it sit around waiting for a new job, or does it get flushed since the MR job is complete?

    Q4: If file.txt sits around after the job is done, and we want to run a that job again next week but with additional, more recent data, can Hadoop simply grab the changes to file.txt (with new emails considered) instead of having to build a whole new file.txt from zero?

  9. Techno Tantrik says:

    I am a newbie to Hadoop. We have an application where multiple nodes (>600) generate data in parallel (each file 20MB, 5000 files total) . Then we run computations on this data in parallel. The solution is data parallel but, I don’t see how I can generate data in parallel and store it in HDFS (without merging it because each file is on its own) and use that data in a single MapReduce. Pretty much most of the examples I see on the net are talking about a single file of data getting split up and injected into HDFS.

    Any pointers in the right direction are very helpful.

    Thank you for your help and Happy Holidays & New Year!
    TT

  10. Arjun Sood says:

    Hi Brad,
    Ur article is simply brilliant. The language, along with the diagrams make it extremely lucid for newbies like me to understand. I’m a fourth year computer engineering student from India, and all my class students have used your page to understand hadoop clusters throughout the year, it being part of our curriculum. It’s been extremely useful for vivas, unit tests, and the final exams as well. Just wanted to thank you very much from all of us here!

  11. Manish says:

    Brad This is the best Hadoop documentation I’ve ever seen. Thanks for sharing!

  12. Shyam Kadari says:

    Brad, good information. I learned a good deal about networking and scaling Hadoop networks form your blogs.
    I came to your site searching for some information on how to scale when Hadoop jobs require several GB of static lookup data but learned other things that are interesting.
    Can you point me to any information you may have come across?
    Thanks you very much for sharing your knowledge and expertise.

  13. ercoppa says:

    Hi, I have a question about the pipeline during a write:

    The Client breaks File.txt into (3) Blocks. For each block, the Client consults the Name Node (usually TCP 9000) and receives a list of (3) Data Nodes that should have a copy of this block. The Client then writes the block directly to the Data Node (usually TCP 50010). The receiving Data Node replicates the block to other Data Nodes, and the cycle repeats for the remaining blocks. The Name Node is not in the data path. The Name Node only provides the map of where data is and where data should go in the cluster (file system metadata).

    How does the receiving Data Node replicate the block received? It waits that the whole block is received OR starts the replication process before this event (e.g. after the first packet of the block)?
    Based on you answer, which is the expected write throughput if the replication factor is for example 3 (suppose datenodes A, B, C are chosen for the pipeline; A on the rack R1 with write bandwidth 50MB/s (from the client); B & C on both the same rack R2 with a write bandwidth 20MB/s inter racks (so from A->B) and write bandwidth 40 MB/s intra rack (B->C). [with write bandwidth I mean an average bandwidth between network & disk write bandwidth].

    Sorry for my poor english.

    Emilio.

    • Brad Hedlund says:

      Emilio,
      Based on observations in my lab, the first data node in the pipeline (A) begins replicating the data to the next node (B) almost immdediately and while data is still being received by the Client. Similar situation with node (B), it too begins replicating data to the next node (C) almost immediately and while still receiving data from node (A). That is why it’s called a Pipeline. Data flows from Node to Node like water in a pipe.

      The next block of data does not begin until the Client receives confirmation that the previous block was successfully written to all Nodes selected for replication.
      In that sense, I would suspect that write throughput of a block would be not be any faster than the lowest throughput link in the pipleline. Such as perhaps the link that traveres from one rack to the next (inter-rack). The assumption being that inter-rack traffic flows will encounter more congestion than the non-blocking intra-rack traffic.

      On the ohter hand, the inter-rack bandwidth is usually N x 10gig — whereas the intra-rack bandwidth is usually N x 1Gig. So, if the network utilization is light, the available inter-rack bandwidth may not be any less than the intra-rack bandwidth.

      Cheers,
      Brad

  14. ketan says:

    hello brand,

    it’s really helpful article for beginner like me, I am waiting for subsequent articles,

    Can you write article on Hadoop Map Reduce programming ?

    Thank You.
    Ketan

  15. Neeraj says:

    Hi All,

    I have one doubt.. Does Hadool cluster knows which node comes under which rack or do we need to configure it in some configuration file..

  16. Dung says:

    Hi brad,

    This article is very helpfull for me, It is very understand easily for beginners.

    Thank you,
    Dung.

  17. Narsi says:

    Yes, I am new to Hadoop. Over the days I have was searching Google for exactly this document. Thank you.

  18. vivek says:

    Hi Brad,
    Its really nice article to understand hadoop flow.

    I am beginner in Hadoop, as you said you have hadoop lab on virtual machine.
    Can you please give me the brief info about this set up? I want to do same setup on my windows 7 laptop.

  19. Arun says:

    Excellent write-up and presentation, Brad! This is one link I would recommend to learn the end-to-end functionality of Hadoop. Also, do you have any presentations / articles on the Software stack of hadoop (connectors / monitors…)?

  20. Jue says:

    Excellent article about Hadoop basics. Thanks for writing it!

  21. karthika says:

    Brad,
    u gave such a very great tutorial

    Thanks

  22. ohm says:

    Great article Brad!
    Brad, As you mention client reads blocs sequentially from HDFS, therefore read throughput will be very low, I come across DistCp utility document, which perform parallel read (uses map/reduce internally).
    Is there is any other way(utility/API) to get the fast read throughput?
    Also Can we access HDFS file from data node? ( Can datanode behave as HDFS client?)

    Thanks!

  23. hadis says:

    Hi Brad,
    Excellent article about Hadoop Concepts. Is it Right? ” any NoSQL Storage that uses HDFS could Map/Reduce”

  24. Jeff Rogers says:

    Hello Brad – I have storage questions… Are we talking about the cluster servers being hooked to a SAN for common fast storage ? or internal disk in each of the clustered servers ? While I appreciate all of your insight on network throughput, I am lost on IOP expectations and shared storage. With all of the new Flash and hybrid SAN technologies becoming more affordable for large databases do you see any of that fitting into the Hadoop solution ? Thank you for any thoughts/insight.

  25. Lelala says:

    What i do not understand, is:
    How does the system guarantee data integrity & consistency across – lets say – 40 machines? (yes, this may be a bigger installation?)
    if everything is routed over the master-node, somewhere must be a something like a “master-table/list” that stores all the stuff…

  26. Mohammed Junaid K says:

    Brad,
    Thanks a lot !
    Very informative tutorial about Hadoop explained using a simple real world use-case !

  27. Ravi Kottu says:

    Hi Brad,
    Thank you!

    Awesome write-up and it gives high level information on Hadoop.
    As said in the first line, ” This article is Part 1″. Can we expect you to post the next part of this article.

    Cheers,
    Ravi

  28. Shanky says:

    Thanks Brad,

    I am a beginner. i got to know a lot from your article. if you could have explain Map and reduce with an example, it would have been more great.

  29. Krishna says:

    Hi Brad,

    It was great writeup. Clear pictures and notes made it easy to understand. Looking forward to see more from you.

    Thanks,
    Krishna Vuppala

  30. One of the best, most detailed yet simple explanations of Hadoop and Map-Reduce paradigm. I would go ahead and link the same article to my own blog post on the same for the benefit of readers of my own blog.

    Thank you once again. Your effort is commendable.

  31. Pal says:

    Excellent write-up, thanks and really appreciate for making it available. I have some fundamental questions .
    (1) When we request to write the file to HDFS how does the HDFS knows to split the files into smaller blocks for example 3 blocks is there default configuration where one can specific the configurable size of blocks
    (2) Assuming that the file has been split into three blocks a, b and c does it write three blocks to data node 1, data node 5 and data node 6 in parallel OR in serial that after block(b) is written only after block(a) is successfully written to data node 1. If that is the case if there hundreds of blocks the writing processing will slow ?

  32. Grace says:

    Under what conditions Data Node will stop sending heart beat to Name Node? I think for sure network problems, system down etc will certainly do that. But how about events such as:

    1: Data Node is running out of memory?
    2: Disks are full?

    Thanks a lot/1

  33. SCOTT CHU says:

    Hi Brad,
    I just start to learn what Hadoop is.
    Your article “Understanding Hadoop Clusters and the Network” is the best for understanding Hadoop I ever read in the internet.
    It helps me a lot.

    thank you.

    but I have some questions as follows,

    Figure : “Client Read from HDFS” & Figure : “Data Node reads from HDFS”
    in the right corner: metadata -> Blk B: –> should be DN8,DN1,DN2 (not DN7,DN1,DN2)

    Figure : “Data Node reads from HDFS”
    You try to explain could read block at the same rack,
    should block B (orange line) point to the “Data Node: 2″ instead of “Data Node: 3″?

    AND

    miss “Client” ?
    “Tell me the locations of Block A of File.txt” —> Client

    Best Regards,
    Scott Chu

  34. Manish says:

    Excellent Write Up. Thank you very much

  35. Rahul says:

    Thanks Brad!!!! That’s a great article for a heads up! I have just started with Hadoop and was wondering what would be the replication policy in case the dfs.replication paramter is set to 4 (or more)!!!

    Thanks for answers in advance :)

  36. Sergio says:

    Excellent beginning for Hadoop. Thanks for your publication.

  37. Chandra K says:

    Hi Brad,
    Feel very happy while rading Your article “Understanding Hadoop Clusters and the Network” is this best for understanding Hadoop.

    I just start to learn what is Hadoop and got good Idea.

    Is it possible to explain in realtime scenario with little bit practice sessions to unesrtand Haddop.

    Thanks a lot ,
    Regards,
    Chandra k

  38. Moon In Young says:

    It’s very helpful to understand Hadoop process.
    And very useful for me.

    I have a question in this system.
    In this system used machines are same level machine such as core, network bandwidth etc.

    Is there no bottleneck or weakness?

    Thanks again for your paper.

    And Thanks in advances.

  39. Brijesh says:

    In Hadoop suppose request from client will come , then how read and write operation will be done.
    e.g. For write operation suppose 3 blocks of 64kb are there and last block finish at 40kb .
    Then for next file, it starts with new block or starts from 41kb.

  40. sravan says:

    awesome article,
    one of the best.

    thanks

  41. Pramod says:

    Hi Brad Hedlund,

    This is one of the best post on Hadoop. Anybody can understand Hadoop basics and internals of it(to some extent) by just going through this post. You have done an amazing job. Thanks a ton for posting this. Hope to see you much more posts like this.

    Regards,
    Pramod

  42. Pruthvi says:

    Hi Brad,

    An excellent article to understand the fundamentals of HADOOP. It made me to understand the basic concepts of well. Good work!

    Regards
    Pruthvi

  43. winneryong says:

    What tools would you use the drawing?
    thanks

  44. ben says:

    replication 3, means data 3 copys ? and one using , all data is 4 ?

  45. insider says:

    With HDFS RAID you can achieve replication factor of 1.2 but more network and CPU is consumed and good data locality is lost.
    Facebook uses Reed-Solomon RAID in their production which saves them a lot of PB of space.

  46. Sanguru says:

    Awesome Article. Really appreciate you taking the time to get this explained

  47. Shravan says:

    “For each block, the Client consults the Name Node (usually TCP 9000)”
    I am using cloudera Hadoop with Cloudera Manager. TCP/IP: 9000 doesn’t run the name node.
    It is bit confusing with different ports that run NameNode (8020/50070/8022). What would the client connect to?

  48. Aneesh says:

    An excellent article to understand the fundamentals of HADOOP.

    Thanks,
    Aneesh

  49. Jags says:

    Great article Brad !!
    Simply Superb

  50. sushil says:

    Nice introduction in practical approach thanks a lot.
    but i can’t understand what is in-rack bandwidth and low latency? why low latency ? can you please give me some example or real life example. Please

  51. James says:

    Hi Brad,
    Thank you for this excellent article I have ever read about Hadoop cluster. On the other hands, I wonder what software utility you use makes such claritied diagrams in above article?

  52. Abhishek says:

    Hello Brad, Its an awesome work ………

  53. vishnu viswanath says:

    Thanks, great blog :)

    I would like to know 3 things.
    1) which property in hadoop conf sets the time limit to consider a node as dead?
    2) after detecting a node as dead, after how much time does hadoop replicates the block to another node?
    3) if the dead node comes alive again, in how much time does hadoop identifies a block as overreplicated and when does it deletes that block?

  54. SUREnder RAJA says:

    This is really nice.. Can you please explain what is main difference between name node and job tracker.. and what is main difference between Data node and Task tracker..

    • vijay ramisetty says:

      Namenode – one which takes care of hdfs metada
      DataNode – in which the data blocks are present
      jobtracker – master process which takes care of job launching, task(piece of a job) tracking
      Task-Tracker – which does the task

      difference –
      when talked in terms of data/hdfs (NN is master , DN is slave)
      when talked in terms of jobs/tasks/works (JT is the master who gives work to the slaves TTs)

  55. vijay ramisetty says:

    great job man :)

  56. Surender RAJA says:

    I have a file.txt that has 3 blocks (block a , block b, block c).
    file.txt is divided in to 3 blocks .. here file.txt is divided in to three blocks . say for example the size of file.txt is 192 MB then that file.txt will be divided into 3 blocks each of 64 MB
    How does hadoop write these 3 blocks in to Distributed Cluster.. My question is Does hadoop follow parallel write for block a and block b and block c ? Or does block b has to wait for block a to write into cluster? Or block a and block b and block c are parallely writtten in to hadoop cluster…

  57. navaz says:

    Hi Brad
    This is the Excellent article , It cleared most of my doubts on Hadoop.
    I set up a cluster with 1 name node and 3 datanodes. I have few clarification to be made. It is mentioned in the article that “DN sends a heartbeat to NN every 3 sec and every 10th heartbeat is block report and it uses TCP port 9000″. So block report is after 30sec ? In my cluster i can see 2 different messages. One is “heartbeat send” for every 3sec from DN to NN, second “heartbeat request” for every 0.3sec from DNs to NN which include task tracker status and it uses TCP port 54311. After 10 heartbeat request it sends “send heartbeat” again. Is this is TCP 3 way handshake ? I can find 3way handshake when client wants to write a file. ie from NN to DN1 ( 3way handshake) and DN1 to DN2 (3way handshake) and DN2 to DN3(3way handshake).Then it sends “DFS Cleint NON_MAPREDUCE” . What is this message ? Then it copies file to DN1 via TCP 50010 and from DN1 to DN2 and DN2 to DN3. i am also able to see “block recieved ” message from DNs to NN.
    In the article “The acknowledgments of readiness come back on the same TCP pipeline, until the initial Data Node 1 sends a “Ready” message back to the Client.”
    What is this “Ready” message ? Can we see this message if we capture the file using Wireshark ?Also could you please tell me what do you mean by “TCP Pipeline” , for which i haven’t got any reference material so far.
    Also connection close happens in this sequence if i am not wrong.
    DN3–close—->DN2—close—>DN1—close—->NM(client)

    Thank you
    Navaz

  58. Div says:

    thank you so much for the awesome tutorial on how things work within hadoop.

  59. Mahesh says:

    Hi Brad,

    Thank you for this excellent article I have ever read about Hadoop cluster but I have one doubt. When we move the data of size 500gb to HDFS, whether client has to split the data as per the block size?

    In the article , it is gives as below

    The Client breaks File.txt into (3) Blocks. For each block, the Client consults the Name Node (usually TCP 9000) and receives a list of (3) Data Nodes that should have a copy of this block. The Client then writes the block directly to the Data Node (usually TCP 50010).

  60. Manoj says:

    Truly. one of the best hadoop article i have seen till now … the way you prepared it ..awesome.,…

    thanks for sharing with us.

  61. Shiju Sathyadevan says:

    One of the best i say!!!! Great work Mate

  62. Ido Schacham says:

    Thanks for this excellent post! It really helped me to understand how Hadoop works when I first got into it. I just sent this post to a colleague as an example of how Hadoop can be written about in a clear and interesting way.

  63. Nada says:

    Hi,

    Thanks for this insightful article!
    I want to ask about setting up 6 virtual node ? how to do that in one laptop?

    Thanks

  64. Rajashekhar says:

    really a good info….
    please share if you find some more information…

  65. Siva says:

    Fantastic article, very useful

  66. Murali says:

    Excellent article Brad!!..thanks very much. Please post more and regular

  67. Prabakaran says:

    Awesome, Awesome article for people who starts working in Hadoop

  68. Chandra says:

    Very nice article Brad on Cluster and Mapreduce.
    Can i get links of your other publications.

  69. Ramakrishna says:

    Great article it is very intresting as a beginner now i have got idea on the architecture of hadoop.Thanks a lot for your great contribution.

    I want to set up hadoop in my personal lap.Could you please suggest me some links which could help me.

  70. triggergirl says:

    Perfect Explanation with diagrams .Realy crspy & simple

  71. kiran says:

    hi! Brad,

    this is kiran, it’s very nice article and really helpful article for beginner like me.
    i have one doubt……… about Input Formats and Can you write article on Hadoop Map Reduce programming ?

    thank you,
    kiran.

Leave a Reply

Your email address will not be published. Required fields are marked *