Hadoop network design challenge

Filed in Big Data, Design Diagrams, Hadoop by on November 5, 2011 31 Comments

I had a bit of fun recently working on a hypothetical network design for a large scale Hadoop implementation.  A friend of mine mentioned he was responding to a challenging RFP, and when I asked him more about it out of curiosity he sent me these requirements:

(4) Containers

In each Container:

(25) Racks

(64) Hadoop nodes per rack

(2) GE per Hadoop node

Try to stay under 15KW network power per container

Try to minimize cables between containers

Non-blocking end-to-end, across all 6400 servers in the (4) containers.

Application: Hadoop

Pretty crazy requirements right?  Though certainly not at all unimaginable given the explosive growth and popularity of Big Data and Hadoop.  Somebody issued an RFP for this design.  Somebody out there must be serious about building this.

So I decided to challenge myself with these guiding requirements and after a series of mangled and trashed power point slides (my preferred diagramming tool) I finally sorted things out and arrived at two designs.  Throughout the process I questioned the logic of such a large cluster with a requirement for non-blocking end-to-end.  After all, Hadoop was designed with awareness of and optimization for an oversubscribed network.  Read my Understanding Hadoop Clusters and the Network post if you haven’t already.

Such a large cluster built for non-blocking end-to-end is going to have a lot of network gear that consumes power, rack space, and costs money.  Money that could otherwise be spent on more Hadoop nodes.  The one thing that actually makes the cluster useful.

UPDATE: After re-thinking these designs one more time (huge thanks to a sanity check from @x86brandon) and making some changes, I was pleased to see how large a non-blocking cluster I could potentially build with a lot less gear than I had originally thought.  Maybe massive non-blocking clusters are not so crazy after all ;-)

With that in mind I built two hypothetical designs just for fun and comparison purposes.  The first design meets the non-blocking end-to-end requirement. The second design is 3:1 oversubscribed, which in my opinion makes a little more sense.

Design #1A, Non-blocking (1.06:1)

 

Here we have (4) containers each with 1600 dual attached Hadoop nodes, with a non-blocking network end-to-end.  I’ve chosen the S60 switch for the top of rack because its large buffers are well suited for the incast traffic patterns and large sustained data streams typical of a Hadoop cluster.  The S60 has a fully configured max power consumption of 225W.  Not bad.  I wasn’t able to dig up nominal power, so we’ll just use max for now.

The entire network is constructed as a non-blocking L3 switched network with massive ECMP using either OSPF or BGP.  No need to fuss with L2 here at all.  The Hadoop nodes do not care at all about having L2 adjacency with each other (Thank God!).  Each S60 top of rack switch is an L2/L3 boundary and the (16) servers connected to it are in their own IP subnet.

There will be (4) S60 switches in each rack, each with 32 server connections.  Each server will have a 2GE bond to just one S60 (simplicity).  Each S60 will have 3 x 10GE uplinks to a leaf tier of (5) Z9000 switches.  The leaf tier aggregates the ToR where each Z9000 in this tier will have (60) 10GE links connecting to the S60 swithces, and 16 x 40G links connecting to the next layer up, a Spine of (16) Z9000 switches.  The spine provides a non-blocking interconnect for all the racks in this container, as well as a non-blocking feed to the other containers.  For power and RU calculations the Z9000 is 2RU with a power consumption of just 800W.

The Z9000′s in the Spine layer will use (20) of their (32) non-blocking 40GE interfaces, leaving room for scalability in the non-blocking design to a possible (6) containers and 9600 2GE nodes in a non-blocking cluster. Not bad!

At (4) containers I came in around 30KW of network power footprint per container. It’s tough to get even close to the 15KW. I’m not sure that’s a realistic number.

UDATE: This design “Option A” is 1.06:1 oversubscribed — close enough to “non-blocking”.  Another option would be to have (3) S60 switches per rack, each with (4) 10GE uplinks.  This would increase the oversubscription to 1.07:1 and remove (25) S60 switches from each container, saving almost 6KW in power.  Shown below as “Option B”.

 Design #1B, Non-blocking (1:07:1)

In a nutshell this design “Option B” reduces the power consumption by almost 6KW, reduces rack space by 25RU, at the expense of increasing the average oversubscription by .01 to 1.07:1.  Each of the 25 racks has (3) S60 switches with (4) 10GE uplinks.  (2) of the S60 switches in each rack will have 42 server connections.  (1) of the S60 switches will have (44) server connections.  Not exactly balanced, but if power is a big concern (usually is), the minor imbalance and extra .01 potential blocking may be a small price to pay.

Design #2, 3:1 oversubscribed

 

I decided to try out a design where each S60 ToR switch is 3:1 oversubscribed and connected to a non-blocking fabric that connects all ToR switches in all the racks, in all the containers.  There’s a big reason why I chose to have the oversubscription at the first access switch, and not in the fabric.  I wanted to strive for the highest possible entropy, for the goal of flexible workload placement across the cluster.  In this design pretty much every server is 3:1 oversubscribed to every other server in the cluster.  There’s very little variation to that, which makes it easier to treat the entire network as one big cluster, one big common pool of bandwidth.  Just like with the non-blocking design, where you place your workloads in this network makes very little difference.

Each S60 access switch has only (1) 10GE uplink, rather than (3), therefore we have fewer Z9000 switches required to provide a non-blocking fabric.  This reduced the power consumed by network gear by about 5KW in each container, and freeing up 22RU of rack space from the previous non-blocking design Option A.

Notice that non-blocking Design #1 Option B has a lower power and RU footprint than this 3:1 oversubscribed Design #2.

UPDATE: I updated Design #2 to use the user port stacking capability of the S60 switch.  I’ll stack just (2) switches at a time, thus allowing each Hadoop node’s 2GE bond to have switch diversity. This minimizes the impact to the cluster when the upstream Z9000 fails or loses power.

I’m not calling out the capital costs of the network gear in each design, but suffice to say Design #2 will have lower costs, freeing up capital for more Hadoop nodes and building a larger cluster.  Which works out nicely because with the oversubscribed design we’re able to quadruple the possible cluster size to (16) containers and 25600 2GE nodes.

By the way, I’m aware that the current scalability limit of a single Hadoop cluster is about 4000 nodes.  Some Hadoop code contributors have said that forthcoming enhancements to Hadoop may increase scalability to over 6000 nodes.  And, who’s to say you can’t have multiple Hadoop clusters sharing the same network infrastructure?

So. Are you up to taking the Hadoop network design challenge?  How would you design the network? :-)

Cheers,

Brad


Disclaimer: The author is an employee of Dell, Inc. However, the views and opinions expressed by the author do not necessarily represent those of Dell, Inc. The author is not an official media spokesperson for Dell, Inc.

About the Author ()

Brad Hedlund is an Engineering Architect with the CTO office of VMware’s Networking and Security Business Unit (NSBU), focused on network & security virtualization (NSX) and the software-defined data center. Brad’s background in data center networking begins in the mid-1990s with a variety of experience in roles such as IT customer, systems integrator, architecture and technical strategy roles at Cisco and Dell, and speaker at industry conferences. CCIE Emeritus #5530.

Comments (31)

Trackback URL | Comments RSS Feed

Sites That Link to this Post

  1. TRILLapalooza « The Data Center Overlords | April 28, 2012
  1. Duro says:

    Hi Brad,

    just few comments

    you have 32 server (1Gx32 ) and uplink is “just” 30G. Uplink has lover speed (2G) as sum speed of connected hosts. This is consider as non-blocking or 2G missing or uplink ?

    seems to me that S60 has option only for 2x10G interface, so ins`t suitable for non-blocking design

    upper scalability limit of a single Hadoop – Igor Gashinsky during OpenFlow presentation mentioned as usual size 10 000 host peer cluster (3:07 minute)
    http://techfieldday.com/2011/yahoo-google-openflow-technology/
    maybe 4000 ins`t limit for Hadoop

    i just release that my favorite vendor don`t have 40G for switches yet :(

    Duro

    • Brad Hedlund says:

      Hi Duro,

      The S60 switch comes with (4) 10G uplinks ports, so it certainly is suitable for a non-blocking design.

      Yes, in Design #1 each S60 switch has 32GE downlink bandwidth and 30G uplink bandwith. This amounts to a ratio of 1.06:1 — which is, in effect, statistically non-blocking.

      At Yahoo, 4,000 Hadoop nodes has been the largest production cluster, with a target of 10,000 nodes provided some scalability enhancements to HDFS. If Yahoo has recently made those enhancements and deployed a 10,000 node cluster in production, that’s awesome news!

      Cheers,
      Brad

      • Duro says:

        Brad,

        apology for doubt with S60 10G ports but into data sheet mislead me Key Features section.

        Thanks for another nice article about Hadoops limitation

        What impress me that cisco, hp, juniper don`t don`t have switch with 40G, but in contrast Force 10, Arista, extreme networks they do have it. Extreme BlackDiamond 8 (not shipping yet) will be nice aggregate switch for container and also inter-container switch (less cabling).
        I would say that Force 10 has best products (right now) for this kind container design, rest of vendors will release some 40G devices soon.Access switch could be replaced by other vendor, but Z9000 unique for now

        Duro

        • Simon Leinen says:

          Cisco sure has some 40G capable switches: The Nexus 3064 has the same 4 QSFP (+48 SFP+) ports that all the other ToR switches of the brands you mention (and IBM and Brocade and …) have. And the Nexus 3016 has 16 QSFP ports. Note quite the 32 that the Z9000 has, but not too shabby either.

  2. Jason Edelman says:

    Brad, interesting stuff. I’m up for the challenge! Based on a Cisco infrastructure, which I’m sure you’re familiar with, let me give it a shot.

    Rack Design:

    2 x 5596

    Each 5596 will connect to 32 Hadoop nodes using up 64 ports (1GE) downstream. Since these do support VPC, we can also dual-home the nodes across the 5Ks as an option. Not sure what Dell supports here, but it wasn’t a requirement anyway. Just an option. Secondly, we will use 6 ports (10GE) from each 5596 to connect north bound to the inter-rack, i.e. intra-container layer. We have plenty of ports here, so we can use 7 or 8 uplink to get 1:1, but you used 3 uplinks for 32 x 1GE ports as well :).

    Intra-Container Design:

    2 x 7010

    Each rack has 12 x 10GE ports (6 per 5K) that will need to be inter-connected into this pair of 7010s. This gives us a requirement of 300 total ports at this layer (12 ports/rack * 25 racks). To keep things simple, we will split them evenly between both 7Ks. We will be using 150 ports (3 linecards) for all of these ports that will connect downstream to each rack. Unfortunately, the 7K isn’t shipping 40GE interfaces yet, so to match the port count going northbound to the inter-container layer and to remain non-blocking, we’ll add another 3 linecards. Not ideal for cabling, so I guess I get a ding for that requirement :).

    Each 7010 will have 2 x Supervisors and 8 x 48 Port 10GE blades (F2 blades to be exact). Only 6 blades required for North/South connectivity, but added 2 more for VPC required links.

    Inter-Container Design:

    2 x 7018
    Each container will have 300 x 10GE that will need to be inter-connected into this pair of 7018s. This means we’ll have 300 x 10GE ports per container x 4 containers aka 1200 ports of 10GE required. Like in the inter-container layer, we’ll distribute these between two Nexus 7ks. We’ll need 600 ports of 10GE per chassis, in other words, 13 x 48 port 10GE blades. Dual-Supervisors? Sure, we have enough space in the chassis! Just in case ISSU is preferred.

    Note: 7018s could be used at the Intra-Container layer as well. This could expand the scalability of the entire Hadoop implementation.

    In this design, L2/L3 at the 7010 layer. L3 boundary per rack using VPC on 7010s. If we use 2K/5K, we could go with L3 in the 5K, keeping in mind we’re limited at 80G throughput on the 5K.

    This does far exceeds the designs I do day to day for my customers, but was a fun exercise.

    How would you do it with Cisco? Feel free to poke holes in my design or modify to improve!

    Oh, by the way, I really didn’t focus on power all that much!

    Regards,
    Jason

    • Brad Hedlund says:

      Hey Jason,

      If I were designing it with Cisco, I would probably go with Catalyst 4948E or Nexus 3048 at the ToR layer. Fussing around with (128) GE SFPs per rack using Nexus 5596 would be (a) unnecessary, (b) costly, and (d) more power consumption.

      In the 7010 layer you actually need (7) linecards per chassis for the North/South links, which should give you 36 ports left over for other stuff like VPC peer-links.

      Where would you locate the “Inter-Container” layer? Just gonna pick two lucky containers to get the 7018s? One other way might be to go with (4) 7009s at that layer, placing (1) in each container.

      To complete the challenge, Jason, you really should figure out what your approximate power and RU footprint is per container.

      Cheers,
      Brad

  3. Jason Edelman says:

    Yes, will need to run through the different scenarios on power consumption to see what it would look like with 3K vs. 5K. However, I do like the 5K in terms of scalability/flexibility with FEX (however, I guess that is sort of irrelevant not knowing future requirements) regardless using all those SFPs would not be fun to manage.

    I like the idea breaking apart the 2 x 7018s into 4 x 7009s…I was hoping to put those 7018s in more of a centralized location. I get the point now – they have to go into one (1) of those four (4) containers! Misunderstanding on my part there.

    Thanks for the catch on the linecard count. Must have been the time change affecting my math! Since I proposed 8, it wouldn’t have affected the sale though. ;)

    Thanks much.

    Jason

    • Ian Erikson says:

      Jason,

      how about this for your rack layer – (2) HP C series blade chassis with 32 BL2x220c G7 each. There is your 64 nodes. Then use dual virtual connect modules – with (4) 10GB connection each side. (that’s 32 nodes so far with 2GB each. you are providing a bit more with 80Gb total per chassis). You need 25 racks of these, with 16 total connections per rack. (4) Pairs of 5548′s might do the trick, and I think you can get away with 5m twinax to uplink from VC to 5548.

      I will let you continue from there if you like it.

      • Brad Hedlund says:

        Ian,
        Blade servers are generally a non-starter for Hadoop due to a lack of DAS density and options. Pitching a blade server architecture to a serious Hadoop shop is one way to kill your credibility on the spot. Just sayn.

        Cheers,
        Brad

        • ian erikson says:

          I can agree with that. I was just trying to throw out an options to get 64 nodes in one rack.

        • Adam Richardson says:

          Actually not entirely true Brad.

          One of the advantages HP c-class has is its ability to support the MDS600 drive trays (70 drives in 5u, 4 trays per enclosure) which actually from a drive density perspective beats things like the SL line (and Dell DCS type cookie cutter servers).

          There are a few customers in Europe (where I’m based) who have hadoop clusters running on c-class (admittedly not large implementations) and the main reason is power efficiency. Quite simply blades are more efficent and in europe power efficiency is king!

          In fact for general HPC users the 2x220c is by far a better option. Its more dense, more efficient and the network savings are huge (especially when you are using infiniband like most do).

          That being said I will agree that most large clusters favour the SL line and other cookie cutter servers as in those worlds the extra management features that all blades have are unecessary and add complexity (plus cost).

  4. Rik Herlaar says:

    Hi Brad ,

    Hope you’re well –
    Thanks for sharing this interesting exercise .
    Two things spring to mind when I read this –
    First off, the requirement for deeper buffers at ingress due to the nature of Hadoop traffic patterns I assume (where some switches especially the ones geared towards DCB use-cases are not necessarily sporting deep queues per-se) and secondly how large are these data-racks ? Possibly due to my ineptitude in terms of understanding how Hadoop nodes are calculated – but if each node equals a 1RU worth of compute power – you must factor in some mighty racks. So how you cram so many switches and so many nodes into a single rack (as my understanding is most are 42″ standard data racks ?

    Thanks in advance
    Rik

    • Brad Hedlund says:

      Hi Rik,
      These are standard size racks. There are quite a few rack server platforms out there that allow you to fit 64 servers in 32U of rack space. One example is the Dell C6100.
      Rack servers such as these with their density and massive DAS options make them well suited for Hadoop.

      Cheers,
      Brad

  5. Rik Herlaar says:

    So 4 compute nodes in a 2RU unit – the math makes sense again – thx.
    Would you say that 1GE per node requirement (which is in a sense driving certain connectivity inefficiencies as I see it vs. 10GE nodes) is due to the very distributed nature of Hadoop as such ? While I understand that constraint to be meaningful in the light of specific use-cases (social networks ) I see it as a constraint for Enterprise use where the nature of federation is likely less outspoken as most ENT’s have consolidated their DCs by now (albeit sometimes offset again by off-premise cloud bursting).

    Regards

    Rik

  6. Joe Deering says:

    Brad, great information. What about the need for remote connectivity? RAC or iLo ports. The idea was to save power and space by removing the KVM/KMM’s.

    Both venders BW needs would be 100M coonections from every server.

    Also, as far as the 3rd option, how many commercial company’s use oversubscribed designs for large Hadoop clusters?

    thank you

    • Brad Hedlund says:

      Hi Joe,

      I didn’t make any special considerations for out of band management to the servers. I just assumed the servers would be managed in-band. What’s wrong with in-band management?

      FWIW, most of the clusters I have seen that have been deployed as non-blocking have been small clusters, where deploying a non-blocking network didn’t add much incremental cost. The largest Hadoop clusters such as those deployed at Yahoo and Facebook are typically oversubscribed.

      Cheers,
      Brad

  7. Hadoop is rack aware so that backplane should only be needed for
    -ingress
    -block writes (that deliberately replicate across switches for availability)
    -merging output from map jobs; ideally “combined” first with a local presort to help direct output.
    -egress
    -executing work that can’t be scheduled on-rack
    -rebalancing blocks across machines (background work)
    -replicating under-replicated blocks

    If you look at Facebook’s fair scheduling paper you can see how that scheduler (bundled with hadoop) delivers better machine and rack locality, so you should be able to get away with oversubscribing the backplane. If you are going 1:1, then you are missing the whole notion of topology awareness.

    The big risk for bandwidth is replication after a server outage, especially as the Hadoop versions in production need a node restart after a single disk fails -the new 0.20.205 and forthcoming 0.23 versions will only take out that single disk. That means when an HDD fails, there’s 1-3 TB of data to move round; Hadoop will probably go switch local if that happens, so it’s not going to use up the backplane as those missing 2TB worth of blocks will be scattered over the same and remote switches.

    A full 36TB outage will generate more traffic, and if the cluster is getting full, P(space-on-same-switch) will invariably decrease, generating more backplane traffic.

    The real nightmare is loss of switch events, either from a HW outage or router misconfiguration. Your designs of having both 1GbE cables going to the same switch “for simplicity”, the loss of an S60 ToR switch will trigger the replication of 16 nodes, which with 36TB/server will be at least 576 TB. I say at least as every block that was 2x replicated in those nodes generates 2x traffic of a single block replication.

    While that replication is going on, the rest of the cluster -assuming its still under load- will be fielding its normal workload, yet now there is a lot more traffic hitting the namenode. If that metadata server gets overloaded enough to start missing heartbeats from other storage nodes, then it will assume that more servers will go offline. This has been observed in the field, and has led to cascade failures. HDFS-599 fixes this in 0.23 (https://issues.apache.org/jira/browse/HDFS-599) ;until then cascading liveness failures would worry me more than backplane oversubscription.

    I’d go for
    – bonded 2x1GbE, route to separate switches (if you use 2GbE)
    – or 10GbE once it’s on the mainboard, for faster intra-rack replication
    -a very rigorous router reconfiguration process with reviews before changes are applied
    -someone implements the discussed proposal for the NN to recognise the loss of a specific percentage of worker nodes as a network partitioning events and to drop into safe mode until the ops team can fix the problem, rather than trying to auto-recover.

    Summary: I’d worry more about failure resilience over bandwidth. Hadoop is designed to be bandwidth-efficient, but the growing size of storage/server means that the traffic from an outage is getting worse (though admittedly, so has bandwidth, as the first clusters were all 100 MBit/s)

    see also:
    http://steveloughran.blogspot.com/2011/11/solving-netapp-open-solution-for-hadoop.html
    http://steveloughran.blogspot.com/2011/11/towards-topology-of-failure.html
    http://steveloughran.blogspot.com/2011/10/reading-availability-in-globally.html

  8. Marius Purice says:

    Hi, Brad,

    Apart from the buffer size, is there any other reason not to choose the F10 S55 instead of S60? S55 switches have a maximum consumption of 130W, much lower than the 225W consumed by S60. For 100 switches per container, as in design 1A, this would translate in savings of 9,5 kW per container, or 7,1kW for design 1B.
    Based on the S55, in design 1B the maximum consumption/container should reach 75 x 130 + 9 x 800 = 16,9 kW.

    • Brad Hedlund says:

      Marius,
      You make an excellent point. Now that you mention it, I did immediately dismiss the S55 based on buffers. But perhaps with the emphasis on power consumption in the design requirements I should have given the S55 the consideration that a performance focused Hadoop customer would not. It would be helpful to see the delta in job completion times you would get with 4MB buffers vs. 1.25GB buffers in the top of rack (S55 vs. S60), to make a better value judgement. Such as running a network stressing Hadoop benchmark like Yahoo TeraSort.

      Cheers,
      Brad

      • Chaoz says:

        Hi, Brad,

        A. About the switch buffer concern in the design, can we do a better trade-off that, use S55 for hadoop storage node, and S60 for nodes that run MapReduce, — reason’s being, only the MapReduce node suffers the incast issue.

        B. Besides of the design with big buffer (GE) switch, I think the other option is to have high-density 10G stackable switch (S2410/4810) for MapReduce node in theory. And the leaf/spine link can be limited to just 1 or 2x10G, because of the nature of incast across all the nodes in that switch, so that to save the cost. BTW, 10GE LOM is going to be very popular and cost drops dramatically soon, …. so, IMHO, this design would be more promising to the future.

        C. Z9000 has very tiny buffer, is that to say the incast in spine switch is not an issue? Just to confirm with you in your theory….

        Comment or reply?

        Cheears,

        Chaoz

        • Brad Hedlund says:

          Chaoz,
          Excellent comments. Here are my thoughts:

          can we do a better trade-off that, use S55 for hadoop

          The S55 vs S60 price/perf tradeoff for Hadoop is something I’ve been thinking about lately. There is a significant price difference between the two switches ($8K vs $12K list). This begs the question: How much better does an S60 perform vs S55 for different types of workloads and jobs (eg. ETL vs. word count)? Given that data, we might be able to assess on a case-by-case basis the importance of job completion times to a given cluster, and if that delta is worth the S60 price point. For example, if an ETL intensive job on the S55 completes in 30 minutes, vs. 22 minutes on the S60 — what is that worth in terms of meeting the business objectives?

          use S55 for hadoop storage node, and S60 for nodes that run MapReduce, — reason’s being, only the MapReduce node suffers the incast issue.

          Actually, in Hadoop each slave node is both a storage nodes (HDFS) and a compute node (Map Reduce). Unless we are talking about HBase, where you might not have any Map Reduce running in that cluster, rather HBase running on top of HDFS.

          10GE LOM is going to be very popular and cost drops dramatically soon, …. so, IMHO, this design [10GE Hadoop] would be more promising to the future.

          I agree. 10GE Hadoop is another thing on my mind lately. As you point out, cost effective 10GE LOM is coming this year. Also, you now have machines that are purpose built for Hadoop and Big Data workloads, where you might 12-24 disks and 12-16 cores in a 2RU machine (eg. Dell PowerEdge C servers). With that much compute and storage density in one machine, 10GE starts to make sense. Also, I tend to believe that Enterprise customers just getting started with Hadoop may consider 10GE from the start — simply because they want to. They may already be buying 10GE for other environments (virtualization). The scale of Hadoop cluster is much smaller in Enterprise vs. Web 2.0, so the price difference in 1GE vs 10GE nodes doesn’t have the same magnitude. Most of folks that say 10GE Hadoop is too costly are coming from the Web 2.0 where Hadoop was born, and where you have clusters of several thousand machines. The average Enterprise cluster in my experience is around 150-300 machines. Federal government clusters are usually larger, in the 300-1000 machine range.

          Z9000 has very tiny buffer, is that to say the incast in spine switch is not an issue? Just to confirm with you in your theory….

          The Z9000 has 54MB buffers. That’s “very tiny”? Compared to what? At any rate, in Hadoop most of the buffers are absorbed by the ToR switch, not the next layer up. When I was at Cisco, we had a team that did some pretty extensive testing on Hadoop clusters and this was one of the key findings which they presented at Hadoop World 2011 in NYC. Granted, this was for 1GE clusters. Buffer load on a 10GE cluster was not tested.

          Cheers,
          Brad

          • Chaoz says:

            Hi, Brad,

            Thanks for the detailed comments. I want to say I’m pretty much agree what you said here.
            Just some comments,

            “Actually, in Hadoop each slave node is both a storage nodes (HDFS) and a compute node (Map Reduce). Unless we are talking about HBase, where you might not have any Map Reduce running in that cluster, rather HBase running on top of HDFS.”
            - I thought it could be manipulated to limite the range of DN/TT nodes, and further study of mine tells me this is not practical in real world,… , the same as your comments.

            “The Z9000 has 54MB buffers. That’s “very tiny”? Compared to what? At any rate, in Hadoop most of the buffers are absorbed by the ToR switch,… ”
            - 54MB is surely not big comparing the the total throughput. However I fully understand the reason behind that, being cost, chipset limit, etc … Maybe, when 10G Hadoop becomes the mainstream, we can expect more …

            Regards,

            Chaoz

  9. John Wu says:

    Hi Brad,

    S60 has a huge 1.2GB buffer totally or we can say 260MB per port. Do you think each GE port really need such a large buffer or is it really useful in here? If they really use 260MB buffer then the application should be already time out?

    Thanks,
    John

    • Brad Hedlund says:

      John,
      The S60 buffers are shared among all ports. Large buffers help in environments where you have bursty steams of data in the presence of speed mis-matched ports (1GE & 10GE), such as in IP storage or Hadoop. There is also a traffic condition called Incast, commonly found in Map Reduce implementations such as Hadoop, where bursty streams of data from multiple ports are simultaneously sending to one port. In either case, large buffers work to minimize packet loss. And with TCP packet loss is what prevents you from achieving the best possible application throughput.

      • John Wu says:

        Hi Brad,

        I totally agree the theory larger buffer will help in Hadoop application but how to compare with normal 16MB TOR switch(Cisco) compare to S60 or other vendor’s huge buffer switch base on BRCM chip? 16MB is much small compare to 1.2GB if we just do a simple “number” point of view. 16 vs 1200 looks not in the same level right? But you can’t say 1.2GB will be 100x better performance from my understand. Any application calcaulation mode can prove 16MB is not enough and we need at least over 100MB~200MB 48GE+4X10GE switch for Hadoop requirment or other cloud application requirement?

        Thanks,
        John

  10. Amol K says:

    Hiiii
    I want your guidance. I have one assignment to design network for Hadoop cluster.
    I am not getting what to do in this.
    I have to create four nodes one name node and other three data node.
    Can you help me…

  11. Yongwei Zhang says:

    Hi Brad,
    nice post. some comments
    Z9000 32 * 40G, S60 (only use) 30 GE and 3 * 10GE
    to achieve 1:1 oversubscription, we need 12800gbps throughput in the ecmp domain
    then for spine layer, we need 12800 / 40 = 320 * 40gbps interfaces, equals 10 Z9000 switches
    For leaf layer, 12800/(16*40) = 20, then we need 20 Z9000 switches.
    For access layer, 12800/30 = 427, then we need 427 S60 switches.

  12. Hnet says:

    Hi Brad,
    For the TCP Incast problem, since the burst experienced at the switch is a transient micro burst, what is required is a transient burst absorption capability rather than deep egress queues.. Dont you think a shaper with a deeper burst bucket can help to absorb these kinds of bursts rather than larger buffers on the egress queues ?

Leave a Reply

Your email address will not be published. Required fields are marked *