One of my accounts had an unfortunate network outage that lasted about an hour. This outage was caused by human error with VTP but not in the classic revision number way we have heard about before.

Here is what happened…

1) A CatOS access switch fails and is scheduled to be replaced by the network team.

2) The network team grabs a replacement switch off the shelf and is configured with the IP address, default gateway, SNMP strings and VTP domain name of the failed switch. In addition the switch was configured as a VTP Server < – mistake. At this point the switch has a very low revision number.

3) The failed switch is removed and the replacement switch is put in its place. Once the new switch connects to the network it downloads the VTP configuration and syncs up its configuration revision. At this point everything is fine.

4) To restore exact configuration of the previously failed switch a Ciscoworks configuration restore job is launched. The Ciscoworks server does a stare and compare of the last archived config and starts configuring the switch.

5) In the process of configuring the switch the Ciscoworks server deletes all VLANs execpt the ones needed by the switch (as was called for in the config file). Since the switch is still a VTP Server it starts deleting the VLANs across the campus. Network connectivity on the MGMT vlan was lost to the switch before Ciscoworks could set the VTP mode back to Client or make any further configurations.

The customer had to manually recreate each VLAN at the intended VTP servers to restore the network.

This is an unfortunate reminder that VTP really is a risky thing that should be turned off everywhere. Whatever administrative ease it provides does not offset the risks.