One of the key flexibilities Cisco ACI offers, is the different options that multi-data centers can be interconnected to meet different business continuity and disaster recovery “BC/DR” requirements.
Before we go too far with these design options, let’s take moment to define BC and DR as these two labels often used interchangeably while they should not, then discuss how BC and DR can drive the design decision with regard to Cisco ACI Multi-POD and Multi-site.
Business continuity planning (BCP) is a practice or methodology aims to build and govern a tested and validated plan to maintain key business functions and operations continuity before, during and following a disruptive event. This event could be a natural disaster, human error or technical system failure. We always hear about zero-down time, this actually refers to continuous availability which is a subset of BC.
Note: the feasibility of considering continuous availability depends on the systems and business criticality. For instance, if the cost of a system’s down time such as in financial services, overweigh the cost of implanting continuous availability, then it will be more feasible to consider redundant systems that help to achieve continuous availability.
On the other hand, Disaster Recovery DR is part of the BC and not another term of BC, DR focuses on the immediate action(s) to contain the impact of an event (failure) on a system and the action(s) involved to recover it.
Technically, you cannot build a justifiable BC/DR plan without identifying the potential threats the business may encounter. This is typically addressed by doing a through risk management and assessment exercise to identify the possible risks and its potential impact. This blog is not going to discuss risk management in details, nonetheless, the key point here is that risk management is a key to start with in any BC/DR planning.
Following the risk assessment stage that focused on the different potential risks the organization may encounter, the business impact assessment (BIA) comes into the picture, to start making more focused and business’s function related assessments and highlight the impact of losing any of these functions for certain amount of time.
Following these assessments, you should be able to come up with a mitigation strategy that ultimately can be translated in more detailed actions and procedures to be used as BC/DR plan.
The figure below summarizes the process described above (in oversimplified way)
Taking the above into considerations, organizations when building BC/DR and doing risk assessment and BIA the impact magnitude is one of the key considerations. For example, in natural disasters such as flood, having redundant data centers within the same city can put the business at risk even though they have two redundant data centers. In this case, the organization may start considering a new data center in a different geographical region.
A common example here, is the way AWS structure its data centers where multiple data center regions available across the globe. While, each of these regional DCs is constructed of multiple interconnected local data centers to provide available zones in each regional DC.
To simplify these different DC distribution models and how Cisco ACI can address it, let’s consider the following sample scenarios:
One of the key design advantages of this scenario is that both DCs located in buildings owned by the same organization and next to each other, in which it should be easy to add direct fiber links between them with low latency (short-distance).
Cisco ACI Multi-POD design option is the best fit for this design requirements, as it will offer company-A a single-pane of management to manage both physical DCs at the same time these two DCs will be treated as separate DC fabrics from control plane point of view (each POD runs separate instances of fabric control planes [IS-IS, COOP, MP-BGP]) to provide control plane failure domains isolation among the ACI-PODs. While, management plane, still shared in which any change will take place across the two fabrics as they are logical a single ACI Fabric.
With this design option company-A has the ability to distribution its DC workload within single logical availability zone residing in two physical data centers.
As discussed eelier in this blog, some organizations aim to achieve continuous availability for their services and applications. therefore, this BC goal requires taking into consideration the impact magnitude of certain threats that identified during the risk assessment phase, such as a natural disaster that might happen in a city. Consequently, considering a secondary DC in a different geographical area is essential to protect the organization from such a threat.
Company-B has an existing ACI-based DC and would like to integrate this DC with the new ACI DC in the other region to be managed centrally without introducing any fate sharing across the data, control and management planes.
The best fit for such requirements is the Cisco ACI Multi-Site design option. With the Cisco ACI Multi-site, each DC has its own data, control and management plane each DC/sie has its own APIC controllers. So, how it can be managed centrally if each DC fabric/site has its own ACI APIC?
With the Cisco ACI multisite policy manager, DC operators now can manage two interconnected ACI fabrics using a single- pane management, where they can monitor the health score state for all the interconnected DC fabrics/sites. Also, they will be able to define, all the inter-site policies that can then be pushed to the different APIC domains for rendering them on the local DC fabric/site in a centralized manner. Most importantly it offers a clear control with regard when and where to push those policies, therefore, its eliminate change/management plane fate sharing which uniquely characterizes the Cisco ACI multisite architecture.
As illustrated in the figure below, each DC in this design represent an availability zone for the organization services and applications.
Inter-site communications achieved using MPBGP-EVPN to carry and exchange MAC and IP address information for the endpoints that communicate across sites while inter-site VxLAN tunnels used for the data plane forwarding among the ACI sites. The underlay network between the interconnected DCs only needs to be IP enabled with ability to increased MTU size to cater for the overhead from the VXLAN encapsulation.
Note: with ACI multi-site design model, the receiving spine switch of a the VxLAN packet, it will translate the VNID value contained in the header to the locally significant VNID value associated with the same bridge domain, and the encapsulate the traffic and send it the to the site/local leaf node based on the destination host information (MAC/IP).
Note: ACI Multi-POD was not selected in this scenario to interconnect these two DCs for two key reasons: from HA/continuous availability and fate sharing point of view, Company-B required an isolated control, data and management planes, therefore, each DC has to be treat as a separate availability zone.
From sizing point of view, since each DC will have over 160 leaf switches, this may exceed the maximum supported number of Leaf switches across all Pods under a single APIC cluster.
This scenario, is similar to scenario-2 above, however, each of the ACI fabrics consists of two PODs connected or built using the Cisco ACI Multi-POD concept.
In such scenario, the Cisco ACI multi-site, can take it a step further and interconnect these 4 physical DCs. Where each DC spines will form a full MP-e/iBGP EVPN session with the other DCs spines in full mesh manner (or using BGP RR concept if iBGP is used among them and there is a large number of interconnected DCs). This is from control plane perspective to exchange hosts information reachability (MAC/IP).
Also, the Cisco ACI policy manager cluster for the Multi-Site centralized management/control, the VMs can be placed in single DC across different physical hosts, or can be distrusted across the different DCs for higher level of resiliency (taking into consideration there must be availability bandwidth between 300 Mbps to 1G [depending on the scale] for the multi-site policy manger clustering over the WAN, however, the good news here, is that, the VMs of the ACI multi-site cluster communicate with each other over a TCP connection, so if any drops occur in the WAN, dropped packets will be retransmitted)
From Data plane perspective, inter-Site VxLAN is used to carry the traffic among the interconnected DCs in a seamlessly.
This scenario, is different from the other ones described above as it addressing a case where not both DCs are ACI based. In such scenario, OTV can be used as the L2 DCI solution and each DC has to hand-off the relevant L2 VLANs at the OTV edge to be extended to the other DC end. This scenario commonly referred to as the “dual/multi fabric” design. This DCI design option can be used to interconnect two ACI fabrics as well as ACI with a non-ACI based DC. They key point here, each DC must terminate its control/Data plane at the OTV/DCI edge, also, each DC has its own separate management plane. However, with this design option the used DCI technology will determine how scalable and flexible the DCI can be in terms of control plane, L2 flooding, number of virtual networks and carried routing/hosts’ reachability information. In contrast, Cisco ACI multi-POD and multi-Site encapsulate the Inter-DC traffic with the inter-site VxLAN along with the VN ID tags, with MP-BGP EVPN for control plane/L2/L3 hosts’ reachability information. This means there is no reliance on the DCI transport network to carry hosts’ routes/MACs nor provide path isolation for the different virtual networks, therefore, it can support flexible and large scale Multi DC designs.