How This Startup Built a Scalable and Secure Data Center from the Ground Up
Choosing a data center and network design is one of the biggest factors in a startup’s ability to grow in the future. Yet networking is one of the most overlooked areas of a company.
Until something goes wrong of course. Then suddenly everyone’s aware of it.
So how do you build a data center that allows for future growth, while also keeping security in mind?
Should you build it yourself, or is it better to go with AWS or Google?
Or perhaps you want to opt for a hybrid option?
We went to Michael Starr, the Director and Product Owner of Secure Edge at Fortinet, with these questions. After years of experience as an infrastructure engineer, Mike has become a galvanizing force around networking and data center operations. He’s passionate about sharing his knowledge with others and empowering them to make the best choice for their company.
His experience at OPAQ, which was recently acquired by Fortinet Inc, provides an inside look at the decision points behind choosing a networking design.
OPAQ was acquired by Fortinet in the summer of 2020.
Can you tell us about the decision-making process behind the networking design for OPAQ?
OPAQ was a combination of three acquisitions. After those acquisitions, the network was in a state of disarray.
Our goal was to deliver a single web interface to build and maintain network security policies across all of our customer’s sites, what’s known in the market today as Secure Access Service Edge or SASE (pronounced “sassy”).
If you look at how security controls are inserted into the network, they typically add latency. With that inherent addition of latency, we had to build a network with that in mind. When the packet comes in, it goes through a number of security controls and a network decision point. Then it goes back out. Depending on architecture decisions, sometimes you can see latency spikes of 20, 30, or 40 percent. Sometimes more.
In order to be successful, we needed to ensure that the user experience wasn’t impacted by the advanced security controls that we have in play. The way we started reasoning about how we do that was to consider cloud vs colocation. Could we reliably host a network-sensitive service in AWS, Google, and/or Azure? They’re massive networks and are highly connected. One of the questions we asked ourselves was: “What do those networks actually optimize for?” The answer: They’re optimized to get traffic inside their network and keep it there.
We realized that it wasn’t just expensive to egress data or traffic from one of those cloud providers, it also wasn’t as performant. That’s not where they put their effort, time, or money for network performance. They focus on the hyperconverged networking pieces for interservice, intraprovider, routing.
So that’s when we doubled down on building our own backbone.
Building the Backbone
BatBlue was the first acquisition at OPAQ. They had a network and had started this idea of Firewall-as-a-Service (FWaaS). They had attempted to build a network, but it wasn’t scalable or stable. We realized we’d essentially have to rebuild it.
That meant starting from the ground up. We had to decide how to go about doing that.
There are a few things that go into those decisions:
- What are the routers we’re going to use (and other physical equipment)?
- How are we going to do connectivity (transit, transport, peering)?
- Once we have physical connectivity, what are the protocols of the software that we’re going to run across these things (OSPF, BGP, IS-IS)?
We started by considering the equipment vendor. We had Brocade Communication Systems but had challenges with them being expensive and inflexible. They’re not known for their stability. If we weren’t going to use Brocade, we needed to look into another vendor.
One of my focus areas since about 2014 has been around enabling and building Software Defined Networks. Additionally, as a company delivering the combination of advanced security controls and high performance networking we wanted to prioritize incorporating Zero Trust Architecture principals into the network design.
If we actually wanted to be able to scale and incorporate orchestration and automation across our backbone, we needed an infrastructure that didn’t require human intervention once it was built.
There are a few major players in the market that can do that: Cisco, Juniper, and Arista.
For us, Arista was the no-brainer here. Cost, port-density, form factor. These three things all with 2 million supported BGP routes and SDN ready? What’s not to love.
Arista has come a long way providing a fantastic routing architecture. However, when we bought it in 2017 it was a brand-new implementation and although performant, lacked some routing state visibility you get with Cisco or Juniper. Their original focus was high frequency trading and hyper converged data centers.
We then needed to figure out how to connect all our locations together. The old way of doing that is to contact a circuit provider and then wait 90 days while they built out the optical network or provisioned a wave for you across all those sites.
For our connectivity requirements, this paradigm was going to be really expensive and take forever.
There’s this newer age type of market called Software Defined Interconnect (SDI). Companies like Megaport, PacketFabric, and more recently Equinix’s Cloud Exchange Fabric provide it. At the time, we knew the people at PacketFabric and Megaport really well and decided to use those for our transport network, which provides dedicated connectivity for our site-to-site traffic.
With both providers, it only takes 30 seconds to provision a virtual circuit. With SDI, we get one physical port but can have an enormous number of Virtual Cross Connects.
So, at that point we had the Aristas for routing, and our transport network powered by PacketFabric and Megaport.
We just had to decide what routing protocols we wanted to use across the backbone.
Choosing the Protocols
The natural response is to use BGP (Border Gateway Protocol) if you operate an Autonomous System Number (ASN) and you require advanced connectivity to other ASNs as well between sites within your ASN. We use internal BGP (iBGP) for sharing routes between our internal sites and external BGP (eBGP) for our interactions with the internet and our peers – All of this is quite standard.
But then the question was how do we transport customer traffic within each of their routing domains? Do we need to use MPLS? Should we use IS-IS? Do we need something fancy or is there something simpler?
Well, as a SASE provider we’re kind of an ISP and kind of a security provider.
Those networks look different for two reasons.
One, our customers need site-to-site connectivity that’s secure. They also need secure connectivity to other Cloud providers.
For those customers that are “born in the cloud”, they may not need site-to-site traffic, but they need highly performant connectivity to Azure or Google and are still looking to get security assurances around their regular Internet connectivity.
Some customers may connect to us through a branch office or via one of our endpoint agents at home. Either connection method affords our customers all of the advanced security controls: threat prevention, anti-spyware, anti-virus, and malware prevention by doing so.
While we could go down the MPLS route, we didn’t think the complexity was worth it given our size at the time.
The Deciding Factors
One of the biggest aspects of engineering is figuring out how to accomplish your immediate objective in the most efficient way possible, without hindering your ability to grow in the future. This is extremely important from a startup’s standpoint. At OPAQ we were early stage and had no revenue.
We didn’t know North America was going to be our biggest market (vs. Europe or Asia-Pacific), so we needed to figure out the smallest amount of expenditure for the most performance that we could deploy.
What’s the easiest amount of configuration to template and automate that someone in our NOC (Network Operations Center) could pick up so our network architects don’t have to do every single change?
How do we train all our staff and document the process so we’re able to scale?
We also considered known limitations. For example, were we going to have to redo this thing once we reached 50 sites? 200 sites? 10 customers? 10,000 customers? Knowing break points/bottlenecks/scaling limiters for your deployments is critical when making any critical decision. Otherwise, you could hinder yourself from scaling down the road.
That’s how we made our decisions. We kept the future in mind but made the most economical choice to support the existing requirements.
As an engineer, you have to keep thinking about whether or not anyone can maintain your work once you leave.
One of the things I say to my team is “I hired you as a builder and you claim you want to be a builder. But if someone else can’t maintain what you’ve built, you will eventually become a maintainer of the things that you built.
I think that strong engineering processes and documentation guidelines, coupled with keeping the solution simple, go a long way.