Getting Started with Network Automation

Network Automation: Horrific Results . . . Instantly!
The compelling promises of network automation are luring some businesses into ignoring the Triple Constraint Triangle: Cost, Quality, Speed:

Cost? "We already have network experts and Python is free! No need to augment staff with expensive developers."
Quality? "Our network engineers are expert at all things, and they develop very complicated configurations on switches and routers, so they should have the discipline and talent to become software engineers and developers and write reliable, secure code."
Speed? "We already have Cost and Quality under control, and most of the staff can already write bash scripts . . . automating the network (whatever that means) won't take any time at all!"

Another argument that frequently surfaces in the blind march toward network automation goes like this: "we've automated server installs. Let's automate switch, router, and firewall installations." That's very close to the logical fallacy: "Honey badgers have two eyes and a nose, I have two eyes and a nose. I must be a honey badger." (Author's note: I am actually human.)

So how can you make progress toward Infrastructure as Code or a Programmable Network? You probably can’t, but you have a greater than zero chance if you follow these steps:

Define what you mean by network automation
Perform a standards assessment
Understand authoritative sources and be ready to create them
Develop an automation library and a code classification scheme
Hire enterprise developers with network skills
Establish Credential Handling Standards . . . NOW!
Write a strategic plan and create a road map

Define What You Mean by Network Automation
That's obvious, right? You mean . . . automate the network. Just automate it. That's clear, isn't it? You know . . . Infrastructure as Code. Programmable Networks. Intent Based Networking. What's truly obvious is that unless you can define in your own terms -- not Google's or some product salesman's -- what automation means and what you want from automation, you will struggle . . . mightily. I was once told as a Network Automation Developer that my automation objective was to remove the need for an engineer to log into a console, and I would have 12 months to do that. I don't work there any more, and that was an insane (and 3 years later, unmet) objective.

Another prospective employer told me during an interview that "we've automated most of our networking activities". My next interview at that company was with the network engineer who did the "automation," and who was delighted to tell me everything he had done . . . but the "automation" really amounted to hand-coding long, complicated XML and JSON input files which he admitted was tedious and prone to error and required inexhaustible attention to detail. "But" he said, "I can push a change to 500 devices in just a few minutes." What he really meant was "with 3-4 weeks of pre-work, I can push a change to 500 devices in just a few minutes."

So, define network automation. What problem are you trying to solve? If the answer is "we need to automate our network changes", that's vague and unhelpful. If the answer is "changes take a long time and we need to do them more quickly", you still don't understand network automation. Taking a long time to perform changes almost always means that configuration data is not centralized or classified or easily retrievable. That's a fundamental problem that has to be fixed so that you can automate.

The end of that last paragraph will be easier to understand with an example. Let's say that a client is leaving and you want to remove all the network configuration for that client. If you have a single internet facing router that also acts as your firewall and a single server that hosts your only application, removing a client should be easy and you don't really need automation. However, if you have multiple data centers and multiple applications in each data center with load balancing and failover and complex resiliency, then you certainly could use automation. For this specific example, here's the litmus test for whether you are ready to automate: Do you have a mechanism whereby you can key in a client name and the following data is returned in a few seconds:

All VPN configuration artifacts including interfaces, policies, routes, keys, etc, along with all device names hosting that data
All firewall rules (ACLs), and the device names where the rules are hosted
All NAT rules, and the device names where the rules are hosted
All BGP definitions, routes, AS numbers (or OSPF info) for that client, and the device names hosting the configuration
All prefix lists, network statements, IP address / NAT assignments, NAT mapping and pooling statements, hosted DNS names, etc.

That's not a comprehensive list, but a picture should be emerging. If you don't have a method where you can submit a client name and get all of the network configuration information returned that you need to remove a client in a few seconds, then automation won't buy you much. You will have to comb through the configurations on a bunch of devices and look for stuff related to the departing client, and hope you find it all. That's going to take time, and you may miss things.

So, the total non-automated removal time for this example would likely be measured in hours. With automation, it would be a couple of minutes. Plus, with automation, you can immediately put it all back if the client chooses to stay!

For automation to work for IT, all of the information I referenced in the bullet points listing the network configuration items needs to be in an authoritative source; a table that stores configuration data that can be searched and updated programmatically, using a tightly controlled, standards-driven, repeatable process. That's the heavy-lifting of automation.

Perform a Standards Assessment
Practically every IT department has some sort of established standard for at least a few things, like device names. Or maybe not. But in most businesses, there's a belief that standards exist because there are Standards Meetings and a lot of people talking about "standards". But a quick look at a list of device names might show all sorts of deviations from the "standard": SW01 and SW2 ("do I really need the 0?"), NYCFW-A and NYFWB ("Is there a dash in the name?"), sw-01.companyname.com and sw-02 ("does the device name need to be an FQDN?").

While device name differences like these are trivial, they require a developer to write code to accommodate differences. And that’s not a huge deal, hopefully. It just means that a few things are going to be a little more complicated than they should be. But more revealingly, if you don't have clear standards or if those standards are not enforced or audited, there's a legitimate question as to whether you really have standards at all. Let's call them "guidelines". And "guidelines" and "automation" don't work well together.

More often, however, standards are tribal . . . ideas that have never been put to paper, and everyone should just sorta "know" that virtual servers should contain "vs" in the name, or that the first 5 addresses on any subnet are reserved for clusters or high-availability pairs, or that PTR records should be created any time an A record is created. These are very simple, foundational things, and far more complicated standards issues will need to be investigated, because without standards, there’s less predictability, and without predictability, automation gets elusive and dangerous.

So . . . figure out what items will need to be called from a script – like device name or a syslog server or IP address or VLAN number or structure for an interface description -- and determine whether a published reference to that standard exists. And perform an audit to see how broadly the standard has been applied. If there are problems, fix them now, or grandfather them out and postpone full automation for a bit.

Ultimately, everything that you programmatically reference will need to be standardized: the way it is stored, retrieved, and updated . . . vlan ranges, autonomous system numbers, IP addresses, NTP servers, Syslog servers, SNMP community strings, access list names, virtual server names . . . the list will be quite long.

Understand Authoritative Sources
Usually, in larger organizations there's a "CMDB" or configuration management data base and more than likely it is widely believed to be inaccurate, so it's seen as a bit of a joke among engineers who, ironically, are usually responsible for why it's not accurate (they don't see it as their job to update it). Or there's a spread sheet that somebody keeps up to date that lists a device name, its management IP address, the vendor, model, and OS version. Or maybe you don’t have a CMDB, or a spread sheet. But you have a product that does “discovery” so you don’t need a CMDB or a spread sheet. But you've decided "that’s ok" because your vendor keeps track of what they sold you and they will just let you know when a product has reached end-of-life or just bill you for annual maintenance, and it will be accurate. Maybe.

Here's the axiom: IF your "authoritative source" is updated manually, THEN it is out of date. And if that simple spread sheet list of devices and attributes is out of date, it's almost a certainty that your business only has a vague knowledge of maintenance contract information, or "end of sale" or "end of support" or endless list of "end of X" data. Or perhaps the organization has been lulled into a false sense of security because when some vulnerability was announced for version X.Y.Z, a clever, resourceful network engineer wrote a shell script to log into and query every device (don't think about how the credentials were stored) for that vulnerable OS version. Or worse, the script simply executed a "show version" command and then entire output was pasted into a spread sheet to be filtered and sorted. That's not automation, that's a work around or a band-aid.

Similar to the standards assessment, the organization needs to look deeply into all the elements or key:value pairs that need to be programmatically accessed. And all that data needs to be stored in a table that is searchable. And there should be a process that unfailingly updates every field in that table anytime a change is made to a configuration on a network device.

In a process mature organization, inventory and authoritative sources are updated by different departments during the life cycle of a device. Procurement, Receiving, Data Center Operations, Network Operations all own elements of the device configuration record. Asset management, hardware maintenance, software maintenance, customer configuration data, environment, device state . . . specific data for all these categories are then reflected for network devices in a table or number of tables.

The table (or tables) thus reflects the state for the device and becomes the authoritative source for the device. This is culturally a difficult departure for most network engineers, who see logging on to the console and scrolling through the configuration as the only accurate method for understanding the state of the device. However, without a cultural change in this area, automation will simply never be anything more than a passing fad.

Develop an Automation Library
Before you start creating and testing code, it's a good idea to know at a minimum:

Where you are going to store the code
How you are going to classify code
Who will own API access to given devices
How you will manage sharing code

Most organizations acquire GitLab or GitHub like any other application they purchase: the application is sold as a solution, so they are comfortable with the feeling that buying and installing it should solve problems without any further effort . . . kinda like expecting to get buff a few days after purchasing a fitness membership without ever visiting the facility. Actually, what’s been accomplished is more like deciding to use the barn to store junk. Code gets written and saved to projects in some repository with an odd group or project name and with even more unrecognizable individual file names. You are left with the illusion that you are “managing your code” but in reality, you are just calling it all “code” and putting it in one place . . . more like a "Lost and Found" that you rummage through looking for something interesting or valuable.

One option is to settle on REST API access as the method for all network automation projects and use a commercial REST API management tool to expose all device and program APIs. There's a compelling simplicity to this idea and rightly so. At the very least, you end up with all your programmatic access in a single "API marketplace". But, this should not be an optional thing, otherwise, you'll have a free-for-all with everyone writing their own APIs. If you choose this route, make using the REST API management tool mandatory.

How these code repositories should be organized and classified is well beyond the scope of this document . . . which is another way of saying that I will happily do that for you for $350 / hour. Sigh. Without some sort of classification scheme, the code repository will simply be a hodge-podge of hard to find, who-knows-what-it-does scripts.

Determine Staff Skill Levels and Augment as Needed
This is actually a deceptive title for this section. Your existing staff doesn't have the skill to do what you want them to do or you wouldn't be in the position you are in . . . you'd already be "automated". You need new talent.

Imagine you own and operate a restaurant that specializes in BBQ and after a period of sharply declining revenues you discover that the demographic has changed in your area and you need to be serving Thai or Ethiopian food. It would never occur to you to tell your BBQ master to go learn how to prepare Thai or Ethiopian food and then monkey with the BBQ pit to accommodate the new food type. Well, it might occur to you, but you'd be an idiot to act on that idea.

Nonetheless, IT departments around the globe are telling their network engineers (often with veiled threats) to learn Python, and GitLab, and CI/CD pipelines, and good coding discipline and software security. That's easily one of the nuttiest ideas in the industry. Think about your reaction to an interviewee for a Network Automation position who responds to a question about Python development experience: "I don't have any, but I can learn while you pay me." Are you seriously going to hire that person? Why? You know why: to save money. And in doing so, you risk your business.

You can't convert your Route | Switch staff to Software Developers. Ok, you can try. But it's a stupid idea, fraught with risk, and they won't be very good. Or you may get lucky, and some of your Route | Switch engineering staff may be able to write decent Python code. But if they have any sense, they will find another job and get a 30% salary increase. And you will have lost some network talent.

Hire professionals to do the work. Or resign yourself to taking years to move toward automation. There’s nothing wrong with replacing a network engineer with a software developer who has network experience, and then building an automation team over time. But that should be a direction you choose intentionally.

Establish Credential Handling Standards . . . NOW!
The first confusing issue that a network engineer encounters in trying to automate . . . something, is how to pass login credentials to a device. The usual path is to hard code the user name (why not?) and write the code to prompt for a password. During code test iterations, the engineer will likely get tired of keying in the password and just hard code it into the script, even though it's a violation of corporate policy. "I'll take it out later."

Using that same principle, I could just leave my keys in the ignition of my car whenever I go into the grocery store. Or leave the door to my home unlocked when I go to work. After all, it's pretty unlikely that anything will happen. On what planet is "likelihood" a factor in determining whether or not something is a good idea? Not Earth. And certainly not IT on planet Earth.

So, before unleashing automation on your IT infrastructure, make certain you have published, easy to follow, low-frustration instructions for using credentials in every possible situation. Don't leave it up to the developer to figure out how to pass credentials. Eventually, out of ignorance or impatience with Identity Access Management (the DMV of IT), someone is going to hardcode a password into a script and it will be found.

There are countless ways of accomplishing the task of protecting credentials, far too many to enumerate in a short article like this one, but you should not let individuals choose or create a new method. Collaborate with your security team and write strict access methods for program-to-system, program-to-program, and system-to-system communications. Decide whether or not to use "system" accounts or whether you want a user account associated with the execution of a script. Figure out whether you want to use a vault like CyberArk or HashiCorp Vault or Ansible Vault or some other commercial or open source product, but make a decision and enforce it.

Once again: do not let developers choose how to pass credentials. Establish standards and enforce them.

Write a Strategic Plan and Create a Road Map
Perhaps the most important activity you undertake will be to write a strategic plan that will consider your current infrastructure, current staff skill set, budget, process maturity level . . . and then create a road map describing the tasks you need to complete to follow your strategic plan.

The most difficult step in the strategic plan is self-assessment, candidly evaluating your standards, process maturity, and staff requirements. If you think your standards are well documented, easy to follow and easy to find, but they aren’t, you will fail. If you think you are a process-mature company but in fact, your organization is held together by tribal knowledge, the professional efforts and integrity of a few, and luck, you will fail miserably. And if you think your staff can perform their current job, learn automation, learn development discipline, and not just learn but excel in Python or C or Java in just a few short months, then you will not only fail, you are crazy.

So, if you still insist on moving forward with an automation effort, and you start before you have solid standards, before you define systems of record for all configuration data and have identified owners for those systems of record, before you have thought through and agreed upon a classification scheme and source control methodology, and before you augment staff with enterprise quality developers, you will spend a lot of time and money and not produce much more than a few outages, which if you are lucky, you will be able to automatically back out.

Good luck!

Getting Started with Network Automation

Comments (0)