Skip to content

Instantly share code, notes, and snippets.

@bdelano
Last active January 16, 2022 14:46
Show Gist options
  • Save bdelano/dcb8e382efe47148b0b96875f38765e8 to your computer and use it in GitHub Desktop.
Save bdelano/dcb8e382efe47148b0b96875f38765e8 to your computer and use it in GitHub Desktop.
netops lifecycle
authors state date updated
Brian Delano <brian.delano@joyent.com>
draft
Aug 8 2019
Oct 14 2019 (adds unified DB as a prerequisite, tidy up)

OPS-RFD 16 Network Device Lifecycle

Introduction

Currently when new network infrastructure is added it is done in an ad-hoc semi automated fashion and there is no standard by which individual devices and accompanying allocations are added to the management and monitoring systems.

The move to the Salt framework across our network and server estate allows us to standardise and fully automate the introduction and ongoing production management of network infrastructure. In order to do this we can rely on Salt for any device interactions but we will also need to add additional Python scripts which will interact with the network device support infrastructure.

This document attempts to call out the specific requirements of Salt and any additional scripts which will be required to take a network device from build into production and through to removal.

Proposed Network Device Lifecycle

Acronyms

  • DCOps -> DC Operations
  • NBT -> Network Build Team
  • DB -> Infrastructure Database (currently referring to dcim.samsungcloud.io, subject to change)

Adding a new device/s

Manual Changes

  • cable map is produced NBT
  • allocate range of mgmt ip addresses in DB tagged with the appropriate build name NBT
  • device is racked and connected to appropriate Opengear console device DCOps
  • device is added to DB DCOps
    • rack location
    • serial (used as device name and serial,supplied by network NBT)
    • MAC-Address (supplied by NBT)
    • Device Role (supplied by NBT)
    • oob connection information
    • adds tag with a unique buildid
    • device 'status' is labeled as 'Inventory' in DB
    • can be done from imported sheet if necessary

Automated Changes

  • run initial script which does the following
    • checks DB for any Inventory devices with names that match the serial number
    • assigns hostname based on device role and rack location
    • adds appropriate range to DHCP server
  • Opengear device is configured and connectivity to new device is validated (manually run script)
  • initial device setup over console (may not be necessary with some switches, can be scripted)
  • Get mac-address,ip from DHCP (needs script which pulls range from allocation in DB)
  • Add base config to network device (assign mgmt ip, snmp and validate reachability)
    • ZTP base config onto device (where applicable)
    • Add base config via script through opengear (where no ZTP)
  • DHCP server/Script sends syslog message to napalm logs
  • salt receives syslog event via napalm logs (needs to include mgmtip)
    • runs build script which does the following
      • pulls hostname, device role and build group from DB
      • pulls base information from device via SNMP (serial,mgmtip)
      • creates proxy based on device serial number
      • runs various salt states to complete device base build
        • this needs to be fleshed out as currently we don't have templates for each build type
      • updates/adds device in DB*
        • MGMT IP
        • change status to 'Staged'

Promoting a device into production

  • select a device or group of devices on the DB.
  • change the status to Active
  • update script will then add any new Active devices to the appropriate tools and add final production config to device
    • run salt state to implement TACACS, syslog and SNMP Traps
    • update DNS
    • add to OpenNMS (initially but this will eventually go away)
    • add to trigger
    • add to netopsinfodb
      • parses all interfaces adds ifindexes, descriptions etc

Production daily updates

  • update script will do the following
    • update device interfaces on DB
    • validate device configurations and send notifications for any failed validations
    • update Circonus with new interfaces or description changes (ifindexes are pulled from netopsinfodb)
    • update connections between devices on DB

Replacing a device

  • Device is swapped out by DCOps and console is verified
  • DHCP kicks off and standard 'add' automation is followed
  • Build script matches hostname to existing device and updates serial number on DB
  • As device is already Active the Promoting process will automatically finish/validate the device config

Removing a device

  • select a device or group of devices on the DB.
  • change the status to Offline
  • update script will remove any Offline devices from the appropriate tools
    • remove from opennms (initially but this will eventually go away)
    • remove from trigger
    • remove from netopsinfodb
    • remove interfaces circonus
    • remove DNS

What is Required?

We are currently working on putting all the pieces together to implement the salt framework into production. Once we have everything we will need to carry out the following tasks:

  • Move ourselves in to a position where we have a single DB (source of truth)
  • Make sure we have the hardware capacity (memory and cpu) to accommodate the Salt proxies
  • Work with DCOps on process and procedures for adding new devices
  • We need a lab environment which allows us to test all aspects of this workflow
  • Work needs to be done on how/which DHCP server is best suited for the initial build and how we go about 'locking down' addresses when a device is reloaded (unless we want to make addresses dynamic)
  • Deploy individual Salt minions in all SPC AZ's
    • Still need to decide exactly how this is going to work in terms of numbers
  • Decide where to put the Master Salt zone
  • Decide on a redundancy/backup solution for the Master
  • Decide on a production setup for napalm-logs
    • configure definitions to support the DHCP syslog messages
    • configure napalm-logs to push to Salt (module already exists for this)
  • Salt states/scripts will need to be written which automatically setup and remove network device proxies
  • Salt validation scripts need to written to push notifications on failures
  • Format and location decided for how and where all the various scripts will run
  • Write various Python scripts to interact with the supporting infrastructure
    • Script which pulls interface ifindexes and updated netopsinfodb
    • Script which pulls interface information via salt and updates netopsinfodb
    • Script to add remove devices from DNS
    • Script to generate a netdevices.json file for trigger based on Active network devices
    • Netbox class for updating devices
      • bdelano has the pieces for this but would need to adapt it for network devices specifically
    • Opennms class for adding/removing devices
      • bdelano has this already but it needs to updated and integrated with this toolset
    • Circonus class to add/remove checks for monitored interfaces and devices
      • bdelano already has a solution for this but it would need to adapted to work within the bounds of this framework
  • Train NetOps on how the system works, what to expect and how to manipulate changes
  • All the other stuff I am forgetting about

Future Considerations

  • building out regional napalm-logs zones should allow us to eventually migrate away from OpenNMS
    • would need to build out definitions for all relevant notifications
    • would need to add scripts which would forward all notifications to pagerduty/mattermost
    • need to come up with solution for where we are using snmp traps instead of syslog (Arista,Juniper)
  • Move all device configuration away from direct user interaction to a salt based solution
    • Need to write appropriate scripts/interfaces to allow NetOPS/NOC to interact
    • Need 'buy-in' from NetOPs as this would be a considerable paradigm shift
    • Should allow us to move away from current Trigger dependence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment