jathanism/nautobot-aws.md

## nautobot-aws.md

      
    Raw
  

              nautobot-aws.md
            
          
    Nautobot on AWS

This is a brief summary of setting up Nautobot on AWS.  I used Nautobot 1.1.3, but it should not vary too much when using different versions.  It assumes some familiarity with Nautobot and AWS.  Pretty much all of the AWS cli commands are the bare minimum to stand up some version of this and only show creating one resource if multiple identical resources are needed, you should add tags and consider your needs around sizing, redundancy, etc.  The end result is a basic Nautobot deployment.  It is not doing any scaling, just one http server and one celery worker.
Starting infrastructure and configurations


a VPC containing 2 private and 2 public subnets, with each pair of subnets in two different AZs
routes configured back to your on premise network from the VPC
an ec2 bastion host with an interface on the private subnet for local access to EFS and RDS
a nautobot_config.py file that has been updated to meet your needs
a nautobot.env file that has been updated to meet your needs
a nautobot docker image that contains nautobot and all of the plugins that you want to run

Nautobot config changes

In order to work with redis SSL, you'll need to add the following to your nautobot_config.py:
if parse_redis_connection(redis_database=0).startswith("rediss"):
    CELERY_REDIS_BACKEND_USE_SSL = {"ssl_cert_reqs": "YOUR SETTING"}

where "YOUR SETTING" is one of 'required', 'optional', or 'none'.  This is needed because Celery requires some value for ssl_cert_reqs when using SSL.
Starting with dependencies
VPC Security Groups

First, I went to VPC and created all of the security groups that I intended to use for the installation.  For the internal services (EFS, RDS, Elasticache, ECS), I configured the security groups to allow inbound access to the service from the private subnets only, eg the postgres security group only allows inbound traffic to the postgres port from the CIDRs of the two private subnets.  For the external services (ELB), I configured the security group to allow inbound http/https access to the service from my end user networks.
The full list of security groups that I created is:

nautobot-ecs-secgrp
nautobot-efs-secgrp
nautobot-elb-secgrp
nautobot-rds-secgrp
nautobot-redis-secgrp

Creating a Security Group

aws ec2 create-security-group
    --description "allow postgres port to rds instance"
    --group-name "nautobot-rds-secgrp"
    --vpc-id "<vpc arn>"

aws ec2 authorize-security-group-ingress
    --group-id "<security group arn>"
    --protocol "tcp"
    --port 5432
    --cidr "<private subnet one>"

aws ec2 authorize-security-group-ingress
    --group-id "<security group arn>"
    --protocol "tcp"
    --port 5432
    --cidr "<private subnet two>"

...
<repeat for each security group>

RDS

Second, I went to RDS and created a database subnet group, nautobot-postgres-subnetgroup that included both private subnets.  I then created a generic Postgres RDS database named nautobot-postgres-db, assigned it to the nautobot-postgres-subnetgroup subnet group and the nautobot-rds-secgrp security group.  The database does not need public access.  I created a superuser at this point.  Obviously sizing, redundancy, backup, and encryption settings will vary based on your use case, so your database setup will vary.  I have not run into any issues with all encryption enabled and starting with limited resources intending to scale up as needed.
Creating RDS subnet group and postgres instance

aws rds create-db-subnet-group
    --db-subnet-group-name nautobot-postgres-subnetgrp
    --db-subnet-group-description "subnet group for nautobot rds instance"
    --subnet-ids "<arn of private subnet one>" "<arn of private subnet two>"

aws rds create-db-instance
    --db-instance-identifier "nautobot-postgres-db"
    --db-instance-class "<node size>"
    --engine "postgres"
    --db-subnet-group-name "nautobot-postgres-subnetgrp"
    --vpc-security-group-ids "<arn of nautobot-rds-secgrp>"
    --master-username "postgres"
    --master-user-password "<password>"
    ...
    <other options as needed>

Elasticache

The elasticache instance also needs a subnet group, so I created an elasticache subnet group named nautobot-redis-subnetgroup that included both private subnets.  I created an Elasticache redis instance, nautobot-redis, encrypted both at rest and in transit, assigned an auth string at creation, and assigned it to the nautobot-redis-subnetgroup subnet group and the nautobot-redis-secgrp security group.  I used default settings for number of shards and redundancy, but used the smallest node size that I could reasonably use, intending to scale up as needed, because Elasticache gets expensive fast.  Note that I used a redis replication group in order to use transit encryption, there are additional options if you don't care about transit encryption.
Creating Elasticache subnet group and redis instance

aws elasticache create-cache-subnet-group
    --cache-subnet-group-name "nautobot-elasticache-subnetgrp"
    --cache-subnet-group-description "subnet group for nautobot redis instance"
    --subnet-ids "<arn of private subnet one>" "<arn of private subnet two>"

aws elasticache create-replication-group
    --replication-group-id "nautobot-elasticache-redis"
    --replication-group-description "redis instance for nautobot"
    --cache-subnet-group-name "nautobot-elasticache-subnetgrp"
    --security-group-ids "<arn of nautobot-redis-secgrp>"
    --auth-token "<redis auth token>"
    --transit-encryption-enabled
    --at-rest-encryption-enabled
    --cache-node-type "<node type>"
    ...
    <other options as needed>

EFS

I created an EFS file system, nautobot-efs-configfiles, and corresponding access point, nautobot-efs-accesspoint, and assigned them to the nautobot-efs-secgrp security group.
Creating the EFS file system and access point

aws efs create-file-system
    --encrypted

aws efs create-access-point
    --file-system-id "<arn of file system>"

aws efs create-mount-target
    --file-system-id "<arn of file system>"
    --subnet-id "<arn of private subnet one>"
    --security-groups "<sg-xxxxxxxx ID of nautobot-efs-secgrp>"

aws efs create-mount-target
    --file-system-id "<arn of file system>"
    --subnet-id "<arn of private subnet two>"
    --security-groups "<sg-xxxxxxxx ID of nautobot-efs-secgrp>"

Systems Manager Parameter Store

For secrets, I used Systems Manager Parameter Store and created SecureStrings.  Note that whichever KMS key you use needs to be accessible by the nautobot-ecs-task-runner-role role, whether you use the account default key or specify a key.  There are 3 secrets that are used at every startup (django secret, redis auth string, and the postgres nautobot user password) and 2 that are only needed at initialization (the admin user password and the admin user api key).
Creating a parameter store secret

aws ssm put-parameter
    --name "NautobotDjangoSecretKey"
    --value "<secret key value>"
    --type "SecureString"

...
<repeat as needed>

ECR

I chose to use ECR for container image storage, but any container image repo will work as long as it is accessible by ECS and the nautobot-ecs-task-runner-role role.  I created a single repo for the nautobot container, nautobot, and uploaded my container image.
For my nautobot container, I started with a standard nautobot as a base and installed my prefered plugins locally.  The result was uploaded to ECR.
Creating the ECR repo and pushing your existing image

aws ecr create-repository
    --repository-name nautobot

aws ecr get-login-password --region <your region> | docker login --username AWS --password-stdin <FQDN for ECR endpoint>

docker tag nautobot:latest <FQDN for ECR endpoint>/nautobot:latest

docker push <FQDN for ECR endpoint>/nautobot:latest

Identity and Access Management

Next, I went to IAM and created a role for executing the ECS task, nautobot-ecs-task-runner-role.  I assigned a custom policy to the role that gave it permission to mount the EFS access point, pull images from ECR, run tasks in ECS, and send logs to CloudWatch.
This policy is not as locked down as it could be, some of the resources didn't exist when I created the policy and it needs to be revisited to specify ARNs, but this should be a good starting point and covers all of the permissions that my nautobot task needed.
IAM task runner policy json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "ecr:GetAuthorizationToken",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "elasticfilesystem:DescribeMountTargets",
                "elasticfilesystem:DescribeLifecycleConfiguration",
                "elasticfilesystem:ClientMount",
                "elasticfilesystem:DescribeFileSystemPolicy",
                "elasticfilesystem:ClientWrite"
            ],
            "Resource": [
                "arn:aws:elasticfilesystem:*:<accountid>:file-system/<nautobot efs>",
                "arn:aws:elasticfilesystem:*:<accountid>:access-point/<nautobot efs access point>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "ssm:GetParameters",
                "ssm:GetParameter"
                ]
            "Resource": [
                "arn:aws:secretsmanager:*:<accountid>:secret:<secret>",
                "arn:aws:ssm:*:<accountid>:parameter/<params>",
                ...
                ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:BatchGetImage",
                "ecr:GetDownloadUrlForLayer"
                ]
            "Resource": "arn:aws:ecr:*:<accountid>:repository/<nautobot repo>"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
                ]
            "Resource": [
                "arn:aws:logs:*:<accountid>:log-group:*:log-stream:*",
                "arn:aws:logs:*:<accountid>:log-group:*"
                ]
        }
    ]
}

Creating an IAM role and policy

aws iam create-policy
    --policy-name "nautobot-ecs-task-runner-policy"
    --policy-document "<file containing IAM task runner policy json>"
    --description "Policy for running nautobot containers"

aws iam create-role
    --role-name "nautobot-ecs-task-runner-role"
    --description "Role for running nautobot containers"

aws iam attach-role-policy
    --role-name "nautobot-ecs-task-runner-role"
    --policy-arn "<arn of nautobot-ecs-task-runner-policy>"

ELB

The load balancer is the only part of this deployment that should be reachable from outside the VPC.  Whatever your connectivity to AWS looks like (direct connect, vpn, public internet), the load balancer needs to deployed to be reachable from your network.  First, the load balancer needs a target group to forward traffic to, so I created nautobot-elb-targetgroup.  Next, I created the load balancer itself, nautobot-elb.  Finally, I created a listener that forwards https connections to the ELB to port 8080 on our nautobot server.
Many of the load balancer configuration settings will depend on your use case, consult the AWS documentation on load balancers for details.  In my environment, I used an internal load balancer since my vpc is connect to a transit gateway that connects back to our premises via a VPN tunnel.  To make it easier on users, I created a cname within my domain that forwarded to the AWS FQDN for the load balancer and installed an SSL cert on the load balancer.
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/elbv2/create-load-balancer.html
Listener default action json

[
  {
    "Type": "forward",
    "TargetGroupArn": "<arn of ELB target group>",
    "Order": integer,
    "ForwardConfig": {
      "TargetGroups": [
        {
          "TargetGroupArn": "<arn of ELB target group>",
          "Weight": 1
        }
        ...
      ]
    }
  }
  ...
]

Creating a target group and load balancer

aws elbv2 create-target-group
    --name "nautobot-elb-targetgroup"
    --protocol "HTTP"
    --port 8080
    --target-type "ip"
    --vpc-id "<vpc id>"

aws elbv2 create-load-balancer
    --name "nautobot-elb"
    --subnets <subnet-XXXXXXXX id of private subnet one> <subnet-XXXXXXXX id of private subnet two>
    --type "application"
    ...
    <other options WILL be needed, but depend on your use case>

aws elbv2 create-listener
    --load-balancer-arn "<arn of load balancer>"
    --protocol "HTTPS"
    --port 443
    --certificates "<arn of cert if using AWS cert manager>"
    --default-actions "<Listener default action json>"
    ...
    <other options as needed, such as ssl security policy>

...
<repeat as needed for additional listeners, such as a redirect from 80 to 443>

Connect to the bastion host

There are a few tasks that are easier to do from the bastion host.  From the bastion host, I connected to postgres and created a nautobot user and a nautobot database and granted the nautobot user access to the database.  I then mounted the EFS volume from the bastion host and copied over my nautobot.env and nautobot_config.py files.  I also created directories for the media, static, and jobs directories and updated the configuration to point at them.
Now that all of the dependencies are in place... ECS!
ECS Cluster

Since I don't want to maintain EC2 hosts, I decided to create a Fargate cluster and named it nautobot-fargate-cluster.
Creating the cluster

aws ecs create-cluster
    --cluster-name "nautobot-fargate-cluster"

ECS Task(s)

Since we are going to use environment variables to create the superuser on initial startup, we can either create two tasks or we can start with the superuser environment variables and then edit the task to remove them after initialization is done.  I'll describe the startup task here, the other task is nearly identical but does not include any of the SUPERUSER environment variables.
I created nautobot-ecs-task as a Fargate 1.4.0 task and assigned nautobot-ecs-task-runner-role to be the execution role and task role.  I added nautobot-efs-configfiles as an EFS volume with the nautobot-efs-accesspoint access point, encrypted in transit and using IAM authorization.
I created the first container, nautobot, with the image I uploaded to the ECR repo, and mapped port 8080.  I'm not using https on the container since it will only communicate with the load balancer.  I specified a working directory of /opt/nautobot, added the EFS volume and mounted it at /opt/nautobot/config.  Finally, I added environment variables for runtime configuration.  Most of them are plain text values, but the secrets stored in the Parameter Store are ValueFrom and the ARN of the parameter.  I loaded some variables in the task definition to make them easier to change or secure, the rest are defined in the nautobot.env file.

NAUTOBOT_ALLOWED_HOSTS
NAUTOBOT_CONFIG - points to my nautobot_config.py located within the EFS mount, /opt/nautobot/config/nautobot_config.py
NAUTOBOT_DB_HOST - points to the RDS endpoint
NAUTOBOT_REDIS_HOST - points to the Elasticache endpoint hostname.  I had trouble here when I included a rediss:// service descriptor or a port number in this variable.
NAUTOBOT_DB_PASSWORD - ValueFrom ARN of nautobot user password to postgres
NAUTOBOT_REDIS_PASSWORD - ValueFrom ARN of redis auth string
NAUTOBOT_SECRET_KEY - ValueFrom ARN of django secret key
NAUTOBOT_CREATE_SUPERUSER - true for initialization
NAUTOBOT_SUPERUSER_EMAIL
NAUTOBOT_SUPERUSER_NAME
NAUTOBOT_SUPERUSER_API_TOKEN - ValueFrom ARN of superuser api token
NAUTOBOT_SUPERUSER_PASSWORD - ValueFrom ARN of superuser password

Again, the SUPERUSER variables are only needed for the first startup.  I kept them defined in my task and changed the value of NAUTOBOT_CREATE_SUPERUSER to false in a newer version of the task.
The second container is the celery worker, nautobot-celery.  It does not need mapped ports, but it does need the same container image, EFS volume, and working directory as well as the database, redis, and config environment variables.  In addition, we need to specify the entrypoint and command.  I used an entrypoint of "/usr/local/bin/nautobot-server" and a command of "celery,worker,--loglevel,INFO,--pidfile,/opt/nautobot/nautobot-celery.pid,-n,nautobot-celery".
For both containers, I used default CloudWatch logging.  How much memory and cpu you want to commit, how much you want to assign to each container, and what sort of scale up/scale out you want to add depends on usage.  My initial usage is low enough that I have not tinkered with scaling.
A warning about ALLOWED_HOSTS - this can be a giant pain with Django and ELB.  The ELB health checks http_host header will be set to the IP address that ELB is using.  There are a variety of ways to work around this, from adding django plugins designed to work around this issue to http server configs to overwrite the http_host header if it comes from the CIDR of the private subnets.  Because I am using /27 CIDRs, I went with the fast but ugly approach of adding all of the IP addresses within the private subnets to the ALLOWED_HOSTS.
Container definition json for nautobot server

  [
    {
      "portMappings": [
        {
          "hostPort": 8080,
          "protocol": "tcp",
          "containerPort": 8080
        }
      ],
      "environment": [
        {
          "name": "NAUTOBOT_ALLOWED_HOSTS",
          "value": "<host names here>"
        },
        {
          "name": "NAUTOBOT_CONFIG",
          "value": "/opt/nautobot/bazaar/nautobot_config.py"
        },
        {
          "name": "NAUTOBOT_CREATE_SUPERUSER",
          "value": "false"
        },
        {
          "name": "NAUTOBOT_DB_HOST",
          "value": "<FQDN for postgres endpoint>"
        },
        {
          "name": "NAUTOBOT_REDIS_HOST",
          "value": "<FQDN for redis endpoint>"
        },
        {
          "name": "NAUTOBOT_REDIS_PORT",
          "value": "6379"
        },
        {
          "name": "NAUTOBOT_REDIS_SSL",
          "value": "true"
        },
        {
          "name": "NAUTOBOT_SUPERUSER_EMAIL",
          "value": "username@yourdomain.com"
        },
        {
          "name": "NAUTOBOT_SUPERUSER_NAME",
          "value": "admin"
        }
      ],
      "mountPoints": [
        {
          "readOnly": null,
          "containerPath": "/opt/nautobot/bazaar",
          "sourceVolume": "<EFS filesystem>"
        }
      ],
      "workingDirectory": "/opt/nautobot",
      "secrets": [
        {
          "valueFrom": "<arn for database password ssm parameter>",
          "name": "NAUTOBOT_DB_PASSWORD"
        },
        {
          "valueFrom": "<arn for django secret key ssm parameter>",
          "name": "NAUTOBOT_SECRET_KEY"
        },
        {
          "valueFrom": "<arn for superuser api token ssm parameter>",
          "name": "NAUTOBOT_SUPERUSER_API_TOKEN"
        },
        {
          "valueFrom": "<arn for superuser password ssm parameter>",
          "name": "NAUTOBOT_SUPERUSER_PASSWORD"
        }
      ],
      "image": "<URL for ECR image>",
      "essential": true,
      "name": "nautobot"
      ...
      <other options as needed, such as health check, log options, cpu or memory reservations>
    },
    ...
    <json for celery worker container>
  ]

Volume definition json

  [
    {
      "efsVolumeConfiguration": {
        "transitEncryptionPort": null,
        "fileSystemId": "<fs-xxxxxxxxxx id of file system>",
        "authorizationConfig": {
          "iam": "ENABLED",
          "accessPointId": "<fsap-xxxxxxxxxx id of access point>"
        },
        "transitEncryption": "ENABLED",
        "rootDirectory": "/"
      },
      "name": "nautobot-efs-configfiles"
    }
  ]

Create the task

aws ecs register-task-definition
    --family "nautobot-ecs-task"
    --network-mode "awsvpc"
    --task-role-arn "<arn of nautobot-ecs-task-runner-role>"
    --execution-role-arn "<arn of nautobot-ecs-task-runner-role>"
    --container-definitions "<list of jsons defining containers>"
    --volumes "<list of jsons defining volumes>"
    --cpu "<number of CPU units used by the task>"
    --memory "<amount of memory used by the task>"
    ...
    <other options as needed>

Service

The service is using the latest version of the task defined above and starting just one instance of the task.  It's launching as Fargate 1.4.0 on linux on the cluster that we already created.  It's deployed into the two private subnets of our VPC and I've assigned the ecs security group.  This security group allows local access to 8080, so that the load balancer can connect directly to the containers but nothing else can.  The load balancer is also defined here, with an IP target group.
Load balancer list json

[
  {
    "targetGroupArn": "<ARN for target group>",
    "loadBalancerName": "nautobot-elb",
    "containerName": "nautobot",
    "containerPort": 8080
  }
  ...
]

Network structure json

{
  "awsvpcConfiguration": {
    "subnets": ["<subnet-XXXXXXXX id of private subnet one>", "<subnet-XXXXXXXX id for private subnet two>"],
    "securityGroups": ["nautobot-ecs-secgrp", ],
    "assignPublicIp": "DISABLED"
  }
}

Note that the task-definition argument below is task-family:task-version, update as needed
Create the service

aws ecs create-service
    --cluster "<ARN of fargate cluster>"
    --service-name "nautobot"
    --task-definition "nautobot-ecs-task:1"
    --load-balancers "<list of load balancer jsons>"
    --desired-count 1
    --launch-type "FARGATE"
    --platform-version "1.4.0"
    --network-configuration "<json for network structure>"
    ...
    <other options as needed>

Test it out!

Once the service is defined, it should start to spin up a task.  It will take several seconds to pull the image from ECR, assign a node, and start the task.  Once the task is in a RUNNING state, the load balancer will need to register the IP of the nautobot server.  At that point, you should be able to connect to the FQDN of the load balancer!
If you used the NAUTOBOT_CREATE_SUPERUSER=True environment variable to create a superuser on the first run, you should go back and create a new version of the task that has that environment variable set to False and update the service to use the new version of the task.  Updating the task should launch a new task instance and drain the old instance.
Troubleshooting

Issues you are likely to run into:

tasks restarting - this can be caused by the nautobot server processes running into a failure state or by the ELB or ECS health checks failing.  Look at the task logs and the state of the health checks.
ALLOWED_HOSTS errors - as noted earlier, ELB does not play nicely with Django ALLOWED_HOSTS.
security groups - connectivity problems between the nautobot containers and any of the dependencies could be caused by security group errors
errors connecting to Redis via SSL - Log entries for the nautobot-celery container such as "A rediss:// URL must have parameter ssl_cert_reqs and this must be set to CERT_REQUIRED, CERT_OPTIONAL, or CERT_NONE" or any traceback in the celery or kombu libraries can be caused by bad redis settings.  Verify that redis ssl changes have been made to nautobot_config.py (see above) and that the NAUTOBOT_REDIS_HOST variable is only the hostname, not protocol or port.