Skip to content

Instantly share code, notes, and snippets.

@AWegnerGitHub
Last active November 22, 2017 02:46
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AWegnerGitHub/f17e4f65c089712f888de429323cd86b to your computer and use it in GitHub Desktop.
Save AWegnerGitHub/f17e4f65c089712f888de429323cd86b to your computer and use it in GitHub Desktop.
Helios proposal and architechture discussion for a centrally managed database for SmokeDetector

Table of Contents

What is this?

This document describes the proposed architecture for a central SmokeDetector databsae. As a proof of concept, there are a few areas that are not finished but these are described in the Work to do section.

This project is tentatively named Helios.

Proposal

Using Amazon Web Services and the Serverless Framework, we will move the blacklists, watchlist, notifications and potentially more items (such as permissions) from Git managed assets to Helios. We'll also eliminate the assets that are only managed locally, such as the notifications list.

This will allow us to maintain a single notification list that is accurate across all instance of SmokeDetector. The Helios managed blacklist and watchlist will remove the need for SmokeDetector and metasmoke to perform long running, error prone Git commands. Additionally, we won't need to wait for continuous integration to complete when adding to these lists. Instead, it will be able to perform it's task on code changes, like it should.

Architecture

Access to the blacklists and notifications will occur via calls to the API that has been set up. The end point are described in the Endpoints section.

These end points will be utilized to add, delete and list the various data structures we need. Each instance of SmokeDetector will perform calls to the GET HTTP end points at start up, promotion from standby or on demand to refresh the local cache.

As SmokeDetector is run and receives commands to add or delete items, the local cache will be updated and a call will be made to the API to do the same on Helios. The local cache will be utilized by the running instance at all times, but can be updated at any time as well.

This means that SmokeDetector will never need to call the API to match patterns during runtime. New instances will be updated to the latest patterns on activation.

HTTP GET endpoints will be open to all. This will allow users to fire up a local copy of SmokeDetector and pull the latest version of our centrally managed lists. Authorized instances of SmokeDetector will require an Authentication token, as they do for metasmoke currently, to be able to add or delete from Helios.

Open endpoints will be eliminated if abuse occurs to prevent unexpected costs from being incurred from AWS. However, this is not anticipated.

On the AWS side, we will utilize the Serverless Framework to handle deployment of Python Lambda functions, while information will be persisted in DynamoDB. Deployment of code upgrades will be accomplished with TravisCI. This behind the scenes information doesn't directly impact SmokeDetector or metasmoke as all access to the information will be accomplished via HTTP calls.

Simple timings tests

Attached to this gist is a python script the will go through a simple workflow the SmokeDetector would utilize to get/update/delete items from the blacklist.

This isn't an exact replication of what SmokeDetector would use, because it doesn't handle the local caching of files. It's goal is to provide timing of how long activities that interact with AWS will take.

This script will perform the following:

  • Request an authentication token
  • Validate the token is active (simulating a user access check)
  • Retrieve each of the blacklists/Watchlists
  • Add five items to a single blacklist
  • Delete the five items from the blacklist

The results of the script are also attached. In it we can see the various timings it takes to pull the blacklist/watchlist items. You can see the longest takes a little less than half a second and the shortest takes a tenth of a second. All together, we pull in approximately 5,500 patterns being watched.

Next we send 5 random patterns to Helios. The average time for the full cycle of send and receive a response back is about a third of a second, with the shortest trip taking a quarter of a second and the longer taking over a second.

Finally, we remove the 5 patterns that we added. The response cycle time for this activity was less than a quarter second for each item being deleted.

This cycle of adding and deleting will be extended, slightly, by adding in a write operation to a local file. However, even with this additional operation, the time to add to a list is reduced to seconds. The current cycle of add, commit to Git, issue a pull request, continuous integration, pull, restart is on the order of minutes for every item being added or deleted.

Response payloads

All endpoints will return a JSON object with the following format:

{
  'items': [array of items],
  'numItems': integer of the number of items in the above array,
  'message': An option message that indicate an error may of occurred
}

Endpoints

Endpoints listed here are for the proof of concept only and live in a development staging area. If this proof of concept is accepted by the community a production area will be set up and new end points will be shared.

Create Authentication token

POST https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/auth/create

This proof of concept allows anyone to create an authentication token. This end point will be restricted and current tokens invalidated if this proposal is accepted.

Creating a token requires that a payload is passed that contains the name of the user this token will be associated with. A Python example follows:

requests.post(url, json={'name': 'Andy'})

Test Authenticaion token

GET https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/test_auth

Test if a passed token is valid. This end point is simply for the proof of concept and will not be publicly accessible when deployed. The token being checked must be passed as part of the Authorization header.

r = requests.get(url, headers={'Authorization': 'AVALIDTOKEN'})

This will return Success!

r = requests.get(url, headers={'Authorization': 'ANINVALIDTOKEN'})

This will return {"Message":"User is not authorized to access this resource with an explicit deny"}

Get Blacklists/Watchlists

GET https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/blacklists/{id}

Blacklists and watchlists are accessible to all users. It requires that you pass a valid list type.

Valid options are: watch-keyword, blacklist-website, blacklist-username, blacklist-keyword. Replace {id} in the following URL with one of those options.

r = requests.get(url)

This returns a list of all items in the selected blacklist:

[
    "rhubcom\\.com",
    "erepairpowerpoint\\.com",
    "createspace\\.com",
    "992\\W?993\\W?3179",
    ...
]

Create blacklist/watchlist item

POST https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/blacklists/{id}

Adding to the blacklist requires an Authorization token. It also requires that a payload is passed with the pattern to be added in a pattern variable. The {id} value in the URL is one of the valid blacklist options. A Python example is below.

params = {'pattern': 'My.Complicated\sPattern'}
r = requests.post(url, json=params, headers={'Authorization': token})

A successful instead will return a record of the inserted item. The created_at and modified_at are unix timestamps.

{
    'modified_by': 'Andy',
    'modified_at': 1510606300,
    'created_at': 1510606300,
    'id': 'watch-keyword-TOTALLYATEST1.COM',
    'text_pattern': 'TOTALLYATEST1.COM',
    'type': 'watch-keyword'
}

Duplicates are not allowed. They will not be inserted and will return a notice of a duplicate attempt and a record of the duplicate.

Delete blacklist/watchlist item

DELETE https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/blacklists/{id}

Deleting a blacklist item requires an Authorization token. It also requires that a payload is passed with the pattern to be deleted in a pattern variable. The {id} value in the URL is one of the valid blacklist options.

A Python example:

params = {'pattern': 'My.Complicated\sPattern'}
r = requests.delete(url, json=params, headers={'Authorization': token})

Get all notifications

GET https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/notifications

Notifications are used to alert a user of a specific post type. Currently, these are stored locally and not shared. This means that when a new instance starts, it doesn't have any of the notifications that the previous instance had unless a user added a notification when this was already run.

r = requests.get(url)

This returns a response like this. Each item is a notification:

[{
    "user_id": 66258,
    "server": "chat.stackexchange.com",
    "room_id": 11540,
    "site": "communitybuilding.stackexchange.com"
 },
 ...
]

Creating a notification

POST https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/notifications

Creating a notification requires an Authorization token. It also requires a payload that contains each of the following: user_id, server, room_id, site

The combination of these must be unique.

params = {'user_id': 9854,
        'server': 'test_server',
        'room_id': 888,
        'site': 'example.com'}
r = requests.post(url, json=params, headers={'Authorization': token})

Delete a notification

 DELETE https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/notifications

Deleting a notification requires an Authorization token. It also requires a payload that contains each of the following: user_id, server, room_id, site

params = {'user_id': 9854,
    'server': 'test_server',
    'room_id': 888,
    'site': 'example.com'}
r = requests.delete(url, json=params, headers={'Authorization': token})

Work to do

Since this proposal only includes a proof of concept, their is a more that would need to be done to completely implement this. Below is a short list of items, but not a guarantee this will include everything.

Auth tokens between MS and Helios

SmokeDetector instances isn't the only thing that will require write capability to the various endpoints described in this document. metasmoke will also require this. On top of that, a discussion should occur on if current tokens should be shared so that SmokeDetector instances only need one token to integrate with both metasmoke and Helios or if it is ok to have two such tokens.

SmokeDetector Changes

SmokeDetector development will be required to create and update local copies of the lists. These lists already exist as text files or pickles. The changes that are required include eliminating the file from git, updating the commands to sent an HTTP request and write the response to the files. Theoretically, no further change would be required to these files as the rest of the application uses them today.

Additionally, writing new records will need to be changed to append to the end of each file as appropriate and then send an HTTP request to Helios to update other instances upon activation.

Git pull functionality will need to be changed to eliminate the need for autopulls after adding to lists. This will now take moments instead of minutes.

Many of these changes will be easier once the NG Chat Backport has been completed.

Request / Response models

Common responses implemented. Not done via models though

The proof of concept does not have unified response objects. Before deploying to product these should be implemented so that end users - SmokeDetector and metasmoke - can expect a common response layout. This will make development easier because everything will follow the same layout.

Move from a dev staging area to a production area

The proof of concept lives in a development area. A production area is needed as well. This allows us to deploy a test branch of any AWS code prior to impacting downstream systems.

Doing this means we should also have a configuration option in SmokeDetector to use development or production. This will allow for easier testing.

List approval

One advantage that GitHub provides that this system will eliminate is the ability for more experienced users to approve a pattern. This functionality should be retained. Options to do this include building the functionality into metasmoke, building a command to list new patterns in SmokeDetector, adding a new permission that users need to generate lists, or perhaps something else.

Automated API Documentation

Documenting the API will be vital for developers of SmokeDetector, metasmoke and other applications. This documentation can be done automatically, but requires a bit of initial set up.

---------------------
AWS Serverless test
---------------------
This will go through a few tests to show how the serverless framework would be
utilized with the SmokeDetector project. It will simulate a few activities
and provide results at the end.
Any, and all, information generated during this test will be eliminated before
full deployment. That means the authentication tokens that are being created
for this test will be invalidated in the future.
Information returned from the calls is not completely in sync with live
SmokeDetector. It is reseasonably recent - the last week or so - but is not
guarenteed to be identical.
Please provide a name to proceed: Andy
Requesting an authentication token for Andy
Token returned: LtFPPk63WZ5h5AECCcWCxbRuBfWc32mxvPMeUA6e
Fetch Blacklists/Watchlist
Blacklist Type: watch-keyword
Response Time (request to full response): 1.2878460884094238
Response Time (request to header parse): 1.228434
Number of records: 2930
Response Size (bytes): 65679
Blacklist Type: blacklist-website
Response Time (request to full response): 0.559063196182251
Response Time (request to header parse): 0.536927
Number of records: 1598
Response Size (bytes): 39387
Blacklist Type: blacklist-username
Response Time (request to full response): 0.19565057754516602
Response Time (request to header parse): 0.189555
Number of records: 56
Response Size (bytes): 1313
Blacklist Type: blacklist-keyword
Response Time (request to full response): 0.40804362297058105
Response Time (request to header parse): 0.400076
Number of records: 921
Response Size (bytes): 19723
Blacklist 5 random patterns
Blacklist selected for insert: blacklist-username
Min. Response Time (request to full response): 0.17793011665344238
Max. Response Time (request to full response): 1.2436163425445557
Avg. Response Time (request to full response): 0.39461569786071776
Min. Response Time (request to header parse): 0.170033
Max. Response Time (request to header parse): 1.235353
Avg. Response Time (request to header parse): 0.38690599999999997
Delete 5 blacklist patterns
Min. Response Time (request to full response): 0.15787363052368164
Max. Response Time (request to full response): 0.17655277252197266
Avg. Response Time (request to full response): 0.16491231918334961
Min. Response Time (request to header parse): 0.149413
Max. Response Time (request to header parse): 0.168255
Avg. Response Time (request to header parse): 0.1555826
Full suite run time: 6.3377768993377686
# This was written and tested with Python 3.5.
import requests
import time
import string
import random
urls = {
'create_auth': 'https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/auth/create',
'test_auth': 'https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/test_auth',
'get_blacklists': 'https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/blacklists/{id}',
'create_blacklist': 'https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/blacklists/{id}',
'delete_blacklist': 'https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/blacklists',
'get_notifications': 'https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/notifications',
'create_notification': 'https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/notifications',
'delete_notification': 'https://fggqk618ri.execute-api.us-east-1.amazonaws.com/dev/notifications',
}
blacklist_types = [
'watch-keyword',
'blacklist-website',
'blacklist-username',
'blacklist-keyword'
]
def time_request(url, params=None):
"""
Time a GET request
return:
r -> GET response
roundtrip -> Total time it took to perform the entire request/response
elapsed -> Time between sending request and parsing the response *headers*
"""
start = time.time()
r = requests.get(url, params=params)
roundtrip = time.time() - start
return r, roundtrip, r.elapsed.total_seconds()
def generate_pattern(size=40, chars=string.ascii_letters + string.digits):
"""
size = Size of string to return
chars = Character set to choose token characters from
"""
return ''.join(random.SystemRandom().choice(chars) for _ in range(size))
print("""
---------------------
AWS Serverless test
---------------------
This will go through a few tests to show how the serverless framework would be
utilized with the SmokeDetector project. It will simulate a few activities
and provide results at the end.
Any, and all, information generated during this test will be eliminated before
full deployment. That means the authentication tokens that are being created
for this test will be invalidated in the future.
Information returned from the calls is not completely in sync with live
SmokeDetector. It is reseasonably recent - the last week or so - but is not
guarenteed to be identical.\n
""")
name = input("Please provide a name to proceed: ")
overall_start = time.time()
# Create an authentication token
print("Requesting an authentication token for {}".format(name))
params = {'name': name}
response = requests.post(urls['create_auth'], json=params)
token = response.json()['items'][0]['token']
print("Token returned: {}".format(token))
# Get each of the blacklist types
print("Fetch Blacklists/Watchlist")
for bl_type in blacklist_types:
url = urls['get_blacklists'].replace("{id}", bl_type)
response, roundtrip, elapsed = time_request(url)
print(" Blacklist Type: {}".format(bl_type))
print(" Response Time (request to full response): {}".format(roundtrip))
print(" Response Time (request to header parse): {}".format(elapsed))
print(" Number of records: {}".format(len(response.json())))
print(" Response Size (bytes): {}".format(len(response.content)))
# Create blacklist items
patterns = [generate_pattern() for x in range(5)]
bl_type = random.choice(blacklist_types)
total_roundtrip = []
total_elapsed = []
print("Blacklist 5 random patterns")
print("Blacklist selected for insert: {}".format(bl_type))
for p in patterns:
url = urls['create_blacklist'].replace("{id}", bl_type)
params = {'pattern': p}
start = time.time()
r = requests.post(url, json=params, headers={'Authorization': token})
total_roundtrip.append(time.time()-start)
total_elapsed.append(r.elapsed.total_seconds())
min_roundtrip = min(total_roundtrip)
max_roundtrip = max(total_roundtrip)
avg_roundtrip = sum(total_roundtrip) / float(len(total_roundtrip))
min_elapsed = min(total_elapsed)
max_elapsed = max(total_elapsed)
avg_elapsed = sum(total_elapsed) / float(len(total_elapsed))
print(" Min. Response Time (request to full response): {}".format(min_roundtrip))
print(" Max. Response Time (request to full response): {}".format(max_roundtrip))
print(" Avg. Response Time (request to full response): {}".format(avg_roundtrip))
print(" Min. Response Time (request to header parse): {}".format(min_elapsed))
print(" Max. Response Time (request to header parse): {}".format(max_elapsed))
print(" Avg. Response Time (request to header parse): {}".format(avg_elapsed))
# Delete blacklist items
total_roundtrip = []
total_elapsed = []
print("Delete 5 blacklist patterns")
for p in patterns:
url = urls['delete_blacklist']
params = {'pattern': p}
start = time.time()
r = requests.post(url, json=params, headers={'Authorization': token})
total_roundtrip.append(time.time()-start)
total_elapsed.append(r.elapsed.total_seconds())
min_roundtrip = min(total_roundtrip)
max_roundtrip = max(total_roundtrip)
avg_roundtrip = sum(total_roundtrip) / float(len(total_roundtrip))
min_elapsed = min(total_elapsed)
max_elapsed = max(total_elapsed)
avg_elapsed = sum(total_elapsed) / float(len(total_elapsed))
print(" Min. Response Time (request to full response): {}".format(min_roundtrip))
print(" Max. Response Time (request to full response): {}".format(max_roundtrip))
print(" Avg. Response Time (request to full response): {}".format(avg_roundtrip))
print(" Min. Response Time (request to header parse): {}".format(min_elapsed))
print(" Max. Response Time (request to header parse): {}".format(max_elapsed))
print(" Avg. Response Time (request to header parse): {}".format(avg_elapsed))
print("Full suite run time: {}".format(time.time() - overall_start))
@ArtOfCode-
Copy link

Security - we're obviously going to want to limit who can generate a token, and we may want some sort of validation of the name that gets sent in the token request. Both of those things could be done via metasmoke - metasmoke has an Uber-Token that must be passed to create a regular token; users request tokens through metasmoke, and it requests and displays a token from Helios. Only users who have linked their account to SE are allowed to create a token, so that we can verify usernames.

@nic-hartley
Copy link

Please make the responses entirely valid JSON -- it makes parsing it on the client-side a lot easier. As a random example, take this error response:

Duplicate entry attempted. {
    'id': 'watch-keyword-TOTALLYATEST1.COM',
    'type': 'watch-keyword',
    'text_pattern': 'TOTALLYATEST1.COM',
    'modified_by': 'Andy',
    'created_at': 1510606419,
    'modified_at': 1510606419
}

It should probably be something like this, instead:

{
    "error": "Duplicate entry attempted",
    "previous": {
    'id': 'watch-keyword-TOTALLYATEST1.COM',
    'type': 'watch-keyword',
    'text_pattern': 'TOTALLYATEST1.COM',
    'modified_by': 'Andy',
    'created_at': 1510606419,
    'modified_at': 1510606419
    }
}

That's just an example; the precise format can be changed. However, keeping things purely JSON (or XML, or BSON, but one consistent format whatever it is) will make it significantly easier to parse -- instead of having to write wrappers around the parsers, we can just pass the response text directly in and... you know, not write half of a custom parser.

@nic-hartley
Copy link

(This is a separate comment for the purposes of letting people yell at me for a specific thing. Clarity and all that.)

Are we going to have any sort of redundancy? I don't see any mentions of it just quickly skimming through this, but it might be worth at least thinking about when we're making this -- if nothing else, building the design so redundancy can be added later more easily. Sure, AWS is fantastically reliable except for that one instance, but if the central database went down because one person forgot to pay the bills, it would be frustrating.

@AWegnerGitHub
Copy link
Author

@nic-hartley It will be valid json. That's one of the TODOs listed. I agree that it's difficult as is though.

What kind of redundancy are you expecting? Multiple people running multiple instances of this? In multiple AWS regions?

I don't like the idea of having this run by multiple people. That eliminates the "central" part of this and complicates how everything interacts with everything (multiple smokedetectors, multiple Helios, single metasmoke).

While "paying bills" is a crappy failure reason, I think we'll figure out how and who can handle that aspect.

Regarding multiple AWS regions, that can be investigated. I don't know if it's needed right now.

@AWegnerGitHub
Copy link
Author

@ArtOfCode- Authorization, in this case, would be for instances of SmokeDetector that should be allowed to update the lists. One thought I had was sharing the metasmoke token with Helios that these instances already have. This means the user running the instance only has to manage one token.

Another option is to give metasmoke an authorized route to Helios where it can create SmokeDetector tokens. This would be part of the "creating integration tokens" step that we go through when we bring on a new instance. Now we just provide two tokens, one to talk to metasmoke and one to talk with Helios. This route isn't something that SmokeDetector would ever use and only metasmoke would have a token to use it.

The advantage to the first is that there is only one token for end instances to manage. However, it does require sharing that token between the two systems and keeping them in sync (does a metasmoke token expire or change?)

The advantage to the second is that we can provide more granularity between who can talk to metasmoke and who can update the underlying lists. I don't know if that's a scenario we need though.

@ArtOfCode-
Copy link

Another option is to give metasmoke an authorized route to Helios where it can create SmokeDetector tokens. This would be part of the "creating integration tokens" step that we go through when we bring on a new instance. Now we just provide two tokens, one to talk to metasmoke and one to talk with Helios. This route isn't something that SmokeDetector would ever use and only metasmoke would have a token to use it.

This is what I was thinking with my comment above - metasmoke generates the Helios tokens on behalf of users who have permissions to do so.

The advantage to the first is that there is only one token for end instances to manage. However, it does require sharing that token between the two systems and keeping them in sync (does a metasmoke token expire or change?)

Yup. MS tokens can be deleted by either admins or developers, I forget which. Their owners can also re-generate them.

@Undo1
Copy link

Undo1 commented Nov 15, 2017

Let metasmoke interface with it to handle tokens - it's good at that.

What kind of redundancy are you expecting? Multiple people running multiple instances of this? In multiple AWS regions?

One region should be plenty. It's AWS - downtime is rare, data loss rarer, and we have a bunch of distributed backups of data everywhere. Metasmoke is one VPS managed by an incompetent dog; this will have a bunch more nines. The incompetent dog probably has more eights, though!

While "paying bills" is a crappy failure reason, I think we'll figure out how and who can handle that aspect.

I've done the numbers on (and currently run a few of) these kinds of projects before. I have a dollar bill on my desk, it should last a decade or two.

Re: Authorization, this is easy. Metasmoke can interface directly with the underlying DynamoDB tables for anything we need MS' permission structure for (my running it would simplify permissions, but not entirely necessary). No dealing with tokens at all.

@quartata
Copy link

quartata commented Nov 15, 2017

I've mentioned this in chat already, but I've already made most of a POC for a central blacklist tracker that uses deltas. Is there any way we can merge the blacklisting parts of this into your more broad specification?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment