groeney/id-insights-technical.md

## id-insights-technical.md

      
    Raw
  

              id-insights-technical.md
            
          
iD Insights stack (technical)

This will serve as a guide for making future technical enhancements to the iD insights stack (ELK and GA), covering general maintenance tasks and current stack architecture.
Important note regarding this document: this document is maintained through a Google Drive <--> StackEdit sync/pairing. Please make edits at either one of these platforms. Do not edit the gist directly. Doing so you will effectively fork the document from its source.
Source: https://drive.google.com/open?id=0B52VNRX_C7xPN0ZvZV9paHd1MmM
Sink: https://gist.github.com/j-groeneveld/c1ffccc6640f52d15491
Important note: keep note of the version of software we are running when poking around at documentation etc:
`logstash --version`

logstash 1.5.4

`kibana --version`

4.1.4

`curl [ES domain]`

{   "status" : 200,   "name" : "Arc",   "cluster_name" :
"095446746036:es-stage",   "version" : {
"number" : "1.5.2",
"build_hash" : "62ff9868b4c8a0c45860bebb259e21980778ab1c",
"build_timestamp" : "2015-04-27T09:21:06Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"   },   "tagline" : "You Know, for Search" }

Target audience: DevOps, IT and developers.
Tracking new things

To track new events in the ELK stack, you will utilize the ElkLogger service. The following are required attributes: event_name and user_id.
Event data flow

Here we will go through the typical data flow of any event being ingested into the ELK stack. We will use the following format to describe event state: "Event state [step] [environment] [software]".
The ElkLogger writes to a file, application.log, on the current machine.
Event state 1 [prod] (log courier):

[2016-01-28 16:56:23 -0800] [phoenix] {"event_name":"Logged
In","user_id":3372,"ip_address":"54.187.86.88","user_agent":"Mozilla/5.0
(Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/47.0.2526.111
Safari/537.36","referrer":"https://www.identity.com/signin","tags":["audit
log"]}

Log courier, constantly watching application.log, ships this data across a tcp socket to "broker" ls-elk-broker-01.ins, port 4545.
Event state 2 [elk] (logstash):

{
"environment" => "prod",
"host" => "web-px-03.prod.us-west-2.i.identityaws.net",
"message" => "[2016-01-28 16:56:23 -0800] [phoenix] {"event_name":"Logged
In","user_id":3372,"ip_address":"54.187.86.88","user_agent":"Mozilla/5.0
(Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/47.0.2526.111
Safari/537.36","referrer":"https://www.identity.com/signin\",\"tags\":[\"audit
log"]}",
"offset" => 123188,
"path" => "/var/log/phoenix/application.log",
"tags_courier" => [
[0] "application log",
[1] "prod"
],
"@version" => "1",
"@timestamp" => "2016-01-29T00:56:27.764Z",
"orig_host" => "web-px-03.prod.us-west-2.i.identityaws.net",
"tags" => [
[0] "broker"
] }

Event is sent to [stage] "processor" -- elk-stage-esclient-01.ins, port 4546 -- and/or elk "processor" -- elk-elk-esclient-01.ins, port 4546 -- depending on the event's correct destination environment. Here at the processor level the events get formatted appropriately by the respective logstash configuration.
Event state 3 [stage] (logstash):

{
"environment" => "prod",
"host" => "web-px-03.prod.us-west-2.i.identityaws.net",
"message" => "[2016-01-28 16:56:23 -0800] [phoenix] {"event_name":"Logged
In","user_id":3372,"ip_address":"54.187.86.88","user_agent":"Mozilla/5.0
(Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/47.0.2526.111
Safari/537.36","referrer":"https://www.identity.com/signin\",\"tags\":[\"audit
log"]}",
"offset" => 123188,
"path" => "/var/log/phoenix/application.log",
"@version" => "1",
"@timestamp" => "2016-01-29T00:56:23.000Z",
"orig_host" => "web-px-03.prod.us-west-2.i.identityaws.net",
"tags" => [
[0] "audit log",
[1] "application log",
[2] "prod"
],
"timestamp" => "2016-01-28 16:56:23 -0800",
"source" => "phoenix",
"event_name" => "Logged In",
"user_id" => 3372,
"ip_address" => "54.187.86.88",
"user_agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111
Safari/537.36",
"referrer" => "https://www.identity.com/signin",
"geoip" => {
"ip" => "54.187.86.88",
"country_code2" => "US",
"country_code3" => "USA",
"country_name" => "United States",
"continent_code" => "NA",
"region_name" => "NJ",
"city_name" => "Woodbridge",
"postal_code" => "07095",
"latitude" => 40.55250000000001,
"longitude" => -74.2915,
"dma_code" => 501,
"area_code" => 732,
"timezone" => "America/New_York",
"real_region_name" => "New Jersey",
"location" => [
[0] -74.2915,
[1] 40.55250000000001
]
} }

The processor writes every event to the corresponding Elasticsearch cluster (AWS ES service).
Important note: environment mappings are as follows - elk => production, stage => stage. Both of these elk infrastructure environments work across all application environments.
Event state 4 [stage] (elasticsearch):

{   "_index": "v2-user-events-2016.01.29",   "_type": "logs",   "_id":
"AVKK4aucWpmR-kl3AvQu",   "_score": null,   "_source": {
"environment": "prod",
"host": "web-px-03.prod.us-west-2.i.identityaws.net",
"message": "[2016-01-28 16:56:23 -0800] [phoenix] {"event_name":"Logged
In","user_id":3372,"ip_address":"54.187.86.88","user_agent":"Mozilla/5.0
(Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/47.0.2526.111
Safari/537.36","referrer":"https://www.identity.com/signin\",\"tags\":[\"audit
log"]}",
"offset": 123188,
"path": "/var/log/phoenix/application.log",
"@version": "1",
"@timestamp": "2016-01-29T00:56:23.000Z",
"orig_host": "web-px-03.prod.us-west-2.i.identityaws.net",
"tags": [
"audit log",
"application log",
"prod"
],
"timestamp": "2016-01-28 16:56:23 -0800",
"source": "phoenix",
"event_name": "Logged In",
"user_id": 3372,
"ip_address": "54.187.86.88",
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111
Safari/537.36",
"referrer": "https://www.identity.com/signin",
"geoip": {
"ip": "54.187.86.88",
"country_code2": "US",
"country_code3": "USA",
"country_name": "United States",
"continent_code": "NA",
"region_name": "NJ",
"city_name": "Woodbridge",
"postal_code": "07095",
"latitude": 40.55250000000001,
"longitude": -74.2915,
"dma_code": 501,
"area_code": 732,
"timezone": "America/New_York",
"real_region_name": "New Jersey",
"location": [
-74.2915,
40.55250000000001
]
}   },   "fields": {
"@timestamp": [
1454028983000
]   },   "sort": [
1454028983000   ] }

Event data may now be consumed through corresponding Kibana instance kibana-stage and kibana-elk.

MySQL data in ES

The only data that doesn't go through this whole data flow pipeline is the MySQL data. Example of MySQL data in ES is the Users dashboard which takes in a scrubbed form of our MySQL users table.
Instead of running another logstash service on the esclient machines, we run a simple cron job every 30 minutes. So here we skip event state 1 and 2. This job is stored in the logstash user crontab. The configuration is currently loaded through salt and can be found here with name of config file [environment]-mysql.conf.
The following recipe was used to install mysql drivers:
> wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.38.tar.gz
> tar -zxf mysql-connector-java-5.1.38.tar.gz
> mv mysql-connector-java-5.1.38 /usr/lib/

Event storage

Important to note currently all application events should be found in the union of the admin-events and user-events aliased indexes. We can  extend this to *-events index pattern in future if needed.
Infrastructure changes

log-courier: this determines what files on what machines get sent into the ELK stack. Refer to log-courier configuration.
logstash: this determines how the data from the ^ files get inserted into the Elasticsearch clusters. Refer to logstash configuration.
elasticsearch: this is hosted by AWS and is provided as a service by them. Navigate to the AWS console and look at our ES cluster domains in the Oregon region. Use the sense plugin to investigate internal structure of ES datastore.
kibana: this is where you go to consume ELK data and to derive insights. Refer to this diagram for an understanding of how the Kibana infrastructure is laid out.
To look over forum questions regarding iD insights on the Elastic forums.
Debugging a broken pipeline

When the pipeline has broken down in the past, it is usually following some sort of update/change in configuration or architecture. If the data pipeline has broken down and you are not seeing any new data enter into Kibana here are a list of things to check in decreasing order of priority. For help we have a file at /var/log/logstash/dots.log which outputs a single dot every time an event is processed into Elasticsearch. Inspect it with watch -n 1 wc -l /var/log/logstash/dots.log for a realtime view of the pipeline processing speeds.

Are there any ghost daemons running on the esclient machines using ps aux|grep logstash or ps aux|grep log-courier. Kill all of these processes manually kill -9 [pid] and start them again using systemctl.
Logstash process is running as expected on the ls-elk-broker machine. Logs are at /var/log/logstash/logstash.log. Potentially restart the process with systemctl or kill -9 [pid] and start again.
All our application machines need to be able to connect with the ls-elk-broker machine on port 4545. Check to see whether an application machine can connect by using telnet [private hostname] 4545.  Also run ss -plnt on the broker machine to see if port 4545 is open.
Can the logstash broker machine connect with the logstash processor machines on port 4546? Use telnet [private_host] 4546 from the broker machines to test.
Does the elastic search access policy (found in the AWS console under the cluster name), allows communication from the appropriate (environment dependent) logstash processor machine(s).
Is the event getting through the logstash processor configuration to the right ES index name/are you looking at the correct ES index?

Stale data & Curator

Data can turn stale when it is no longer useful for the business to have it around. We are using curator to help manage old indexes by doing some general house cleaning. We are running some curator commands once a day via the logstash user's crontab on the esclient machines. We are deleting nginx and logstash indexes older than 7 days.. Inspect this crontab with crontab -u logstash -l as root.
Reindexing data

Sometimes we will want to reindex data in Elasticsearch if for example we have a mapping conflict:

This allows us to structure data already in the Elasticsearch datastore. Standard practice here is to use logstash to pull data out of ES, reformat it and write it back to the same or different index(es). Typically we will perform this reindexing job from the logstash "processor" machines (e.g., elk-elk-esclient-01.ins).
Sample logstash configuration (others can be found on processor machines locate reindex-*), domains scrubbed for security:
input {
  elasticsearch {
    hosts => ["{ES domain}:80"]
    index => "user-events-*"
		docinfo => true
  }
}
filter {
    mutate{
	    rename => {"foo" => "bar"}
    }
}
output {
  elasticsearch {
    host => "{ES domain}"
    port => 80
		protocol => "http"
    index => "v2-%{[@metadata][_index]}"
	}
}

If we are changing the data types of the particular fields, likely we will want to create an ES template in order for the formatting to be applied. In this case we will also want to insert the data into a new index entirely, the best way of doing this is by using the following convention v2-user-events. Note you cannot write directly to an alias, rather you will have to explicit about it, for examplev2-user-events-2015.12.31). A good idea to investigate the indexes you are modifying with the Sense chrome extension, or something similar, before you "initiate the reindex sequence".
Here is a detailed reindexing sequence that I wrote out as a guideline for this particular case:
Re-index with dynamic templates & .raw fields
* define “template”: “*"
* clean old indices if curator has failed
* reindex all nginx-* to v2-nginx-*
* repeat last step for admin-events, user-events, audit-logs and logstash
* make change in salt-formulas logstash — write to v2-* indices and add in any new filters — LOGSTASH — as necessary (test in stage)
* check all v2-* indexes for today to make sure data is coming in: v2-nginx-*, v2-admin-events-*, v2-user-events-*, v2-audit-logs, v2-logstash-*
* as an extra precaution make sure no data is coming into: nginx-*, admin-events-*, user-events-*, audit-logs, logstash-*
* Do these next two steps side by side for each index
    * DELETE nginx-*, admin-events-*, user-events-*, audit-logs, logstash-*
    * alias nginx-* to v2-nginx-*, admin-events-* to v2-admin-events-*, user-events-* to v2-user-events-*, audit-logs to v2-audit-logs, logstash-* to v2-logstash-*
* check: nginx-*, admin-events-*, user-events-*, audit-logs, logstash-*, to make sure they all resolve to v2-* and that data is coming into them (where expected)
* refresh index patterns in kibana to display new mappings

Elasticsearch (ES)

ES templates

Templates tell ES the kinds of meta-data that should be associated with a new index when it is created. We have a template called all that handles all indexes that get created in ES. It will take care of ensuring things like "for every field that is a string, create a separate field named [field].raw that is not analyzed". We also have indexes for all of our indexes that are aliased in order to ensure that a new index admin-events-2016.01.01 will be aliased through admin-events. You can use the Sense plugin and run GET _template/* to inspect all all existing templates. Elastic Documentation
ES aliases

Aliases simply act as proxies to the real indexes behind the scenes. For instance we have indexes admin-events-2016.01.01 and admin-events-2016.01.02. With aliases properly setup on both of these indexes, we can query over both of them simply by using admin-events. Elastic Documentation
ES types

At an Elastic workshop, we were advised to avoid using ES types. This is well explained here and has to do with Lucene indexes and how they operate under the hood. So it is for this reason that at present every document our ES clusters will have "_type": "logs"

The fact that documents of different types can be added to the same
index introduces some unexpected complications.

Future enhancements


Replace logstash broker layer. With Convox setup, we may also need to drop the log-courier shippers. Likely move ahead here is subbing in AWS Kinesis for both log-courier shippers and logstash broker layer, and have logstash processors subscribe directly to this Kinesis stream.


Move ES hosting from AWS ES service to Elastic Found. Couple of reasons here. Community consensus is that AWS ES service is still a little buggy. While this is common for a new AWS offering, conversations with support staff say they will be in "bug fixing mode" at least until (approx.) July, 2016. We are also locked in to v1.5.2 from July 1, 2015. Elastic Found offering gives essentially up to date ES version and relatively seamless upgrades. The main reason we went with AWS ES service originally was because of VPC support, to integrate our ES deployment with our application environments. Well there is currently no VPC and it may not be on the cards likely until the end of 2016. That said, we are managing fine by explicitly defining IPs that can talk with our ES deployment in our access policy. We would do the same if we were in Elastic Found, so it doesn't seem prohibitively expensive to maintain.


Move config management from SaltStack to Ansible. Proving difficult to maintain the SaltStack and with application infrastructure moving to Convox, ELK stack will be the only remaining infrastructure served up through SaltStack.


Remove httpd web server from in front of Kibana. At the moment the sole purpose of this web server is to act as the authentication layer to our Kibana instance. We should integrate this authentication into Kibana itself. The reason we haven't done this yet is we have httpd authenticating to Inflection's LDAP servers and there was no easy way to do this in Kibana (at the time we were looking at creating a single Kibana login shared with everyone or piggy-backing off of our existing LDAP setup).