46bit/cfsummiteu2017-day1-notes.md

## cfsummiteu2017-day1-notes.md

      
    Raw
  

              cfsummiteu2017-day1-notes.md
            
          
    Cloud Foundry Summit Europe 2017

Still to watch:

CF Container Runtime keynote?
Sidecars talk https://www.youtube.com/watch?v=sd8RAz5T5Z4&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa&index=69
Multi-cloud talks (CFs in multiple AZs, one CF across multiple DCs, etc)
Credhub https://www.youtube.com/watch?v=UHqKLEEZH3s&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa&index=28
Abacus billing talk https://www.youtube.com/watch?v=-evPNYAJGEU&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa&t=466s&index=33

Keynotes


They're making Kubo part of the core Cloud Foundry project.

The CF we run is now the Cloud Foundry Application Runtime.
Kubo (Kubernetes deployed by BOSH) is now the Cloud Foundry Container Runtime.


The Open Service Broker API is seeing some use outside of CF, and they're very keen on getting wider use. It's getting optional extensions for things like triggering backups.
CF continues to rapidly grow in deployment.

Foundry (trade show)


IBM BlueMix has a web interface with QuickStart kits and CI pipelines etc. It hides the detail a lot but by choice—when creating a CI pipeline it asks if you want it to run on CF or Kubernetes. IBM value-add services such as Watson.
Datadog have a new APM (Application Performance Metrics) feature set. This could be useful in watching our own components (Service Brokers, etc.) They have libraries which reports on the function calls, queries and etc that apps make. Separate cost to what we pay for, but there's a free tier.
Swisscom are selling CF deployments to Enterprise, rather than doing public cloud. Mix of CF apps and a container layer (I didn't ask what it's built on.) Their value sell is that they have lots of data centres in Switzerland (so low latency between), so they can offer more than deploying to AWS/etc.

Diego Project Updates

https://www.youtube.com/watch?v=YmXNA49tVpo&index=49&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa

There's a new cfdot (Diego Operatior Toolkit) CLI for Diego

This helps because the Diego API is protobuf RPC, not RESTful
Examples in talk: count apps by instance state; find app GUID from IP:port pair


They're working on authenticated routes

These will make traffic being routed to the wrong apps nearly impossible
They plan to put App IDs into the Common Name fields on certificates
I'm not sure if the killer value of this applies to us


They're starting work on natively-supported zero-downtime/rolling app updates
They're thinking about rebalancing apps or overcommitting cells to reduce need for excess platform capacity

Planning to automatically back-off stressed cells or clusters


Plans for a v2

All components TLS-only
Further switch Consul usage to Locket


An Introduction to Cloud Foundry Core Development

https://www.youtube.com/watch?v=LYg5yUcwvj4&index=42&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa

The vcap username comes from CF's VMWare history!
CF has been an OSS foundation since 2015.
PMC committees manage related groups of projects

Runtime PMC: Diego+CLI+CAPI
BOSH PMC
etc


Core project is themed around Dedicated Contributors

Active members of the team working on the backlog, not jumping in.
You join by active pairing in person for a few months


To get deeply involved in core, properly, you need your employer to dedicate you 50-75+% to working on an OSS CF team's backlog

I discussed this further: it's relatively rare that core CF projects receive patches from non-DCs, but you could coordinate in their Slack channel.


CF AutoScaler Service Project Update

https://www.youtube.com/watch?v=YHt3ydqzBpY&t=106s&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa&index=71
https://github.com/cloudfoundry-incubator/app-autoscaler


cf-incubator project, ready for wide usage

BETA release was in July 2017
SAP Cloud Platform has it, preparing for General Availability
Multi-cloud - AWS, Azure, OpenStack, GCP is next
Strong internal and ramp-up-customer adoption for lots of use-cases
Perfomance tests went well - no key details


It's a CF service, deployed by BOSH, for scaling horizontally.

You define a scaling policy when creating a service instance, then bind it to apps.
Scaling policies given as JSON and are quite feature-rich

Thresholds over monitoring windows, e.g., "for window of 300s with cooldown of 300s, if > 8000 Throughput, +1 instance"
According to recurring schedules
At specific dates and times


The AutoScaler has a public API for managing policies.

The service broker talks to this public API.
Public APIs seems a popular technique, offering a workaround for not being able to retrieve provisioning parameters.


It doesn't respect open connections because of Diego limitations.

Connections will be killed if the app is TERMed because of scaling down :-(
This is an issue of expecting containers to have a long-lifetime again


It has a lot of moving parts

Metric Controller takes raw events from Loggregator and transforms to scaling metrics
Aggregator/EventGen triggers dynamic scaling events
Scaling Engine talks to CLoudController to make scale
Scheduler triggers scheduled scaling events
Service Broker talks to API Server
API Server erves the Public API
Postgres backing DB (plans to expand to support other backing DBs)


I asked about reducing the number of moving parts, as Graham had asked.

They plan to reduce the number over time, but it's longer-term.
At the moment they have so many microservices to make it easy for people to add custom metrics.
If in time all metrics come via loggregator, they can merge down and get something similar.


CF Networking Project Updates

https://www.youtube.com/watch?v=lskNPk1c2xM&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa&index=44
https://github.com/cloudfoundry/cf-networking-release


Current problem: apps need public routes because their outgoing traffic is NATted.

We want app-to-app networking.
Right now each container has an IP. But we want app-to-app policies rather than IP-based ones.


Until now, apps have needed public routes because their outgoing traffic is NATted.

We want app-to-app networking rather than each-container-has-an-IP-based networking.
In 1.0 you can now configure this as a space developer or an admin, via CLI or API.
Out of incubation in September 2017. Default in cf-networking. Has been in PCF for a long time in various guises.


Using standard Container Networking Initiative.

The default networking uses a Silk CNI overlay network for all traffic between apps.

It uses a /16 subnet by default which can contain 255 cells each with 255 containers.
New Garden External Networking on cell does mapping to the CNI API. There's then a kernel-and-security-related thing that also walks to Silk.
There's almost two sides, the networking and the policy side. See diagrams from recorded presentation.
Supports both private apps (e.g., hidden backend APIs) and apps that operate as clusters.
Current service discovery is messy but they're 2 weeks into a discovery-style phase. Service Discovery uses Eureka in the demo, but they brought that with them.


You forward port N on a container to a different container, because a private app has no route at all. Not sure if you can map multiple ports or suchlike.


It's actively designed so that vendors can make different networking setups. This is seriously happening, for instance in DC environments where it makes sense to do cleverer networking to containers.


Next

Polyglot Service Discovery. Apps able to discover each other without routes. Stop users bringing service Discovery. But you need to support finding 1 instance of an app or all instances of an app.
Aim to give all apps an Infrastructure name. Currently the app GUID followed by .cfinternal. Then each instance will get an ID. Aim to integrate with Istio, a new service mesh.
Better tagging of logs to include relevant IDs.


I asked Angela and Usha (the Product Manager that Graham talked to last month) if they could tell me a bit about how the new components inside the cell work.

The agent I didn't understand gets the IP addresses of the remote apps, then sets up the network interfaces and iptables (both on the sending and receiving end) to make the routing happen and port forward. Maybe I should read the source.


Open Container and Open Networking Initiatives Updates

https://www.youtube.com/watch?v=T4OJwoQCpt4&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa&index=45

Garden is becoming smaller, with pluggable networking and something else just like it already had pluggable runtimes.
IPv6 support is to-do. This would then let wider CF start working on support.
I was a bit fatigued by this point. But it's worth watching the video, most definitely. Of note, IPv6 support is on the list of things to do…

UAA Project Updates

https://www.youtube.com/watch?v=cerANFN9ufk&index=48&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa


Roadmap

Gaining support for opaque tokens. Easily-revokeable alternative to the current JWTs.
Improving performance. Goal is no performance regressions with new releases.
Perm is coming, it's in the Incubator on GitHub. Fine-grained custom roles. It'll let us give people roles which can create users?
Doing MVP of moving Cloud Controller roles to UAA, which they hope to take all the way! They want feedback.


They have public project update notes. Graham is too busy now to read them. Maybe I should read them? It'd be a good way to build our knowledge and find ways to contribute.


Of interest, it has SAML support, which is partly for onboarding legacy apps. Allows converting from SAML assertions to OAuth tokens.

Some of this is hard to follow without knowing about SAML. It clearly has lots of features to support things far more complex/legacy than what we're doing.


Discussions at CAPI Office Hours

CAPI is the team name and Cloud Controller is the main project.
I wanted to talk about the job queue backing up from slow LastOperation, and that it doesn't seem to be documented. I discussed it in depth with Zach, a Pivotal person on the Core CAPI team.
We talked through the issue where the job queue was building up because of slow LastOperation, and he seemed to find it very enlightening. The CDN Broker is badly written, but it explained an issue we filed about logging job queue logging time that they wanted a concrete use case for (given that it'd be surrounded by fast jobs.)
Pivotal's prod environment has 4 workers, but that's a very recent increase. You have to scale them, but not all that quickly—I could ask IBM BlueMix about theirs.
The Pivotal people on the CAPI team have one graph of the Pivotal prod CC metrics: it shows the number of jobs processed and the number of jobs failed. I'm not clear if jobs processed was actually the queue length. Zach said they're slowly learning the key metrics to look at for each CF component, to diagnose what component is going wrong, and the job queue is a big one for CC.
We could put a timeout on calling LastOperation if there isn't one already. But. Deleting orgs can take a really long time too. LastOperation might not be safely interruptible.
We think the team might decide to do nothing for now. But Zach found it fascinating to think about, and is going to relay it to discuss inside the team.
Really good discussion.
Routing Isolation Segments

https://www.youtube.com/watch?v=rlBc1VZe4nw&index=93&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa


Isolation Segments are so that particular workloads can be done on segregated/dedicated infrastructure. This can help with PCI accreditation or for particularly sensitive workloads.

Compute Isolation Segments can already be done with a custom org or space tagged with a particular isolation segment to put it on cells with that tag.
But with routing we want to not share network traffic. This can prevent low-traffic apps from being starved of network, protect against Host header spoofing, etc.


At present they're working on preventing Host header spoofing to access private isolation segments. It's about sending a Host header to a private domain and it getting to the Gorouter:

Send a "GET abc.cloudapps.digital" with a Host: secretbbcdomain.com header.
This may get through the external load balancer because of the abc.cloudapps.digital.
The internal gorouter will prioritise the Host header… potentially routing to the private isolation segment.
I'm still missing details but thanks to one of the people speaking for teaching me this.


Autosleeping idle applications

https://www.youtube.com/watch?v=4tRA6vTjIEE&index=90&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa https://github.com/cloudfoundry-community/autosleep


SAP-built service to stop apps that don't receive any traffic for a given length of time.

SAP use this to avoid apps left running on abandoned free trials. In use on their internal environment but not yet on their external.
Default is to stop apps that have been inactive for 24 hours.


Each autosleep service binds itself to all apps in its space.

It has a forced enrolment mode that prevents its deletion without providing a secret. Provided the app developer does not know the secret, they cannot prevent their apps from autosleeping. There's a REST API to make this process easier.


Potentially useful for us

Would reduce resources used by rarely-accessed prototypes.
Could enforce that PaaS trials aren't used for production.
We'd want to carefully think through any chilling effects this could have on our users.


May have problems configuring enforcement of autosleep on tenants

In order to auto-wakeup apps, it uses a wildcard proxy. This would seem to need defining per-domain https://github.com/cloudfoundry-community/autosleep/blob/develop/doc/publish.md#deploy-autosleep-app
The CF Routing Team has not prioritised a way to do this without a wildcard proxy.
But with tenant co-operation, it might well be possible.


Loggregator SLAs and Project Updates

https://www.youtube.com/watch?v=93IdFq47c-w&index=95&list=PLhuMOCWn4P9hsn9q-GRTa77gxavTOnHaa
Project Updates


There's a new Loggregator V2 API

Based on gRPC
Able to supply filters (e.g., gauge metrics of X)
They're working to tag logs with org and space. It'll then be possible to filter logs by org and space. It'll then be possible to supply them to app developers as a "Developer Segmented Firehose."
V1 API will be removed in time.


Service Level Objectives when operating Loggregator


Operators should be defining SLOs for Loggregator

The objectives should be your targets for log stream reliability
Pivotal aim for 99.9% and hit it. BlueMix and others are more like 95-97%.


The reliability of your message delivery takes work

Suggestion: Use syslog drains for log storage.
Suggestion: Monitor the syslog drains (and the Firehose?)


Tools exist for monitoring your log stream reliability

Black box testing: emit known logs and see what proportion you receive

cf-logmon is a black-box approach for measuring log stream reliability
Logspinner can produce logs at known rates
Loggregator-ci is the Loggregator team approach for monitoring multiple environments


White box monitoring

Monitor resources free on Loggregator VMs
Pivotal's presentation showed plotting dropped % of messages directly but I'm not clear how


They suggest monitoring Doppler utilisation based on envelopes-per-second.

They see messages dropped from roughly 10,000 envelopes/second/doppler, thus scale their Dopplers when they reach this rate.


They explained their process for working on this:

Setup blackbox monitoring with enough precision to monitor your target reliability
Achieve 99% for 24 hours
Achieve 99% for 1 week
Add a 9 to the 24-hour target
Achieve 99% for 30 days
Add a 9 to the 1-week target
Pivotal have 99% over 6 months but have not managed to hit 99.9% over 30 days.

For more details, there's a Loggregator Operator Guide in the repo.