Good, let's get started. What happens when you load an URL? This is is what we're going to do: We're going to trace a request to a Web site as it happens. But first, we need to cover some basics.
A client is a computer which makes requests.
A server is a computer which responds to requests.
The internet is a global network of physically connected servers and clients.
The World Wide Web is a subset of the internet concerned with hypertext (links) between resources (URLs).
We got it? Good. Time to really get down to business.
We're going to cover four big buckets: Networking, DNS, routing and HTTP. The first one is networking.
Networking relies on the 7 OSI layers for communication. These are the fundamental building blocks that underly any interaction between your computing devices and other computing devices. It's what separates us from the animals, really.
The problem with the OSI is that it's kind of complicated.
And, frankly, hellish.
So be prepared to abandon hope. It's not really that bad, and to preserve your presumably fragile mental state, I've interspersed images of cute kittens throughout this section of the talk.
These are the 7 OSI layers; Physical, Data link, Network, Transport, Session, Presentation and Application. Every bit of communication travels up this stack from the lowest level to the highest -- and there are means for communicating between the levels.
First, the physical layer. The physical layer translates communication requests into hardware-specific operations -- you know, radios or cables or whatnot.
At this level, you have things like Ethernet, Bluetooth, USB and 802.11 (WiFi) among others. Communication appears as bits (1s and 0s) and aren't guaranteed to be reliable -- there can be interference and other stuff.
All of that unreliability gets smoothed out at the next level up, the data link layer. This is a theme throughout the OSI layers; any rough spots at a lower layer are usually resolved the next layer up. The data link layer is responsible for access control, flow control and error checking.
There are not a lot of fun technologies working at this layer. Packet-switching, DOCSIS, Quality-of-service and VLANs. But the good news is that we've switched from bits to frames -- and that makes everything more reliable.
We lose some of that reliability at the next level: Level 3, the network layer. This layer is responsible for packet forwarding including routing.
IPv4/IPv6 which is the internet protocol. IPsec which is VPNs, ICMP which is how tools we'll use in a bit, trace route and ping work.
The good news: At this layer, we start talking about packets. Packets are awesome.
You might say that a packet is the atomic unit of the internet. There are other data formats (frames, bits, etc), but the internet protocol is all about packets.
A packet is a layer cake of messages, including headers that identify the source and the destination and the data we're sending. There's a bunch of other stuff too and if I had any spare time we'd talk about it. I just realized this layer had no kittens. Damn.
NEXT LAYER we're moving on to Layer 4, the transport layer. This level is concerned with the orderly and reliable delivery of packets between points on a network.
This is where we get TCP and UDP, which are the primary "languages" that computers on the internet speak to each other. You've heard of TCP/IP? Of course not. If you had, you'd know it was about the TRANSMISSION CONTROL PROTOCOL over the INTERNET PROTOCOL.
Moving on to layer 5, the session layer. This layer is concerned with opening, closing and managing sessions, which are a semi-permanent dialogue between applications.
H.245 (Video calling), NetBIOS which isn't cool anymore but was for a while, PPTP (a VPN service), The Remote Procedure Call service, or where you got lots of Windows viruses; SOCKS which is a browser proxy service -- think TOR.
MOVING ON to Layer 6, the presentation layer, which is concerned with mostly encryption and decryption.
Services that reside on layer 6 include Telnet, AFP, ICA (Citrix). This layer is super-boring.
Finally, the final layer of the OSI stack: Layer 7! This is known as the application layer and is the layer at which we're translating lower-level data formations to formats which are human recognizable. Basically, strings of text.
The only reason we're even talking about the OSI levels is because I want you to know about these important things: HTTP, DNS. What is DNS you ask?
Well, now that you fully understand how networking works, I feel like we can move on to the next big bucket: DNS.
DNS, or the Domain Name Service is the internet's phone book.
DNS is concerned with translating domain names
Into IP addresses, e.g., NPR.ORG into 220.127.116.11. The way we accomplish this feat is with a series of records that fall into several categories.
I am mostly interested A records, which map a domain or a subdomain to an IP address. There are other kinds of records, some that handle mail or certificates or canonical names. Those make the internet work but are functionally irrelevant to our narrative. FOCUS FOCUS FOCUS
So, if there's a record that matches NPR.ORG to 18.104.22.168 -- who holds the record?
This computer holds a record like this. This computer is the K root domain name server, aka, k.root-servers.net. If there's a K, what other letters are there?
Good question. There are A-M root name servers. These are the ultimate arbiters of DNS records -- the justices of the supreme court of the internet. If your record isn't on these servers, nobody can find you. It's like you don't exist.
Now, the funny thing about root name servers is that they're physical object -- they exist in meatspace! Many of these are associated with universities or the military -- the D root name server is right down the street at the university of maryland!
Now, if you were to use the UNIX tool dig to check out the DNS records for NPR.org, you'd see
That NPR.org has an A record pointing to 22.214.171.124.
But wait, I'm not hitting a root name server. This IP address is a private address -- I'm going to talk about that later -- but why is that happening?
Because there's a shit-ton of caching happening. Browser cache
There's at least an OS cache, a Router cache, a Local DNS server on your network, your ISP’s DNS server, a peering ISP's DNS server and only then will you ever get to a Root name server. And in real life, it's probably way more layers than that!
This is why it takes so damn long to make a change to your domain. It's amazing it even works at all, right? People should all be much, much nicer to their IT staffs, is all I'm saying.
Now I want to know who registered the domain NPR.ORG, which I can do with the UNIX WHOIS command.
This tells me that the NPR.ORG domain was registered in 1993 and will expire in 2018 -- keep an eye out! It also lists a human being as a contact -- Lilly Ladd, who is very nice. So, wait, what does it mean to register a domain name?
If you want a .com or a .org domain, you purchase the rights to it from a domain registrar.
They permit you to modify the DNS records for that domain, perhaps creating an A record that points to your server's IP address.
This is what it means to own a domain name -- you have the right to point it wherever you want.
So, how do people get from their computer to your computer if they know the IP address?
GREAT QUESTION, FAKE PERSON WHO LIVES IN MY HEAD Now we're going to talk about the third of four big buckets -- ROUTING.
If DNS is concerned with mapping domain names to IP addresses, what is an IP address?
An IP address is a string of numbers. And there are a finite number of them!
This concerns me because basically every device connected to the internet needs a public IP address. Including Matt Waite's drones.
There are two formats for IP addresses: The one you know is called IP version 4 or IPv4 for short. There are 4.29 billion IPv4 addresses. That's a lot! There are slightly more IPv6 addresses ...
This many. I prefer to say Three hundred and forty undecillion and change.
How many is that? We can assign an IPv6 address to every atom on the surface of the earth and still have enough addresses for another 100-ish earths. But this is irrelevant because nobody uses IPv6 in any meaningful way.
Okay, so these IP addresses that we care about are IPv4 and they're referred to as "dotted quads." Why? Because there are four groups of three numbers.
Those numbers span 0.0.0.0 through 255.255.255.254 ...
with some certain exceptions.
One of these, 255.255.255.255 is a "broadcast" address that everyone can listen in on. And there's also the matter of these "private" addresses which can't be routed on the internet.
These private addresses span these ranges -- 192.168.x.x, 172.16.x.x through 172.32.x.x and 10.x.x.x.
These addresses are for LOCAL AREA NETWORKS or lans, like the wifi network we have at the hotel. These are tiny little mini internets! And we can all re-use these same IP addresses if we're not connecting to the actual internet.
The actual internet is an example of a wide-area network or a WAN. So, if there are 4.29 billion IP addresses, why are we running out?
For example, Xerox was given 16.8 million IPv4 addresses in 1994. They're using ALL of those, I'm sure. Any address starting with 13.x.x.x.
Addresses starting with 17 were given to Apple in 1992. That's another 16.8 million.
Ford motors: All addresses starting with 19.x.x.x, there's another 16.8 million.
Prudential: All addresses starting with 48.x.x.x. That's ANOTHER 16.8 million. Get the picture? Of course not. Let's keep going.
The DOD has 10 blocks of 16.8 million addresses each. Nobody else can use these or buy them or anything. They're just ... gone. We're all fighting for what's left.
And there are none left. They're all assigned to someone already. Who assigned them?
The IANA, the internet assigned numbers authority is a department of an American non-profit company named ICANN, which runs this whole thing. IP addresses; domains; the time zone database; even the protocols that exist like FTP and HTTP. IP address blocks are dedicated to some regional internet registries for each of the major continents (GEOGRAPHY AGAIN INTERNET PEOPLE THIS IS A THEME) and the blocks are divvied up according to magic. Or whatever.
Okay, so, still: How does my computer know how to get to your computer? Because of routes. Routes are like a global bucket brigade; you're sending your request to an ISP who peers with another until you find your way to the destination.
Now here's the fun part: The internet is super-duper physical. There's like wires and tubes and things. Ted Stevens was right, fools!
If I were to trace the route between a computer I rent from Amazon in oregon and NPR's servers in what do you think would happen?
It would take 21 hops between computers for me to get to NPR! Which turns out to be hosted in northern virginia, of all places.
The first 11 hops are within amazon's data center in Portland oregon. That's ludicrous.
These three are interesting.
We go from Portland to Seattle and then seattle to San jose. We're making progress!
We don't just cover ground; we also trade across three networks; from Amazon to Telia to Equinix. These companies agree to trade data with each other -- an arrangement called peering.
You might have heard about Netflix and Comcast agreeing to a peering relationship this last week -- this is what they promised to do -- to share traffic directly instead of through another provider.
Back to the list! We're going to do some more geographic travel.
We bounce around between San Jose and San Fransisco.
Finally, between hops 16 and 17, something amazing happens.
We go from sanfransisco.savvis.net to nor.savvis.net.
FREAKING TUBES, MAN. TUBES.
This map is of Savvis, AKA, century link.
They are part of the internet backbone.
They handle about 20% of the traffic to the whole internet.
They have TWO giant 10-gigabit pipes between San Jose and NYC.
Check out this map again. There are some other concentrations of pipes.
How about 15 between NYC and Chicago? Ever heard of high frequency trading? It's a way to make a lot of money based on minute changes in stock values over very short periods of time. The faster your connection, the more money you can make. Traders in chicago are making money on the difference in time it takes data to travel DIRECTLY to chicago and the time it takes data to travel to chicago through other means. ASSHOLES.
Back to the story.
Next two hops?
Still Savvis, but from NYC to Northern Virginia.
Then the last few hops?
All in northern virginia.
We travel less than a mile.
Here's the map again.
Summary: 2 hops get us 3200 miles; 19 others get us just 1030 miles.
WE ARE MOVING ALONG. Last bucket! HTTP. You know how networking works. You know how DNS maps names to ip addresses. And you know how your computer routes to other computers via ip addresses. What's left?
FREAKING THE WHOLE INTERNET IS STILL LEFT. The hyper text transfer protocol is our focus of the last few seconds of this talk.
This fetching fellow is Sir
Timothy Berners-Lee, OM, KBE, FBS, FREng, FRSA, DFBCS. That second one? Knight of the british empire. What's the first one then? NOT TELLING -- LOOK IT UP ON THE INTERNET, LOSERS.
He's known as TimBL.
And he is a stone-cold, elemental bad-ass of the internet, much like Jacob Harris or Jeff Larson.
Or, Andy Boyle.
Tim invented hypertext, those silly links that make the whole internet work.
Tim invented the world wide web. Seriously. That WWW? He made that happen.
Tim wrote the first web server, called HTTPd for the HTTP daemon. You can still download and run it.
Tim invented the http:// notation you type every time you write a web address.
In 2009, incidentally, he decided that maybe it was a mistake -- merely a good idea at the time.
Tim's big idea is that he connected existing bits with something he invented -- TCP/IP and DNS with HyperText. He invented the Web, that clickable thing we all enjoy.
BACK TO THE SCRIPT, BOWERS. Http is a request-response protocol.
Let's break that down -- browsers make requests.
A request is just a bunch of formatted text, with two basic parts -- headers and a body.
Here is what request headers look like. Notice this says at the top GET HTTP/1.1. What does get mean?
It's an HTTP verb! The two I'm going to bother explaining to you are GET and POST.
GET fetches data from an URL.
POST sends data to an URL.
Loading pages? That's a get.
Submitting forms? That's a post.
Then we got all complicated with our AJAX and our client-side shenanigans.
You use GMAIL?
You might GET, POST, PUT, DELETE and make HEAD requests, and just in the first minute!
Servers, our friends who respond to requests, send ... RESPONSES.
Here is a response -- notice that the top says HTTP/1.1 200 OK.
What is that 200? A status code!
An enterprising group on the internet decided Rappers are a good way to explain HTTP status codes. I will broadly crib their work here.
2xx statuses mean success. 200 means GOOD TO GO OKAY.
3xx statuses mean redirection. 301 means moved permanently from one location to another.
4xx statuses mean that there is a problem with the request. 400 means bad request.
404, which is coincidentally my apartment number, means NOT found.
5xx means a server error, like the Wu Tang clan.
But where does a Web page come from? From a web server! A Web server sends response headers and a response body. The headers contain that status code and some other information about the response.
Remember these? Sure you do. It was like 10 seconds ago.
The response body contains the HTML your browser wants!
Look! It's a string!
TimBL also invented the Web browser. So he's got that going for him.
WRAPPING UP: We covered Networking DNS Routing and HTTP.
THINGS I DIDN'T TALK ABOUT: I'm not talking about SSL or encryption.