Skip to content

Instantly share code, notes, and snippets.

@mamacneil
Created October 17, 2018 11:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mamacneil/84bba366ab6f06ee86d592f02245a6b4 to your computer and use it in GitHub Desktop.
Save mamacneil/84bba366ab6f06ee86d592f02245a6b4 to your computer and use it in GitHub Desktop.
Lecture 7 - string handling
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Lecture 7 - String manipulation"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"A surprising amount of what people do with computers involves text - searching for and manipulating strings within a programming language. In biology, the area with the major lock on text manipulation is bioinformatics. As the name implies, bioinformatics deals with biological information - especially analysis of DNA, RNA and protein sequences. However, the challenges and scientific opportunities for analyzing the information are incredible. In its simplest form, we can represent DNA/RNA and protein as text -- either nucleic or amino acids. Each base or amino acid is represented as a single letter (e.g. A/C/G/T for DNA). Stored in the sequence of nucleic and amino acids are all the instructions to create life. So strings are important!!\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Among the simplest but most cricial attributes of strings is determining its length:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"protein = \"vlspadktnv\"\n",
"length(protein)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wait - that wasn't quite what we wanted. R, as it turns out, has really awkward commands for using strings. To get the length of a string object, we need to use the special function `nchar()`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nchar(protein)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How about taking a slice of the string, should be simple through indexing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"protein[1:4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Crap - this is harder than it should be. To grab a section of a string by its position of each letter, R has a `substr()` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"substr(protein,1,3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"R also has the confusingly named `substring()` which is some legacy code from S to be compatible. \n",
"\n",
"Another common action on a string is to split on a specific element, using the `strsplit()` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"strsplit(protein,'a')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice here that the 'a' has disappeared. If we want to keep that 'a', we need to take a slice at the 'a' position. So how do we find the 'a' position? "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gregexpr('a', protein)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For reasons known only to the original R programmers, this returns a list object, with a number at the begnining, followed by the position we're looking for. So to use this to get that index number, we need to index the list:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gregexpr('a', protein)[[1]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 1\n",
"---\n",
"\n",
"Using the baseball pitcher data, use a loop to store the first names of all the MLB pitchers, then provide a summary table of their counts **in order**.\n",
"\n",
"*Hint: be sure to check the class of the column containing the names and recall your summary for team saves in a previous lecture*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Regular expressions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Regular expressions allow you to define a class of words/character sequences that have some parts the look the same. They are really useful for sifting through and subsetting large textual datasets (like genetic sequences). Using them can impart superhero-like qualities:\n",
"\n",
"<img src=\"regular_expressions.png\" alt=\"xkcd\" style=\"width: 500px;\"/>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Special characters for string matching \n",
" \n",
"**\\^** matches the beginning of the string \n",
"**\\$** matches the end of a string \n",
"**[]** match any characters inside the square brackets \n",
"**[^]** match any characters *except* those inside the brackets \n",
"**{n}** match preceding character n times \n",
"**\\{n,m\\}** match preceding character between n and m times \n",
"**\\n** new line \n",
"**\\t** tab \n",
"**\\.** matches any single character \n",
"**\\*** match preceding character/number 0 or more times \n",
"**\\+** match preceding character/number at least once \n",
"**\\?** match preceding character/number exactly once \n",
"**\\** suppresses special meanings. Use this if you want to search for one of the special characters in your string\n",
"\n",
"...and there are lots more here \n",
"Look at pg2 of the 'Strings' Cheatsheet for R\n",
"https://www.rstudio.com/resources/cheatsheets/ \n",
"Use of metacharacters for pattern matching\n",
"http://stat545.com/block022_regular-expression.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Things you can do with regular expressions...\n",
"#### 1) Find values in strings/vectors that match your desired pattern or sequence \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#find all the players whose first name is Jim\n",
"grep(\"^Jim\", mlb_pitching$Name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`grep()` returns the position of each instance of the search string, so we can also use them inside an indexing statement to find other values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Grep indexing the Jims\n",
"mlb_pitching$Name[grep(\"^Jim\", mlb_pitching$Name)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Slightly more powerfully than just the Jims is to figure out quantitites, for example what proportion of players are in their 30's?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Proportion of 30-year olds\n",
"length(grep(\"3.\", mlb_pitching$Age))/length(mlb_pitching$Age)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Only a third, but better certainly than those over 40:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Proportion of 40-year olds\n",
"length(grep(\"4.\", mlb_pitching$Age))/length(mlb_pitching$Age)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A paltry 1.4%...who are they?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 2\n",
"---\n",
"\n",
"Figure out who the MLB pitchers AND batters are over 40."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another thing you can do is\n",
"\n",
"### 2) Return logical vectors that match your pattern \n",
"\n",
"\n",
"The `grepl()` function will return TRUE if conditions are satisfied, and FALSE if not. Here you have to use the standard escape character `\\` to stop the function of `*` as a special character:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Which players have a * next to their name?\n",
"grepl(\"\\\\*\", mlb_pitching$Name)[1:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this, you can then do typical boolean indexing. In addition you can\n",
"\n",
"### 3) Find and replace values that match your pattern\n",
"\n",
"If you or your data provider has a spelling problem, you can correct them on the fly with the `gsub()` find and replace function. For example, we can change the name of all the players called 'Tyler' to 'Superman':"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Find and replace\n",
"mlb_pitching$Name <- gsub(\"Tyler\", \"Superman\", mlb_pitching$Name)\n",
"\n",
"# Search for all players with names that start with Superman\n",
"mlb_pitching$Name[grep(\"^Superman\", mlb_pitching$Name)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or an even more useful thing to do is filter out stuff you dont' want. For example, those annoying player codes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Get rid of all the text after the back-slashes\n",
"mlb_pitching$Name <- gsub(\"\\\\\\\\.{1,20}\", \"\", mlb_pitching$Name)\n",
"mlb_pitching$Name[1:10]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"But wait, why are there three Al Alburquerques?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 3\n",
"---\n",
"\n",
"Figure out why there are three Al Alburquerques in the MLB pitching data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Concatenation\n",
"\n",
"Among the most common tasks in working with strings is to concatenate (stick together) two strings. In R, this is done using the somwhat obscure `paste()` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Add attributes to player name\n",
"paste(mlb_pitching$Name[1], ' is a swell guy.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 4\n",
"---\n",
"\n",
"Create an array of baseball player position names in order of their positional number (you might need to look this up, depending on your leisure activities). Then use this to concatenate the names and positions of the first 10 players in the MLB batting data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Web scraping\n",
"\n",
"Among the more functional and powerful things R can do is to pull down information from the web and process it for use. The librarary to do this is `rvest` created (again) by [Hadley Wickham](https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/) (where does he find the time...but praises be that he does!). This is a deep topic that requires some insight into `html`, a tag-driven programming language that powers most of the web. \n",
"\n",
"Because this is likely not on your computer yet, we'll try to install it:\n",
"\n",
"**Long pause**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(rvest)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Let's start by picking a page at random: https://www.baseball-reference.com/leagues/MLB/2017-standard-batting.shtml\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Grab webpage\n",
"mlb_batting2 = read_html(\"https://www.baseball-reference.com/leagues/MLB/2017-standard-batting.shtml\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"mlbtext = mlb_batting2 %>% html_nodes('tr') %>% html_text()\n",
"mlbtext"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 5\n",
"---\n",
"\n",
"Using positional character indexing, turn mlbtext into a `data.frame()`. Hint: you can tediously count the positions matched to the numbers on the website>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# What have you learned and what's next?\n",
"\n",
"The point of today's lab was to learn about how to handle basic text and introduce webscraping\n",
"\n",
"**You should at this point be comfortable:**\n",
" 1. Splitting strings based on a deliminator\n",
" 2. Splitting strings by position\n",
" 3. Regular expressions\n",
" 4. Initial webscraping\n",
"\n",
"Next week we will delve into the taggy world of html...\n",
"\n",
"\n",
"---\n",
"# ** A bientôt ** !"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment