Skip to content

Instantly share code, notes, and snippets.

@OddExtension5
Last active January 27, 2020 05:34
Show Gist options
  • Save OddExtension5/3ed39604cb9d24f5f0b2c484591a6d8f to your computer and use it in GitHub Desktop.
Save OddExtension5/3ed39604cb9d24f5f0b2c484591a6d8f to your computer and use it in GitHub Desktop.
Data Science For Engineers

Set the working directory

Enter in the console: setwd("directorypath")

Executing an R file

  • Press Run/Ctrl+Enter

  • Press Source/Ctrl+Shift+S OR Press Ctrl+Shift+Enter (Source with Echo)

  • Run can be used to execute selected lines

  • Source/ Source with echo is for a whole file

Add Comments

  • For single line comment, insert '#' at the start of the line
  • Multiple line comments can be added in two ways:
    • Select multiple lines using cursor, then press Ctrl + Shift + C
    • Select multiple lines using cursor, click on "Code" in menu and select "Comment/Uncomment lines

Clear the console

Ctrl + L

Clear the environment

  • Single Variable: Enter in console/R Script: rm(variable)
  • All variables: ENter in console/R Script: rm(list=ls())

Savinf data from workspace

Workspace Data

  • Workspace information is temporary
  • Is not retained after the session
    • If you close the R-session
    • If you restart the computer

Manual Saving

  • Can be permanently saved in a file - save command
  • Can be reloaded for future sessions - load command
save(a, file="sess1.Rdata")  # to save a single variable 'a'

# to save a full workspace with specified file name
save(list=ls(all.names=TRUE), file="sess1.Rdata")

save.image() # shortcut function to save whole workspace

load(file="sess1.Rdata") # to load saved workpace

Variables and datatypes in R

Variables

  • Rules:
    • Allowed characters are Alphanumeric, _, .
    • Always start with alphabets
    • No special characters like !,@,#,$,.....
    • Examples: b2=7, Manoj_GDPL="Scientist", Manoj.GDPL = "Scientist"

Predefined Constants

  • pi, letters, LETTERS, month.name, month.abb

Basic Data Types

  • Logical, Integer, Numeric, Complex, Character
  • Find datatype of object typeof(object)
  • Verify if object is of a certain datatype is.data_type(object)
  • Coerce or convert data type of object to another as.data_type(object)
  • Note: Not all coercions are possible and if attempted will return "NA" as output
typeof(l) # double
typeof(("22-01-2001")) # character

is.character("21-11-2001") # TRUE
is.character(as.Date("21-11-2001")) # FALSE

as.complex(2) # 2+0i
as.numeric("a") # NA

Basic Objects

  • Vector : Ordered collection of same data types
  • List: Ordered collection of objects
  • Data Frame: Generic tabular object

Vectors

  • An ordered collection of basic data types pf given length
  • All the elements of a vector must be of same data type
X = c(1,2,3,4)
print(X)

Lists

  • A generic object consisting of an ordered collection of objects
  • A list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function and so on
ID = c(1,2,3,4)
emp.name = c("Man", "Rag", "Sha", "Din")
num.emp = 4
emp.list = list(ID,emp.name, num.emp)
print(emp.list)

---------
OUTPUT

[[1]]
[1] 1 2 3 4

[[2]]
[1] "Man" "Rag" "Sha" "Din"

[[3]]
[1] 4

Accessing components (by names)

  • All the components of a list can be named
  • These components can be accessed using the given names
emp.list = list("Id" = ID, "Names"= emp.name, "Total staff" = num.emp)
print(emp.list$Names)

------
OUTPUT
[1] "Man" "Rag" "Sha" "Din"

Accessing components (indices)

  • To access top level components, use double slicing operator [[]] or [] and for lower/inner level componets use [] along with [[]]
print(emp.list[1])
print(emp.list[2])
print(emp.list[[1]][1])
print(emp.list([[2]][1])

------
OUTPUT

$Id
[1] 1 2 3 4

$Names
[1] "Man" "Rag" "Sha" "Din"

[1] 1

[1] "Man"

Manipulating Lists

  • A list can be modified by accessing components & replacing them
emp.list["Total staff"] = 5
emp.list[[2]][5] = "Nir"
emp.list[[1]][5] = 5
print(emp.list)

--------
OUTPUT

$Id
[1] 1 2 3 4 5

$Names
[1] "Man" "Rag" "Sha" "Din" "Nir"

$'Total Staff'
[1] 5

Concatenation of lists

  • Two lists can be concatenated using the concatenation function, c(list1, list2)
emp.ages = list("ages" = c(23,48,54,30,32))
emp.list = c(emp.list, emp.ages)

print(emp.list)

-------------
OUTPUT
$Id
[1] 1 2 3 4 5

$Names
[1] "Man" "Rag" "Sha" "Din" "Nir"

$'Tota; Staff'
[1] 5

$ages
[1] 23 48 54 30 32

Data Frames

  • Data frames are generic data objects of R, used to store tabular data
vec1 = c(1,2,3)
vec2 = c("R", "Scilab", "Java")
vec3 = c("For prototyping", "for prototyping", "For Scaleup")

df = data.frame(vec1,vec2,vec3)
print(df)

------------
OUTPUT

  vec1     vec2             vec3
1    1        R  For prototyping
2    2    Scilab For prototyping
3    3    Java   For Scaleup

Create a dataframe using data from a file

  • A dataframe can also be created by reading data from a file using the following command

    newDF = read.table(path="Path of the file")

  • In the path, please use / instead **

    *Example: “C:/Users/hii/Documents/R/R-Workspace/”

  • A separator can also be used to distinguish between entries. Default separator is space

    newDf = read.table(file="path of the file", sep=' ')

Accessing rows and columns

  • df[val1,val2] refers to row "va1", column "val2". Can be number or string
  • "val1" or "val2" can also be array of values like "1:2" or "c(1,3)"
  • df[val2] (no commas) - just refer to column "val2" only
# accessing first & second row
print(df[1:2,])
# accessing first & second column:
print(df[,1:2])
# OR
print(df[1:2])

Subset

  • subset() which extracts subset of data based on conditions
pd = data.frame("Name"=c("Senthil","Senthil","Sam","Sam"), "Month"=c("Jan","Feb","Jan","Feb"),
   "BS" = c(141.2,139.3,135.2,160.1),
   "BP" = c(90,78,80,81))

pd2 = subset(pd, Name== "Senthil" | BS>150)
print("new subset pd2")
print(pd2)

----------
OUTPUT

   Name     Month     BS   BP
1   Senthil    Jan  141.2   90
2   Senthil    Feb  139.3   78
4       Sam    Feb  160.1   81

Editing dataframe

  • Dataframes can be edited by direct assignment

df[[2]][2] = "R"

  • A dataframe can also be edited using the edit() command
  • Create an instance of data frame and use edit command to open a table editor, change can be manually made
  myTable = data.frame()
  myTable = edit(myTable)

Adding extra rows and columns

  • Extra row can be added with rbind function and extra column with cbind
df = data.frame(df, data.frame(vec1=4, vec2="C", vec3="For Scale Up"))
print("adding extra row")
print(df)

df = cbind(df, vec4=c(10,20,30,40))
print("adding extra col")
print(df)

Deleting rows and columns

  • There are several ways to delete arow/column, some cases are shown
df2 = df[-3,-1]
df3 = df[, !names(df)%in%c("vec3")]
df4 = df[!df$vec1==3,]

Manipulating rows - the factor issue

  • When character columns are created in a data.frame, they become factors
  • Factor variables are those where the character column is split into categories or factor levels
  • New entries need to be consistent with factor levels which are fixed when the dataframe is first created
vec1 = c(1,2,3)
vec2 = c("R","Scilab","java")
vec3 = c("For prototyping:, "For prototyping", "For ScaleUp")
df = data.frame(vec1,vec2, vec3, stringAsFactors= F)
df[3,3] = "Others"
print(df)

Recasting and joining of dataframes

Recasting dataframes

  • Recasting is the process of manipulating a dataframe in terms of its variables
  • Reshaping the data
pd=data.frame("Name"=c("Senthil","Senthil","Sam","Sam"),
"Month"=c("Jan","Feb","Jan","Feb"),
"BS" = c(141.2,139.3,135.2,160.1),
"BP" = c(90,78,80,81))
  • Recast in two steps:
    • Melt
    • Cast
  • Identifier(Discrete type variables)
  • Measurements (numeric variables)
  • Categorical and Data variables can be not be measurements

Step 1: Melt

  • Call the library reshape2 using the library() command
  • melt(data, id, vars, measure.vars, variable.name="variable", value.name="value")
install.packages("reshape2")
library(reshape2)

Df = melt(pd, id.vars=c("Names","Month"),measure.vars=c("BS","BP"))
print(Df)

Step 2: cast

  • Applying the dcast() function
  • dcast(data, formula, value.var=col.with values)
 
 Df2 = dcast(Df, variable+month ~Name, value.var = "value")
 print(df2)

Recasting in single step

  • Applying the recast() function performs melt and cast in one command
  • **recast(data, formula,..., id.var, measure.var)
recast(pd, variable+Month~Name, id.var=c("Name","month"))

Add new variable to dataframe based on existing ones

  • Call the library dplyr command using the ;ibrary() command
  • mutate() command will add extra variable columns based on existing ones
library(dplyr)
pd2 <- mutate(pd, log_BP = log(BP))
print(pd2)

Joining of two frames

  • Comnining two dataframes - dplyr package

  • The common syntax for "dplyr" functions used to combine dataframes:

    function(dataframe1, datafrme2, by = id.variable)

    where : + Id.variable" is common tho both dataframes + This variable provides the identifier for combining the 2 dataframes + The nature of combination depends on the function to be used

Combining two dataframes

  • Call the library 'dplyr' command using the library() command

  • The following commands would be used to combine datasets:

    left_join(),right_join(),inner_join(),full_join(),semi_join(),anti_join()

Screenshot-44

## Creating first dataframe
pd = data.frame("Name" = c("Senthil", "Senthil", "Sam", "Sam"),
"Month" = c("Jan", "Feb", "Jan", "Feb"),
"BS" = c(141.2, 139.3, 135.2, 160.1),
"BP" = c(90,78,80,81))

## creating another dataframe

pd_new = data.frame("Name" = c("Snethil", "Ramesh","Sam"),
"Department"=c("PSE","Data Analytics", "PSE"))
print(pd_new)

## left_join() --> n(A)
pd_left_join1 <- left_join(pd, pd_new, by="Name")

##right_join() ..> n(B)
pd_right_join1 <- right_join(pd,pd_new, by="Name")

##inner_join()
pd_inner_join1 <- inner_join(pd_new, pd, by="Name")

Looping over objects

  • apply : Apply a function over the margins of an array or matrix
  • lapply: Apply a function over a list or a vector
  • tapply: Apply a function over a ragged array
  • mapply: Multivariate version of lapply
  • xxply : (plyr package)

apply function

  • Applies a given function over the margins of a given array
  • Syntax: apply(array, margins, function,..)
  • Here margins refer to the dimension of the array along which the function need to be applied.
A <- matrix(1:9, 3,3)
apply(A,1,sum) # along rows
apply(A,2,sum) # along columns

lapply function

  • lapply is used to apply a function over a list
  • lapply always returns a list of the same length as the input list
  • Syntax: lapply(list, function, ...)
A = matrix(1:9, 3,3)
B = matrix)10:18, 3,3)
Mylist = list(A,B)
determinant = lapply(Mylist, det)
determinant

mapply function

  • mapply is a multivariate version of lapply
  • A function can be applied over several lists simultaneously
  • Syntax: mapply(fun, list1, list2, ..)
source('~/volcylinder.R')
dia = c(1,2,3,4)
len = c(7,4,3,2)
vol = mapply(volcylinder,dia, len)
vol

tapply function

  • tapply is used to apply a function over subset of vectors given by a combination of factors
  • Sytnax: tapply(vector, factors, function, ..)
Id = c(1,1,1,1,2,2,2,3,3)
Values = c(1,2,3,4,5,6,7,8,9)
tapply(Values, Id, sum)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment