Mahdiark/stat545a-2013-hw05_khosravi-mah.Rmd

## stat545a-2013-hw05_khosravi-mah.Rmd
Mahdiar Khosravi
========================================================
**STAT-545A hw#05**
**October.07.2013**

```{r include = FALSE}
opts_chunk$set(tidy = FALSE)
```

### Source Data
I decided to work on a new data set for this homework. I picked a data set from [OECD.Stat Extracts](http://stats.oecd.org/) about Bank Profitability Statistics under its Finance subsection.

### Data Import
Used libraries:
```{r}
library(ggplot2)
library(xtable)
```

```{r}
gDat <- read.delim(file="BankProfitabilityStatistics.csv")
```

Basic sanity check:
```{r}
str(gDat)
names(gDat)
levels(gDat$Country)
levels(gDat$Bank)
```
As we can see here, this data set presents some information about financial indexes and parameters for different types of banking institutions, in different countries and for `r length(unique(gDat$Year))` years (`r min(gDat$Year)` to `r max(gDat$Year)`).

### Data Manipulation
This is *long.format* data.frame. There are a number of dependent variables and levels which we can omit from the main data.frame.`All Bank` for example is the summation of the values for different types of banks.
I also chose to omit the two levels `Foreign commercial banks` and `Large commercial banks`, for consistency reasons.
```{r}
BankLevels <- c("Co-operative banks","Commercial banks","Other miscellaneous monetary institutions",
                "Savings banks")
iDat <- droplevels(subset(gDat, subset= Bank %in% BankLevels))
str(iDat)
```

The matter of dependency also exists for levels in `Item` variable; many of the following parameters are acquired by subtracting some previous ones:

```{r}
levels(iDat$Item)
```
As one can imagine from this levels, we have a problem of **levels names** here, which I did not know how to appropriately cope with. I did not succumb to the temptation of replacing them by reasonable names in the .csv file prior to import! Instead, I decided to reshape the data.frame and change the names whenever needed.
The levels here address different categories. And I define some vectors to use later on.

```{r}
incomeLevels <- c("1. Interest income","2. Interest expenses",
                "4. Net non-interest income","6. Operating expenses",
                "8. Net provisions","10. Income tax",
                "12. Distributed profit","13. Retained profit ")
assetLevels <- c("14. Cash and balance with Central bank","15. Interbank deposits","16. Loans",
                 "17. Securities","18. Other assets")
liabLevels <- c("19. Capital and reserves","20. Borrowing from Central bank",
                "21. Interbank deposits","22. Customer deposits","23. Bonds","24. Other liabilities")
QuantLevels <- c("37. Number of institutions","38. Number of branches","39. Number of employees")
```

Now I can introduce a data.frame addressing *Income Statements* as follows:
```{r}
incomeDat <- droplevels(subset(iDat, Item %in% incomeLevels))
incomeDat <- reshape(incomeDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
names(incomeDat) <- c("Country","Bank","Year","InterestIncome","InterestExpenses",
                       "NetNonInterestIncome","OperatingExpenses","NetProvisions","IncomeTax",
                       "DistributedProfit","RetainedProfit")
str(incomeDat)
```
We can see some rows of this data.frame. From the table below we can notice that some data is missing and this comes from the source data, as far as I could see on the website.
```{r results='asis'}
Dumm1 <- incomeDat[sample(nrow(incomeDat), 10),  ]
print(xtable(Dumm1),type = "html", include.rownames = FALSE)
```

With this data.frame we can look into the interest income from different types of banking with countries as the grouping variable. Also, the distribution of Operating Expenses values and change of this value by time are illustrated.

```{r fig.width=14, fig.height=10}
ggplot(incomeDat, aes(x= Bank, y= InterestIncome, color=Country)) +
      geom_jitter(alpha=1/2, position = position_jitter(width = .2)) + scale_y_log10()
```

```{r fig.width=14, fig.height=10}
ggplot(incomeDat, aes(x= OperatingExpenses, fill= Country)) + geom_density() + scale_x_log10()

ggplot(incomeDat, aes(x= Year, y= OperatingExpenses, color=Bank)) +
       facet_wrap(~ Country) + geom_point() + scale_y_log10()
```

Surprisingly, the last plot does not show significant changes in expenses by time.
Now, I want to do some HR study on banking institutes using another level vector:

```{r}
HRDat <- droplevels(subset(gDat, Item %in% QuantLevels))
HRDat <- reshape(HRDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
names(HRDat) <- c("Country","Bank","Year","institutions","branches","employees")
```
We can take look at some rows of this new data.frame.
```{r results='asis'}
Dumm1 <- HRDat[sample(nrow(HRDat), 10),  ]
print(xtable(Dumm1),type = "html", include.rownames = FALSE)
```
We can do a different set of investigations using this new set. Here, I illustrate number of employees in different banking institutions in U.S. by year and number of institutions.
```{r fig.width=14, fig.height=10}
ggplot(subset(HRDat, Country == "United States"), aes(x = Year, y = employees, color= Bank,
                size = 1500 *sqrt(institutions/pi))) + geom_point()
```
As we can see here, although `Other monetary institutions` have a considerable number of institutions, they do not embrace a considerable number of employees and `Large commercial banks` show a fantastic contribution given their relatively low number of institutions. Another interesting observation here is the reduction in employees after 2007 in `Commercial banks`, who consistently had the majority of employees.

At the end, I should say, maybe a more professional way of manipulating this data set is avoiding segregating it through level vectors I used here. I just used it as a faster approach to the problem and after I got a headache of trying to prepare the input data due to weird level names!
I am also a bit suspicious about my `reshaping` and am not sure if I have not messed things using it!
	Mahdiar Khosravi
	========================================================
	STAT-545A hw#05
	October.07.2013

	```{r include = FALSE}
	opts_chunk$set(tidy = FALSE)
	```

	### Source Data
	I decided to work on a new data set for this homework. I picked a data set from [OECD.Stat Extracts](http://stats.oecd.org/) about Bank Profitability Statistics under its Finance subsection.

	### Data Import
	Used libraries:
	```{r}
	library(ggplot2)
	library(xtable)
	```

	```{r}
	gDat <- read.delim(file="BankProfitabilityStatistics.csv")
	```

	Basic sanity check:
	```{r}
	str(gDat)
	names(gDat)
	levels(gDat$Country)
	levels(gDat$Bank)
	```
	As we can see here, this data set presents some information about financial indexes and parameters for different types of banking institutions, in different countries and for `r length(unique(gDat$Year))` years (`r min(gDat$Year)` to `r max(gDat$Year)`).

	### Data Manipulation
	This is long.format data.frame. There are a number of dependent variables and levels which we can omit from the main data.frame.`All Bank` for example is the summation of the values for different types of banks.
	I also chose to omit the two levels `Foreign commercial banks` and `Large commercial banks`, for consistency reasons.
	```{r}
	BankLevels <- c("Co-operative banks","Commercial banks","Other miscellaneous monetary institutions",
	"Savings banks")
	iDat <- droplevels(subset(gDat, subset= Bank %in% BankLevels))
	str(iDat)
	```

	The matter of dependency also exists for levels in `Item` variable; many of the following parameters are acquired by subtracting some previous ones:

	```{r}
	levels(iDat$Item)
	```
	As one can imagine from this levels, we have a problem of levels names here, which I did not know how to appropriately cope with. I did not succumb to the temptation of replacing them by reasonable names in the .csv file prior to import! Instead, I decided to reshape the data.frame and change the names whenever needed.
	The levels here address different categories. And I define some vectors to use later on.

	```{r}
	incomeLevels <- c("1. Interest income","2. Interest expenses",
	"4. Net non-interest income","6. Operating expenses",
	"8. Net provisions","10. Income tax",
	"12. Distributed profit","13. Retained profit ")
	assetLevels <- c("14. Cash and balance with Central bank","15. Interbank deposits","16. Loans",
	"17. Securities","18. Other assets")
	liabLevels <- c("19. Capital and reserves","20. Borrowing from Central bank",
	"21. Interbank deposits","22. Customer deposits","23. Bonds","24. Other liabilities")
	QuantLevels <- c("37. Number of institutions","38. Number of branches","39. Number of employees")
	```

	Now I can introduce a data.frame addressing Income Statements as follows:
	```{r}
	incomeDat <- droplevels(subset(iDat, Item %in% incomeLevels))
	incomeDat <- reshape(incomeDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
	names(incomeDat) <- c("Country","Bank","Year","InterestIncome","InterestExpenses",
	"NetNonInterestIncome","OperatingExpenses","NetProvisions","IncomeTax",
	"DistributedProfit","RetainedProfit")
	str(incomeDat)
	```
	We can see some rows of this data.frame. From the table below we can notice that some data is missing and this comes from the source data, as far as I could see on the website.
	```{r results='asis'}
	Dumm1 <- incomeDat[sample(nrow(incomeDat), 10), ]
	print(xtable(Dumm1),type = "html", include.rownames = FALSE)
	```

	With this data.frame we can look into the interest income from different types of banking with countries as the grouping variable. Also, the distribution of Operating Expenses values and change of this value by time are illustrated.

	```{r fig.width=14, fig.height=10}
	ggplot(incomeDat, aes(x= Bank, y= InterestIncome, color=Country)) +
	geom_jitter(alpha=1/2, position = position_jitter(width = .2)) + scale_y_log10()
	```

	```{r fig.width=14, fig.height=10}
	ggplot(incomeDat, aes(x= OperatingExpenses, fill= Country)) + geom_density() + scale_x_log10()

	ggplot(incomeDat, aes(x= Year, y= OperatingExpenses, color=Bank)) +
	facet_wrap(~ Country) + geom_point() + scale_y_log10()
	```

	Surprisingly, the last plot does not show significant changes in expenses by time.
	Now, I want to do some HR study on banking institutes using another level vector:

	```{r}
	HRDat <- droplevels(subset(gDat, Item %in% QuantLevels))
	HRDat <- reshape(HRDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
	names(HRDat) <- c("Country","Bank","Year","institutions","branches","employees")
	```
	We can take look at some rows of this new data.frame.
	```{r results='asis'}
	Dumm1 <- HRDat[sample(nrow(HRDat), 10), ]
	print(xtable(Dumm1),type = "html", include.rownames = FALSE)
	```
	We can do a different set of investigations using this new set. Here, I illustrate number of employees in different banking institutions in U.S. by year and number of institutions.
	```{r fig.width=14, fig.height=10}
	ggplot(subset(HRDat, Country == "United States"), aes(x = Year, y = employees, color= Bank,
	size = 1500 *sqrt(institutions/pi))) + geom_point()
	```
	As we can see here, although `Other monetary institutions` have a considerable number of institutions, they do not embrace a considerable number of employees and `Large commercial banks` show a fantastic contribution given their relatively low number of institutions. Another interesting observation here is the reduction in employees after 2007 in `Commercial banks`, who consistently had the majority of employees.

	At the end, I should say, maybe a more professional way of manipulating this data set is avoiding segregating it through level vectors I used here. I just used it as a faster approach to the problem and after I got a headache of trying to prepare the input data due to weird level names!
	I am also a bit suspicious about my `reshaping` and am not sure if I have not messed things using it!
No results found