isaacarnault/.gitignore

## .gitignore
 ________  ________  ___  __    ___
|\_____  \|\   __  \|\  \|\  \ |\  \
 \|___/  /\ \  \|\  \ \  \/  /|\ \  \
     /  / /\ \   __  \ \   ___  \ \  \
    /  /_/__\ \  \ \  \ \  \\ \  \ \  \
   |\________\ \__\ \__\ \__\\ \__\ \__\
    \|_______|\|__|\|__|\|__| \|__|\|__|


exercise_solution.md

## README-Template.md

      
    Raw
  

              README-Template.md
            
          
    Data exploration and visualization using R


Scripting in R and Jupyter

The following repository helps you learn how to create a dataset from end-to-end and performing some data exploration and vizualization.

Prerequisites : the story you want to tell

To implement data visualization in R programming, you should have some interest for data you use daily either in your job or at uni. Before I created this gist, I imagined how my data visualization could be of interest to Hadoop professionals on social networks since ultimately I share my gist to my Twitter and Linkedin followers. I therefore decided to find some available data related to this technology that could be interesting, to create a dataset in which I would use these data, to read this dataset using R, to perfom some analysis and cleaning operations on this dataset and to create a vizualisation chart that could tell a story about this dataset.
How to : make a data exploration and visualization using R

The following steps will help you visualize "the number of nodes in a Hadoop cluster used by major tech companies" (the story that I want to tell). To implement what I did, you may wish to proceed as follows:

you can follow the below steps to understand all the steps from end-to-end
or you can use program.rda in R Studio or in your favorite workbench to check the output


Steps

 Check https://who.is for retrieving data you'll use in your dataset (e.g: search for https://last.fm)
 Open your favorite text editor
 Name your columns company, nodes, country, server_type, server_version, Id
 Create 20 observations (an observation is equal to a row, 20 obs = 20 rows)
 Make sure to store data in each cell of your dataset (if you have no available data, use "NA"
 Save your file in .csv
 Make sure you have RStudio installed on your machine (see Running the tests)
 Open your file with R and vizualize it
 Create a new R script, install and load the packages (refer to Tips.md
 Open your .csv in R and explore the data (refer to Tips.md to know how


Running the tests

I am using Ubuntu (18.04 bionic).


Check on your shell if R Studio is correctly installed using this:

Check RStudio version
$ R --version
Check Jupyter Notebook version
$ jupyter --version


You need RStudio and Jupyter Nptebook installed on your PC to proprely use this gist.

Jupyter Notebook is not compulsory. It is another way to read R programming scripts.

You can still use Jupyter Notebook on remote sites to perform same operations you would perform in RStudio.


use https://labs.cognitiveclass.ai (create a free account, then click on "JupyterLab" in the Build Analytics section)

use https://dataplatform.ibm.com (recommended for IBM Coders)

Built With


Notepadqq - A text editor - Linux/Unix
R Studio - A statistical computing environment
ggvis - a package for creating histograms
ggplot2 - a famous package for plotting in R


This dataset was created using notepadqq.

Data is sorted by company name, number of nodes, country name, server type, server version and position in the table.

Save the code below in .csv and read it using RStudio before you invoke vizualisation functions.

Data are provided by various sites. Some of them are listed in Tips.md

Versioning

I used no vesioning system for this gist. My repository's status is flagged as active because it has reached a stable, usable state. Original gist related to this repository is pending as concept.
Author


Isaac Arnault

Licence

All public gists https://gist.github.com/isaacarnault

Copyright 2018, Isaac Arnault

MIT License, http://www.opensource.org/licenses/mit-license.php
Exercise

As an IT or Big Data Project Manager, your are asked by the Information System Manager to use a dataset in order to do some presentation regarding the management of Hadoop clusters all over the world. For your presentation, you have decided to include some metrics related to the number of nodes processed by top Internet companies and to locate the servers on which the nodes are processed by Internet Protocol address. Since some data are available in the Public Domain (on the Internet), you have decided to go for them. This excercise is only a part of a whole set of steps you'd have conducted on top of your presentation (Business understanding, Analytic approach, Data requirements / - collection /  - analysis / - preparation, - modeling). Completing this exercise could be seen as a prerequisite regarding data analysis for enterprise.


Create your dataset by using data from this Slideshare
Consider the following range of data while extracting them from the above link: dataset = {2, 21}
Name the variables of your dataset Id, Company, Nodes, Country, Server
Go to Tips.md to find sources where you can find Server name and Country
Assign to each Id a Company, number of Nodes, Country and Server Name
Read your dataset using RStudio or Jupyter
Use Jupyter to perform some exploration of your dataset
Use RStudio to perform some visualisation of your dataset:

Install and activate ggvis and ggplot2 packages from the CRAN
Use geom_dotplot function for plotting. Sort the graph by Company per Nodes.


Question: How many companies use {500, 1500} nodes? Name the companies while visualizing the graph.


## dataset_original.csv

          
            Id
             Company
             Nodes
             Server
             Version
             IP

            
              1
               Adobe
               3
               Apache
               NA
               193

            
              2
               Crowdmedia
               5
               Apache
               NA
               88

            
              3
               Beebler
               14
               Nginx
               1.11.9
               54

            
              4
               Bixolabs
               20
               Nginx
               1.14.0
               50

            
              5
               Careers
               15
               Nginx
               NA
               185

            
              6
               Contextweb
               50
               Openresty
               NA
               52

            
              7
               Criteo
               2000
               Nginx
               NA
               178

            
              8
               Ebay
               532
               NA
               NA
               66

            
              9
               Facebook
               1400
               NA
               NA
               31

            
              10
               Infochimps
               30
               Nginx
               NA
               23

            
              11
               Lastfm
               100
               Nginx
               NA
               64

            
              12
               Mercadolibre
               20
               Tengine
               NA
               54

            
              13
               Openneptune
               200
               Apache
               NA
               103

            
              14
               Quantcast
               3000
               Apache
               NA
               34

            
              15
               Rackspace
               30
               Akamaighost
               NA
               173

            
              16
               Rakuten
               69
               Akamaighost
               NA
               203

            
              17
               Spotify
               1650
               Nginx
               NA
               104

            
              18
               Telenav
               60
               CentOS
               2.4.6
               35

            
              19
               Worldlingo
               44
               Nginx
               NA
               204

## exercise_solution.md

      
    Raw
  

              exercise_solution.md
            
          
Your dataset should render as this in RStudio for dataset = {2, 21}.


Id
Company
Nodes
Country
Server


1
Linkedin.com
4100
USA
Play


2
Facebook.com
1400
USA
NA


3
NetSeer.com
1050
USA
Nginx


4
Ebay.com
532
USA
NA


5
CRS4.com
400
USA
Nginx


6
Powerset
400
USA
NA


7
Aknowledge.com
400
USA
NA


8
Neptune.com
200
UK
Cloudflare


9
Aol.com
1400
CHE
ATS


10
Immobi.com
150
GER
Apache


11
FOX.com
140
USA
AkamaiGHost


12
Specificmedia.com
20
USA
Apache


13
Search.wikia.com
125
USA
Apache


14
Ecircle.com
120
Germany
Apache


15
Spotify.com
120
USA
Nginx


16
A9.com
69
USA
Server


17
Ara.com.tr
100
USA
NA


18
Cornell.com
100
USA
NA


19
Last.fm
100
USA
Nginx


20
Tint.com
94
USA
Nginx


see dataset

Id, Company, Nodes, Country, Server
1, Linkedin.com, 4100, USA, Play
2, Facebook.com, 1400, USA, NA
3, NetSeer.com, 1050, USA, Nginx
4, bay.com, 532, USA, NA
5, Crs4.com, 400, USA , Nginx
6, Powerset, 400, USA, NA
7, Aknowledge.com, 400, USA, NA
8, Neptune.com, 200, UK, Cloudflare
9, Aol.com, 1400, CHE, ATS
10, Immobi.com, 150, GER, Apache
11, Fox.com, 140, USA, AkamaiGHost
12, Specificmedia.com, 20, USA, Apache
13, Search.wikia.com, 125, USA, Apache
14, Ecircle.com, 120, Germany, Apache
15, Spotify.com, 120, USA, Nginx
16, A9.com, 69, USA, Server
17, Ara.com.tr, 100, USA, NA
18, Cornell.com, 100, USA, NA
19, Last.fm, 100, USA, Nginx
20, Tint.com, 94, USA, Nginx


See graph
  

See answer

The graph shows that 4 companies use 500 to 1500 nodes: Ebay, NetSeer, Facebook, Aol.


## LICENCE
MIT License

Copyright (c) 2018 Isaac Arnault

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## OUTPUT.md

      
    Raw
  

              OUTPUT.md
            
          
    Ignore below outputs if the scripts ran correctly.


Creating a dataset, saving it in .csv and reading it with Jupyter Notebook


see raw format

# Raw format
Id, Company, Nodes, Server, Version, IP
1, Adobe, 3, Apache, NA, 193
2, Crowdmedia, 5, Apache, NA, 88
3, Beebler, 14, Nginx, 1.11.9, 54
4, Bixolabs, 20, Nginx, 1.14.0, 50
5, Careers, 15, Nginx, NA, 185 
6, Contextweb, 50, Openresty, NA, 52
7, Criteo, 2000, Nginx, NA, 178
8, Ebay, 532, NA, NA, 66
9, Facebook, 1400, NA, NA, 31
10, Infochimps, 30, Nginx, NA, 23
11, Lastfm, 100, Nginx, NA, 64
12, Mercadolibre, 20, Tengine, NA, 54
13, Openneptune, 200, Apache, NA, 103
14, Quantcast, 3000, Apache, NA, 34
15, Rackspace, 30, Akamaighost, NA, 173
16, Rakuten, 69, Akamaighost, NA, 203
17, Spotify, 1650, Nginx, NA, 104
18, Telenav, 60, CentOS, 2.4.6, 35
19, Worldlingo, 44, Nginx, NA, 204


Data Exploration


Reading our dataset


first argument

# 1. Reading dataset using Jupyter Notebook
MyData <- read.csv(file="dataset_hadoop.csv")
MyData


Exploring our dataset


dim() function

# 2. Showing the dimensions of the dataset by variables (columns) and observations (rows)
MyData <- read.csv(file="dataset_hadoop.csv")

dim(MyData)


Exploring our dataset


str() function

# 3. Showing the structure of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

str(MyData)


Exploring our dataset

summary() function

# 4 Summary statistics on the variables (columns) of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

summary(MyData)


Exploring our dataset


colnames() function

# 5 Showing the name of each variable (column) of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

colnames(MyData)


Exploring our dataset


head() function

# 6  Showing the first 6 observations (rows) of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

head(MyData)


Exploring our dataset


tail() function

# 7  Showing the first 6 observations (rows) of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

tail(MyData)


Data Visualization


Visualizing our dataset


basic plotting

# ggvis: first visualization using `layer_points` function
MyData %>% 
  ggvis(~Server, ~Nodes) %>%
layer_points()


Visualizing our dataset


improved plotting

# ggvis: improving the above script by sorting the graph per Server per Nodes per Company
MyData %>% 
  ggvis(~Server, ~Nodes) %>%
  layer_points() %>%
layer_points(fill = ~Company))


Visualizing our dataset


IP per Nodes, diamond

# ggvis: third visualization using layer_points, diamond shape
MyData %>% 
  ggvis(~IP, ~Nodes) %>% 
  layer_points(size := 25, shape := "diamond", stroke := "red", fill := NA)


Visualizing our dataset


IP per Nodes, triangles

# ggvis: fourth visualization using layer_lines, layer_points, triangle shape
MyData %>%
  ggvis(~IP, ~Nodes, stroke := "skyblue",
        strokeOpacity := 0.5, strokeWidth := 5) %>%
  layer_lines() %>%
  layer_points(fill = ~Company,
               shape := "triangle-up",
               size := 300)


Visualizing our dataset


Company per Nodes using geom_bar

# ggplot: first visualization using geom_bar
g <- ggplot(MyData, aes(Company))
g + geom_bar(aes(fill=Nodes), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Using geom_bar", 
       subtitle="IP Vs Nodes", 
       caption="Author: Isaac Arnault")


Visualizing our dataset


Company per Nodes using geom_violin

# ggplot: second visualization using geom_violin
g <- ggplot(MyData, aes(IP, Nodes))
g + geom_violin(trim=FALSE, fill='#ffffff', color="black") + 
  labs(title="Using geom_violin" , 
       subtitle="IP Vs Nodes",
       caption="Author: Isaac Arnault",
       x="IP",
       y="Nodes")


Visualizing our dataset


Company per Nodes using geom_point, geom_segment

# ggplot: third visualization using geom_point, geom segment, shape: tomato
ggplot(MyData, aes(x=IP, y=Nodes)) + 
  geom_point(col="tomato2", size=3) + 
  geom_segment(aes(x=IP, 
                   xend=IP, 
                   y=min(Nodes), 
                   yend=max(Nodes)), 
               linetype="dashed", 
               size=0.1) +  
  labs(title="Using geom_points and geom_segment", 
       subtitle="IP Vs Nodes",
       caption="Author: Isaac Arnault")


## scripting.R
# Reading our dataset
MyData <- read.csv(file="dataset_hadoop.csv",
          header=TRUE, sep=",")
MyData

# ggvis: first visualization using `layer_points` function
MyData %>%
  ggvis(~Server, ~Nodes) %>%
layer_points()

# ggvis: improving the above script by sorting the graph per Server per Nodes per Company
MyData %>%
  ggvis(~Server, ~Nodes) %>%
  layer_points() %>%
layer_points(fill = ~Company))

# ggvis: third visualization using layer_points, shape: diamond
MyData %>%
  ggvis(~IP, ~Nodes) %>%
  layer_points(size := 25, shape := "diamond", stroke := "red", fill := NA)

# ggvis: fourth visualization using layer_lines, layer_points, triangle shape
MyData %>%
  ggvis(~IP, ~Nodes, stroke := "skyblue",
        strokeOpacity := 0.5, strokeWidth := 5) %>%
  layer_lines() %>%
  layer_points(fill = ~Company,
               shape := "triangle-up",
               size := 300)

# ggplot: first visualization using geom_bar
g <- ggplot(MyData, aes(Company))
g + geom_bar(aes(fill=Nodes), width = 0.5) +
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(title="Using geom_bar",
       subtitle="IP Vs Nodes",
       caption="Author: Isaac Arnault")

# ggplot: second visualization using geom_violin
g <- ggplot(MyData, aes(IP, Nodes))
g + geom_violin(trim=FALSE, fill='#ffffff', color="black") +
  labs(title="Using geom_violin" ,
       subtitle="IP Vs Nodes",
       caption="Author: Isaac Arnault",
       x="IP",
       y="Nodes")

# ggplot: third visualization using geom_point, geom segment, shape: tomato
ggplot(MyData, aes(x=IP, y=Nodes)) +
  geom_point(col="tomato2", size=3) +
  geom_segment(aes(x=IP,
                   xend=IP,
                   y=min(Nodes),
                   yend=max(Nodes)),
               linetype="dashed",
               size=0.1) +
  labs(title="Using geom_points and geom_segment",
       subtitle="IP Vs Nodes",
       caption="Author: Isaac Arnault")

## Tips.md

      
    Raw
  

              Tips.md
            
          
Install and load the three packages we are using. Use the CRAN or install & load them from your R Script


install ggplot2

install.packages('ggplot2')


activate ggplot2

library('ggplot2')


install ggvis

install.packages('ggvis')


activate ggvis

library('ggvis')


install BoxPlot

install.packages('BoxPlot')


activate BoxPlot

library('BoxPlot')


Open your .csv in RStudio and explore the data. You have the file hosted on your PC.

Dataset_1 <- read.csv(file="filepath/myfile.csv", header=TRUE, sep=",")

Or you can read your file in RStudio if having the .csv hosted on a remote site


Dataset_1 <- read.csv("http://fileurl/myfile.csv", header=TRUE, sep=",")


Launching R using your terminal

see code

R


Launching Jupyter using your terminal

see code

jupyter notebook


If you don't have R and Jupyter installed, you can still use the following links:

RStudio - Download RStudio Desktop (Open Source Licence)

Jupyter Notebook - You don't need to install, see Tips.md)


If you need to access / search data related to this gist and its exercise, please check:

Who.is

Wiki.apache.org 

Hostingchecker.com
	________ ________ ___ __ ___
	\|\_____ \\|\ __ \\|\ \\|\ \ \|\ \
	\\|___/ /\ \ \\|\ \ \ \/ /\|\ \ \
	/ / /\ \ __ \ \ ___ \ \ \
	/ /_/__\ \ \ \ \ \ \\ \ \ \ \
	\|\________\ \__\ \__\ \__\\ \__\ \__\
	\\|_______\|\\|__\|\\|__\|\\|__\| \\|__\|\\|__\|


	exercise_solution.md
Id	Company	Nodes	Server	Version	IP
1	Adobe	3	Apache	NA	193
2	Crowdmedia	5	Apache	NA	88
3	Beebler	14	Nginx	1.11.9	54
4	Bixolabs	20	Nginx	1.14.0	50
5	Careers	15	Nginx	NA	185
6	Contextweb	50	Openresty	NA	52
7	Criteo	2000	Nginx	NA	178
8	Ebay	532	NA	NA	66
9	Facebook	1400	NA	NA	31
10	Infochimps	30	Nginx	NA	23
11	Lastfm	100	Nginx	NA	64
12	Mercadolibre	20	Tengine	NA	54
13	Openneptune	200	Apache	NA	103
14	Quantcast	3000	Apache	NA	34
15	Rackspace	30	Akamaighost	NA	173
16	Rakuten	69	Akamaighost	NA	203
17	Spotify	1650	Nginx	NA	104
18	Telenav	60	CentOS	2.4.6	35
19	Worldlingo	44	Nginx	NA	204
Id	Company	Nodes	Country	Server
1	Linkedin.com	4100	USA	Play
2	Facebook.com	1400	USA	NA
3	NetSeer.com	1050	USA	Nginx
4	Ebay.com	532	USA	NA
5	CRS4.com	400	USA	Nginx
6	Powerset	400	USA	NA
7	Aknowledge.com	400	USA	NA
8	Neptune.com	200	UK	Cloudflare
9	Aol.com	1400	CHE	ATS
10	Immobi.com	150	GER	Apache
11	FOX.com	140	USA	AkamaiGHost
12	Specificmedia.com	20	USA	Apache
13	Search.wikia.com	125	USA	Apache
14	Ecircle.com	120	Germany	Apache
15	Spotify.com	120	USA	Nginx
16	A9.com	69	USA	Server
17	Ara.com.tr	100	USA	NA
18	Cornell.com	100	USA	NA
19	Last.fm	100	USA	Nginx
20	Tint.com	94	USA	Nginx
	MIT License

	Copyright (c) 2018 Isaac Arnault

	Permission is hereby granted, free of charge, to any person obtaining a copy
	of this software and associated documentation files (the "Software"), to deal
	in the Software without restriction, including without limitation the rights
	to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
	copies of the Software, and to permit persons to whom the Software is
	furnished to do so, subject to the following conditions:

	The above copyright notice and this permission notice shall be included in all
	copies or substantial portions of the Software.

	THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
	IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
	FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
	AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
	LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
	OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
	SOFTWARE.
	# Reading our dataset
	MyData <- read.csv(file="dataset_hadoop.csv",
	header=TRUE, sep=",")
	MyData

	# ggvis: first visualization using `layer_points` function
	MyData %>%
	ggvis(~Server, ~Nodes) %>%
	layer_points()

	# ggvis: improving the above script by sorting the graph per Server per Nodes per Company
	MyData %>%
	ggvis(~Server, ~Nodes) %>%
	layer_points() %>%
	layer_points(fill = ~Company))

	# ggvis: third visualization using layer_points, shape: diamond
	MyData %>%
	ggvis(~IP, ~Nodes) %>%
	layer_points(size := 25, shape := "diamond", stroke := "red", fill := NA)

	# ggvis: fourth visualization using layer_lines, layer_points, triangle shape
	MyData %>%
	ggvis(~IP, ~Nodes, stroke := "skyblue",
	strokeOpacity := 0.5, strokeWidth := 5) %>%
	layer_lines() %>%
	layer_points(fill = ~Company,
	shape := "triangle-up",
	size := 300)

	# ggplot: first visualization using geom_bar
	g <- ggplot(MyData, aes(Company))
	g + geom_bar(aes(fill=Nodes), width = 0.5) +
	theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
	labs(title="Using geom_bar",
	subtitle="IP Vs Nodes",
	caption="Author: Isaac Arnault")

	# ggplot: second visualization using geom_violin
	g <- ggplot(MyData, aes(IP, Nodes))
	g + geom_violin(trim=FALSE, fill='#ffffff', color="black") +
	labs(title="Using geom_violin" ,
	subtitle="IP Vs Nodes",
	caption="Author: Isaac Arnault",
	x="IP",
	y="Nodes")

	# ggplot: third visualization using geom_point, geom segment, shape: tomato
	ggplot(MyData, aes(x=IP, y=Nodes)) +
	geom_point(col="tomato2", size=3) +
	geom_segment(aes(x=IP,
	xend=IP,
	y=min(Nodes),
	yend=max(Nodes)),
	linetype="dashed",
	size=0.1) +
	labs(title="Using geom_points and geom_segment",
	subtitle="IP Vs Nodes",
	caption="Author: Isaac Arnault")