Skip to content

Instantly share code, notes, and snippets.

@isaacarnault
Last active May 5, 2022 06:39
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 8 You must be signed in to fork a gist
  • Save isaacarnault/16d757d8eff227ae624f1852fe1c8a91 to your computer and use it in GitHub Desktop.
Save isaacarnault/16d757d8eff227ae624f1852fe1c8a91 to your computer and use it in GitHub Desktop.
Data exploration and visualization using R
________ ________ ___ __ ___
|\_____ \|\ __ \|\ \|\ \ |\ \
\|___/ /\ \ \|\ \ \ \/ /|\ \ \
/ / /\ \ __ \ \ ___ \ \ \
/ /_/__\ \ \ \ \ \ \\ \ \ \ \
|\________\ \__\ \__\ \__\\ \__\ \__\
\|_______|\|__|\|__|\|__| \|__|\|__|
exercise_solution.md

Data exploration and visualization using R

Project Status: Concept – Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept.

Scripting in R and Jupyter

The following repository helps you learn how to create a dataset from end-to-end and performing some data exploration and vizualization.

Prerequisites : the story you want to tell

To implement data visualization in R programming, you should have some interest for data you use daily either in your job or at uni. Before I created this gist, I imagined how my data visualization could be of interest to Hadoop professionals on social networks since ultimately I share my gist to my Twitter and Linkedin followers. I therefore decided to find some available data related to this technology that could be interesting, to create a dataset in which I would use these data, to read this dataset using R, to perfom some analysis and cleaning operations on this dataset and to create a vizualisation chart that could tell a story about this dataset.

How to : make a data exploration and visualization using R

The following steps will help you visualize "the number of nodes in a Hadoop cluster used by major tech companies" (the story that I want to tell). To implement what I did, you may wish to proceed as follows:

  • you can follow the below steps to understand all the steps from end-to-end
  • or you can use program.rda in R Studio or in your favorite workbench to check the output

Steps

  • Check https://who.is for retrieving data you'll use in your dataset (e.g: search for https://last.fm)
  • Open your favorite text editor
  • Name your columns company, nodes, country, server_type, server_version, Id
  • Create 20 observations (an observation is equal to a row, 20 obs = 20 rows)
  • Make sure to store data in each cell of your dataset (if you have no available data, use "NA"
  • Save your file in .csv
  • Make sure you have RStudio installed on your machine (see Running the tests)
  • Open your file with R and vizualize it
  • Create a new R script, install and load the packages (refer to Tips.md
  • Open your .csv in R and explore the data (refer to Tips.md to know how

Running the tests

I am using Ubuntu (18.04 bionic).

  • Check on your shell if R Studio is correctly installed using this:

Check RStudio version

$ R --version

Check Jupyter Notebook version

$ jupyter --version

You need RStudio and Jupyter Nptebook installed on your PC to proprely use this gist.
Jupyter Notebook is not compulsory. It is another way to read R programming scripts.
You can still use Jupyter Notebook on remote sites to perform same operations you would perform in RStudio.

Built With

  • Notepadqq - A text editor - Linux/Unix
  • R Studio - A statistical computing environment
  • ggvis - a package for creating histograms
  • ggplot2 - a famous package for plotting in R
  • This dataset was created using notepadqq.
  • Data is sorted by company name, number of nodes, country name, server type, server version and position in the table.
  • Save the code below in .csv and read it using RStudio before you invoke vizualisation functions.
  • Data are provided by various sites. Some of them are listed in Tips.md

Versioning

I used no vesioning system for this gist. My repository's status is flagged as active because it has reached a stable, usable state. Original gist related to this repository is pending as concept.

Author

  • Isaac Arnault

Licence

All public gists https://gist.github.com/isaacarnault
Copyright 2018, Isaac Arnault
MIT License, http://www.opensource.org/licenses/mit-license.php

Exercise

As an IT or Big Data Project Manager, your are asked by the Information System Manager to use a dataset in order to do some presentation regarding the management of Hadoop clusters all over the world. For your presentation, you have decided to include some metrics related to the number of nodes processed by top Internet companies and to locate the servers on which the nodes are processed by Internet Protocol address. Since some data are available in the Public Domain (on the Internet), you have decided to go for them. This excercise is only a part of a whole set of steps you'd have conducted on top of your presentation (Business understanding, Analytic approach, Data requirements / - collection / - analysis / - preparation, - modeling). Completing this exercise could be seen as a prerequisite regarding data analysis for enterprise.

  • Create your dataset by using data from this Slideshare
  • Consider the following range of data while extracting them from the above link: dataset = {2, 21}
  • Name the variables of your dataset Id, Company, Nodes, Country, Server
  • Go to Tips.md to find sources where you can find Server name and Country
  • Assign to each Id a Company, number of Nodes, Country and Server Name
  • Read your dataset using RStudio or Jupyter
  • Use Jupyter to perform some exploration of your dataset
  • Use RStudio to perform some visualisation of your dataset:
    1. Install and activate ggvis and ggplot2 packages from the CRAN
    2. Use geom_dotplot function for plotting. Sort the graph by Company per Nodes.
  • Question: How many companies use {500, 1500} nodes? Name the companies while visualizing the graph.
Id Company Nodes Server Version IP
1 Adobe 3 Apache NA 193
2 Crowdmedia 5 Apache NA 88
3 Beebler 14 Nginx 1.11.9 54
4 Bixolabs 20 Nginx 1.14.0 50
5 Careers 15 Nginx NA 185
6 Contextweb 50 Openresty NA 52
7 Criteo 2000 Nginx NA 178
8 Ebay 532 NA NA 66
9 Facebook 1400 NA NA 31
10 Infochimps 30 Nginx NA 23
11 Lastfm 100 Nginx NA 64
12 Mercadolibre 20 Tengine NA 54
13 Openneptune 200 Apache NA 103
14 Quantcast 3000 Apache NA 34
15 Rackspace 30 Akamaighost NA 173
16 Rakuten 69 Akamaighost NA 203
17 Spotify 1650 Nginx NA 104
18 Telenav 60 CentOS 2.4.6 35
19 Worldlingo 44 Nginx NA 204
  • Your dataset should render as this in RStudio for dataset = {2, 21}.
Id Company Nodes Country Server
1 Linkedin.com 4100 USA Play
2 Facebook.com 1400 USA NA
3 NetSeer.com 1050 USA Nginx
4 Ebay.com 532 USA NA
5 CRS4.com 400 USA Nginx
6 Powerset 400 USA NA
7 Aknowledge.com 400 USA NA
8 Neptune.com 200 UK Cloudflare
9 Aol.com 1400 CHE ATS
10 Immobi.com 150 GER Apache
11 FOX.com 140 USA AkamaiGHost
12 Specificmedia.com 20 USA Apache
13 Search.wikia.com 125 USA Apache
14 Ecircle.com 120 Germany Apache
15 Spotify.com 120 USA Nginx
16 A9.com 69 USA Server
17 Ara.com.tr 100 USA NA
18 Cornell.com 100 USA NA
19 Last.fm 100 USA Nginx
20 Tint.com 94 USA Nginx
see dataset

Id, Company, Nodes, Country, Server
1, Linkedin.com, 4100, USA, Play
2, Facebook.com, 1400, USA, NA
3, NetSeer.com, 1050, USA, Nginx
4, bay.com, 532, USA, NA
5, Crs4.com, 400, USA , Nginx
6, Powerset, 400, USA, NA
7, Aknowledge.com, 400, USA, NA
8, Neptune.com, 200, UK, Cloudflare
9, Aol.com, 1400, CHE, ATS
10, Immobi.com, 150, GER, Apache
11, Fox.com, 140, USA, AkamaiGHost
12, Specificmedia.com, 20, USA, Apache
13, Search.wikia.com, 125, USA, Apache
14, Ecircle.com, 120, Germany, Apache
15, Spotify.com, 120, USA, Nginx
16, A9.com, 69, USA, Server
17, Ara.com.tr, 100, USA, NA
18, Cornell.com, 100, USA, NA
19, Last.fm, 100, USA, Nginx
20, Tint.com, 94, USA, Nginx

See graph

isaac-arnault-datavisualization-using-R-17.png

See answer

The graph shows that 4 companies use 500 to 1500 nodes: Ebay, NetSeer, Facebook, Aol.

MIT License
Copyright (c) 2018 Isaac Arnault
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Ignore below outputs if the scripts ran correctly.

  • Creating a dataset, saving it in .csv and reading it with Jupyter Notebook
see raw format

# Raw format
Id, Company, Nodes, Server, Version, IP
1, Adobe, 3, Apache, NA, 193
2, Crowdmedia, 5, Apache, NA, 88
3, Beebler, 14, Nginx, 1.11.9, 54
4, Bixolabs, 20, Nginx, 1.14.0, 50
5, Careers, 15, Nginx, NA, 185 
6, Contextweb, 50, Openresty, NA, 52
7, Criteo, 2000, Nginx, NA, 178
8, Ebay, 532, NA, NA, 66
9, Facebook, 1400, NA, NA, 31
10, Infochimps, 30, Nginx, NA, 23
11, Lastfm, 100, Nginx, NA, 64
12, Mercadolibre, 20, Tengine, NA, 54
13, Openneptune, 200, Apache, NA, 103
14, Quantcast, 3000, Apache, NA, 34
15, Rackspace, 30, Akamaighost, NA, 173
16, Rakuten, 69, Akamaighost, NA, 203
17, Spotify, 1650, Nginx, NA, 104
18, Telenav, 60, CentOS, 2.4.6, 35
19, Worldlingo, 44, Nginx, NA, 204

isaac-arnault-datavisualization-using-R-0.png

Data Exploration

  • Reading our dataset
first argument

# 1. Reading dataset using Jupyter Notebook
MyData <- read.csv(file="dataset_hadoop.csv")
MyData

isaac-arnault-datavisualization-using-R-2.png

  • Exploring our dataset
dim() function

# 2. Showing the dimensions of the dataset by variables (columns) and observations (rows)
MyData <- read.csv(file="dataset_hadoop.csv")

dim(MyData)

isaac-arnault-datavisualization-using-R-3.png

  • Exploring our dataset
str() function

# 3. Showing the structure of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

str(MyData)

isaac-arnault-datavisualization-using-R-4.png

  • Exploring our dataset
summary() function

# 4 Summary statistics on the variables (columns) of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

summary(MyData)

isaac-arnault-datavisualization-using-R-5.png

  • Exploring our dataset
colnames() function

# 5 Showing the name of each variable (column) of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

colnames(MyData)

isaac-arnault-datavisualization-using-R-6.png

  • Exploring our dataset
head() function

# 6  Showing the first 6 observations (rows) of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

head(MyData)

isaac-arnault-datavisualization-using-R-7.png

  • Exploring our dataset
tail() function

# 7  Showing the first 6 observations (rows) of the dataset
MyData <- read.csv(file="dataset_hadoop.csv")

tail(MyData)

isaac-arnault-datavisualization-using-R-8.png

Data Visualization

  • Visualizing our dataset
basic plotting

# ggvis: first visualization using `layer_points` function
MyData %>% 
  ggvis(~Server, ~Nodes) %>%
layer_points()

isaac-arnault-datavisualization-using-R-9.png

  • Visualizing our dataset
improved plotting

# ggvis: improving the above script by sorting the graph per Server per Nodes per Company
MyData %>% 
  ggvis(~Server, ~Nodes) %>%
  layer_points() %>%
layer_points(fill = ~Company))

isaac-arnault-datavisualization-using-R-10.png

  • Visualizing our dataset
IP per Nodes, diamond

# ggvis: third visualization using layer_points, diamond shape
MyData %>% 
  ggvis(~IP, ~Nodes) %>% 
  layer_points(size := 25, shape := "diamond", stroke := "red", fill := NA)

isaac-arnault-datavisualization-using-R-11.png

  • Visualizing our dataset
IP per Nodes, triangles

# ggvis: fourth visualization using layer_lines, layer_points, triangle shape
MyData %>%
  ggvis(~IP, ~Nodes, stroke := "skyblue",
        strokeOpacity := 0.5, strokeWidth := 5) %>%
  layer_lines() %>%
  layer_points(fill = ~Company,
               shape := "triangle-up",
               size := 300)

isaac-arnault-datavisualization-using-R-12.png

  • Visualizing our dataset
Company per Nodes using geom_bar

# ggplot: first visualization using geom_bar
g <- ggplot(MyData, aes(Company))
g + geom_bar(aes(fill=Nodes), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Using geom_bar", 
       subtitle="IP Vs Nodes", 
       caption="Author: Isaac Arnault")

isaac-arnault-datavisualization-using-R-14.png

  • Visualizing our dataset
Company per Nodes using geom_violin

# ggplot: second visualization using geom_violin
g <- ggplot(MyData, aes(IP, Nodes))
g + geom_violin(trim=FALSE, fill='#ffffff', color="black") + 
  labs(title="Using geom_violin" , 
       subtitle="IP Vs Nodes",
       caption="Author: Isaac Arnault",
       x="IP",
       y="Nodes")

isaac-arnault-datavisualization-using-R-15.png

  • Visualizing our dataset
Company per Nodes using geom_point, geom_segment

# ggplot: third visualization using geom_point, geom segment, shape: tomato
ggplot(MyData, aes(x=IP, y=Nodes)) + 
  geom_point(col="tomato2", size=3) + 
  geom_segment(aes(x=IP, 
                   xend=IP, 
                   y=min(Nodes), 
                   yend=max(Nodes)), 
               linetype="dashed", 
               size=0.1) +  
  labs(title="Using geom_points and geom_segment", 
       subtitle="IP Vs Nodes",
       caption="Author: Isaac Arnault")

isaac-arnault-datavisualization-using-R-16.png

# Reading our dataset
MyData <- read.csv(file="dataset_hadoop.csv",
header=TRUE, sep=",")
MyData
# ggvis: first visualization using `layer_points` function
MyData %>%
ggvis(~Server, ~Nodes) %>%
layer_points()
# ggvis: improving the above script by sorting the graph per Server per Nodes per Company
MyData %>%
ggvis(~Server, ~Nodes) %>%
layer_points() %>%
layer_points(fill = ~Company))
# ggvis: third visualization using layer_points, shape: diamond
MyData %>%
ggvis(~IP, ~Nodes) %>%
layer_points(size := 25, shape := "diamond", stroke := "red", fill := NA)
# ggvis: fourth visualization using layer_lines, layer_points, triangle shape
MyData %>%
ggvis(~IP, ~Nodes, stroke := "skyblue",
strokeOpacity := 0.5, strokeWidth := 5) %>%
layer_lines() %>%
layer_points(fill = ~Company,
shape := "triangle-up",
size := 300)
# ggplot: first visualization using geom_bar
g <- ggplot(MyData, aes(Company))
g + geom_bar(aes(fill=Nodes), width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Using geom_bar",
subtitle="IP Vs Nodes",
caption="Author: Isaac Arnault")
# ggplot: second visualization using geom_violin
g <- ggplot(MyData, aes(IP, Nodes))
g + geom_violin(trim=FALSE, fill='#ffffff', color="black") +
labs(title="Using geom_violin" ,
subtitle="IP Vs Nodes",
caption="Author: Isaac Arnault",
x="IP",
y="Nodes")
# ggplot: third visualization using geom_point, geom segment, shape: tomato
ggplot(MyData, aes(x=IP, y=Nodes)) +
geom_point(col="tomato2", size=3) +
geom_segment(aes(x=IP,
xend=IP,
y=min(Nodes),
yend=max(Nodes)),
linetype="dashed",
size=0.1) +
labs(title="Using geom_points and geom_segment",
subtitle="IP Vs Nodes",
caption="Author: Isaac Arnault")
  • Install and load the three packages we are using. Use the CRAN or install & load them from your R Script
install ggplot2

install.packages('ggplot2')

activate ggplot2

library('ggplot2')


install ggvis

install.packages('ggvis')

activate ggvis

library('ggvis')


install BoxPlot

install.packages('BoxPlot')

activate BoxPlot

library('BoxPlot')


  • Open your .csv in RStudio and explore the data. You have the file hosted on your PC.
Dataset_1 <- read.csv(file="filepath/myfile.csv", header=TRUE, sep=",")
  • Or you can read your file in RStudio if having the .csv hosted on a remote site
Dataset_1 <- read.csv("http://fileurl/myfile.csv", header=TRUE, sep=",")

  • Launching R using your terminal
see code

R

  • Launching Jupyter using your terminal
see code

jupyter notebook

  • If you don't have R and Jupyter installed, you can still use the following links:
    RStudio - Download RStudio Desktop (Open Source Licence)
    Jupyter Notebook - You don't need to install, see Tips.md)

  • If you need to access / search data related to this gist and its exercise, please check:
    Who.is
    Wiki.apache.org
    Hostingchecker.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment