Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
FIESTA-IoT Analytics

FIESTA-IoT Analytics

Table of Contents

Introduction

In order to maximise the added value of the data being extracted from FIESTA-IoT testbeds for the experimenter, it is important to provide data analysis tools and service. As a result, the Knowledge Acquisition Toolkit (KAT) web service is being developed for FIESTA-IoT in order to provide open access data analysis tools for data consumers as a web service. Such a tool provides the following benefits: namely for novice/beginner data consumer, the tools that would enable them to analyse and obtain useful information. While for the more advanced/experienced user providing the most effective tools for a given data set.

For example, such a tool would provide relevant documentation for the beginner data consumer, with examples of data processing work flows; while for the more experienced user the most advanced tools developed in academic institutions can be evaluated by a wide range of users and the most useful data analysis tool for a given data set will be identified. Furthermore, by providing data analysis as a web service, FIESTA-IoT enables a wider range of experimenters to access the FIESTA-IoT platform.

Background on Data Analysis methods

Data Pre-Processing Techniques

  • Digital Filtering – A finite impulse response (FIR) filter, for lowpass, bandpass and highpass filtering. That is, the removal of low frequencies (lowpass filtering), between a range of frequencies (bandpass filtering) and the removal of high frequency components (highpass filtering). This method is important for the removal of unwanted signal components that corrupt the desired signal. It should be noted that the user would need to define the type of filtering required (that is lowpass, bandpass etc.) as well as providing the frequency ranges they require to remove. Please see Figure 3, for an illustrative example.
  • Outlier Removal – Many machine learning algorithms are sensitive to outliers. That is the performance of the methods degrade as the number of outlier’s increases. Accordingly, we provide an outlier tool that is based on winsorization [17]. The user defines only one parameter that is the percentage of from the highest and lowest value in the data set is clipped (an example is shown in Figure 4).

Machine Learning Techniques

Supervised learning

Supervised learning seeks to identify a functional relationship between the data when an input-output relationship is required by the experimenter for the data set being analysed. For example, if one considers the output set of data points Y that may correspond for example to data obtained from a sensor measuring air pollution. While the input variable X may be the number of cars. One can then find a functional (either linear or nonlinear) relationship between the output Y with the input X. Such a problem is generally referred to as supervised learning, where training is first carried out in order to identify the parameters of the functional form specified by the experimenter. Such that either inference or prediction can then be carried out. Examples of supervised learning algorithms that will be initially included in the FIESTA-IoT Analytics platform include the following [18]:

  • Linear Regression – This supervised learning method seeks to find a linear functional relationship between the output Y and the set of input variables X_p (where the subscript p corresponds to the variable index). An advantage of linear regression is the relatively simple interpretability of model while maintaining reasonable prediction performance. For example consider the following example: that is we wish to generate a linear model between the input and output, Y=aX, where a is the linear parameter that relates the input X to the output Y. As a result, the following inference (interpretation) can be made, if the input variables changes by, ∆X then the output will change by a∆X. A simple illustration of a linear regression model is shown in Figure 5.
  • K-Nearest Neighbors (K-NN) regression – While linear regression may provide an interpretable relationship between the input and output variables. The performance of the method for predicting the output given the input data points may degrade. This may arise owing to a non-linear relationship with respect to the model parameters (it should be noted that, transformations of the input variables themselves can be carried out in order to carry out linear regression). Given the input data (X_(train,) Y_train), we seek to estimate the output ( Y) ̂_test, given the test data X_test (which is a subset of the X_(train,)), by averaging the K output training samples using the corresponding K nearest (using the Euclidean distance) input training data X_(train )points to〖 X〗_(test ). An illustrative example of K-NN is provided in Figure 6.

Unsupervised learning

Unsupervised learning seeks to identify structures and patterns in the data set, where an explicit input-output relationship is not known (or required). In such problems where only the input data is only available with no explicit output, the challenge is to identify groups or cluster the data in order to understand the relationship between the variables (this is often carried out in exploratory data analysis). To this end, we have also initially included the following unsupervised learning algorithms:

  • K-Means Clustering - This algorithms seeks to identify clusters or subgroups of the data points being analysed. Clustering is often performed as part of exploratory analysis of data sets, where the experimenter seeks to identify group structure within the data. An example of clustering analysis given two sensors is shown in Figure 7, where each point corresponds to the observation time index. The output of the two sensors can therefore be assigned to one of the two clusters shown in Figure 7, thus providing unsupervised classification of data. K-means clustering seeks to identify a set of K non-overlapping clusters, where each data point is assigned to one of the K clusters. The algorithm achieves this by iteratively estimating the clusters such that the total distance between each point within the K clusters is minimized.
  • Principal Component Analysis (PCA) – Given a large number of sensors, it is often not practical or possible (for very large number of sensors) to visualize the structure between the data. Accordingly methods such as PCA seek to identify a lower dimensional representation of the data that captures the most significant variability of the data. For example, consider Figure 8 where there are two sensors such that the data has a large variation along one direction (this is illustrated by the red arrow). By projecting the original data of the two sensors along this direction of highest variation (this is referred to the as the principal component scores), we can obtain a lower dimensional representation of the data thus enabling more effective analysis of the original data set.

Methods and Parameters

Pre-Processing Methods ("cleaning the data")

Method Parameter Description Result Possible Subsequent Methods
Outlier Thresh value between 0 and 1, selects the percentage of tail values to remove from the ordered time series data. {row_number}, {sensor1_outliers_clipped}, {sensor2_outliers_clipped}... {sensorN_outliers_clipped} Preprocessing, Unsupervised, Supervised, Other Methods
FilterData Type Select between, “B” Bandpass Filter, “L” Lowpass filter and Highpass filter “H”. {row_number}, {sensor1_post_filter}, {sensor2_post_filter}... {sensorN_post_filter}
cutoff_1 For the respective filters is the first normalised cutoff frequency, between 0 and 0.5.
cutoff_2 For bandpass filter only, the second cutoff frequency.
numtaps Filter length. Usually select 30.

Unsupervised Learning

Method Parameter Description Result Subsequent Methods
Kmeans NumClusters The number of clusters to select. An integer value. {row_number}, {Cluster Label}
PCA Mode Select either, ExpVar the explained variance for the different principal components, or Comp the principal component loadings that is the direction in the data corresponds to the highest variance.

Supervised Learning

Method Parameter Description Result Possible Subsequent Methods
LinReg Type Select between, “Param” the estimated parameters of the regression model, and “Predict” the estimate of the output given the test data. {row_number}, {coefficients}
Dependant Select the column index corresponding to the dependent variable.
Ratio Select the ratio of the training data to test data. Value between 0 and 1.
KNNreg Num Selects the number of nearest neighbours. {row_number}, {predicted}, {true}
Dependant Select the column index corresponding to the dependent variable.
Ratio Select the ratio of the training data to test data. Value between 0 and 1.

Other Methods

Method Parameter Description Result Possible Subsequent Methods
FFT N/A N/A {frequency}, {sensor1_fft}, {sensor2_fft}
Periodogram N/A N/A ???
Correlation N/A N/A {sensor1}, {sensor2_correlation}, {sensor1_correlation}

Data Input/Output

Input Output
CSV CSV

Standard Web Service

kat-restful-1

URL:

POST http://{serverRoot}/AnalyseData

Request body:

{
	"Method": ["Non","FFT","Non"], 
	"Parameters":["B, 0.1, 0.3, 30", "0,1", "5"],
	"DataPointer":["http://iot1.ee.surrey.ac.uk/fiesta/data/samples/TestResample.csv"]	
}
  • Method: A sequence of methods can be selected. Note that the sequences can not be in any order.
  • Parameter: A set of parameters can be set for each method declared in the same order as the sequence of methods, so that the array index of method 1 should be the same as the array index of parameter 1, and so on.
  • DataPointer: the URL of the target dataset, whereby the response is in CSV format. Furthermore, the data type must be numerical (float/int etc)

User Interface

A potential example of the layout for the KAT web service GUI is shown in Figure 1. Where on the top left hand corner of the figure, the user specifies both the input CSV data file to be processed along with the corresponding directory for storing the processed data. The user can then load the data into the KAT web server, along with selecting the requested methods and corresponding parameters for processing. The user can then visualize the processed data on the right hand panel of the GUI.

kat-gui

Dataset Delivery

There are several option on how the dataset (in a pre-defined structure) can be delivered to the KAT tool:

  • A pointer to the dataset can be passed
  • The dataset can be passed to the KAT tool with the request.
  • A pointer to the SPARQL endpoint is passed with the SPARQL query (in a pre-defined structure).

General Workflow

For the KAT to process a dataset, it needs a dataset in CSV format, whereby the columns sequentially correspond a Observation data value and it's corresponding date value (timestamp):

sensor1DataValue sensor1Timestamp sensor2DataValue sensor2Timestamp
21.0 2017-04-21T13:57:00Z 22.2 2017-04-21T13:57:00Z
21.1 2017-04-21T13:58:00Z 22.3 2017-04-21T13:58:00Z
21.2 2017-04-21T13:59:00Z 22.2 2017-04-21T13:59:00Z

It is possible to invoke a SPARQL endpoint to return the result of a query in CSV format.

  1. An experimenter discovers sensor devices of interest
  2. Experimenter defines time interval for dataset
  3. Experimenter invokes KAT by passing:
    • list of SensorDevices
    • time interval
    • methods + parameters

Resource Discovery

A SPARQL query can be in the form below can be used to discover one or more SensingDevice.

Query 1: Resources within a geographical area (bounding box) measuring certain phenomena
PREFIX iot-lite: <http://purl.oclc.org/NET/UNIS/fiware/iot-lite#>
PREFIX m3-lite: <http://purl.org/iot/vocab/m3-lite#>
PREFIX ssn: <http://purl.oclc.org/NET/ssnx/ssn#>
PREFIX geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT  ?sensingDev
WHERE {
    ?sensingDev a m3-lite:EnergyMeter .
    ?sensingDev iot-lite:hasQuantityKind ?qk .
    ?qk a m3-lite:Power .
    ?sensingDev iot-lite:hasUnit ?unit .
    ?unit a m3-lite:Watt .
    ?sensingDev iot-lite:isSubSystemOf ?dev .
    ?dev a ssn:Device .
    ?dev ssn:onPlatform ?platform .
    ?platform geo:location ?point .
    ?point geo:lat ?lat .
    ?point geo:long ?long .    
  FILTER ( 
       (xsd:double(?lat) >= "0"^^xsd:double) 
    && (xsd:double(?lat) <= "60"^^xsd:double) 
    && ( xsd:double(?long) < "10"^^xsd:double)  
    && ( xsd:double(?long) > "-6"^^xsd:double)
    )     
}ORDER BY ASC(?sensingDev)

Dataset retrieval

Approach 1:

A SPARQL query can be recursively sent to the SPARQL endpoint to retrieve data values for each sensingDevice. For example:

PREFIX iot-lite: <http://purl.oclc.org/NET/UNIS/fiware/iot-lite#>
PREFIX m3-lite: <http://purl.org/iot/vocab/m3-lite#>
PREFIX ssn: <http://purl.oclc.org/NET/ssnx/ssn#>
PREFIX geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dul: <http://www.loa.istc.cnr.it/ontologies/DUL.owl#>
PREFIX time: <http://www.w3.org/2006/time#>

SELECT  ?dataValue ?dateTime
WHERE {
?observation ssn:observedBy <http://smart-ics.surrey.ac.uk/fiesta-iot/resource/SensingDevice1>
?observation ssn:observationResult ?sensorOutput
?sensorOutput ssn:hasValue ?obsValue
?obsValue dul:hasDataValue ?dataValue
?observation ssn:observationSamplingTime ?instant
?instant time:inXSDDateTime ?dateTime
FILTER ( 
       ( xsd:dateTime(?dateTime) > xsd:dateTime("2017-05-05T14:10:00Z"))
    && ( xsd:dateTime(?dateTime) < xsd:dateTime("2017-05-05T14:20:00Z"))
    ) .
}order by ASC(?dateTime)
Approach 2:

Another approach is to receive a dataset from all sensor devices within a time period by limiting the values of the sensor devices to the ones retrieved from the discovery process.

PREFIX iot-lite: <http://purl.oclc.org/NET/UNIS/fiware/iot-lite#>
PREFIX m3-lite: <http://purl.org/iot/vocab/m3-lite#>
PREFIX ssn: <http://purl.oclc.org/NET/ssnx/ssn#>
PREFIX geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dul: <http://www.loa.istc.cnr.it/ontologies/DUL.owl#>
PREFIX time: <http://www.w3.org/2006/time#>

SELECT  ?sensingDevice ?dataValue ?dateTime
WHERE {
?observation ssn:observedBy ?sensingDevice .
VALUES ?sensingDevice { 
<http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power>
<http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power>
} .
?observation ssn:observationResult ?sensorOutput .
?sensorOutput ssn:hasValue ?obsValue .
?obsValue dul:hasDataValue ?dataValue .
?observation ssn:observationSamplingTime ?instant .
?instant time:inXSDDateTime ?dateTime .
  FILTER ( 
       ( xsd:dateTime(?dateTime) > xsd:dateTime("2017-05-05T14:10:00Z"))
    && ( xsd:dateTime(?dateTime) < xsd:dateTime("2017-05-05T14:20:00Z"))
    ) .
}ORDER BY ?sensingDevice ASC(?dateTime)   

# timezone might affect result if the reading is not in that format
Approach 3:

Another approach would be to merge both the discovery and retrieval of a dataset with respect to a set of sensing devices.

PREFIX iot-lite: <http://purl.oclc.org/NET/UNIS/fiware/iot-lite#>
PREFIX m3-lite: <http://purl.org/iot/vocab/m3-lite#>
PREFIX ssn: <http://purl.oclc.org/NET/ssnx/ssn#>
PREFIX geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dul: <http://www.loa.istc.cnr.it/ontologies/DUL.owl#>
PREFIX time: <http://www.w3.org/2006/time#>
PREFIX sics: <http://smart-ics.ee.surrey.ac.uk/fiesta-iot#>

SELECT  ?sensingDevice ?dataValue ?dateTime
WHERE {
    ?sensingDevice a m3-lite:EnergyMeter .
    ?sensingDevice iot-lite:hasQuantityKind ?qk .
    ?qk a m3-lite:Power .
    ?sensingDevice iot-lite:hasUnit ?unit .
    ?unit a m3-lite:Watt .
    ?sensingDevice iot-lite:isSubSystemOf ?device .
    ?device a ssn:Device .
    ?device ssn:onPlatform ?platform .
    ?platform geo:location ?point .
    ?point geo:lat ?lat .
    ?point geo:long ?long .
    ?observation ssn:observedBy ?sensingDevice .    
    ?observation ssn:observationResult ?sensorOutput .
    ?sensorOutput ssn:hasValue ?obsValue .
    ?obsValue dul:hasDataValue ?dataValue .
    ?observation ssn:observationSamplingTime ?instant .
    ?instant time:inXSDDateTime ?dateTime .
    #set interval
    FILTER ( 
         ( xsd:dateTime(?dateTime) > xsd:dateTime("2017-05-05T14:10:00Z"))
      && ( xsd:dateTime(?dateTime) < xsd:dateTime("2017-05-05T14:20:00Z"))
      ) . 
    #set location bounding box 
    FILTER ( 
         (xsd:double(?lat) >= "0"^^xsd:double) 
      && (xsd:double(?lat) <= "60"^^xsd:double) 
      && ( xsd:double(?long) < "10"^^xsd:double)  
      && ( xsd:double(?long) > "-6"^^xsd:double)
      )  .   
} ORDER BY ?sensingDevice ASC(?dateTime)  
LIMIT 100000
  

Example result:

sensingDevice, dataValue, dateTime
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power,0.0E0,2017-05-05T14:11:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power,0.0E0,2017-05-05T14:12:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power,0.0E0,2017-05-05T14:13:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power,0.0E0,2017-05-05T14:14:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power,0.0E0,2017-05-05T14:15:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power,0.0E0,2017-05-05T14:16:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power,0.0E0,2017-05-05T14:17:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power,0.0E0,2017-05-05T14:18:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-001-power,0.0E0,2017-05-05T14:19:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power,9.82058E-1,2017-05-05T14:11:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power,9.24422E-1,2017-05-05T14:12:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power,8.06907E-1,2017-05-05T14:13:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power,7.59163E-1,2017-05-05T14:14:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power,7.81057E-1,2017-05-05T14:15:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power,8.71402E-1,2017-05-05T14:16:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power,8.47793E-1,2017-05-05T14:17:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power,1.096935E0,2017-05-05T14:18:00Z
http://smart-ics.surrey.ac.uk/fiesta-iot/resource/sc-sics-sp-002-power,9.27258E-1,2017-05-05T14:19:00Z  

FIESTA-IoT Web Service

Based on the above analysis, the FIESTA-IoT Web Service can be invoked as shown below:

kat-restful-2

URL:

POST http://{serverRoot}/AnalyseData

Request body (1):

{
    "Method": ["FFT"],
    "Parameters":[""],
    "SPARQLquery":["PREFIX iot-lite: <http://purl.oclc.org/NET/UNIS/fiware/iot-lite#>\r\nPREFIX m3-lite: <http://purl.org/iot/vocab/m3-lite#>\r\nPREFIX ssn: <http://purl.oclc.org/NET/ssnx/ssn#>\r\nPREFIX geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#>\r\nPREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>\r\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\r\nPREFIX dul: <http://www.loa.istc.cnr.it/ontologies/DUL.owl#>\r\nPREFIX time: <http://www.w3.org/2006/time#>\r\nPREFIX sics: <http://smart-ics.ee.surrey.ac.uk/fiesta-iot#>\r\n\r\nSELECT  ?sensingDevice ?dataValue ?dateTime\r\nWHERE {\r\n    ?sensingDevice a m3-lite:EnergyMeter .\r\n    ?sensingDevice iot-lite:hasQuantityKind ?qk .\r\n    ?qk a m3-lite:Power .\r\n    ?sensingDevice iot-lite:hasUnit ?unit .\r\n    ?unit a m3-lite:Watt .\r\n    ?sensingDevice iot-lite:isSubSystemOf ?device .\r\n    ?device a ssn:Device .\r\n    ?device ssn:onPlatform ?platform .\r\n    ?platform geo:location ?point .\r\n    ?point geo:lat ?lat .\r\n    ?point geo:long ?long .\r\n    ?observation ssn:observedBy ?sensingDevice .    \r\n    ?observation ssn:observationResult ?sensorOutput .\r\n    ?sensorOutput ssn:hasValue ?obsValue .\r\n    ?obsValue dul:hasDataValue ?dataValue .\r\n    ?observation ssn:observationSamplingTime ?instant .\r\n    ?instant time:inXSDDateTime ?dateTime .\r\n    FILTER ( \r\n         ( xsd:dateTime(?dateTime) > xsd:dateTime(\"2017-05-01T12:10:00Z\"))\r\n      && ( xsd:dateTime(?dateTime) < xsd:dateTime(\"2017-05-01T14:20:00Z\"))\r\n      ) .  \r\n  FILTER ( \r\n       (xsd:double(?lat) >= \"0\"^^xsd:double) \r\n    && (xsd:double(?lat) <= \"60\"^^xsd:double) \r\n    && ( xsd:double(?long) < \"10\"^^xsd:double)  \r\n    && ( xsd:double(?long) > \"-6\"^^xsd:double)\r\n   )  .   \r\n}ORDER BY ?sensingDevice ASC(?dateTime)"],
    "SPARQLendpoint":["http://smart-ics.ee.surrey.ac.uk/srd/sparql/test"]
}
  • Method: A sequence of methods can be selected. Note that the sequences can not be in any order.
  • Parameter: A set of parameters can be set for each method declared in the same order as the sequence of methods, so that the array index of method 1 should be the same as the array index of parameter 1, and so on.
  • SPARQLquery: the SPARQL query (including escape characters for double-quotes and carriage returns)
  • SPARQLendpoint: the SPARQL endpoint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment