Created
August 7, 2020 11:47
-
-
Save Yashwants19/4967c5939ba054bc868415d1e3e5c956 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#' @title Large Margin Nearest Neighbors (LMNN) | |
#' | |
#' @description | |
#' An implementation of Large Margin Nearest Neighbors (LMNN), a distance | |
#' learning technique. Given a labeled dataset, this learns a transformation of | |
#' the data that improves k-nearest-neighbor performance; this can be useful as | |
#' a preprocessing step. | |
#' | |
#' @param input Input dataset to run LMNN on (numeric matrix). | |
#' @param batch_size Batch size for mini-batch SGD. Default value "50" | |
#' (integer). | |
#' @param center Perform mean-centering on the dataset. It is useful when | |
#' the centroid of the data is far from the origin. Default value "FALSE" | |
#' (logical). | |
#' @param distance Initial distance matrix to be used as starting poin | |
#' (numeric matrix). | |
#' @param k Number of target neighbors to use for each datapoint. Default | |
#' value "1" (integer). | |
#' @param labels Labels for input dataset (integer row). | |
#' @param linear_scan Don't shuffle the order in which data points are | |
#' visited for SGD or mini-batch SGD. Default value "FALSE" (logical). | |
#' @param max_iterations Maximum number of iterations for L-BFGS (0 | |
#' indicates no limit). Default value "100000" (integer). | |
#' @param normalize Use a normalized starting point for optimization. Itis | |
#' useful for when points are far apart, or when SGD is returning NaN. | |
#' Default value "FALSE" (logical). | |
#' @param optimizer Optimizer to use; 'amsgrad', 'bbsgd', 'sgd', or | |
#' 'lbfgs'. Default value "amsgrad" (character). | |
#' @param passes Maximum number of full passes over dataset for AMSGrad, | |
#' BB_SGD and SGD. Default value "50" (integer). | |
#' @param print_accuracy Print accuracies on initial and transformed | |
#' datase. Default value "FALSE" (logical). | |
#' @param range Number of iterations after which impostors needs to be | |
#' recalculate. Default value "1" (integer). | |
#' @param rank Rank of distance matrix to be optimized.. Default value "0" | |
#' (integer). | |
#' @param regularization Regularization for LMNN objective function. | |
#' Default value "0.5" (numeric). | |
#' @param seed Random seed. If 0, 'std::time(NULL)' is used. Default | |
#' value "0" (integer). | |
#' @param step_size Step size for AMSGrad, BB_SGD and SGD (alpha). Default | |
#' value "0.01" (numeric). | |
#' @param tolerance Maximum tolerance for termination of AMSGrad, BB_SGD, | |
#' SGD or L-BFGS. Default value "1e-07" (numeric). | |
#' @param verbose Display informational messages and the full list of | |
#' parameters and timers at the end of execution. Default value "FALSE" | |
#' (logical). | |
#' | |
#' @return A list with several components: | |
#' \item{centered_data}{Output matrix for mean-centered dataset (numeric | |
#' matrix).} | |
#' \item{output}{Output matrix for learned distance matrix (numeric | |
#' matrix).} | |
#' \item{transformed_data}{Output matrix for transformed dataset (numeric | |
#' matrix).} | |
#' | |
#' @details | |
#' This program implements Large Margin Nearest Neighbors, a distance learning | |
#' technique. The method seeks to improve k-nearest-neighbor classification on | |
#' a dataset. The method employes the strategy of reducing distance between | |
#' similar labeled data points (a.k.a target neighbors) and increasing distance | |
#' between differently labeled points (a.k.a impostors) using standard | |
#' optimization techniques over the gradient of the distance between data | |
#' points. | |
#' | |
#' To work, this algorithm needs labeled data. It can be given as the last row | |
#' of the input dataset (specified with "input"), or alternatively as a separate | |
#' matrix (specified with "labels"). Additionally, a starting point for | |
#' optimization (specified with "distance"can be given, having (r x d) | |
#' dimensionality. Here r should satisfy 1 <= r <= d, Consequently a Low-Rank | |
#' matrix will be optimized. Alternatively, Low-Rank distance can be learned by | |
#' specifying the "rank"parameter (A Low-Rank matrix with uniformly distributed | |
#' values will be used as initial learning point). | |
#' | |
#' The program also requires number of targets neighbors to work with ( | |
#' specified with "k"), A regularization parameter can also be passed, It acts | |
#' as a trade of between the pulling and pushing terms (specified with | |
#' "regularization"), In addition, this implementation of LMNN includes a | |
#' parameter to decide the interval after which impostors must be re-calculated | |
#' (specified with "range"). | |
#' | |
#' Output can either be the learned distance matrix (specified with "output"), | |
#' or the transformed dataset (specified with "transformed_data"), or both. | |
#' Additionally mean-centered dataset (specified with "centered_data") can be | |
#' accessed given mean-centering (specified with "center") is performed on the | |
#' dataset. Accuracy on initial dataset and final transformed dataset can be | |
#' printed by specifying the "print_accuracy"parameter. | |
#' | |
#' This implementation of LMNN uses AdaGrad, BigBatch_SGD, stochastic gradient | |
#' descent, mini-batch stochastic gradient descent, or the L_BFGS optimizer. | |
#' | |
#' AdaGrad, specified by the value 'adagrad' for the parameter "optimizer", uses | |
#' maximum of past squared gradients. It primarily on six parameters: the step | |
#' size (specified with "step_size"), the batch size (specified with | |
#' "batch_size"), the maximum number of passes (specified with "passes"). | |
#' Inaddition, a normalized starting point can be used by specifying the | |
#' "normalize" parameter. | |
#' | |
#' BigBatch_SGD, specified by the value 'bbsgd' for the parameter "optimizer", | |
#' depends primarily on four parameters: the step size (specified with | |
#' "step_size"), the batch size (specified with "batch_size"), the maximum | |
#' number of passes (specified with "passes"). In addition, a normalized | |
#' starting point can be used by specifying the "normalize" parameter. | |
#' | |
#' Stochastic gradient descent, specified by the value 'sgd' for the parameter | |
#' "optimizer", depends primarily on three parameters: the step size (specified | |
#' with "step_size"), the batch size (specified with "batch_size"), and the | |
#' maximum number of passes (specified with "passes"). In addition, a | |
#' normalized starting point can be used by specifying the "normalize" | |
#' parameter. Furthermore, mean-centering can be performed on the dataset by | |
#' specifying the "center"parameter. | |
#' | |
#' The L-BFGS optimizer, specified by the value 'lbfgs' for the parameter | |
#' "optimizer", uses a back-tracking line search algorithm to minimize a | |
#' function. The following parameters are used by L-BFGS: "max_iterations", | |
#' "tolerance"(the optimization is terminated when the gradient norm is below | |
#' this value). For more details on the L-BFGS optimizer, consult either the | |
#' mlpack L-BFGS documentation (in lbfgs.hpp) or the vast set of published | |
#' literature on L-BFGS. In addition, a normalized starting point can be used | |
#' by specifying the "normalize" parameter. | |
#' | |
#' By default, the AMSGrad optimizer is used. | |
#' | |
#' @author | |
#' mlpack developers | |
#' | |
#' @export | |
#' @examples | |
#' # Example - Let's say we want to learn distance on iris dataset with number | |
#' # of targets as 3 using BigBatch_SGD optimizer. A simple call for the same | |
#' # will look like: | |
#' | |
#' \donttest{ | |
#' output <- mlpack_lmnn(input=iris, labels=iris_labels, k=3, optimizer="bbsgd") | |
#' output <- output$output | |
#' } | |
#' | |
#' # An another program call making use of range & regularization parameter with | |
#' # dataset having labels as last column can be made as: | |
#' | |
#' \donttest{ | |
#' output <- mlpack_lmnn(input=letter_recognition, k=5, range=10, | |
#' regularization=0.4) | |
#' output <- output$output | |
#' } | |
lmnn <- function(input, | |
batch_size=NA, | |
center=FALSE, | |
distance=NA, | |
k=NA, | |
labels=NA, | |
linear_scan=FALSE, | |
max_iterations=NA, | |
normalize=FALSE, | |
optimizer=NA, | |
passes=NA, | |
print_accuracy=FALSE, | |
range=NA, | |
rank=NA, | |
regularization=NA, | |
seed=NA, | |
step_size=NA, | |
tolerance=NA, | |
verbose=FALSE) { | |
# Restore IO settings. | |
IO_RestoreSettings("Large Margin Nearest Neighbors (LMNN)") | |
# Process each input argument before calling mlpackMain(). | |
IO_SetParamMat("input", to_matrix(input)) | |
if (!identical(batch_size, NA)) { | |
IO_SetParamInt("batch_size", batch_size) | |
} | |
if (!identical(center, FALSE)) { | |
IO_SetParamBool("center", center) | |
} | |
if (!identical(distance, NA)) { | |
IO_SetParamMat("distance", to_matrix(distance)) | |
} | |
if (!identical(k, NA)) { | |
IO_SetParamInt("k", k) | |
} | |
if (!identical(labels, NA)) { | |
IO_SetParamURow("labels", to_matrix(labels)) | |
} | |
if (!identical(linear_scan, FALSE)) { | |
IO_SetParamBool("linear_scan", linear_scan) | |
} | |
if (!identical(max_iterations, NA)) { | |
IO_SetParamInt("max_iterations", max_iterations) | |
} | |
if (!identical(normalize, FALSE)) { | |
IO_SetParamBool("normalize", normalize) | |
} | |
if (!identical(optimizer, NA)) { | |
IO_SetParamString("optimizer", optimizer) | |
} | |
if (!identical(passes, NA)) { | |
IO_SetParamInt("passes", passes) | |
} | |
if (!identical(print_accuracy, FALSE)) { | |
IO_SetParamBool("print_accuracy", print_accuracy) | |
} | |
if (!identical(range, NA)) { | |
IO_SetParamInt("range", range) | |
} | |
if (!identical(rank, NA)) { | |
IO_SetParamInt("rank", rank) | |
} | |
if (!identical(regularization, NA)) { | |
IO_SetParamDouble("regularization", regularization) | |
} | |
if (!identical(seed, NA)) { | |
IO_SetParamInt("seed", seed) | |
} | |
if (!identical(step_size, NA)) { | |
IO_SetParamDouble("step_size", step_size) | |
} | |
if (!identical(tolerance, NA)) { | |
IO_SetParamDouble("tolerance", tolerance) | |
} | |
if (verbose) { | |
IO_EnableVerbose() | |
} else { | |
IO_DisableVerbose() | |
} | |
# Mark all output options as passed. | |
IO_SetPassed("centered_data") | |
IO_SetPassed("output") | |
IO_SetPassed("transformed_data") | |
# Call the program. | |
lmnn_mlpackMain() | |
# Add ModelType as attribute to the model pointer, if needed. | |
# Extract the results in order. | |
out <- list( | |
"centered_data" = IO_GetParamMat("centered_data"), | |
"output" = IO_GetParamMat("output"), | |
"transformed_data" = IO_GetParamMat("transformed_data") | |
) | |
# Clear the parameters. | |
IO_ClearSettings() | |
return(out) | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment