Neighbr is a package for performing k-nearest neighbor classification and regression. Highlights include:
Neighbr models can also be converted to the PMML (Predictive Model
Markup Language) standard using the pmml
R package.
This vignette provides examples and advice on using the package.
First, load necessary libraries and set the seed and number display
options. knitr::kable
is used to display data frames.
This example shows using squared euclidean distance with 3 neighbors
to classify the Species of flowers in the iris
dataset.
Each training instance consists of 4 features and 1 class variable. The
categorical target is predicted by a majority vote from the closest
k
neighbors. The knn()
function requires that
all columns in test_set
are feature columns, and have the
same names and are in the same order as the features in
train_set
. The train_set
is assumed to only
contain features and targets (one categorical, one continuous, and/or ID
for neighbor ranking); i.e., if a column name is not specified as a
target, it is assumed to be a feature. The fit
object
contains predictions for test_set
in
fit$test_set_scores
(there is no predict
method for knn
).
data(iris)
train_set <- iris[1:147,] #train set contains all targets and features
test_set <- iris[148:150,!names(iris) %in% c("Species")] #test set does not contain any targets
#run knn function
fit <- knn(train_set=train_set,test_set=test_set,
k=3,
categorical_target="Species",
comparison_measure="squared_euclidean")
#show predictions
kable(fit$test_set_scores)
categorical_target | |
---|---|
148 | virginica |
149 | virginica |
150 | virginica |
The returned data frame contains predictions for the categorical target (Species).
It is possible to predict categorical and continuous targets
simultaneously, as well as to return the IDs of closest neighbors of a
given instance. In the next example, an ID column is added to the data
for ranking, and Petal.Width
is used as a continuous
target. By default, the prediction for the continuous target is
calculated by averaging the closest k
neighbors.
data(iris)
iris$ID <- c(1:150) #an ID column is necessary if ranks are to be calculated
train_set <- iris[1:147,] #train set contains all predicted variables, features, and ID column
test_set <- iris[148:150,!names(iris) %in% c("Petal.Width","Species","ID")] #test set does not contain predicted variables or ID column
fit <- knn(train_set=train_set,test_set=test_set,
k=3,
categorical_target="Species",
continuous_target= "Petal.Width",
comparison_measure="squared_euclidean",
return_ranked_neighbors=3,
id="ID")
kable(fit$test_set_scores)
categorical_target | continuous_target | neighbor1 | neighbor2 | neighbor3 | |
---|---|---|---|---|---|
148 | virginica | 2.20 | 146 | 111 | 116 |
149 | virginica | 2.17 | 137 | 116 | 138 |
150 | virginica | 1.93 | 115 | 128 | 84 |
The ranked neighbor IDs are returned along with the categorical and
continuous targets, with neghbor1
being the closest in
terms of distance. If a similarity measure were being used,
neighbor1
would be the most similar. Any number of
neighbors can be returned, as long as
return_ranked_neighbors <= k
.
It is possible to get neighbor ranks without a target variable. In
this unsupervised learning case, continuous_target
and
categorical_target
are left as NULL
by
default.
data(iris)
iris$ID <- c(1:150) #an ID column is necessary if ranks are to be calculated
train_set <- iris[1:147,-c(5)] #remove `Species` categorical variable
test_set <- iris[148:150,!names(iris) %in% c("Species","ID")] #test set does not contain predicted variables or ID column
fit <- knn(train_set=train_set,test_set=test_set,
k=5,
comparison_measure="squared_euclidean",
return_ranked_neighbors=4,
id="ID")
kable(fit$test_set_scores)
neighbor1 | neighbor2 | neighbor3 | neighbor4 | |
---|---|---|---|---|
148 | 111 | 112 | 117 | 146 |
149 | 137 | 116 | 111 | 141 |
150 | 128 | 139 | 102 | 143 |
The package supports logical features, to be used with an appropriate
similarity measure. This example demonstrates predicting a categorical
target and ranking neighbors for the HouseVotes84
dataset
(from the mlbench
package). The features may be logical
consisting of {TRUE, FALSE}
or numeric vectors consisting
of {0,1}
, but not factors. In this example, the factor
features are converted to numeric vectors.
library(mlbench)
data(HouseVotes84)
dat <- HouseVotes84[complete.cases(HouseVotes84),] # remove any rows with N/A elements
# change all {yes,no} factors to {0,1}
feature_names <- names(dat)[!names(dat) %in% c("Class","ID")]
for (n in feature_names) {
levels(dat[,n])[levels(dat[,n])=="n"] <- 0
levels(dat[,n])[levels(dat[,n])=="y"] <- 1
}
# change factors to numeric
for (n in feature_names) {dat[,n] <- as.numeric(levels(dat[,n]))[dat[,n]]}
dat$ID <- c(1:nrow(dat)) #an ID column is necessary if ranks are to be calculated
train_set <- dat[1:225,]
test_set <- dat[226:232,!names(dat) %in% c("Class","ID")] #test set does not contain predicted variables or ID column
fit <- knn(train_set=train_set,test_set=test_set,
k=7,
categorical_target = "Class",
comparison_measure="jaccard",
return_ranked_neighbors=3,
id="ID")
kable(fit$test_set_scores)
categorical_target | neighbor1 | neighbor2 | neighbor3 | |
---|---|---|---|---|
422 | democrat | 209 | 109 | 149 |
423 | democrat | 114 | 148 | 106 |
424 | democrat | 114 | 96 | 112 |
427 | democrat | 5 | 47 | 91 |
428 | republican | 70 | 156 | 155 |
431 | republican | 115 | 117 | 152 |
432 | democrat | 57 | 130 | 135 |
Categorical features are not directly supported: categorical numeric
features are assumed to be continuous, and if
comparison_measure
is a similarity measure, only logical
features are allowed. However, categorical features may be transformed
into the required form with one-hot encoding (for example, using the
dummies
package).
The algorithm will not work for a dataset with a mix of categorical and continuous features as-is: all features must be either logical, or continuous.
Distance measures are used for vectors with continuous elements.
Similarity measures are used for logical vectors. The comparison
measures used in neighbr
are based on those defined in the
PMML standard.
Functions in neighbr
can be used to calculate distances
or similarities between vectors directly:
distance(c(1,2,3),c(2,3,4),"squared_euclidean")
#> [1] 3
similarity(c(0,1,0,0),c(1,1,1,0),"simple_matching")
#> [1] 0.5
The next two sections show the formulas used in measure calculation.
For two vectors x and y of length n, distances are calculated as follows:
Euclidean: $(\sum_{i=0}^{n}(x_i - y_i)^2)^{1/2}$
Squared euclidean: $\sum_{i=0}^{n}(x_i - y_i)^2$
For two vectors x and y of length n, let:
Then, similarities are calculated as follows:
Simple matching: (a11 + a00)/(a11 + a10 + a01 + a00)
Jaccard: (a11)/(a11 + a10 + a01)
Tanimoto: (a11 + a00)/(a11 + 2 * (a10 + a01) + a00)
When two (or more) training instances are the same distance from a
test instance, a tie occurs. In this case, the training example that
appears first in train_set
will be first in the list of
nearest neighbors. If ranked neighbors are being output, that training
example will be assigned the lower rank.
For categorical targets, a tie occurs when two (or more) training
instances have the same class (regardless of distance or similarity),
and no single class has the highest frequency of occurrence in a
majority vote among k
neighbors. In this case, the tie
breaking procedure follows the PMML specification:
In case of a tie, the category with the largest number of cases in the training data is the winner. If multiple categories are tied on the largest number of cases in the training data, then the category with the smallest data value (in lexical order) among the tied categories is the winner.
The package does not directly support missing data. Various
imputation techniques may be used (e.g., average for continuous
features), or rows with N/A may be deleted, before being passed to
knn()
.
This package was developed following the KNN specification in the
PMML (Predictive Model Markup Language) standard. The models produced by
neighbr
can be converted to PMML (using the
pmml
R package).
Some parts of the package are only used for conversion to PMML. For
example, the function_name
field returned by
knn()
corresponds to a field required by PMML.