In this article, we will look at different similarity measures in data science and the top similarity techniques in machine learning. These are not only helpful to any operations AI or deep learning algorithm but valuable for any business analysts who look to take insights from nuggets of data.
The goal of data science is to find useful information from large sets of data. While the data scientist must use their intuition, they also usually have access to a statistical package and some automation in order to achieve their goals.
Similarity measures are an important topic in data science. In my opinion, they are one of the underrated phenomena in the field. Similarity measures (particularly when used to define similarity between objects) have been a vital part of machine learning since its inception. As such, it is not surprising that there are a variety of different similarity techniques in machine learning.They are used to understand similarities and differences between different objects, or elements of a set. They also allow us to answer questions such as: “How similar is group A with group B?”
This is done by calculating metrics for each pair of data points in the two sets and comparing those metrics. Some common similarity measures include cosine similarity, Pearson correlation coefficient, entropy, mutual information and many more.
Similarity measures use the distance between two objects to gauge their similarity. The properties of similarity include:
- mutual exclusivity: two objects cannot be compared unless they are mutually exclusive. For example, if you have the task of classifying images as cats or dogs, then you cannot compare a cat to a dog because there is no such thing as “a cat that looks like a dog.”
- exhaustiveness: two objects must be included in order for them to be compared. If you have the task of classifying images as cats or dogs, then it does not matter how many other features you add—the image must still be classified one way or another.
- degrees of freedom: two objects may only be compared with respect to one property at a time, thereby eliminating any possible ambiguity in the interpretation of results obtained from such comparisons.
TensorFlow is a popular open-source programming framework for building machine learning systems. It allows you to build, train, and evaluate machine learning models that run on devices like smartphones and computers.
TensorFlow uses the concept of similarity to measure how similar two datasets are. Similarity can be thought of as an operation on a dataset that returns a number between -1 and +1. The closer the number is to +1 or -1, the more similar the data points are.
The TensorFlow library contains several different similarity algorithms, including cosine distance and Pearson correlation coefficient.
The main difference between the TensorFlow Similarity and other similarity measures is that it does not require pre-processing steps such as tokenization or the use of features. This makes it ideal for processing large amounts of unstructured data without having to worry about slowing down your machine learning pipeline.
TensorFlow Similarity also takes into account how similar the dataset looks when viewed in a certain format. For example, if you have an image that has been cropped or resized, then you will get different results than if you just see an image in its original size.
The following are some examples of similarity measures implemented in TensorFlow:
- Sum (of means): The sum of the squared differences between two sets of features is calculated and used to determine how similar two sets are. This measure is usually applied to small datasets.
- Euclidean distance: This similarity measure compares one vector with all other vectors in the dataset using an inner product, which results in a real number that represents how far apart two vectors are from each other in a vector space (i.e., how similar they are). It’s commonly used for large datasets where there may be more than one dimensionality needed for further analysis (e.g., if we want to compare two groups based on their age).
- Mahalanobis distance: This metric is similar to Euclidean distance in that it compares one vector with every other vector in our dataset by computing its distance along different dimensions at once (e.g., Euclidean distance would compare one dimensionally while Mahalanobis would compare multiple dimensions at once).
Now, lets look at the top similarity techniques in machine learning along its implementations :
1.Euclidean Distance :
The distance between two points in Euclidean space is referred to as the “Euclidean distance.” The shortest path between two points is essentially what it symbolises. Distance is another name for euclidean distance. This is the most effective proximity measure when the data is dense or continuous.
from math import*
def euclidean_distance(x,y):
return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))
print euclidean_distance([2.3,5,0],[4,3,5,9])
2.Manhattan Distance : (Taxicab Geometry) :
In an N-dimensional vector space, the Manhattan distance measures the distance between two points. It is the total length of the line segment that is projected from the points onto the coordinate axes. Simply said, it is the total absolute difference between the measurements taken at two sites in all dimensions. Consider that we have two points, A and B.
Just the absolute x-axis and y-axis variation will do if we want to calculate the Manhattan distance between them. This means that we must ascertain how the two points A and B fluctuate along the X and Y axes. The Manhattan distance is the distance between two places measured along axes at right angles.
from math import*
def manhattan_distance(x,y):
return sum(abs(a-b) for a,b in zip(x,y))
print manhattan_distance(x,y)
3.Minkowski distance:
The order parameter p is added to the Manhattan and Euclidean distances to create the Minkowski distance. Based on experiments, the value of P in the Minkowski distance is determined. P typically has a value of 2, 3, or 4 for the majority of problems. The applications determine the precise value.
To achieve the desired outcomes based on the application’s goal, having the ideal value of P is essential. Minkowshi distance is equivalent to Euclidean distance when p = 2 and to Manhattan distance when p = 1.
from math import*
from decimal import Decimal
def nth_root(value, n_root):
root_value = 1/float(n_root)
return round (Decimal(value) ** Decimal(root_value),3)
def minkowski_distance(x,y,p_value):
return nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value)
print minkowski_distance(x,y,p)
4. Jaccard distance :
The Jaccard similarity calculates the degree of similarity between two collections of data to identify common and unique individuals. It is a typical proximity measurement that is used to determine how similar two objects—like two texts—are. Applications of data science including text mining, e-commerce, recommendation systems, etc. typically use Jaccard Similarity.
def jaccard(list1, list2):
intersection = len(list(set(list1).intersection(list2)))
union = (len(list1) + len(list2)) - intersection
return float(intersection) / union
jaccard(a, b)
5.Cosine Distance :
The cosine similarity between two numerical sequences is a metric of similarity. Regardless of the size of the documents, cosine similarity is a statistic used to determine how similar they are. The normalized dot product of the two qualities is discovered by the cosine similarity metric. We might try to find the cosine of the angle between the two objects by calculating the cosine similarity.
def cosine_similarity(x,y):
numerator = sum(a*b for a,b in zip(x,y))
denominator = square_rooted(x)*square_rooted(y)
return round(numerator/float(denominator),3)
6.Chebychev :
Maximum value distance is another name for the Chebyshev distance. It looks at the size in absolute terms of coordinate difference between two objects. It is challenging to use Chebyshev as a general-purpose distance measure because it is often only used in a small number of use-cases. The smallest number of moves a king needs to make to move between two squares on
import numpy as np
from numpy.polynomial import chebyshev
7.Mahalanobis:
The Mahalanobis distance similarity measure (MDSM) is frequently used to compare two vectors’ similarity in terms of distance. When measuring, MDSM has one advantage. The Mahalanobis distance weighs the absolute distance between two points by taking into consideration the variance of the dataset and how dispersed apart the points are within it.
def mahalanobis(x=None, data=None, cov=None):
x_mu = x - np.mean(data)
if not cov:
cov = np.cov(data.values.T)
inv_covmat = np.linalg.inv(cov)
left = np.dot(x_mu, inv_covmat)
mahal = np.dot(left, x_mu.T)
return mahal.diagonal()
8.Levenshtein :
A value that indicates how unlike two strings are is called the Levenshtein distance. It is a measure of word similarity; the greater the value, the more unlike the two strings. It has several uses, such as text autocomplete and autocorrection.
import enchant
print(enchant.utils.levenshtein(string1, string2))
The user-entered word is matched to words in a dictionary to discover the closest match for either of these use cases, at which time a suggestion or suggestions are made.
If you are getting a list of data 2 classes and then finding the best predictive model. There are many distance measures to calculate the similarity between two objects in machine learning. This article explains many of the different similarity measures used in machine learning methods including Cosine similarity, Euclidean distance, Hamming distance, Jaccard distance & Levenshtein distance. Hope you enjoyed this article at mldots.