Jaccard

Bjørn Kjos-Hanssen

This project contains excerpt from the paper Interpolating between the Jaccard distance and an analogue of the normalized information distance. That paper is unique in that a majority of the results were formalized at the time of publication. Therefore it is especially suitable for a Lean blueprint project.

Abstract.

Jiménez, Becerra, and Gelbukh (2013) defined a family of “symmetric Tversky ratio models” Sα,β, 0α1, β>0. Each function Dα,β=1Sα,β is a semimetric on the powerset of a given finite set.

We show that Dα,β is a metric if and only if 0α12 and β1/(1α). This result is formally verified in the Lean proof assistant.

The extreme points of this parametrized space of metrics are V1=D1/2,2, the Jaccard distance, and V=D0,1, an analogue of the normalized information distance of M. Li, Chen, X. Li, Ma, and Vitányi (2004).

As a second interpolation, in general we also show that Vp is a metric, 1p, where

Δp(A,B)=(|BA|p+|AB|p)1/p,
Vp(A,B)=Δp(A,B)|AB|+Δp(A,B).

0.1 Introduction

Distance measures (metrics), are used in a wide variety of scientific contexts. In bioinformatics, M. Li, Badger, Chen, Kwong, and Kearney [ 13 ] introduced an information-based sequence distance. In an information-theoretical setting, M. Li, Chen, X. Li, Ma and Vitányi [ 14 ] rejected the distance of [ 13 ] in favor of a normalized information distance (NID). The Encyclopedia of Distances [ 3 ] describes the NID on page 205 out of 583, as

max{K(xy),K(yx)}max{K(x),K(y)}

where K(xy) is the Kolmogorov complexity of x given a shortest program y to compute y. It is equivalent to be given y itself in hard-coded form:

max{K(xy),K(yx)}max{K(x),K(y)}

Another formulation (see [ 14 , page 8 ] ) is

K(x,y)min{K(x),K(y)}max{K(x),K(y)}.

The fact that the NID is in a sense a normalized metric is proved in  [ 14 ] . Then in 2017, while studying malware detection, Raff and Nicholas [ 15 ] suggested Lempel–Ziv Jaccard distance (LZJD) as a practical alternative to NID. As we shall see, this is a metric. In a way this constitutes a full circle: the distance in [ 13 ] is itself essentially a Jaccard distance, and the LZJD is related to it as Lempel–Ziv complexity is to Kolmogorov complexity. In the present paper we aim to shed light on this back-and-forth by showing that the NID and Jaccard distances constitute the endpoints of a parametrized family of metrics.

For comparison, the Jaccard distance between two sets X and Y, and our analogue of the NID, are as follows:

J1(X,Y)=|XY|+|YX||XY|=1|XY||XY|J(X,Y)=max{|XY|,|YX|}max{|X|,|Y|}

Our main result Theorem 20 shows which interpolations between these two are metrics. The way we arrived at J as an analogue of NID is via Lempel–Ziv complexity. While there are several variants [ 12 , 19 , 20 ] , the LZ 1978 complexity [ 20 ] of a sequence is the cardinality of a certain set, the dictionary.

Definition 1
#

Let LZSet(A) be the Lempel–Ziv dictionary for a sequence A. We define LZ–Jaccard distance LZJD by

LZJD(A,B)=1|LZSet(A)LZSet(B)||LZSet(A)LZSet(B)|.

It is shown in [ 13 , Theorem 1 ] that the triangle inequality holds for a function which they call an information-based sequence distance. Later papers give it the notation ds in [ 14 , Definition V.1 ] , and call their normalized information distance d. Raff and Nicholas [ 15 ] introduced the LZJD and did not discuss the appearance of ds in [ 14 , Definition V.1 ] , even though they do cite [ 14 ] (but not [ 13 ] ).

Kraskov et al.  [ 11 , 10 ] use D and D for continuous analogues of ds and d in [ 14 ] (which they cite). The Encyclopedia calls it the normalized information metric,

H(XY)+H(XY)H(X,Y)=1I(X;Y)H(X,Y)

or Rajski distance [ 16 ] .

This ds was called d by [ 13 ] — see Table 1.

Reference

Jaccard notation

NID notation

[ 13 ]

d

 

[ 14 ]

ds

d

[ 10 ]

D

D

[ 15 ]

LZJD

NCD

Table 1 Overview of notation used in the literature. (It seems that authors use simple names for their favored notions.)

Conversely, [ 14 , near Definition V.1 ] mentions mutual information.

Remark 2
#

Ridgway [ 4 ] observed that the entropy-based distance D is essentially a Jaccard distance. No explanation was given, but we attempt one as follows. Suppose X1,X2,X3,X4 are iid Bernoulli(p=1/2) random variables, Y is the random vector (X1,X2,X3) and Z is (X2,X3,X4). Then Y and Z have two bits of mutual information I(Y,Z)=2. They have an entropy H(Y)=H(Z)=3 of three bits. Thus the relationship H(Y,Z)=H(Y)+H(Z)I(Y,Z) becomes a Venn diagram relationship |{X1,X2,X3,X4}|=|{X1,X2,X3}|+|{X2,X3,X4}||{X2,X3}|. The relationship to Jaccard distance may not have been well known, as it is not mentioned in [ 10 , 2 , 13 , 1 ] .

A more general setting is that of STRM (Symmetric Tversky Ratio Models), Definition 17. These are variants of the Tversky index (Definition 14) proposed in [ 7 ] .

0.1.1 Generalities about metrics

Definition 3
#

Let X be a set. A metric on X is a function d:X×XR such that

  1. d(x,y)0,

  2. d(x,y)=0 if and only if x=y,

  3. d(x,y)=d(y,x) (symmetry),

  4. d(x,y)d(x,z)+d(z,y) (the triangle inequality)

for all x,y,zX. If d satisfies Enumi 1, Enumi 2, Enumi 3 but not necessarily Enumi 4 then d is called a semimetric.

A basic exercise in Definition 3 that we will make use of is Theorem 4.

Theorem 4

If d1 and d2 are metrics and a,b are nonnegative constants, not both zero, then ad1+bd2 is a metric.

Proof
Lemma 5

Let d(x,y) be a metric and let a(x,y) be a nonnegative symmetric function. If a(x,z)a(x,y)+d(y,z) for all x,y,z, then d(x,y)=d(x,y)a(x,y)+d(x,y), with d(x,y)=0 if d(x,y)=0, is a metric.

Proof

0.1.2 Metrics on a family of finite sets

Lemma 6
#

For sets A,B,C, we have |AB||AC|+|CB|.

Proof
Lemma 7

Let f(A,B)=|AB|+|BA|. Then f is a metric.

Proof
Lemma 8

Let f(A,B)=max{|AB|,|BA|}. Then f is a metric.

Proof

For a real number α, we write α=1α. For finite sets X,Y we define

m~(X,Y)=min{|XY|,|YX|},
M~(X,Y)=max{|XY|,|YX|}.
Lemma 9
#

Let δ:=αm~+αM~. Let X={0},Y={1},Z={0,1}. Then δ(X,Y)=1, δ(X,Z)=δ(Y,Z)=α.

The proof of Lemma 9 is an immediate calculation.

Theorem 10

δα=αm~+αM~ satisfies the triangle inequality if and only if 0α1/2.

Proof
Lemma 11
#

Suppose d is a metric on a collection of nonempty sets X, with d(X,Y)2 for all X,YX. Let X^=X{} and define d^:X^×X^R by stipulating that for X,YX,

d^(X,Y)=d(X,Y);d(X,)=1=d(,X);d(,)=0.

Then d^ is a metric on X^.

Theorem 12

Let f(A,B) be a metric such that

|BA|f(A,B)

for all A,B. Then the function d given by

d(A,B)={f(A,B)|AB|+f(A,B),if |AB|+f(A,B)>0,0,otherwise,

is a metric.

Proof
Theorem 13

Let f(A,B)=mmin{|AB|,|BA|}+Mmax{|AB|,|BA|} with 0<mM and 1M. Then the function d given by

d(A,B)={f(A,B)|AB|+f(A,B),if AB,0,otherwise,

is a metric.

Proof

0.1.3 Tversky indices

Definition 14 [ 18 ]
#

For sets X and Y the Tversky index with parameters α,β0 is a number between 0 and 1 given by

S(X,Y)=|XY||XY|+α|XY|+β|YX|.

We also define the corresponding Tversky dissimilarity dα,βT by

dα,βT(X,Y)={1S(X,Y)if XY;0if X=Y=.
Definition 15
#

The Szymkiewicz–-Simpson coefficient is defined by

overlap(X,Y)=|XY|min(|X|,|Y|)

We may note that overlap(X,Y)=1 whenever XY or YX, so that 1overlap is not a metric.

Definition 16
#

The Sørensen–Dice coefficient is defined by

2|XY||X|+|Y|.
Definition 17 [ 7 , Section 2 ]
#

Let X be a collection of finite sets. We define S:X×XR as follows. The symmetric Tversky ratio model is defined by

strm(X,Y)=|XY|+bias|XY|+bias+β(αm~+(1α)M~)

The unbiased symmetric TRM (ustrm) is the case where bias=0, which is the case we shall assume we are in for the rest of this paper. The Tversky semimetric Dα,β is defined by Dα,β(X,Y)=1ustrm(X,Y), or more precisely

Dα,β(X,Y)={βαm~+(1α)M~|XY|+β(αm~+(1α)M~),if XY;0if X=Y=.

Note that for α=1/2, β=1, the STRM is equivalent to the Sørensen–Dice coefficient. Similarly, for α=1/2, β=2, it is equivalent to Jaccard’s coefficient.

0.2 Tversky metrics

Theorem 18

The function Dα,β is a metric only if β1/(1α).

Proof

In Theorem 19 we use the interval notation on N, given by [a,a]={a} and [a,b]=[a,b1]{b}.

Theorem 19

The function Dα,β is a metric on all finite power sets only if α1/2.

Proof
Theorem 20

Let 0α1 and β>0. Then Dα,β is a metric if and only if 0α1/2 and β1/(1α).

Proof

We have formally proved Theorem 20 in the Lean theorem prover. The Github repository can be found at [ 8 ] .

0.2.1 A converse to Gragera and Suppakitpaisarn

Theorem 21 Gragera and Suppakitpaisarn [ 5 , 6 ]
#

The optimal constant ρ such that dα,βT(X,Y)ρ(dα,βT(X,Y)+dα,βT(Y,Z)) for all X,Y,Z is

12(1+1αβ).
Corollary 22

dα,βT is a metric only if α=β1.

Proof

Theorem 23 gives the converse to the Gragera and Suppakitpaisarn inspired Corollary 22:

Theorem 23

The Tversky dissimilarity dα,βT is a metric iff α=β1.

Proof

The truth or falsity of Theorem 23 does not arise in Gragera and Suppakitpaisarn’s work, as they require α,β1 in their definition of Tversky index. We note that Tversky [ 18 ] only required α,β0.

0.3 Lebesgue-style metrics

Incidentally, the names of J1 and J come from the observation that they are special cases of Jp given by

Jp(A,B)=(2|BA|p+|AB|p|A|p+|B|p+|BA|p+|AB|p)1/p={J1(A,B)p=1J(A,B)p

which was suggested in [ 9 ] as another possible means of interpolating between J1 and J. We still conjecture that J2 is a metric, but shall not attempt to prove it here. However:

Theorem 24
#

J3 is not a metric.

Because of Theorem 24, we searched for a better version of Jp, and found Vp:

Definition 25
#

For each 1p, let 1

Δp(A,B)=(|BA|p+|AB|p)1/p, andVp(A,B)=Δp(A,B)|AB|+Δp(A,B).

We have V1=J1 and V:=limpVp=J.

In a way what is going on here is that we consider Lp spaces instead of

1pL1+(11p)L

spaces.

Theorem 26
#

For each 1p, Δp is a metric.

Theorem 27

For each 1p, Vp is a metric.

Proof

Of special interest may be V2 as a canonical interpolant between V1, the Jaccard distance, and V=J, the analogue of the NID. If |BA|=3, |AB|=4, and |AB|=5, then

V1(A,B)=7/12,V2(A,B)=1/2,V(A,B)=4/9.

Note that if AB then Vp(A,B)=V1(A,B) for all p.

0.4 Conclusion and applications

Many researchers have considered metrics based on sums or maxima, but we have shown that these need not be considered in “isolation” in the sense that they form the endpoints of a family of metrics.

As an example, the mutations of spike glycoproteins of coronaviruses are of interest in connection with diseases such as CoViD-19. We calculated several distance measures between peptide sequences for such proteins. The distance

Z2,α(x0,x1)=αmin(|A1|,|A2|)+αmax(|A1|,|A2|)

where Ai is the set of subwords of length 2 in xi but not in x1i, counts how many subwords of length 2 appear in one sequence and not the other.

We used the Ward linkage criterion for producing Newick trees using the hclust package for the Go programming language. The calculated phylogenetic trees were based on the metric Z2,α.

We found one tree isomorphism class each for 0α0.21, 0.22α0.36, and 0.37α0.5, respectively. We see that the various intervals for α can correspond to “better” or “worse” agreement with other distance measures. Thus, we propose that rather than focusing on α=0 and α=1/2 exclusively, future work may consider the whole interval [0,1/2].

  1. Here, V can stand for Paul M. B. Vitányi, who introduced the author to the normalized information distance at a Dagstuhl workshop in 2006.