Jaccard

Bjørn Kjos-Hanssen

This project contains excerpt from the paper Interpolating between the Jaccard distance and an analogue of the normalized information distance. That paper is unique in that a majority of the results were formalized at the time of publication. Therefore it is especially suitable for a Lean blueprint project.

Abstract.

Jiménez, Becerra, and Gelbukh (2013) defined a family of “symmetric Tversky ratio models” \(S_{\alpha ,\beta }\), \(0\le \alpha \le 1\), \(\beta {\gt}0\). Each function \(D_{\alpha ,\beta }=1-S_{\alpha ,\beta }\) is a semimetric on the powerset of a given finite set.

We show that \(D_{\alpha ,\beta }\) is a metric if and only if \(0\le \alpha \le \frac12\) and \(\beta \ge 1/(1-\alpha )\). This result is formally verified in the Lean proof assistant.

The extreme points of this parametrized space of metrics are \(\mathcal V_1=D_{1/2,2}\), the Jaccard distance, and \(\mathcal V_{\infty }=D_{0,1}\), an analogue of the normalized information distance of M. Li, Chen, X. Li, Ma, and Vitányi (2004).

As a second interpolation, in general we also show that \(\mathcal V_p\) is a metric, \(1\le p\le \infty \), where

\[ \Delta _p(A,B)=(\lvert B\setminus A\rvert ^p+\lvert A\setminus B\rvert ^p)^{1/p}, \]
\[ \mathcal V_p(A,B)=\frac{\Delta _p(A,B)}{\lvert A\cap B\rvert + \Delta _p(A,B)}. \]

0.1 Introduction

Distance measures (metrics), are used in a wide variety of scientific contexts. In bioinformatics, M. Li, Badger, Chen, Kwong, and Kearney [ 13 ] introduced an information-based sequence distance. In an information-theoretical setting, M. Li, Chen, X. Li, Ma and Vitányi [ 14 ] rejected the distance of [ 13 ] in favor of a normalized information distance (NID). The Encyclopedia of Distances [ 3 ] describes the NID on page 205 out of 583, as

\[ \frac{ \max \{ K(x\mid y^*),K(y\mid x^*) \} }{ \max \{ K(x),K(y) \} } \]

where \(K(x\mid y^*)\) is the Kolmogorov complexity of \(x\) given a shortest program \(y^*\) to compute \(y\). It is equivalent to be given \(y\) itself in hard-coded form:

\[ \frac{ \max \{ K(x\mid y),K(y\mid x) \} }{ \max \{ K(x),K(y) \} } \]

Another formulation (see [ 14 , page 8 ] ) is

\[ \frac{K(x,y)-\min \{ K(x),K(y)\} }{\max \{ K(x),K(y)\} }. \]

The fact that the NID is in a sense a normalized metric is proved in  [ 14 ] . Then in 2017, while studying malware detection, Raff and Nicholas [ 15 ] suggested Lempel–Ziv Jaccard distance (LZJD) as a practical alternative to NID. As we shall see, this is a metric. In a way this constitutes a full circle: the distance in [ 13 ] is itself essentially a Jaccard distance, and the LZJD is related to it as Lempel–Ziv complexity is to Kolmogorov complexity. In the present paper we aim to shed light on this back-and-forth by showing that the NID and Jaccard distances constitute the endpoints of a parametrized family of metrics.

For comparison, the Jaccard distance between two sets \(X\) and \(Y\), and our analogue of the NID, are as follows:

\begin{eqnarray} J_1(X,Y)& =& \frac{\lvert X\setminus Y\rvert +\lvert Y\setminus X\rvert }{\lvert X\cup Y\rvert } = 1 - \frac{\lvert X\cap Y\rvert }{\lvert X\cup Y\rvert }\\ J_{\infty }(X,Y)& =& \frac{\max \{ \lvert X\setminus Y\rvert , \lvert Y\setminus X\rvert \} }{\max \{ \lvert X\rvert ,\lvert Y\rvert \} }\label{setNID} \end{eqnarray}

Our main result Theorem 20 shows which interpolations between these two are metrics. The way we arrived at \(J_{\infty }\) as an analogue of NID is via Lempel–Ziv complexity. While there are several variants [ 12 , 19 , 20 ] , the LZ 1978 complexity [ 20 ] of a sequence is the cardinality of a certain set, the dictionary.

Definition 1
#

Let \(\operatorname{LZSet}(A)\) be the Lempel–Ziv dictionary for a sequence \(A\). We define LZ–Jaccard distance \(\operatorname{LZJD}\) by

\[ \operatorname{LZJD}(A,B) = 1- \frac{\lvert \operatorname{LZSet}(A)\cap \operatorname{LZSet}(B)\rvert }{\lvert \operatorname{LZSet}(A)\cup \operatorname{LZSet}(B)\rvert }. \]

It is shown in [ 13 , Theorem 1 ] that the triangle inequality holds for a function which they call an information-based sequence distance. Later papers give it the notation \(d_s\) in [ 14 , Definition V.1 ] , and call their normalized information distance \(d\). Raff and Nicholas [ 15 ] introduced the LZJD and did not discuss the appearance of \(d_s\) in [ 14 , Definition V.1 ] , even though they do cite [ 14 ] (but not [ 13 ] ).

Kraskov et al.  [ 11 , 10 ] use \(D\) and \(D'\) for continuous analogues of \(d_s\) and \(d\) in [ 14 ] (which they cite). The Encyclopedia calls it the normalized information metric,

\[ \frac{H(X\mid Y)+H(X\mid Y)}{H(X,Y)} = 1 - \frac{I(X;Y)}{H(X,Y)} \]

or Rajski distance [ 16 ] .

This \(d_s\) was called \(d\) by [ 13 ] — see Table 1.

Reference

Jaccard notation

NID notation

[ 13 ]

\(d\)

 

[ 14 ]

\(d_s\)

\(d\)

[ 10 ]

\(D\)

\(D'\)

[ 15 ]

\(\operatorname{LZJD}\)

NCD

Table 1 Overview of notation used in the literature. (It seems that authors use simple names for their favored notions.)

Conversely, [ 14 , near Definition V.1 ] mentions mutual information.

Remark 2
#

Ridgway [ 4 ] observed that the entropy-based distance \(D\) is essentially a Jaccard distance. No explanation was given, but we attempt one as follows. Suppose \(X_1,X_2,X_3,X_4\) are iid Bernoulli(\(p=1/2\)) random variables, \(Y\) is the random vector \((X_1,X_2,X_3)\) and \(Z\) is \((X_2,X_3,X_4)\). Then \(Y\) and \(Z\) have two bits of mutual information \(I(Y,Z)=2\). They have an entropy \(H(Y)=H(Z)=3\) of three bits. Thus the relationship \(H(Y,Z)=H(Y)+H(Z)-I(Y,Z)\) becomes a Venn diagram relationship \(\lvert \{ X_1,X_2,X_3,X_4\} \rvert = \lvert \{ X_1,X_2,X_3\} \rvert + \lvert \{ X_2,X_3,X_4\} \rvert - \lvert \{ X_2,X_3\} \rvert \). The relationship to Jaccard distance may not have been well known, as it is not mentioned in [ 10 , 2 , 13 , 1 ] .

A more general setting is that of STRM (Symmetric Tversky Ratio Models), Definition 17. These are variants of the Tversky index (Definition 14) proposed in [ 7 ] .

0.1.1 Generalities about metrics

Definition 3
#

Let \(\mathcal X\) be a set. A metric on \(\mathcal X\) is a function \(d : \mathcal X \times \mathcal X \to \mathbb R\) such that

  1. \(d(x, y) \ge 0\),

  2. \(d(x, y) = 0\) if and only if \(x = y\),

  3. \(d(x, y) = d(y, x)\) (symmetry),

  4. \(d(x,y)\le d(x,z)+d(z,y)\) (the triangle inequality)

for all \(x,y,z\in \mathcal X\). If \(d\) satisfies Enumi 1, Enumi 2, Enumi 3 but not necessarily Enumi 4 then \(d\) is called a semimetric.

A basic exercise in Definition 3 that we will make use of is Theorem 4.

Theorem 4

If \(d_1\) and \(d_2\) are metrics and \(a,b\) are nonnegative constants, not both zero, then \(ad_1+bd_2\) is a metric.

Proof

Enumi 1 is immediate from Enumi 1 for \(d_1\) and \(d_2\).

Enumi 2: Assume \(ad_1(x,y)+bd_2(x,y)=0\). Then \(ad_1(x,y)=0\) and \(bd_2(x,y)=0\). Since \(a,b\) are not both 0, we may assume \(a{\gt}0\). Then \(d_1(x,y)=0\) and hence \(x=y\) by Enumi 2 for \(d_1\).

Enumi 3: We have \(ad_1(x,y)+bd_2(x,y)=ad_1(y,x)+bd_2(y,x)\) by Enumi 3 for \(d_1\) and \(d_2\).

Enumi 4: By Enumi 4 for \(d_1\) and \(d_2\) we have

\begin{eqnarray*} ad_1(x,y)+bd_2(x,y)& \le & a(d_1(x,z)+d_1(z,y))+b(d_2(x,z)+d_2(z,y))\\ & =& (ad_1(x,z)+bd_2(x,z)) + (ad_2(z,y)+bd_2(z,y)).\qedhere \end{eqnarray*}
Lemma 5

Let \(d(x,y)\) be a metric and let \(a(x,y)\) be a nonnegative symmetric function. If \(a(x,z)\le a(x,y)+d(y,z)\) for all \(x,y,z\), then \(d'(x,y)=\frac{d(x,y)}{a(x,y)+d(x,y)}\), with \(d'(x,y)=0\) if \(d(x,y)=0\), is a metric.

Proof

As a piece of notation, let us write \(d_{xy}=d(x,y)\) and \(a_{xy}=a(x,y)\). As observated by [ 17 ] , in order to show

\[ \frac{d_{xy}}{a_{xy}+d_{xy}}\le \frac{d_{xz}}{a_{xz}+d_{xz}} + \frac{d_{yz}}{a_{yz}+d_{yz}}, \]

it suffices to show the following pair of inequalities:

\begin{eqnarray} \frac{d_{xy}}{a_{xy}+d_{xy}}\le & \frac{d_{xz}+d_{yz}}{a_{xy}+d_{xz}+d_{yz}}& \label{1}\\ & \frac{d_{xz}+d_{yz}}{a_{xy}+d_{xz}+d_{yz}}& \le \frac{d_{xz}}{a_{xz}+d_{xz}} + \frac{d_{yz}}{a_{yz}+d_{yz}}\label{2} \end{eqnarray}

Here 3 follows from \(d\) being a metric, i.e., \(d_{xy}\le d_{xz}+d_{yz}\), since

\[ c\ge 0{\lt}a\le b\implies \frac{a}{a+c}\le \frac{b}{b+c}. \]

Next, 4 would follow from \(a_{xy}+d_{yz}\ge a_{xz}\) and \(a_{xy}+d_{xz}\ge a_{yz}\). By symmetry between \(x\) and \(y\) and since \(a_{xy}=a_{yx}\) by assumption, it suffices to prove the first of these, \(a_{xy}+d_{yz}\ge a_{xz}\), which holds by assumption.

0.1.2 Metrics on a family of finite sets

Lemma 6
#

For sets \(A,B,C\), we have \(\lvert A\setminus B\rvert \le \lvert A\setminus C\rvert +\lvert C\setminus B\rvert \).

Proof

We have \(A\setminus B\subseteq (A\setminus C)\cup (C\setminus B)\). Therefore, the result follows from the union bound for cardinality.

Lemma 7

Let \(f(A,B)=\lvert A\setminus B\rvert +\lvert B\setminus A\rvert \). Then \(f\) is a metric.

Proof

The most nontrivial part is to prove the triangle inequality,

\[ \lvert A\setminus B\rvert +\lvert B\setminus A\rvert \le \lvert A\setminus C\rvert +\lvert C\setminus A\rvert + \lvert C\setminus B\rvert +\lvert B\setminus C\rvert . \]

By the “rotation identity” \(\lvert A\setminus C\rvert +\lvert C\setminus B\rvert +\lvert B\setminus A\rvert = \lvert A\setminus B\rvert +\lvert B\setminus C\rvert +\lvert C\setminus A\rvert \), this is equivalent to

\[ 2(\lvert A\setminus B\rvert +\lvert B\setminus A\rvert )\le 2(\lvert A\setminus C\rvert + \lvert C\setminus B\rvert + \lvert B\setminus A\rvert ), \]

which is immediate from Lemma 6.

Lemma 8

Let \(f(A,B)=\max \{ \lvert A\setminus B\rvert ,\lvert B\setminus A\rvert \} \). Then \(f\) is a metric.

Proof

For the triangle inequality, we need to show

\[ \max \{ \lvert A\setminus B\rvert ,\lvert B\setminus A\rvert \} \le \max \{ \lvert A\setminus C\rvert ,\lvert C\setminus A\rvert \} + \max \{ \lvert C\setminus B\rvert ,\lvert B\setminus C\rvert \} . \]

By symmetry we may assume that \(\max \{ \lvert A\setminus B\rvert ,\lvert B\setminus A\rvert \} =\lvert A\setminus B\rvert \). Then, the result is immediate from Lemma 6.

For a real number \(\alpha \), we write \(\overline{\alpha }=1-\alpha \). For finite sets \(X,Y\) we define

\[ \tilde m(X,Y)=\min \{ \lvert X\setminus Y\rvert , \lvert Y\setminus X\rvert \} , \]
\[ \tilde M(X,Y)=\max \{ \lvert X\setminus Y\rvert , \lvert Y\setminus X\rvert \} . \]
Lemma 9
#

Let \(\delta :=\alpha \tilde m +\overline{\alpha }\tilde M\). Let \(X=\{ 0\} , Y=\{ 1\} , Z=\{ 0,1\} \). Then \(\delta (X,Y)=1\), \(\delta (X,Z)=\delta (Y,Z)=\overline{\alpha }\).

The proof of Lemma 9 is an immediate calculation.

Theorem 10

\(\delta _{\alpha }=\alpha \tilde m +\overline{\alpha }\tilde M\) satisfies the triangle inequality if and only if \(0\le \alpha \le 1/2\).

Proof

We first show the only if direction. By Lemma 9 the triangle inequality only holds for the example given there if \(1\le 2\overline{\alpha }\), i.e., \(\alpha \le 1/2\).

Now let us show the if direction. If \(\alpha \le 1/2\) then \(\alpha \le \overline{\alpha }\), so \(\delta _{\alpha }=\alpha (\tilde m + \tilde M) + (\overline{\alpha }-\alpha ) \tilde M\) is a nontrivial nonnegative linear combination. Since \((\tilde m+\tilde M)(A,B)=\lvert A\setminus B\rvert +\lvert B\setminus A\rvert \) (Lemma 7) and \(\tilde M(A,B)=\max \{ \lvert A\setminus B\rvert ,\lvert B\setminus A\rvert \} \) (Lemma 8) are both metrics, the result follows from Theorem 4.

Lemma 11
#

Suppose \(d\) is a metric on a collection of nonempty sets \(\mathcal X\), with \(d(X,Y)\le 2\) for all \(X,Y\in \mathcal X\). Let \(\hat{\mathcal X}=\mathcal X\cup \{ \emptyset \} \) and define \(\hat d:\hat{\mathcal X}\times \hat{\mathcal X}\to \mathbb R\) by stipulating that for \(X,Y\in \mathcal X\),

\[ \hat d(X,Y)=d(X,Y);\quad d(X,\emptyset )=1=d(\emptyset ,X);\quad d(\emptyset ,\emptyset )=0. \]

Then \(\hat d\) is a metric on \(\hat{\mathcal X}\).

Theorem 12

Let \(f(A,B)\) be a metric such that

\[ \lvert B\setminus A\rvert \le f(A,B) \]

for all \(A,B\). Then the function \(d\) given by

\[ d(A,B)=\begin{cases} \frac{f(A,B)}{\lvert A\cap B\rvert +f(A,B)},& \text{if }\lvert A\cap B\rvert +f(A,B){\gt}0,\\ 0,& \text{otherwise}, \end{cases} \]

is a metric.

Proof

By Lemma 5 (with \(a_{x,y}=\lvert X\cap Y\rvert \)) we only need to verify that for all sets \(A,B,C\),

\[ \lvert A\cap C\rvert +f(A,B)\ge \lvert B\cap C\rvert . \]

And indeed, since tautologically \(B\cap C\subseteq (B\setminus A)\cup (A\cap C)\), by the union bound we have \(\lvert B\cap C\rvert -\lvert A\cap C\rvert \le \lvert B\setminus A\rvert \le f(A,B)\).

Theorem 13

Let \(f(A,B)=m\min \{ \lvert A\setminus B\rvert ,\lvert B\setminus A\rvert \} +M\max \{ \lvert A\setminus B\rvert ,\lvert B\setminus A\rvert \} \) with \(0{\lt}m\le M\) and \(1\le M\). Then the function \(d\) given by

\[ d(A,B)= \begin{cases} \frac{f(A,B)}{\lvert A\cap B\rvert + f(A,B)},& \text{if } A\cup B\ne \emptyset ,\\ 0,& \text{otherwise}, \end{cases} \]

is a metric.

Proof

We have \(f(A,B)=(m+M) \delta _{\alpha }(A,B)\) where \(\alpha =\frac{m}{m+M}\). Since \(m\le M\), \(\alpha \le 1/2\), so \(f\) satisfies the triangle inequality by Theorem 10. Since \(m{\gt}0\), in fact \(f\) is a metric. Using \(M\ge 1\),

\[ f(A,B)\ge M\max \{ \lvert A\setminus B\rvert ,\lvert B\setminus A\rvert \} \ge M \lvert B\setminus A\rvert \ge \lvert B\setminus A\rvert , \]

so that by Theorem 12, \(d\) is a metric.

0.1.3 Tversky indices

Definition 14 [ 18 ]
#

For sets \(X\) and \(Y\) the Tversky index with parameters \(\alpha ,\beta \ge 0\) is a number between 0 and 1 given by

\[ S(X, Y) = \frac{\lvert X \cap Y\rvert }{\lvert X \cap Y\rvert + \alpha \lvert X \setminus Y\rvert + \beta \lvert Y \setminus X\rvert }. \]

We also define the corresponding Tversky dissimilarity \(d^T_{\alpha ,\beta }\) by

\[ d^{T}_{\alpha ,\beta }(X,Y) = \begin{cases} 1-S(X,Y)& \text{if }X\cup Y\ne \emptyset ;\\ 0& \text{if }X=Y=\emptyset . \end{cases} \]
Definition 15
#

The Szymkiewicz–-Simpson coefficient is defined by

\[ \operatorname {overlap}(X,Y) = \frac{\lvert X \cap Y\rvert }{\min (\lvert X\rvert ,\lvert Y\rvert )} \]

We may note that \(\operatorname {overlap}(X,Y)=1\) whenever \(X\subseteq Y\) or \(Y\subseteq X\), so that \(1-\operatorname {overlap}\) is not a metric.

Definition 16
#

The Sørensen–Dice coefficient is defined by

\[ \frac{2\lvert X\cap Y\rvert }{\lvert X\rvert +\lvert Y\rvert }. \]
Definition 17 [ 7 , Section 2 ]
#

Let \(\mathcal X\) be a collection of finite sets. We define \(S:\mathcal X\times \mathcal X\to \mathbb R\) as follows. The symmetric Tversky ratio model is defined by

\[ \mathbf{strm}(X,Y)=\frac{| X \cap Y |+\mathrm{bias}}{| X \cap Y |+\mathrm{bias}+\beta \left(\alpha \tilde m+(1-\alpha )\tilde M\right)} \]

The unbiased symmetric TRM (\(\mathbf{ustrm}\)) is the case where \(\mathrm{bias}=0\), which is the case we shall assume we are in for the rest of this paper. The Tversky semimetric \(D_{\alpha ,\beta }\) is defined by \(D_{\alpha ,\beta }(X,Y)=1-\mathbf{ustrm}(X,Y)\), or more precisely

\[ D_{\alpha ,\beta }(X,Y) = \begin{cases} \beta \frac{\alpha \tilde m + (1-\alpha )\tilde M}{\lvert X\cap Y\rvert +\beta (\alpha \tilde m + (1-\alpha )\tilde M)}, & \text{if }X\cup Y\ne \emptyset ;\\ 0 & \text{if }X=Y=\emptyset . \end{cases} \]

Note that for \(\alpha = 1/2\), \(\beta = 1\), the STRM is equivalent to the Sørensen–Dice coefficient. Similarly, for \(\alpha = 1/2\), \(\beta = 2\), it is equivalent to Jaccard’s coefficient.

0.2 Tversky metrics

Theorem 18

The function \(D_{\alpha ,\beta }\) is a metric only if \(\beta \ge 1/(1-\alpha )\).

Proof

Recall that with \(D=D_{\alpha ,\beta }\),

\[ D(X,Y)=\frac{\beta \delta }{\lvert X\cap Y\rvert +\beta \delta }. \]

By Lemma 9, for the example given there we have

\begin{eqnarray*} D(X,Y) & =& \frac{\beta \cdot 1}{0+\beta \cdot 1} = 1, \\ D(X,Z) = D(Y,Z) & =& \frac{\beta \cdot \overline{\alpha }}{1+\beta \cdot \overline{\alpha }}. \end{eqnarray*}

The triangle inequality is then equivalent to:

\[ 1\le 2\frac{\beta \overline{\alpha }}{1+\beta \overline{\alpha }}\iff \beta \overline{\alpha }\ge 1 \iff \beta \ge 1/(1-\alpha ).\qedhere \]

In Theorem 19 we use the interval notation on \(\mathbb N\), given by \([a,a]=\{ a\} \) and \([a,b]=[a,b-1]\cup \{ b\} \).

Theorem 19

The function \(D_{\alpha ,\beta }\) is a metric on all finite power sets only if \(\alpha \le 1/2\).

Proof

Suppose \(\alpha {\gt}1/2\). Then \(2\overline{\alpha }{\lt}1\). Let \(n\) be an integer with \(n{\gt}\frac{\beta \overline{\alpha }}{1-2\overline{\alpha }}\). Let \(X_n=[0,n]\), and \(Y_n=[1,n+1]\), and \(Z_n=[1,n]\). The triangle inequality says

\begin{eqnarray*} \beta \frac{1}{n+\beta \cdot 1} = D(X_n,Y_n)& \le & D(X_n,Z_n)+D(Z_n,Y_n)=2\beta \frac{\overline{\alpha }}{n+\beta \overline{\alpha }} \\ n+\beta \overline{\alpha }& \le & 2 \overline{\alpha }(n+\beta ) \\ n(1-2\overline{\alpha })& \le & \beta \overline{\alpha }\end{eqnarray*}

Then the triangle inequality does not hold, so \(D_{\alpha ,\beta }\) is not a metric on the power set of \([0,n+1]\).

Theorem 20

Let \(0\le \alpha \le 1\) and \(\beta {\gt} 0\). Then \(D_{\alpha ,\beta }\) is a metric if and only if \(0\le \alpha \le 1/2\) and \(\beta \ge 1/(1-\alpha )\).

Proof

Theorem 18 and Theorem 19 give the necessary condition. Since

\[ D_{\alpha ,\beta } = \begin{cases} \beta \frac{\alpha \tilde m + (1-\alpha )\tilde M}{\lvert X\cap Y\rvert +\beta (\alpha \tilde m + (1-\alpha )\tilde M)}, & \text{if }X\cup Y\ne \emptyset ,\\ 0 & \text{otherwise}, \end{cases} \]

where \(\tilde m\) is the minimum of the set differences and \(\tilde M\) is the maximum, we can let \(f(A,B)=\beta (\alpha \tilde m(A,B)+(1-\alpha )\tilde M(A,B))\). Then with the constants \(m=\beta \alpha \) and \(M=\beta \overline{\alpha }\), we can apply Theorem 13.

We have formally proved Theorem 20 in the Lean theorem prover. The Github repository can be found at [ 8 ] .

0.2.1 A converse to Gragera and Suppakitpaisarn

Theorem 21 Gragera and Suppakitpaisarn [ 5 , 6 ]
#

The optimal constant \(\rho \) such that \(d^T_{\alpha ,\beta }(X,Y)\le \rho (d^T_{\alpha ,\beta }(X,Y)+d^T_{\alpha ,\beta }(Y,Z))\) for all \(X,Y,Z\) is

\[ \frac12\left(1+\sqrt{\frac{1}{\alpha \beta }}\right). \]
Corollary 22

\(d^T_{\alpha ,\beta }\) is a metric only if \(\alpha =\beta \ge 1\).

Proof

Clearly, \(\alpha =\beta \) is necessary to ensure \(d^T_{\alpha ,\beta }(X,Y)=d^T_{\alpha ,\beta }(Y,X)\). Moreover \(\rho \le 1\) is necessary, so Theorem 21 gives \(\alpha \beta \ge 1\).

Theorem 23 gives the converse to the Gragera and Suppakitpaisarn inspired Corollary 22:

Theorem 23

The Tversky dissimilarity \(d^T_{\alpha ,\beta }\) is a metric iff \(\alpha =\beta \ge 1\).

Proof

Suppose the Tversky dissimilarity \(d^T_{\alpha ,\beta }\) is a semimetric. Let \(X,Y\) be sets with \(\lvert X\cap Y\rvert =\lvert X\setminus Y\rvert =1\) and \(\lvert Y\setminus X\rvert =0\). Then

\[ 1-\frac1{1+\beta } = d^T_{\alpha ,\beta }(Y,X) = d^T_{\alpha ,\beta }(X,Y)=1-\frac{1}{1+\alpha }, \]

hence \(\alpha =\beta \). Let \(\gamma =\alpha =\beta \).

Now, \(d^T_{\gamma ,\gamma }=D_{\alpha _0,\beta _0}\) where \(\alpha _0=1/2\) and \(\beta _0 = 2\gamma \). Indeed, with \(\tilde m=\min \{ \lvert X\setminus Y\rvert , \lvert Y\setminus X\rvert \} \) and \(\tilde M=\max \{ \lvert X\setminus Y\rvert , \lvert Y\setminus X\rvert \} \), since

\[ D_{\alpha _0,\beta _0} = \beta _0\frac{ \alpha _0 \tilde m + (1-\alpha _0)\tilde M }{ \lvert X\cap Y\rvert + \beta _0\left[\alpha _0 \tilde m + (1-\alpha _0)\tilde M\right] }, \]
\[ D_{\frac12,2\gamma } = 2\gamma \frac{ \frac12 \tilde m + (1-\frac12)\tilde M }{ \lvert X\cap Y\rvert + 2\gamma \left[\frac12 \tilde m + (1-\frac12)\tilde M\right] } \]
\[ = \gamma \frac{ \lvert X\setminus Y\rvert +\lvert Y\setminus X\rvert }{ \lvert X\cap Y\rvert + \gamma \left[\lvert X\setminus Y\rvert +\lvert Y\setminus X\rvert \right] } = 1 - \frac{\lvert X \cap Y\rvert }{\lvert X \cap Y\rvert + \gamma \lvert X \setminus Y\rvert + \gamma \lvert Y \setminus X\rvert } = d^T_{\gamma ,\gamma }. \]

By Theorem 20, \(d^T_{\gamma ,\gamma }\) is a metric if and only if \(\beta _0\ge 1/(1-\alpha _0)\). This is equivalent to \(2\gamma \ge 2\), i.e., \(\gamma \ge 1\).

The truth or falsity of Theorem 23 does not arise in Gragera and Suppakitpaisarn’s work, as they require \(\alpha ,\beta \le 1\) in their definition of Tversky index. We note that Tversky [ 18 ] only required \(\alpha ,\beta \ge 0\).

0.3 Lebesgue-style metrics

Incidentally, the names of \(J_1\) and \(J_{\infty }\) come from the observation that they are special cases of \(J_p\) given by

\[ J_p(A,B)= \left(2\cdot \frac{ \lvert B\setminus A\rvert ^p+\lvert A\setminus B\rvert ^p }{ \lvert A\rvert ^p+\lvert B\rvert ^p+\lvert B\setminus A\rvert ^p+\lvert A\setminus B\rvert ^p } \right)^{1/p} = \begin{cases} J_1(A,B) & p=1\\ J_{\infty }(A,B) & p\to \infty \end{cases} \]

which was suggested in [ 9 ] as another possible means of interpolating between \(J_1\) and \(J_{\infty }\). We still conjecture that \(J_2\) is a metric, but shall not attempt to prove it here. However:

Theorem 24
#

\(J_3\) is not a metric.

Because of Theorem 24, we searched for a better version of \(J_p\), and found \(\mathcal V_p\):

Definition 25
#

For each \(1\le p\le \infty \), let 1

\begin{eqnarray*} \Delta _p(A,B)& =& (\lvert B\setminus A\rvert ^p+\lvert A\setminus B\rvert ^p)^{1/p},\text{ and} \\ \mathcal V_p(A,B)& =& \frac{\Delta _p(A,B)}{\lvert A\cap B\rvert + \Delta _p(A,B)}. \end{eqnarray*}

We have \(\mathcal V_1=J_1\) and \(\mathcal V_{\infty }:=\lim _{p\to \infty }\mathcal V_p=J_{\infty }\).

In a way what is going on here is that we consider \(L^p\) spaces instead of

\[ \frac1p L^1 + \left(1-\frac1p\right)L^{\infty } \]

spaces.

Theorem 26
#

For each \(1\le p\le \infty \), \(\Delta _p\) is a metric.

Theorem 27

For each \(1\le p\le \infty \), \(\mathcal V_p\) is a metric.

Proof

By Theorem 26 and Theorem 12, we only have to check \(\lvert B\setminus A\rvert \le \Delta _p(A,B)\), which is immediate for \(1\le p\le \infty \).

Of special interest may be \(\mathcal V_2\) as a canonical interpolant between \(\mathcal V_1\), the Jaccard distance, and \(\mathcal V_{\infty }=J_{\infty }\), the analogue of the NID. If \(\lvert B\setminus A\rvert =3\), \(\lvert A\setminus B\rvert =4\), and \(\lvert A\cap B\rvert =5\), then

\begin{eqnarray*} \mathcal V_1(A,B)& =& 7/12,\\ \mathcal V_2(A,B)& =& 1/2,\\ \mathcal V_{\infty }(A,B)& =& 4/9. \end{eqnarray*}

Note that if \(A\subseteq B\) then \(\mathcal V_p(A,B)=\mathcal V_1(A,B)\) for all \(p\).

0.4 Conclusion and applications

Many researchers have considered metrics based on sums or maxima, but we have shown that these need not be considered in “isolation” in the sense that they form the endpoints of a family of metrics.

As an example, the mutations of spike glycoproteins of coronaviruses are of interest in connection with diseases such as CoViD-19. We calculated several distance measures between peptide sequences for such proteins. The distance

\[ Z_{2,\alpha }(x_0,x_1)=\alpha \min (\lvert A_1\rvert ,\lvert A_2\rvert )+\overline{\alpha }\max (\lvert A_1\rvert ,\lvert A_2\rvert ) \]

where \(A_i\) is the set of subwords of length 2 in \(x_i\) but not in \(x_{1-i}\), counts how many subwords of length 2 appear in one sequence and not the other.

We used the Ward linkage criterion for producing Newick trees using the hclust package for the Go programming language. The calculated phylogenetic trees were based on the metric \(Z_{2,\alpha }\).

We found one tree isomorphism class each for \(0\le \alpha \le 0.21\), \(0.22\le \alpha \le 0.36\), and \(0.37\le \alpha \le 0.5\), respectively. We see that the various intervals for \(\alpha \) can correspond to “better” or “worse” agreement with other distance measures. Thus, we propose that rather than focusing on \(\alpha =0\) and \(\alpha =1/2\) exclusively, future work may consider the whole interval \([0,1/2]\).

  1. Here, \(\mathcal V\) can stand for Paul M. B. Vitányi, who introduced the author to the normalized information distance at a Dagstuhl workshop in 2006.