Jaccard

Bjørn Kjos-Hanssen

This project contains excerpt from the paper Interpolating between the Jaccard distance and an analogue of the normalized information distance. That paper is unique in that a majority of the results were formalized at the time of publication. Therefore it is especially suitable for a Lean blueprint project.

Abstract.

Jiménez, Becerra, and Gelbukh (2013) defined a family of “symmetric Tversky ratio models” $S_{α, β}$ , $0 \leq α \leq 1$ , $β > 0$ . Each function $D_{α, β} = 1 - S_{α, β}$ is a semimetric on the powerset of a given finite set.

We show that $D_{α, β}$ is a metric if and only if $0 \leq α \leq \frac{1}{2}$ and $β \geq 1 / (1 - α)$ . This result is formally verified in the Lean proof assistant.

The extreme points of this parametrized space of metrics are $V_{1} = D_{1 / 2, 2}$ , the Jaccard distance, and $V_{\infty} = D_{0, 1}$ , an analogue of the normalized information distance of M. Li, Chen, X. Li, Ma, and Vitányi (2004).

As a second interpolation, in general we also show that $V_{p}$ is a metric, $1 \leq p \leq \infty$ , where

Δ_{p} (A, B) = (| B ∖ A |^{p} + | A ∖ B |^{p})^{1 / p},

V_{p} (A, B) = \frac{Δ_{p} (A, B)}{| A \cap B | + Δ_{p} (A, B)} .

0.1 Introduction

Distance measures (metrics), are used in a wide variety of scientific contexts. In bioinformatics, M. Li, Badger, Chen, Kwong, and Kearney [ 13 ] introduced an information-based sequence distance. In an information-theoretical setting, M. Li, Chen, X. Li, Ma and Vitányi [ 14 ] rejected the distance of [ 13 ] in favor of a normalized information distance (NID). The Encyclopedia of Distances [ 3 ] describes the NID on page 205 out of 583, as

\frac{max {K (x ∣ y^{*}), K (y ∣ x^{*})}}{max {K (x), K (y)}}

where $K (x ∣ y^{*})$ is the Kolmogorov complexity of $x$ given a shortest program $y^{*}$ to compute $y$ . It is equivalent to be given $y$ itself in hard-coded form:

\frac{max {K (x ∣ y), K (y ∣ x)}}{max {K (x), K (y)}}

Another formulation (see [ 14 , page 8 ] ) is

\frac{K (x, y) - min {K (x), K (y)}}{max {K (x), K (y)}} .

The fact that the NID is in a sense a normalized metric is proved in [ 14 ] . Then in 2017, while studying malware detection, Raff and Nicholas [ 15 ] suggested Lempel–Ziv Jaccard distance (LZJD) as a practical alternative to NID. As we shall see, this is a metric. In a way this constitutes a full circle: the distance in [ 13 ] is itself essentially a Jaccard distance, and the LZJD is related to it as Lempel–Ziv complexity is to Kolmogorov complexity. In the present paper we aim to shed light on this back-and-forth by showing that the NID and Jaccard distances constitute the endpoints of a parametrized family of metrics.

For comparison, the Jaccard distance between two sets $X$ and $Y$ , and our analogue of the NID, are as follows:

\begin{array}{rcl} J_{1} (X, Y) & = & \frac{| X ∖ Y | + | Y ∖ X |}{| X \cup Y |} = 1 - \frac{| X \cap Y |}{| X \cup Y |} \\ J_{\infty} (X, Y) & = & \frac{max {| X ∖ Y |, | Y ∖ X |}}{max {| X |, | Y |}} \end{array}

Our main result Theorem 20 shows which interpolations between these two are metrics. The way we arrived at $J_{\infty}$ as an analogue of NID is via Lempel–Ziv complexity. While there are several variants [ 12 , 19 , 20 ] , the LZ 1978 complexity [ 20 ] of a sequence is the cardinality of a certain set, the dictionary.

Definition 1

Let $LZSet (A)$ be the Lempel–Ziv dictionary for a sequence $A$ . We define LZ–Jaccard distance $LZJD$ by

LZJD (A, B) = 1 - \frac{| LZSet (A) \cap LZSet (B) |}{| LZSet (A) \cup LZSet (B) |} .

It is shown in [ 13 , Theorem 1 ] that the triangle inequality holds for a function which they call an information-based sequence distance. Later papers give it the notation $d_{s}$ in [ 14 , Definition V.1 ] , and call their normalized information distance $d$ . Raff and Nicholas [ 15 ] introduced the LZJD and did not discuss the appearance of $d_{s}$ in [ 14 , Definition V.1 ] , even though they do cite [ 14 ] (but not [ 13 ] ).

Kraskov et al. [ 11 , 10 ] use $D$ and $D^{'}$ for continuous analogues of $d_{s}$ and $d$ in [ 14 ] (which they cite). The Encyclopedia calls it the normalized information metric,

\frac{H (X ∣ Y) + H (X ∣ Y)}{H (X, Y)} = 1 - \frac{I (X; Y)}{H (X, Y)}

or Rajski distance [ 16 ] .

This $d_{s}$ was called $d$ by [ 13 ] — see Table 1.

Reference	Jaccard notation	NID notation
[ 13 ]	$d$
[ 14 ]	$d_{s}$	$d$
[ 10 ]	$D$	$D^{'}$
[ 15 ]	$LZJD$	NCD

Table 1 Overview of notation used in the literature. (It seems that authors use simple names for their favored notions.)

Conversely, [ 14 , near Definition V.1 ] mentions mutual information.

Remark 2

Ridgway [ 4 ] observed that the entropy-based distance $D$ is essentially a Jaccard distance. No explanation was given, but we attempt one as follows. Suppose $X_{1}, X_{2}, X_{3}, X_{4}$ are iid Bernoulli( $p = 1 / 2$ ) random variables, $Y$ is the random vector $(X_{1}, X_{2}, X_{3})$ and $Z$ is $(X_{2}, X_{3}, X_{4})$ . Then $Y$ and $Z$ have two bits of mutual information $I (Y, Z) = 2$ . They have an entropy $H (Y) = H (Z) = 3$ of three bits. Thus the relationship $H (Y, Z) = H (Y) + H (Z) - I (Y, Z)$ becomes a Venn diagram relationship $| {X_{1}, X_{2}, X_{3}, X_{4}} | = | {X_{1}, X_{2}, X_{3}} | + | {X_{2}, X_{3}, X_{4}} | - | {X_{2}, X_{3}} |$ . The relationship to Jaccard distance may not have been well known, as it is not mentioned in [ 10 , 2 , 13 , 1 ] .

A more general setting is that of STRM (Symmetric Tversky Ratio Models), Definition 17. These are variants of the Tversky index (Definition 14) proposed in [ 7 ] .

0.1.1 Generalities about metrics

Definition 3

✓

Let $X$ be a set. A metric on $X$ is a function $d : X \times X \to R$ such that

$d (x, y) \geq 0$ ,
$d (x, y) = 0$ if and only if $x = y$ ,
$d (x, y) = d (y, x)$ (symmetry),
$d (x, y) \leq d (x, z) + d (z, y)$ (the triangle inequality)

for all $x, y, z \in X$ . If $d$ satisfies Enumi 1, Enumi 2, Enumi 3 but not necessarily Enumi 4 then $d$ is called a semimetric.

A basic exercise in Definition 3 that we will make use of is Theorem 4.

Theorem 4

If $d_{1}$ and $d_{2}$ are metrics and $a, b$ are nonnegative constants, not both zero, then $a d_{1} + b d_{2}$ is a metric.

Proof ▶

Enumi 1 is immediate from Enumi 1 for $d_{1}$ and $d_{2}$ .

Enumi 2: Assume $a d_{1} (x, y) + b d_{2} (x, y) = 0$ . Then $a d_{1} (x, y) = 0$ and $b d_{2} (x, y) = 0$ . Since $a, b$ are not both 0, we may assume $a > 0$ . Then $d_{1} (x, y) = 0$ and hence $x = y$ by Enumi 2 for $d_{1}$ .

Enumi 3: We have $a d_{1} (x, y) + b d_{2} (x, y) = a d_{1} (y, x) + b d_{2} (y, x)$ by Enumi 3 for $d_{1}$ and $d_{2}$ .

Enumi 4: By Enumi 4 for $d_{1}$ and $d_{2}$ we have

\begin{array}{rcl} a d_{1} (x, y) + b d_{2} (x, y) & \leq & a (d_{1} (x, z) + d_{1} (z, y)) + b (d_{2} (x, z) + d_{2} (z, y)) \\ = & (a d_{1} (x, z) + b d_{2} (x, z)) + (a d_{2} (z, y) + b d_{2} (z, y)) . \qedhere \end{array}

Lemma 5

Let $d (x, y)$ be a metric and let $a (x, y)$ be a nonnegative symmetric function. If $a (x, z) \leq a (x, y) + d (y, z)$ for all $x, y, z$ , then $d^{'} (x, y) = \frac{d (x, y)}{a (x, y) + d (x, y)}$ , with $d^{'} (x, y) = 0$ if $d (x, y) = 0$ , is a metric.

Proof ▶

As a piece of notation, let us write $d_{x y} = d (x, y)$ and $a_{x y} = a (x, y)$ . As observated by [ 17 ] , in order to show

\frac{d_{x y}}{a_{x y} + d_{x y}} \leq \frac{d_{x z}}{a_{x z} + d_{x z}} + \frac{d_{y z}}{a_{y z} + d_{y z}},

it suffices to show the following pair of inequalities:

\begin{array}{rcl} \frac{d_{x y}}{a_{x y} + d_{x y}} \leq & \frac{d_{x z} + d_{y z}}{a_{x y} + d_{x z} + d_{y z}} \\ \frac{d_{x z} + d_{y z}}{a_{x y} + d_{x z} + d_{y z}} & \leq \frac{d_{x z}}{a_{x z} + d_{x z}} + \frac{d_{y z}}{a_{y z} + d_{y z}} \end{array}

Here 3 follows from $d$ being a metric, i.e., $d_{x y} \leq d_{x z} + d_{y z}$ , since

c \geq 0 < a \leq b ⟹ \frac{a}{a + c} \leq \frac{b}{b + c} .

Next, 4 would follow from $a_{x y} + d_{y z} \geq a_{x z}$ and $a_{x y} + d_{x z} \geq a_{y z}$ . By symmetry between $x$ and $y$ and since $a_{x y} = a_{y x}$ by assumption, it suffices to prove the first of these, $a_{x y} + d_{y z} \geq a_{x z}$ , which holds by assumption.

0.1.2 Metrics on a family of finite sets

Lemma 6

For sets $A, B, C$ , we have $| A ∖ B | \leq | A ∖ C | + | C ∖ B |$ .

Proof ▶

We have $A ∖ B \subseteq (A ∖ C) \cup (C ∖ B)$ . Therefore, the result follows from the union bound for cardinality.

Lemma 7

Let $f (A, B) = | A ∖ B | + | B ∖ A |$ . Then $f$ is a metric.

Proof ▶

The most nontrivial part is to prove the triangle inequality,

| A ∖ B | + | B ∖ A | \leq | A ∖ C | + | C ∖ A | + | C ∖ B | + | B ∖ C | .

By the “rotation identity” $| A ∖ C | + | C ∖ B | + | B ∖ A | = | A ∖ B | + | B ∖ C | + | C ∖ A |$ , this is equivalent to

2 (| A ∖ B | + | B ∖ A |) \leq 2 (| A ∖ C | + | C ∖ B | + | B ∖ A |),

which is immediate from Lemma 6.

Lemma 8

Let $f (A, B) = max {| A ∖ B |, | B ∖ A |}$ . Then $f$ is a metric.

Proof ▶

For the triangle inequality, we need to show

max {| A ∖ B |, | B ∖ A |} \leq max {| A ∖ C |, | C ∖ A |} + max {| C ∖ B |, | B ∖ C |} .

By symmetry we may assume that $max {| A ∖ B |, | B ∖ A |} = | A ∖ B |$ . Then, the result is immediate from Lemma 6.

For a real number $α$ , we write $\overset{―}{α} = 1 - α$ . For finite sets $X, Y$ we define

\tilde{m} (X, Y) = min {| X ∖ Y |, | Y ∖ X |},

\tilde{M} (X, Y) = max {| X ∖ Y |, | Y ∖ X |} .

Lemma 9

Let $δ := α \tilde{m} + \overset{―}{α} \tilde{M}$ . Let $X = {0}, Y = {1}, Z = {0, 1}$ . Then $δ (X, Y) = 1$ , $δ (X, Z) = δ (Y, Z) = \overset{―}{α}$ .

The proof of Lemma 9 is an immediate calculation.

Theorem 10

$δ_{α} = α \tilde{m} + \overset{―}{α} \tilde{M}$ satisfies the triangle inequality if and only if $0 \leq α \leq 1 / 2$ .

Proof ▶

We first show the only if direction. By Lemma 9 the triangle inequality only holds for the example given there if $1 \leq 2 \overset{―}{α}$ , i.e., $α \leq 1 / 2$ .

Now let us show the if direction. If $α \leq 1 / 2$ then $α \leq \overset{―}{α}$ , so $δ_{α} = α (\tilde{m} + \tilde{M}) + (\overset{―}{α} - α) \tilde{M}$ is a nontrivial nonnegative linear combination. Since $(\tilde{m} + \tilde{M}) (A, B) = | A ∖ B | + | B ∖ A |$ (Lemma 7) and $\tilde{M} (A, B) = max {| A ∖ B |, | B ∖ A |}$ (Lemma 8) are both metrics, the result follows from Theorem 4.

Lemma 11

Suppose $d$ is a metric on a collection of nonempty sets $X$ , with $d (X, Y) \leq 2$ for all $X, Y \in X$ . Let $\hat{X} = X \cup {\emptyset}$ and define $\hat{d} : \hat{X} \times \hat{X} \to R$ by stipulating that for $X, Y \in X$ ,

\hat{d} (X, Y) = d (X, Y); d (X, \emptyset) = 1 = d (\emptyset, X); d (\emptyset, \emptyset) = 0.

Then $\hat{d}$ is a metric on $\hat{X}$ .

Theorem 12

Let $f (A, B)$ be a metric such that

| B ∖ A | \leq f (A, B)

for all $A, B$ . Then the function $d$ given by

d (A, B) = {\begin{cases} \frac{f (A, B)}{| A \cap B | + f (A, B)}, & if | A \cap B | + f (A, B) > 0, \\ 0, & otherwise, \end{cases}

is a metric.

Proof ▶

By Lemma 5 (with $a_{x, y} = | X \cap Y |$ ) we only need to verify that for all sets $A, B, C$ ,

| A \cap C | + f (A, B) \geq | B \cap C | .

And indeed, since tautologically $B \cap C \subseteq (B ∖ A) \cup (A \cap C)$ , by the union bound we have $| B \cap C | - | A \cap C | \leq | B ∖ A | \leq f (A, B)$ .

Theorem 13

Let $f (A, B) = m min {| A ∖ B |, | B ∖ A |} + M max {| A ∖ B |, | B ∖ A |}$ with $0 < m \leq M$ and $1 \leq M$ . Then the function $d$ given by

d (A, B) = {\begin{cases} \frac{f (A, B)}{| A \cap B | + f (A, B)}, & if A \cup B \neq \emptyset, \\ 0, & otherwise, \end{cases}

is a metric.

Proof ▶

We have $f (A, B) = (m + M) δ_{α} (A, B)$ where $α = \frac{m}{m + M}$ . Since $m \leq M$ , $α \leq 1 / 2$ , so $f$ satisfies the triangle inequality by Theorem 10. Since $m > 0$ , in fact $f$ is a metric. Using $M \geq 1$ ,

f (A, B) \geq M max {| A ∖ B |, | B ∖ A |} \geq M | B ∖ A | \geq | B ∖ A |,

so that by Theorem 12, $d$ is a metric.

0.1.3 Tversky indices

Definition 14 [ 18 ]

For sets $X$ and $Y$ the Tversky index with parameters $α, β \geq 0$ is a number between 0 and 1 given by

S (X, Y) = \frac{| X \cap Y |}{| X \cap Y | + α | X ∖ Y | + β | Y ∖ X |} .

We also define the corresponding Tversky dissimilarity $d_{α, β}^{T}$ by

d_{α, β}^{T} (X, Y) = {\begin{cases} 1 - S (X, Y) & if X \cup Y \neq \emptyset; \\ 0 & if X = Y = \emptyset . \end{cases}

Definition 15

The Szymkiewicz–-Simpson coefficient is defined by

overlap (X, Y) = \frac{| X \cap Y |}{min (| X |, | Y |)}

We may note that $overlap (X, Y) = 1$ whenever $X \subseteq Y$ or $Y \subseteq X$ , so that $1 - overlap$ is not a metric.

Definition 16

The Sørensen–Dice coefficient is defined by

\frac{2 | X \cap Y |}{| X | + | Y |} .

Definition 17 [ 7 , Section 2 ]

Let $X$ be a collection of finite sets. We define $S : X \times X \to R$ as follows. The symmetric Tversky ratio model is defined by

strm (X, Y) = \frac{| X \cap Y | + bias}{| X \cap Y | + bias + β (α \tilde{m} + (1 - α) \tilde{M})}

The unbiased symmetric TRM ( $ustrm$ ) is the case where $bias = 0$ , which is the case we shall assume we are in for the rest of this paper. The Tversky semimetric $D_{α, β}$ is defined by $D_{α, β} (X, Y) = 1 - ustrm (X, Y)$ , or more precisely

D_{α, β} (X, Y) = {\begin{cases} β \frac{α \tilde{m} + (1 - α) \tilde{M}}{| X \cap Y | + β (α \tilde{m} + (1 - α) \tilde{M})}, & if X \cup Y \neq \emptyset; \\ 0 & if X = Y = \emptyset . \end{cases}

Note that for $α = 1 / 2$ , $β = 1$ , the STRM is equivalent to the Sørensen–Dice coefficient. Similarly, for $α = 1 / 2$ , $β = 2$ , it is equivalent to Jaccard’s coefficient.

0.2 Tversky metrics

Theorem 18

The function $D_{α, β}$ is a metric only if $β \geq 1 / (1 - α)$ .

Proof ▶

Recall that with $D = D_{α, β}$ ,

D (X, Y) = \frac{β δ}{| X \cap Y | + β δ} .

By Lemma 9, for the example given there we have

\begin{array}{rcl} D (X, Y) & = & \frac{β \cdot 1}{0 + β \cdot 1} = 1, \\ D (X, Z) = D (Y, Z) & = & \frac{β \cdot \overset{―}{α}}{1 + β \cdot \overset{―}{α}} . \end{array}

The triangle inequality is then equivalent to:

1 \leq 2 \frac{β \overset{―}{α}}{1 + β \overset{―}{α}} ⟺ β \overset{―}{α} \geq 1 ⟺ β \geq 1 / (1 - α) . \qedhere

In Theorem 19 we use the interval notation on $N$ , given by $[a, a] = {a}$ and $[a, b] = [a, b - 1] \cup {b}$ .

Theorem 19

The function $D_{α, β}$ is a metric on all finite power sets only if $α \leq 1 / 2$ .

Proof ▶

Suppose $α > 1 / 2$ . Then $2 \overset{―}{α} < 1$ . Let $n$ be an integer with $n > \frac{β \overset{―}{α}}{1 - 2 \overset{―}{α}}$ . Let $X_{n} = [0, n]$ , and $Y_{n} = [1, n + 1]$ , and $Z_{n} = [1, n]$ . The triangle inequality says

\begin{array}{rcl} β \frac{1}{n + β \cdot 1} = D (X_{n}, Y_{n}) & \leq & D (X_{n}, Z_{n}) + D (Z_{n}, Y_{n}) = 2 β \frac{\overset{―}{α}}{n + β \overset{―}{α}} \\ n + β \overset{―}{α} & \leq & 2 \overset{―}{α} (n + β) \\ n (1 - 2 \overset{―}{α}) & \leq & β \overset{―}{α} \end{array}

Then the triangle inequality does not hold, so $D_{α, β}$ is not a metric on the power set of $[0, n + 1]$ .

Theorem 20

Let $0 \leq α \leq 1$ and $β > 0$ . Then $D_{α, β}$ is a metric if and only if $0 \leq α \leq 1 / 2$ and $β \geq 1 / (1 - α)$ .

Proof ▶

Theorem 18 and Theorem 19 give the necessary condition. Since

D_{α, β} = {\begin{cases} β \frac{α \tilde{m} + (1 - α) \tilde{M}}{| X \cap Y | + β (α \tilde{m} + (1 - α) \tilde{M})}, & if X \cup Y \neq \emptyset, \\ 0 & otherwise, \end{cases}

where $\tilde{m}$ is the minimum of the set differences and $\tilde{M}$ is the maximum, we can let $f (A, B) = β (α \tilde{m} (A, B) + (1 - α) \tilde{M} (A, B))$ . Then with the constants $m = β α$ and $M = β \overset{―}{α}$ , we can apply Theorem 13.

We have formally proved Theorem 20 in the Lean theorem prover. The Github repository can be found at [ 8 ] .

0.2.1 A converse to Gragera and Suppakitpaisarn

Theorem 21 Gragera and Suppakitpaisarn [ 5 , 6 ]

The optimal constant $ρ$ such that $d_{α, β}^{T} (X, Y) \leq ρ (d_{α, β}^{T} (X, Y) + d_{α, β}^{T} (Y, Z))$ for all $X, Y, Z$ is

\frac{1}{2} (1 + \sqrt{\frac{1}{α β}}) .

Corollary 22

$d_{α, β}^{T}$ is a metric only if $α = β \geq 1$ .

Proof ▶

Clearly, $α = β$ is necessary to ensure $d_{α, β}^{T} (X, Y) = d_{α, β}^{T} (Y, X)$ . Moreover $ρ \leq 1$ is necessary, so Theorem 21 gives $α β \geq 1$ .

Theorem 23 gives the converse to the Gragera and Suppakitpaisarn inspired Corollary 22:

Theorem 23

The Tversky dissimilarity $d_{α, β}^{T}$ is a metric iff $α = β \geq 1$ .

Proof ▶

Suppose the Tversky dissimilarity $d_{α, β}^{T}$ is a semimetric. Let $X, Y$ be sets with $| X \cap Y | = | X ∖ Y | = 1$ and $| Y ∖ X | = 0$ . Then

1 - \frac{1}{1 + β} = d_{α, β}^{T} (Y, X) = d_{α, β}^{T} (X, Y) = 1 - \frac{1}{1 + α},

hence $α = β$ . Let $γ = α = β$ .

Now, $d_{γ, γ}^{T} = D_{α_{0}, β_{0}}$ where $α_{0} = 1 / 2$ and $β_{0} = 2 γ$ . Indeed, with $\tilde{m} = min {| X ∖ Y |, | Y ∖ X |}$ and $\tilde{M} = max {| X ∖ Y |, | Y ∖ X |}$ , since

D_{α_{0}, β_{0}} = β_{0} \frac{α_{0} \tilde{m} + (1 - α_{0}) \tilde{M}}{| X \cap Y | + β_{0} [α_{0} \tilde{m} + (1 - α_{0}) \tilde{M}]},

D_{\frac{1}{2}, 2 γ} = 2 γ \frac{\frac{1}{2} \tilde{m} + (1 - \frac{1}{2}) \tilde{M}}{| X \cap Y | + 2 γ [\frac{1}{2} \tilde{m} + (1 - \frac{1}{2}) \tilde{M}]}

= γ \frac{| X ∖ Y | + | Y ∖ X |}{| X \cap Y | + γ [| X ∖ Y | + | Y ∖ X |]} = 1 - \frac{| X \cap Y |}{| X \cap Y | + γ | X ∖ Y | + γ | Y ∖ X |} = d_{γ, γ}^{T} .

By Theorem 20, $d_{γ, γ}^{T}$ is a metric if and only if $β_{0} \geq 1 / (1 - α_{0})$ . This is equivalent to $2 γ \geq 2$ , i.e., $γ \geq 1$ .

The truth or falsity of Theorem 23 does not arise in Gragera and Suppakitpaisarn’s work, as they require $α, β \leq 1$ in their definition of Tversky index. We note that Tversky [ 18 ] only required $α, β \geq 0$ .

0.3 Lebesgue-style metrics

Incidentally, the names of $J_{1}$ and $J_{\infty}$ come from the observation that they are special cases of $J_{p}$ given by

J_{p} (A, B) = {(2 \cdot \frac{| B ∖ A |^{p} + | A ∖ B |^{p}}{| A |^{p} + | B |^{p} + | B ∖ A |^{p} + | A ∖ B |^{p}})}^{1 / p} = {\begin{cases} J_{1} (A, B) & p = 1 \\ J_{\infty} (A, B) & p \to \infty \end{cases}

which was suggested in [ 9 ] as another possible means of interpolating between $J_{1}$ and $J_{\infty}$ . We still conjecture that $J_{2}$ is a metric, but shall not attempt to prove it here. However:

Theorem 24

$J_{3}$ is not a metric.

Because of Theorem 24, we searched for a better version of $J_{p}$ , and found $V_{p}$ :

Definition 25

For each $1 \leq p \leq \infty$ , let ¹

\begin{array}{rcl} Δ_{p} (A, B) & = & (| B ∖ A |^{p} + | A ∖ B |^{p})^{1 / p}, and \\ V_{p} (A, B) & = & \frac{Δ_{p} (A, B)}{| A \cap B | + Δ_{p} (A, B)} . \end{array}

We have $V_{1} = J_{1}$ and $V_{\infty} := lim_{p \to \infty} V_{p} = J_{\infty}$ .

In a way what is going on here is that we consider $L^{p}$ spaces instead of

\frac{1}{p} L^{1} + (1 - \frac{1}{p}) L^{\infty}

spaces.

Theorem 26

For each $1 \leq p \leq \infty$ , $Δ_{p}$ is a metric.

Theorem 27

For each $1 \leq p \leq \infty$ , $V_{p}$ is a metric.

Proof ▶

By Theorem 26 and Theorem 12, we only have to check $| B ∖ A | \leq Δ_{p} (A, B)$ , which is immediate for $1 \leq p \leq \infty$ . □

Of special interest may be $V_{2}$ as a canonical interpolant between $V_{1}$ , the Jaccard distance, and $V_{\infty} = J_{\infty}$ , the analogue of the NID. If $| B ∖ A | = 3$ , $| A ∖ B | = 4$ , and $| A \cap B | = 5$ , then

\begin{array}{rcl} V_{1} (A, B) & = & 7 / 12, \\ V_{2} (A, B) & = & 1 / 2, \\ V_{\infty} (A, B) & = & 4 / 9. \end{array}

Note that if $A \subseteq B$ then $V_{p} (A, B) = V_{1} (A, B)$ for all $p$ .

0.4 Conclusion and applications

Many researchers have considered metrics based on sums or maxima, but we have shown that these need not be considered in “isolation” in the sense that they form the endpoints of a family of metrics.

As an example, the mutations of spike glycoproteins of coronaviruses are of interest in connection with diseases such as CoViD-19. We calculated several distance measures between peptide sequences for such proteins. The distance

Z_{2, α} (x_{0}, x_{1}) = α min (| A_{1} |, | A_{2} |) + \overset{―}{α} max (| A_{1} |, | A_{2} |)

where $A_{i}$ is the set of subwords of length 2 in $x_{i}$ but not in $x_{1 - i}$ , counts how many subwords of length 2 appear in one sequence and not the other.

We used the Ward linkage criterion for producing Newick trees using the hclust package for the Go programming language. The calculated phylogenetic trees were based on the metric $Z_{2, α}$ .

We found one tree isomorphism class each for $0 \leq α \leq 0.21$ , $0.22 \leq α \leq 0.36$ , and $0.37 \leq α \leq 0.5$ , respectively. We see that the various intervals for $α$ can correspond to “better” or “worse” agreement with other distance measures. Thus, we propose that rather than focusing on $α = 0$ and $α = 1 / 2$ exclusively, future work may consider the whole interval $[0, 1 / 2]$ .