# 浅谈sklearn中DBSCAN的欧式距离(Euclidean Distance)的计算

DBSCAN根据密度聚类，常用欧式距离（Euclidean Distance）来度量数据之间的距离。笔者学习发现DBSCAN中欧式距离的计算先后调用了sklearn.neighbors.NearestNeighbors和sklearn.metrics.pairwise.euclidean_distances，并浅谈其中sklearn.metrics.pairwise.euclidean_distances对欧式距离计算公式的优化和不足。

## 1 sklearn.cluster.DBSCAN

class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric=‘euclidean’, metric_params=None, algorithm=‘auto’, leaf_size=30, p=None, n_jobs=None)

# 节选自Libsite-packagessklearncluster_dbscan.py
class DBSCAN(ClusterMixin, BaseEstimator):
def fit(self, X, y=None, sample_weight=None):
.......
neighbors_model = NearestNeighbors(radius=self.eps, algorithm=self.algorithm, leaf_size=self.leaf_size, metric=self.metric, metric_params=self.metric_params, p=self.p, n_jobs=self.n_jobs)
neighbors_model.fit(X)
# This has worst case O(n^2) memory complexity
......


## 2 sklearn.neighbors.NearestNeighbors

class sklearn.neighbors.NearestNeighbors(*, n_neighbors=5, radius=1.0, algorithm=‘auto’, leaf_size=30, metric=‘minkowski’, p=2, metric_params=None, n_jobs=None)

# 节选自Libsite-packagessklearnneighbors_unsupervised.py
from ._base import NeighborsBase
from ._base import KNeighborsMixin
def fit(self, X, y=None):
return self._fit(X)

# 节选自Libsite-packagessklearnneighbors_base.py
from ..metrics import pairwise_distances_chunked
class NeighborsBase(MultiOutputMixin, BaseEstimator, metaclass=ABCMeta):
def _fit(self, X, y=None):
......
self.effective_metric_ = self.metric
......
self._fit_X = X
......
return self

......
elif self._fit_method == 'brute':
# for efficiency, use squared euclidean distances
if self.effective_metric_ == 'euclidean':
kwds = {'squared': True}
else:
kwds = self.effective_metric_params_
chunked_results = pairwise_distances_chunked(X, self._fit_X, reduce_func=reduce_func, metric=self.effective_metric_, n_jobs=self.n_jobs, **kwds)
......


## 3 sklearn.metrics.pairwise

# 节选自Libsite-packagessklearnmetircspairwise.py
def pairwise_distances_chunked(X, Y=None, *, reduce_func=None, metric='euclidean', n_jobs=None, working_memory=None, **kwds):
......
for sl in slices:
......
D_chunk = pairwise_distances(X_chunk, Y, metric=metric, n_jobs=n_jobs, **kwds)
if ((X is Y or Y is None)
and PAIRWISE_DISTANCE_FUNCTIONS.get(metric, None)
is euclidean_distances):
# zeroing diagonal, taking care of aliases of "euclidean",
# i.e. "l2"
D_chunk.flat[sl.start::_num_samples(X) + 1] = 0
......
yield D_chunk

def pairwise_distances(X, Y=None, metric="euclidean", *, n_jobs=None, force_all_finite=True, **kwds):
......
elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
func = PAIRWISE_DISTANCE_FUNCTIONS[metric]
......
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)

PAIRWISE_DISTANCE_FUNCTIONS = {......'euclidean': euclidean_distances,......} # 字典, value对应函数

def _parallel_pairwise(X, Y, func, n_jobs, **kwds):
if Y is None:
Y = X
X, Y, dtype = _return_float_dtype(X, Y)
if effective_n_jobs(n_jobs) == 1:
return func(X, Y, **kwds)  # 这里func为euclidean_distances
......

def euclidean_distances(X, Y=None, *, Y_norm_squared=None, squared=False, X_norm_squared=None):
......#具体见4 sklearn.metrics.pairwise.euclidean_distances


## 4 sklearn.metrics.pairwise.euclidean_distances

sklearn.metrics.pairwise.euclidean_distances(X, Y=None, *, Y_norm_squared=None, squared=False, X_norm_squared=None)

euclidean_distances是一个函数，输入向量数组X和Y， 计算向量数组X和Y每对（行）向量之间的欧式距离，返回距离矩阵。若仅输入X而Y=None（相当于Y=X），则返回X自身向量间的邻接矩阵。根据官方文档，每对向量x和y的欧式距离计算公式为：

dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))

a

=

[

x

a

y

a

]

a=begin{bmatrix}x_a\y_aend{bmatrix}

a=[xaya],

b

=

[

x

b

y

b

]

b=begin{bmatrix}x_b\y_bend{bmatrix}

b=[xbyb]为例。

d

o

t

(

a

,

a

)

=

x

a

2

+

y

a

2

dot(a, a) = x_a^2+y_a^2

dot(a,a)=xa2+ya2

d

o

t

(

a

,

b

)

=

x

a

x

b

+

y

a

y

b

dot(a, b) = {x_ax_b}+{y_ay_b}

dot(a,b)=xaxb+yayb

d

o

t

(

b

,

b

)

=

x

b

2

+

y

b

2

dot(b, b) = x_b^2+y_b^2

dot(b,b)=xb2+yb2，因此

d

i

s

t

(

a

,

b

)

=

x

a

2

+

y

a

2

2

(

x

a

x

b

+

y

a

y

b

)

+

x

b

2

+

y

b

2

=

(

x

a

x

b

)

2

+

(

y

a

y

b

)

2

dist(a, b)=sqrt{x_a^2+y_a^2-2ast(x_ax_b+y_ay_b)+x_b^2+y_b^2}=sqrt{{{{(x_a}-x_b})}^2+{{(y_a}-y_b)}^2}

dist(a,b)=xa2+ya22(xaxb+yayb)+xb2+yb2

=(xaxb)2+(yayb)2

。采用该公式实际是出于计算效率的考虑，如果预先算出每个向量（数据）自身的点乘dot(x,x)和dot(y,y)，则能缩减不同向量间距离的计算时间（原文见下）。比如，对于任意i来说，计算

d

i

s

t

(

x

1

,

x

i

)

dist(oversetrightharpoonup{x_1}, oversetrightharpoonup{x_i})

dist(x1,xi)涉及

d

o

t

(

x

1

,

x

1

)

dot(oversetrightharpoonup{x_1}, oversetrightharpoonup{x_1})

dot(x1,x1)这一中间值，若提前算好并临时保存这个点乘结果，则能更快地算出与向量

x

1

oversetrightharpoonup{x_1}

x1相关的所有距离。而这一点乘结果实际对应函数输入中的关键词参数X_norm_squared和Y_norm_squared。

This formulation has two advantages over other ways of computing distances. First, it is computationally efficient when dealing with sparse data. Second, if one argument varies but the other remains unchanged, then dot(x, x) and/or dot(y, y) can be pre-computed.

x

i

oversetrightharpoonup{x_i}

xi

x

j

oversetrightharpoonup{x_j}

xj之间的欧式距离，理论上M应为对称矩阵。

X

=

[

x

1

x

2

x

m

]

=

[

x

11

x

12

x

1

n

x

21

x

22

x

2

n

x

m

1

x

m

2

x

m

n

]

X;=;begin{bmatrix}oversetrightharpoonup{x_1}\oversetrightharpoonup{x_2}\cdots\oversetrightharpoonup{x_m}end{bmatrix};=;begin{bmatrix}x_{11}&x_{12}&cdots&x_{1n}\x_{21}&x_{22}&cdots&x_{2n}\cdots&cdots&cdots&cdots\x_{m1}&x_{m2}&cdots&x_{mn}end{bmatrix}

X=x1x2xm=x11x21xm1x12x22xm2x1nx2nxmn

x

1

oversetrightharpoonup{x_1}

x1

x

2

oversetrightharpoonup{x_2}

x2之间的距离为dist(

x

1

oversetrightharpoonup{x_1}

x1,

x

2

oversetrightharpoonup{x_2}

x2)，对应M的

m

12

m_{12}

m12以及

m

21

m_{21}

m21元素。

However, this is not the most precise way of doing this computation, because this equation potentially suffers from “catastrophic cancellation”. Also, the distance matrix returned by this function may not be exactly symmetric as required by, e.g., scipy.spatial.distance functions.

u

v

2

=

(

(

w

i

u

i

v

i

2

)

)

1

/

2

{parallel u-vparallel}_2={({textstylesum_{}}(w_ileft|u_i-v_iright|^2))}^{1/2}

uv2=((wiuivi2))1/2，其中w指u和v之间的权重，默认

w

i

w_i

wi为1.0。

（1）sklearn.neighbors.NearestNeighbors, https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html?highlight=nearestneighbors#sklearn.neighbors.NearestNeighbors
（2）sklearn.metrics.pairwise.euclidean_distances, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances
（3）sklearn.cluster.DBSCAN, https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
（4）scipy.spatial.distance.euclidean, https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean

THE END