KDTree
如果您可以选择多个邻居来保证阵列中的每个项目都具有唯一的邻居,则使用的解决方案可以表现良好。使用以下代码:
def nearest_neighbors_kd_tree(x, y, k) :
x, y = map(np.asarray, (x, y))
tree =scipy.spatial.cKDTree(y[:, None])
ordered_neighbors = tree.query(x[:, None], k)[1]
nearest_neighbor = np.empty((len(x),), dtype=np.intp)
nearest_neighbor.fill(-1)
used_y = set()
for j, neigh_j in enumerate(ordered_neighbors) :
for k in neigh_j :
if k not in used_y :
nearest_neighbor[j] = k
used_y.add(k)
break
return nearest_neighbor
和一个n=1000
点样本,我得到:
In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1)
Out[9]: True
In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1)
Out[10]: False
因此,最佳值为k=13
,然后时间为:
In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13)
100 loops, best of 3: 9.26 ms per loop
但在最坏的情况下,您可能需要k=1000
,然后:
In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000)
1 loops, best of 3: 424 ms per loop
这比其他选项要慢:
In [13]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 60 ms per loop
In [14]: %timeit nearest_neighbors_sorted(x, y)
10 loops, best of 3: 47.4 ms per loop
在搜索之前对数组进行排序可以得到1000个以上的数组的数组:
def nearest_neighbors_sorted(x, y) :
x, y = map(np.asarray, (x, y))
y_idx = np.argsort(y)
y = y[y_idx]
nearest_neighbor = np.empty((len(x),), dtype=np.intp)
for j, xj in enumerate(x) :
idx = np.searchsorted(y, xj)
if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] :
idx -= 1
nearest_neighbor[j] = y_idx[idx]
y = np.delete(y, idx)
y_idx = np.delete(y_idx, idx)
return nearest_neighbor
具有10000个元素的长数组:
In [2]: %timeit nearest_neighbors_sorted(x, y)
1 loops, best of 3: 557 ms per loop
In [3]: %timeit nearest_neighbors(x, y)
1 loops, best of 3: 1.53 s per loop
对于较小的阵列,其性能稍差一些。
如果只丢弃重复项,则必须遍历所有项以实现贪婪的最近邻居算法。考虑到这一点,这是我能想到的最快的方法:
def nearest_neighbors(x, y) :
x, y = map(np.asarray, (x, y))
y = y.copy()
y_idx = np.arange(len(y))
nearest_neighbor = np.empty((len(x),), dtype=np.intp)
for j, xj in enumerate(x) :
idx = np.argmin(np.abs(y - xj))
nearest_neighbor[j] = y_idx[idx]
y = np.delete(y, idx)
y_idx = np.delete(y_idx, idx)
return nearest_neighbor
现在有:
n = 1000
x = np.random.rand(n)
y = np.random.rand(2*n)
我得到:
In [11]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 52.4 ms per loop