您可以通过仅将中间存储空间减少到对角线元素来改进@Bill的解决方案:
from numpy.core.umath_tests import inner1d
m, n = 1000, 500
a = np.random.rand(m, n)
b = np.random.rand(n, m)
# They all should give the same result
print np.trace(a.dot(b))
print np.sum(a*b.T)
print np.sum(inner1d(a, b.T))
%timeit np.trace(a.dot(b))
10 loops, best of 3: 34.7 ms per loop
%timeit np.sum(a*b.T)
100 loops, best of 3: 4.85 ms per loop
%timeit np.sum(inner1d(a, b.T))
1000 loops, best of 3: 1.83 ms per loop
另一种选择是使用np.einsum
并且根本没有显式的中间存储:
# Will print the same as the others:
print np.einsum('ij,ji->', a, b)
在我的系统上,它的运行速度比使用慢inner1d
,但可能不适用于所有系统,请参见以下问题:
%timeit np.einsum('ij,ji->', a, b)
100 loops, best of 3: 1.91 ms per loop