如果您对性能感兴趣,可以对IntervalIndex进行优化以进行搜索。使用.get_loc
或.get_indexer
使用内部构建的IntervalTree(如二叉树),该树是在首次使用时构造的。
In [29]: idx = pd.IntervalIndex.from_tuples(data*10000)
In [30]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
92.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [40]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
42.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# construct tree and search
In [31]: %timeit -n 1 -r 1 idx.get_loc(900)
4.55 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# subsequently
In [32]: %timeit -n 1 -r 1 idx.get_loc(900)
137 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# for a single indexer you can do even better (note that this is
# dipping into the impl a bit
In [27]: %timeit np.arange(len(idx))[(900 > idx.left) & (900 <= idx.right)]
203 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
请注意,.get_loc()返回一个索引器(实际上比布尔数组有用,但它们可以相互转换)。
In [38]: idx.map(lambda x: 900 in x)
...:
Out[38]:
Index([ True, False, False, True, False, False, True, False, False, True,
...
False, True, False, False, True, False, False, True, False, False], dtype='object', length=30000)
In [39]: idx.get_loc(900)
...:
Out[39]: array([29997, 9987, 10008, ..., 19992, 19989, 0])
返回布尔数组将转换为索引器数组
In [5]: np.arange(len(idx))[idx.map(lambda x: 900 in x).values.astype(bool)]
Out[5]: array([ 0, 3, 6, ..., 29991, 29994, 29997])
这是.get_loc()和.get_indexer()返回的内容:
In [6]: np.sort(idx.get_loc(900))
Out[6]: array([ 0, 3, 6, ..., 29991, 29994, 29997])