Indexing in NumPy
Some use cases
Created: 2021-04-30
Compress overlapping indices
Background
matches
below is the (truncated) result of calling spacy.Matcher
on a Doc.
Each tuple corresponds to a single match. So to return the result of the first match we run:
doc[29:30]
Remember, the end slice is exclusive - we expect only one token.
Notice there is some duplication here in the spans. What we want to do is consolidate our matches:
[(29, 30), (29, 31)]
Should become:
[(29, 31)]
Method
There several methods to solve this - here's one that's efficient and relatively clean. I take advantage of numpy
indexing.
import numpy as np
doc_len = 700
# Matches
matches = \
[(29, 30),
(29, 31),
(30, 31),
(32, 33),
(86, 87),
(96, 97),
(192, 193),
(194, 195),
(196, 197),
(196, 198),
(197, 198),
(199, 200),
(201, 202),
(209, 210),
(268, 269),
(270, 271),
(313, 314),
(328, 329),
(330, 331),
(332, 333),
(332, 334),
(333, 334),
(354, 355),
(357, 358),
(381, 382),
(396, 397),
(399, 400),
(430, 431),
(433, 434),
(445, 446),
(457, 458),
(461, 462),
(461, 463),
(462, 463),
(482, 483),
(484, 485),
(486, 487),
(489, 490),
(544, 545)]
z = np.zeros(doc_len)
i = np.unique(np.hstack([np.arange(*x) for x in matches]))
z[i] = 1
result = np.argwhere(z).ravel()
What we have now is our z
array where values of 1
indicate that a token was a member of a match.
Since we expanded the ranges to get to this point, we need to get them back into a (start, stop)
format.
from toolz import sliding_window
spans = []
start_idx = None
for a, b in sliding_window(2, result):
if a + 1 == b: # is it continuous?
if start_idx:
continue
else:
start_idx = a
continue
else:
if start_idx:
spans.append((start_idx, a))
start_idx = None
continue
else:
spans.append(a)
# We have left AND right side inclusive
span_range_form = []
for span in spans:
if isinstance(span, tuple):
span_range_form.append((span[0], span[1] + 1))
else:
span_range_form.append((span, span+ 1))
span_range_form
[(29, 31), (32, 33), (86, 87), (96, 97), (192, 193), (194, 195), (196, 198), (199, 200), (201, 202), (209, 210), (268, 269), (270, 271), (313, 314), (328, 329), (330, 331), (332, 334), (354, 355), (357, 358), (381, 382), (396, 397), (399, 400), (430, 431), (433, 434), (445, 446), (457, 458), (461, 463), (482, 483), (484, 485), (486, 487), (489, 490)]
Speed Test
%%timeit -r50
z = np.zeros(doc_len)
i = np.unique(np.hstack([np.arange(*x) for x in matches]))
z[i] = 1
result = np.argwhere(z).ravel()
spans = []
start_idx = None
for a, b in sliding_window(2, result):
if a + 1 == b: # is it continuous?
if start_idx:
continue
else:
start_idx = a
continue
else:
if start_idx:
spans.append((start_idx, a))
start_idx = None
continue
else:
spans.append(a)
# We have left AND right side inclusive
span_range_form = []
for span in spans:
if isinstance(span, tuple):
span_range_form.append((span[0], span[1] + 1))
else:
span_range_form.append((span, span+ 1))
119 µs ± 4.42 µs per loop (mean ± std. dev. of 50 runs, 10000 loops each)