Indexing in NumPy

Some use cases

Created: 2021-04-30

Compress overlapping indices

Background

matches below is the (truncated) result of calling spacy.Matcher on a Doc.

Each tuple corresponds to a single match. So to return the result of the first match we run:

doc[29:30]

Remember, the end slice is exclusive - we expect only one token.

Notice there is some duplication here in the spans. What we want to do is consolidate our matches:

[(29, 30), (29, 31)]

Should become:

[(29, 31)]

Method

There several methods to solve this - here's one that's efficient and relatively clean. I take advantage of numpy indexing.

import numpy as np
doc_len = 700

# Matches
matches = \
[(29, 30),
 (29, 31),
 (30, 31),
 (32, 33),
 (86, 87),
 (96, 97),
 (192, 193),
 (194, 195),
 (196, 197),
 (196, 198),
 (197, 198),
 (199, 200),
 (201, 202),
 (209, 210),
 (268, 269),
 (270, 271),
 (313, 314),
 (328, 329),
 (330, 331),
 (332, 333),
 (332, 334),
 (333, 334),
 (354, 355),
 (357, 358),
 (381, 382),
 (396, 397),
 (399, 400),
 (430, 431),
 (433, 434),
 (445, 446),
 (457, 458),
 (461, 462),
 (461, 463),
 (462, 463),
 (482, 483),
 (484, 485),
 (486, 487),
 (489, 490),
 (544, 545)]

z = np.zeros(doc_len)
i = np.unique(np.hstack([np.arange(*x) for x in matches]))
z[i] = 1
result = np.argwhere(z).ravel()

What we have now is our z array where values of 1 indicate that a token was a member of a match.

Since we expanded the ranges to get to this point, we need to get them back into a (start, stop) format.

from toolz import sliding_window
spans = []
start_idx = None
for a, b in sliding_window(2, result):
    if a + 1 == b:  # is it continuous?
        if start_idx:
            continue
        else:
            start_idx = a
            continue
    else:
        if start_idx:
            spans.append((start_idx, a))
            start_idx = None
            continue
        else:
            spans.append(a)

# We have left AND right side inclusive
span_range_form = []
for span in spans:
    if isinstance(span, tuple):
        span_range_form.append((span[0], span[1] + 1))
    else:
        span_range_form.append((span, span+ 1))
span_range_form

[(29, 31),
 (32, 33),
 (86, 87),
 (96, 97),
 (192, 193),
 (194, 195),
 (196, 198),
 (199, 200),
 (201, 202),
 (209, 210),
 (268, 269),
 (270, 271),
 (313, 314),
 (328, 329),
 (330, 331),
 (332, 334),
 (354, 355),
 (357, 358),
 (381, 382),
 (396, 397),
 (399, 400),
 (430, 431),
 (433, 434),
 (445, 446),
 (457, 458),
 (461, 463),
 (482, 483),
 (484, 485),
 (486, 487),
 (489, 490)]

Speed Test

%%timeit -r50
z = np.zeros(doc_len)
i = np.unique(np.hstack([np.arange(*x) for x in matches]))
z[i] = 1
result = np.argwhere(z).ravel()
spans = []
start_idx = None
for a, b in sliding_window(2, result):
    if a + 1 == b:  # is it continuous?
        if start_idx:
            continue
        else:
            start_idx = a
            continue
    else:
        if start_idx:
            spans.append((start_idx, a))
            start_idx = None
            continue
        else:
            spans.append(a)

# We have left AND right side inclusive
span_range_form = []
for span in spans:
    if isinstance(span, tuple):
        span_range_form.append((span[0], span[1] + 1))
    else:
        span_range_form.append((span, span+ 1))

119 µs ± 4.42 µs per loop (mean ± std. dev. of 50 runs, 10000 loops each)

My Notes

Indexing in NumPy

Compress overlapping indices

Background

Method

Speed Test