Regex Named Capture Groups

And Using with Pandas

Created: 2021-04-30

Groupdict is one of these features I'd seen before but never realized a use case for it until recently.

Regex allows defining named capture groups:

>>> re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> {'first_name': 'Malcolm', 'last_name': 'Reynolds'}

Use Case : Pandas

An example nginx entry:

000.00.000.00 - [US] [04/Jan/2021:19:33:35 +0000] "GET /api/count HTTP/1.1" 200 19 "/some/path" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "000.00.000.00

entry = ["""000.00.000.00 - [US] [04/Jan/2021:19:33:35 +0000] "GET /api/count HTTP/1.1" 200 19 "/some/path" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "000.00.000.00"
"""]

import pandas as pd
import re
df = pd.DataFrame(entry); df

	0
0	000.00.000.00 - [US] [04/Jan/2021:19:33:35 +00...

nginx_parse = re.compile(r'(?P<realip>[\d.]+)\s-\s\[(?P<country_code>\w+)]\s\[(?P<day>\d+)/(?P<month>[A-z][a-z]+)/(?P<year>\d{4}):(?P<time>[\d:]+)\s(?P<tzoffset>\+\d+)]\s"(?P<method>[A-Z]+)\s(?P<path>[/\w\d_]+)\s.+?(?P<status>\d{3})')

# Extracting into columns based on group name

df[0].str.extract(nginx_parse)

	realip	country_code	day	month	year	time	tzoffset	method	path	status
0	000.00.000.00	US	04	Jan	2021	19:33:35	+0000	GET	/api/count	200

My Notes

Regex Named Capture Groups

Use Case : Pandas