Regex Named Capture Groups
And Using with Pandas
Created: 2021-04-30
Groupdict is one of these features I'd seen before but never realized a use case for it until recently.
Regex allows defining named capture groups:
>>> re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Use Case : Pandas
An example nginx entry:
000.00.000.00 - [US] [04/Jan/2021:19:33:35 +0000] "GET /api/count HTTP/1.1" 200 19 "/some/path" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "000.00.000.00
entry = ["""000.00.000.00 - [US] [04/Jan/2021:19:33:35 +0000] "GET /api/count HTTP/1.1" 200 19 "/some/path" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "000.00.000.00"
"""]
import pandas as pd
import re
df = pd.DataFrame(entry); df
0 | |
---|---|
0 | 000.00.000.00 - [US] [04/Jan/2021:19:33:35 +00... |
nginx_parse = re.compile(r'(?P<realip>[\d.]+)\s-\s\[(?P<country_code>\w+)]\s\[(?P<day>\d+)/(?P<month>[A-z][a-z]+)/(?P<year>\d{4}):(?P<time>[\d:]+)\s(?P<tzoffset>\+\d+)]\s"(?P<method>[A-Z]+)\s(?P<path>[/\w\d_]+)\s.+?(?P<status>\d{3})')
# Extracting into columns based on group name
df[0].str.extract(nginx_parse)
realip | country_code | day | month | year | time | tzoffset | method | path | status | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 000.00.000.00 | US | 04 | Jan | 2021 | 19:33:35 | +0000 | GET | /api/count | 200 |