home
lars.st0ne.at
Linux And Related Stuff
[BLOG] |ARCHIVE| |TAG MAP| |OTHER STUFF| |ME|
  • boost python regex expand performance


    #python #regex

    Yesterday while playing around with pythons regex library re, i came across the expand function of the Match-Object class. It uses a handy template syntax supporting named backreferences to capturing groups.

    Example:

    $ python
    >>> import re
    >>> re.match("(?P<fst>.*?) (?P<snd>.*)", "a b").expand( "\g<snd> \g<fst>" )
    'b a'
    

    ... "\g<fst>" expands to the first capturing group ...

    ... "\g<snd>" expands to the second capturing group ...

    Using this kind of string expansion i my script, i encountered that the performance was rather poor. So i was searching for a better/faster solution and started to measure execution times:

    inital version - re expand with named backreferences:

    $ python -m timeit "import re" 're.match("(?P<fst>.)(?P<snd>.)", "ab").expand("\g<snd>\g<fst>")'
    10000 loops, best of 3: 34.1 usec per loop
    

    improvement 1 - re expand with numeric backreferences #1:

    $ python -m timeit "import re" 're.match("(?P<fst>.)(?P<snd>.)", "ab").expand("\g<2>\g<1>")'
    10000 loops, best of 3: 25.2 usec per loop
    

    ... this version shows a slight performance improvement. But "\g<2>" can be written as "\2".

    improvement 2 - re expand with numeric backreferences #2:

    $ python -m timeit "import re" 're.match("(?P<fst>.)(?P<snd>.)", "ab").expand("\2\1")'
    100000 loops, best of 3: 12.8 usec per loop
    

    ... this version is performing approximately 2.5 times better than the first. Anyway the handy named-backreference syntax is gone. Therefore i tried a complete different approach using the str.format function and the groupdict() function.

    improvement 3 - string format:

    $ python -m timeit "import re" '"{b}{a}".format( **re.match("(?P<a>.)(?P<b>.)", "ab").groupdict())'
    100000 loops, best of 3: 3.78 usec per loop
    

    Conclusio:

    ... the final version above scores with performance and the named-backreferenc syntax by extracting the capturing groups with the groupdict() and feeds it to the string format function. The performance improvement is about factor 10 without loosing any feature.

    Dec 17 2013 12:26
    by st0ne
    ? hits
    • comment @Twitter
    • share on Twitter
©2012 Robert Steininger aka st0ne
CC-BY-SA