Alex Volkov | website

Fun of premature optimization

Last week I deployed a redesigned website that included a new search feature, looking at the server logs I've noticed that users of the site use search. A lot.

I wanted to know more about what kind of questions end-users were asking, so I wrote a short program to parse the logs, extracting query strings formed with HTTP GET variable q=query+text. It was fairly easy do, especially with apache_log_parser pypi, github, that works with any web server with common log format support. Nginx is compatible, default access log line can be parsed with this -- %h %l %u %t "%r" %>s %O "%{Referer}i" "%{User-Agent}i'

The program returns an array records, each containing query string, remote IP address and a time stamp, then I toyed with the idea of measuring its performance, so I tested several possibilities with the most crude tool ever -- time command. I used real time value.

The setup: this program is implemented in Python 3, it was run on Intel Celeron 2955U, the input log file size is 15MB. All the code changes were cumulative -- the performance was measured with all the changes up to that point.

Source code from the initial version.

from pprint import pprint
import apache_log_parser
nginx_parser = apache_log_parser.make_parser(
    '%h %l %u %t "%r" %>s %O "%{Referer}i" \"%{User-Agent}i')
log_file = "access.log"


def parse_queries(in_query):
    """Return one or more quieries in the request stripping 'q' member"""
    for q in in_query:
        q = set(q)
        if 'q' in q:
            out = q - {'q'}
            yield out.pop()


def prep_item(label, parsed_line):
    """create a dictinary with label, timestamp and IP address members"""
    timestamp = parsed_line['time_received_tz_datetimeobj']
    remote_ip = parsed_line['remote_host']
    return dict(query=label, time=timestamp, ip=remote_ip)


def extract_searches(filename):
    """parse the logfile and return list of dictionaries with query, time ip"""
    search_list = []
    with open(filename, 'r') as log:
        for line in log:
            outp = nginx_parser(line)
            if 'request_url_path' not in outp:
                continue
            if '/search/' not in outp['request_url_path']:
                continue
            for item in parse_queries(outp['request_url_query_list']):
                search_list.append(prep_item(item, outp))
    return search_list

if __name__ == "__main__":
    searches = extract_searches(log_file)
    pprint(searches)

Running this script under time gave the following results.

Run #	time (seconds)
1	11.052
2	11.139
3	11.019
4	11.073
average	11.0708

What would happen if the list accumulator in extract_searches were replaced by a generator called outside of the function by list comprehension? The code will look much cleaner, generators are a little slower than in-memory operations, but list comprehensions are much faster than array append, on average this should be the same or slightly faster.

def extract_searches(log_fd):
    for line in log_fd:
        outp = nginx_parser(line)
        if (not 'request_url_path' in outp) or \
                ('/search/' not in outp['request_url_path']):
            continue
        for item in parse_queries(outp['request_url_query_list']):
            yield prep_item(item, outp)

...

if __name__ == "__main__":
    with open(log_file, 'r') as logfile:
        searches = [x for x in extract_searches(logfile)]
    print(searches)

As it turns out, on average this particular case of nicer python code is marginally slower.

Run #	time (seconds)
1	10.991
2	11.219
3	11.431
4	11.141
average	11.1955

Another possible optimization was to replace if 'q' in q: statement with try .. catch block as at that point in the code, this statement should mostly evaluate to True. After running the program without the conditional it turn out my guess was wrong and the statement always evaluated to True.

def parse_queries(in_query):
    for q in in_query:
        out = set(q)
        out -= {'q'}
        yield out.pop()

After this modification, the average runtime improved enough to make up for performance hit from the pythonic changes.

Run #	time (seconds)
1	10.950
2	11.119
3	11.090
4	11.028
average	11.0468

Finally, I'm only interested in urls containing '/search/', what if the whole log line is searched for '/search'/ without attempting to find request URL string first?

def extract_searches(log_fd):
    for line in log_fd:
        if '/search/' not in line:
            continue
        outp = nginx_parser(line)
        if ('request_url_path' not in outp) or \
                ('/search/' not in outp['request_url_path']):
            continue
        for item in parse_queries(outp['request_url_query_list']):
            yield prep_item(item, outp)

The improvement was an order of magnitude. Turns out the common log format parser, is just a nicely abstracted set of regular expressions, a rather neat neat set -- __init__.py line 141, but running several regular expression to parse a line is way slower than a simple string comparison.

Run #	time (seconds)
1	1.025
2	1.022
3	1.026
4	1.040
average	1.0285

I can think of two things after doing this exercise: first, the law of leaky abstractions is still with us even if the abstracted code is relatively straightforward and works well, there are still implementation details to be aware of; second, fixing up and making code better and more pythonic helps me to think about program flow differently with much greater potential benefits in the long run, even if in the short run performance of a program may decrease.