GoAccess as google analytics replacement

Intro

Since removing google analytics from my blog a few months ago, I kept looking for a tool that would give me some idea who is reading my blog while respecting visitor's privacy.

I have found that tool and it is GoAccess.

Side-by-side view of GoAccess TUI and web interface

GoAccess is an open-source web analytics tool that has text-based (TUI) and single page web-based user interfaces. It is shipped with any linux distro [1] including all of the debian releases [2].

Installation and basic usage

GoAccess is installed with the following command.

# apt install goaccess

Installing GeoIP database will give additional stats on regional breakdown of incoming requests.

# apt install geoip-database

I use Nginx as a static server and I'm interested in analyzing access log files. The log file in question is under 'server' section.

access_log  /var/log/nginx/flamy_ca.access.log;

For more details on setting up nginx, see Nginx config section under 'Moving pelican blog to own server' article [3].

Opening an access_log file with goaccess command will give statistics for the day -- from midnight until now.

You need to be root or belong to the group 'adm' in debian to be able to read the logs.

# goaccess /var/log/nginx/flamy_ca.access.log

That produces nice summary of site visitors, requested files, referring sites etc.

Sample goAccess TUI view

Navigation in GoAccess:

  • up/down arrows scroll up and down line-by-line.
  • tab scrolls down section-by-section.
  • 'enter' or right arrow expands each section to show top 10 entries.
  • keys 1-9 to jump to sections 1 to 9. This doesn't work beyond section 9 -- 'Virtual hosts'.

GoAccess will live update summaries as the logs get appended by nginx process.

Configuration

In my usecase, GoAccess only displays stats for Nginx log. Setting the following defaults in /etc/goaccess.conf skips configuration settings on every invocation of the program.

time-format %H:%M:%S
date-format %d/%b/%Y
log-format %h %^[%d:%t %^] "%r" %s %b "%R" "%u"
color-scheme 3
config-dialog false
hl-header true
agent-list true
ignore-crawlers true
std-geoip true

This is all the configuring needed for basic monitoring. The following command displays the summary for the past two days.

# goaccess /var/log/nginx/flamy_ca.access.log /var/log/nginx/flamy_ca.access.log.1

Both types of logs -- recent (uncompressed) and rotated (compressed) can be viewed in a single report. This is accomplished by connecting the output of uncompressing log stream of rotated log files to goaccess stdin pipe.

# cd /var/log/nginx
# goaccess flamy_ca.access.log flamy_ca.access.log.1 <(zcat flamy_ca.access.log.*.gz)

The screenshot below shows multiple days of log files on 'Unique visitors per day' panel. Pipe redirect is shown as a log source.

Sample goAccess TUI view

Logs are managed by logrotate.d and in the default configuration compressed logs are kept for 14 days.

Theming

The screenshots in this article of GoAccess display 256 Tuesday Bright [4] color scheme that works well with solarized [5] light [6] theme that I use everywhere.

I made some very minor changes to 256 Tuesday Bright theme to get it to work in GoAccess 1.2. The following configuration was appended to /etc/goaccess.conf

color COLOR_MTRC_HITS               color27:color254
color COLOR_MTRC_VISITORS           color161:color254
color COLOR_MTRC_DATA               color28:color254
color COLOR_MTRC_BW                 color173:color254
color COLOR_MTRC_AVGTS              color240:color254
color COLOR_MTRC_CUMTS              color130:color254
color COLOR_MTRC_MAXTS              color92:color254
color COLOR_MTRC_PROT               color161:color254
color COLOR_MTRC_MTHD               color75:color254
color COLOR_MTRC_HITS_PERC          color92:color254
color COLOR_MTRC_HITS_PERC_MAX      color92:color254 underline,bold
color COLOR_MTRC_HITS_PERC_MAX      color92:color254 underline,bold VISITORS
color COLOR_MTRC_HITS_PERC_MAX      color92:color254 underline,bold OS
color COLOR_MTRC_HITS_PERC_MAX      color92:color254 underline,bold BROWSERS
color COLOR_MTRC_HITS_PERC_MAX      color92:color254 underline,bold VISIT_TIMES
color COLOR_MTRC_VISITORS_PERC      color92:color254
color COLOR_MTRC_VISITORS_PERC_MAX  color92:color254 underline,bold
color COLOR_PANEL_COLS              color242:color254
color COLOR_BARS                    color26:color254
color COLOR_ERROR                   color231:color161
color COLOR_SELECTED                color15:color161
color COLOR_PANEL_ACTIVE            color7:color243
color COLOR_PANEL_HEADER            color234:color249
color COLOR_PANEL_DESC              color237:color254
color COLOR_OVERALL_LBLS            color234:color254
color COLOR_OVERALL_VALS            color27:color254
color COLOR_OVERALL_PATH            color173:color254
color COLOR_ACTIVE_LABEL            color0:color249
color COLOR_BG                      color0:color254
color COLOR_DEFAULT                 color0:color254
color COLOR_PROGRESS                color7:color161

Managing Logs

I would like to keep the record of the logs for longer than 14 days and make GoAccess web reports for the periods of months and years. There is a 'right' way of doing this, it involves setting up remote logging server and then messing with syslog-ng for a while, but I'm implementing this on a single machine and I'm absolutely unwilling to enterprise my way of it. I'll copy files over ssh.

By default in debian, compressed logs are stored in a numerical order i.e. flamy_ca.access.log.0.gz to flamy_ca.access.log.13.gz I need to update nginx logrotate configuration to use date (yyyy-mm-dd) as a suffix and while I'm at it, switch to lzma compression (.xz file extension) to save a little on storage and data transfer.

This configuration update is not strictly necessary for RedHat-based systems as they use date as a file suffix by default.

My logrotate configuration in /etc/logrotate.d/nginx looks like the following

/var/log/nginx/*.log {
        daily
        missingok
        rotate 14
        compress
        delaycompress
        compresscmd /usr/bin/xz
        compressext .xz
        uncompresscmd /usr/bin/unxz
        notifempty
        create 0640 www-data adm
        dateext
        dateformat .%Y-%m-%d
        sharedscripts
        prerotate
                if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
                        run-parts /etc/logrotate.d/httpd-prerotate; \
                fi \
        endscript
        postrotate
                invoke-rc.d nginx rotate >/dev/null 2>&1
        endscript
}

Beware! logrotate.d needs to be restarted to update the configuration. It took me a few days to figure that one out.

# service logrotate restart

After running new logrotate configuration for a while, I started getting the following rotated log file names

flamy_ca.access.log.2020-02-10.xz
flamy_ca.access.log.2020-02-11.xz
flamy_ca.access.log.2020-02-12.xz
flamy_ca.access.log.2020-02-13.xz
flamy_ca.access.log.2020-02-14.xz
flamy_ca.access.log.2020-02-15.xz
flamy_ca.access.log.2020-02-16.xz
flamy_ca.access.log.2020-02-17.xz
flamy_ca.access.log.2020-02-18.xz
flamy_ca.access.log.2020-02-19.xz
flamy_ca.access.log.2020-02-20.xz
flamy_ca.access.log.2020-02-21.xz
flamy_ca.access.log.2020-02-22.xz
flamy_ca.access.log.2020-02-23.xz
flamy_ca.access.log.2020-02-24.xz

The following script moves compressed logs from /var/log/nginx to a place where archiving script can pull those files to a remote machine. The script also changes user ownership from 'admin' to a user who archives the logs.

#!/usr/bin/env bash

# Move this /usr/local/sbin

# Source config
LOG_SRC_DIR="/var/log/nginx"
LOG_PREFIX="flamy_ca.access.log"
LOG_ARCH_SUFFIX=".xz"

# Destination config
LOG_DEST_DIR="/home/alex/logs/flamy_ca"
LOG_DEST_CRED="alex.alex"


if [ ! -d "${LOG_SRC_DIR}" ]
then
    echo "Source log directory doesn't exist" >> /dev/stderr
    exit 1
fi


if [ ! -d "${LOG_DEST_DIR}" ]
then
    echo "Destination log directory doesn't exist" >> /dev/stderr
    exit 2
fi

mv ${LOG_SRC_DIR}/${LOG_PREFIX}.*${LOG_ARCH_SUFFIX} ${LOG_DEST_DIR}
chown ${LOG_DEST_CRED} ${LOG_DEST_DIR}/*${LOG_ARCH_SUFFIX}

I set up crontab to run daily to clean compressed logs, but this script can be run anytime up to 14 days before logs get cleaned up by logrotate. To clean up logs at the longer interval consider changing rotate option in logrotate configuration.

@daily /usr/local/sbin/archive_logs

To get the logs off the web server, I run the following script from the storage machine.

#!/usr/bin/env bash

# Place this into /usr/local/bin

REMOTE_SERVER="flamy.ca"
REMOTE_DIR="~/logs/flamy_ca"

LOCAL_DIR="~/storage/logs"


echo scp "${REMOTE_SERVER}:${REMOTE_DIR}/*.xz" ${LOCAL_DIR} | sh

ssh ${REMOTE_SERVER} "rm -v ${REMOTE_DIR}/*.xz"

To automate the action of copying the files by setting this script via cron job on the computer that stores the files

@weekly /usr/local/bin/pull_logs

You will need to set up ssh public/private keys via ssh-keygen in order to allow paswordless authentication between the server.

I choose this setup because, the storage server is behind a firewall it is less likely to be compromised, also in case the web server is compromised the attacker will have harder time figuring out the external services associated with the server. However, I'm not a security expert, please take this approach with some skepticism.

Generating html reports

When storing logs in a the single directory I use file completion operator in order to generate monthly logs

$ xzcat flamy_ca.access.log.2020-02-* > feb_2020_report.log

Then generate html report from monthly log

$ goaccess feb_2020_report.log -a -o feb_2020_report.html

This should produce an interactive page that looks similar to the following

A crop of a goAccess html report with general statistics and two sample sections.

Privacy implication of storing web server logs

I have an issue with storing user data, and I'm not sure how GDPR compliant it is -- I have a record of IP addresses that connected to my site and at what time. I will never use this data for anything other than looking at the aggregate summary for site visitors, still I'm going to look for the tool that anonymizes visitor IP addresses and doesn't break GoAccess reports.