GoAccess as google analytics replacement
Intro
Since removing google analytics from my blog a few months ago, I kept looking for a tool that would give me some idea who is reading my blog while respecting visitor's privacy.
I have found that tool and it is GoAccess.
GoAccess is an open-source web analytics tool that has text-based (TUI) and single page web-based user interfaces. It is shipped with any linux distro [1] including all of the debian releases [2].
Installation and basic usage
GoAccess is installed with the following command.
# apt install goaccess
Installing GeoIP database will give additional stats on regional breakdown of incoming requests.
# apt install geoip-database
I use Nginx as a static server and I'm interested in analyzing access log files. The log file in question is under 'server' section.
access_log /var/log/nginx/flamy_ca.access.log;
For more details on setting up nginx, see Nginx config section under 'Moving pelican blog to own server' article [3].
Opening an access_log file with goaccess command will give statistics for the day -- from midnight until now.
You need to be root or belong to the group 'adm' in debian to be able to read the logs.
# goaccess /var/log/nginx/flamy_ca.access.log
That produces nice summary of site visitors, requested files, referring sites etc.
Navigation in GoAccess:
- up/down arrows scroll up and down line-by-line.
- tab scrolls down section-by-section.
- 'enter' or right arrow expands each section to show top 10 entries.
- keys 1-9 to jump to sections 1 to 9. This doesn't work beyond section 9 -- 'Virtual hosts'.
GoAccess will live update summaries as the logs get appended by nginx process.
Configuration
In my usecase, GoAccess only displays stats for Nginx log. Setting the following defaults in /etc/goaccess.conf skips configuration settings on every invocation of the program.
time-format %H:%M:%S
date-format %d/%b/%Y
log-format %h %^[%d:%t %^] "%r" %s %b "%R" "%u"
color-scheme 3
config-dialog false
hl-header true
agent-list true
ignore-crawlers true
std-geoip true
This is all the configuring needed for basic monitoring. The following command displays the summary for the past two days.
# goaccess /var/log/nginx/flamy_ca.access.log /var/log/nginx/flamy_ca.access.log.1
Both types of logs -- recent (uncompressed) and rotated (compressed) can be viewed in a single report. This is accomplished by connecting the output of uncompressing log stream of rotated log files to goaccess stdin pipe.
# cd /var/log/nginx
# goaccess flamy_ca.access.log flamy_ca.access.log.1 <(zcat flamy_ca.access.log.*.gz)
The screenshot below shows multiple days of log files on 'Unique visitors per day' panel. Pipe redirect is shown as a log source.
Logs are managed by logrotate.d and in the default configuration compressed logs are kept for 14 days.
Theming
The screenshots in this article of GoAccess display 256 Tuesday Bright [4] color scheme that works well with solarized [5] light [6] theme that I use everywhere.
I made some very minor changes to 256 Tuesday Bright theme to get it to work in GoAccess 1.2. The following configuration was appended to /etc/goaccess.conf
color COLOR_MTRC_HITS color27:color254
color COLOR_MTRC_VISITORS color161:color254
color COLOR_MTRC_DATA color28:color254
color COLOR_MTRC_BW color173:color254
color COLOR_MTRC_AVGTS color240:color254
color COLOR_MTRC_CUMTS color130:color254
color COLOR_MTRC_MAXTS color92:color254
color COLOR_MTRC_PROT color161:color254
color COLOR_MTRC_MTHD color75:color254
color COLOR_MTRC_HITS_PERC color92:color254
color COLOR_MTRC_HITS_PERC_MAX color92:color254 underline,bold
color COLOR_MTRC_HITS_PERC_MAX color92:color254 underline,bold VISITORS
color COLOR_MTRC_HITS_PERC_MAX color92:color254 underline,bold OS
color COLOR_MTRC_HITS_PERC_MAX color92:color254 underline,bold BROWSERS
color COLOR_MTRC_HITS_PERC_MAX color92:color254 underline,bold VISIT_TIMES
color COLOR_MTRC_VISITORS_PERC color92:color254
color COLOR_MTRC_VISITORS_PERC_MAX color92:color254 underline,bold
color COLOR_PANEL_COLS color242:color254
color COLOR_BARS color26:color254
color COLOR_ERROR color231:color161
color COLOR_SELECTED color15:color161
color COLOR_PANEL_ACTIVE color7:color243
color COLOR_PANEL_HEADER color234:color249
color COLOR_PANEL_DESC color237:color254
color COLOR_OVERALL_LBLS color234:color254
color COLOR_OVERALL_VALS color27:color254
color COLOR_OVERALL_PATH color173:color254
color COLOR_ACTIVE_LABEL color0:color249
color COLOR_BG color0:color254
color COLOR_DEFAULT color0:color254
color COLOR_PROGRESS color7:color161
Managing Logs
I would like to keep the record of the logs for longer than 14 days and make GoAccess web reports for the periods of months and years. There is a 'right' way of doing this, it involves setting up remote logging server and then messing with syslog-ng for a while, but I'm implementing this on a single machine and I'm absolutely unwilling to enterprise my way of it. I'll copy files over ssh.
By default in debian, compressed logs are stored in a numerical order i.e. flamy_ca.access.log.0.gz to flamy_ca.access.log.13.gz I need to update nginx logrotate configuration to use date (yyyy-mm-dd) as a suffix and while I'm at it, switch to lzma compression (.xz file extension) to save a little on storage and data transfer.
This configuration update is not strictly necessary for RedHat-based systems as they use date as a file suffix by default.
My logrotate configuration in /etc/logrotate.d/nginx looks like the following
/var/log/nginx/*.log {
daily
missingok
rotate 14
compress
delaycompress
compresscmd /usr/bin/xz
compressext .xz
uncompresscmd /usr/bin/unxz
notifempty
create 0640 www-data adm
dateext
dateformat .%Y-%m-%d
sharedscripts
prerotate
if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
run-parts /etc/logrotate.d/httpd-prerotate; \
fi \
endscript
postrotate
invoke-rc.d nginx rotate >/dev/null 2>&1
endscript
}
Beware! logrotate.d needs to be restarted to update the configuration. It took me a few days to figure that one out.
# service logrotate restart
After running new logrotate configuration for a while, I started getting the following rotated log file names
flamy_ca.access.log.2020-02-10.xz
flamy_ca.access.log.2020-02-11.xz
flamy_ca.access.log.2020-02-12.xz
flamy_ca.access.log.2020-02-13.xz
flamy_ca.access.log.2020-02-14.xz
flamy_ca.access.log.2020-02-15.xz
flamy_ca.access.log.2020-02-16.xz
flamy_ca.access.log.2020-02-17.xz
flamy_ca.access.log.2020-02-18.xz
flamy_ca.access.log.2020-02-19.xz
flamy_ca.access.log.2020-02-20.xz
flamy_ca.access.log.2020-02-21.xz
flamy_ca.access.log.2020-02-22.xz
flamy_ca.access.log.2020-02-23.xz
flamy_ca.access.log.2020-02-24.xz
The following script moves compressed logs from /var/log/nginx to a place where archiving script can pull those files to a remote machine. The script also changes user ownership from 'admin' to a user who archives the logs.
#!/usr/bin/env bash
# Move this /usr/local/sbin
# Source config
LOG_SRC_DIR="/var/log/nginx"
LOG_PREFIX="flamy_ca.access.log"
LOG_ARCH_SUFFIX=".xz"
# Destination config
LOG_DEST_DIR="/home/alex/logs/flamy_ca"
LOG_DEST_CRED="alex.alex"
if [ ! -d "${LOG_SRC_DIR}" ]
then
echo "Source log directory doesn't exist" >> /dev/stderr
exit 1
fi
if [ ! -d "${LOG_DEST_DIR}" ]
then
echo "Destination log directory doesn't exist" >> /dev/stderr
exit 2
fi
mv ${LOG_SRC_DIR}/${LOG_PREFIX}.*${LOG_ARCH_SUFFIX} ${LOG_DEST_DIR}
chown ${LOG_DEST_CRED} ${LOG_DEST_DIR}/*${LOG_ARCH_SUFFIX}
I set up crontab to run daily to clean compressed logs, but this script can be run anytime up to 14 days before logs get cleaned up by logrotate. To clean up logs at the longer interval consider changing rotate option in logrotate configuration.
@daily /usr/local/sbin/archive_logs
To get the logs off the web server, I run the following script from the storage machine.
#!/usr/bin/env bash
# Place this into /usr/local/bin
REMOTE_SERVER="flamy.ca"
REMOTE_DIR="~/logs/flamy_ca"
LOCAL_DIR="~/storage/logs"
echo scp "${REMOTE_SERVER}:${REMOTE_DIR}/*.xz" ${LOCAL_DIR} | sh
ssh ${REMOTE_SERVER} "rm -v ${REMOTE_DIR}/*.xz"
To automate the action of copying the files by setting this script via cron job on the computer that stores the files
@weekly /usr/local/bin/pull_logs
You will need to set up ssh public/private keys via ssh-keygen in order to allow paswordless authentication between the server.
I choose this setup because, the storage server is behind a firewall it is less likely to be compromised, also in case the web server is compromised the attacker will have harder time figuring out the external services associated with the server. However, I'm not a security expert, please take this approach with some skepticism.
Generating html reports
When storing logs in a the single directory I use file completion operator in order to generate monthly logs
$ xzcat flamy_ca.access.log.2020-02-* > feb_2020_report.log
Then generate html report from monthly log
$ goaccess feb_2020_report.log -a -o feb_2020_report.html
This should produce an interactive page that looks similar to the following
Privacy implication of storing web server logs
I have an issue with storing user data, and I'm not sure how GDPR compliant it is -- I have a record of IP addresses that connected to my site and at what time. I will never use this data for anything other than looking at the aggregate summary for site visitors, still I'm going to look for the tool that anonymizes visitor IP addresses and doesn't break GoAccess reports.