[RADIATOR] mysql DBI timed out connections and too many open files (sockets)

Thu Mar 30 10:54:48 UTC 2017

Hi,

I've tracked down a quite unexpected issue with radiator in our live
environment and I don't know, how to solve it properly.
It's a radiator 4.11 live accounting instance with single mysql DB
configuration:
        <AuthBy SQL>
                DBSource                dbi:mysql:dbname:x.x.x.x:3307
                Timeout                 2
                FailureBackoffTime      10

Every time, when a DBI mysql connection times out, a new socket is
opened. These sockets, which never get a reply (e.g. conn refused) from
server, idle forever - each one consumes an "open file". Since the
server is Debian 8 with systemd, the radiator process is started with
the following systemd service configuration:

[Service]
Type=simple
Restart=on-failure
TimeoutSec=10
User=radius
ExecStart=/usr/bin/radiusd -foreground -config_file /etc/acct.cfg

As the Debian 8 systemd ignores system wide limits and starts the
process with systemd default limit of 1024
$ cat /proc/${pid_of_radiusd}/limits
Limit                     Soft Limit          Hard Limit           Units

Max open files            1024                1024
files

Basically, the radiusd will hit that limit after approx 1020 DBI reconnects:
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = -1 EMFILE (Too many open files)
As there are too many open files, also all logging fails, as the process
is unable to open a log file.

Even is the database comes online again for example 800 reconnects, life
goes on, but the previously opened 800 socket remain there until the
process exists. Problem is, that the old, timed out, sockets are never
freed. Example of a process open file list, which exhausted the resources:
$ lsof -p radiusd_pid
radiusd 6409 radius    0r   CHR                1,3      0t0   6097 /dev/null
radiusd 6409 radius    1u  unix 0xffff88007b997bc0      0t0  57168 socket
radiusd 6409 radius    2u  unix 0xffff88007b997bc0      0t0  57168 socket
radiusd 6409 radius    3u  IPv4              57184      0t0    UDP *:9141
radiusd 6409 radius    4u  IPv4              57183      0t0    UDP
localhost:914
radiusd 6409 radius    5u  sock                0,7      0t0  57195 can't
identify protocol
....1015 lines skipped...
radiusd 6409 radius 1020u  sock                0,7      0t0  70498 can't
identify protocol
radiusd 6409 radius 1021u  sock                0,7      0t0  70516 can't
identify protocol
radiusd 6409 radius 1022u  sock                0,7      0t0  70525 can't
identify protocol
radiusd 6409 radius 1023u  sock                0,7      0t0  70539 can't
identify protocol

I know, that I can tune the systemd service parameter LimitNOFILE=x to a
bigger value, but it doesn't change the behaviour. Eventually the kernel
sockets resource runs out.

Even if the radiusd hits the "Too many open files", it would be a great
help, if the process would exit. all resources would be freed and
systemd would restart it after 10 seconds. But the radiusd doesn't die
nor serve.

Related debian packages and versions:
 perl                           5.20.2-3+deb8u4
 libdbi-perl                    1.631-3+b1
 libdbd-mysql-perl              4.028-2+b1
 libmysqlclient18:amd64         5.5.47-0+deb8u1

With the specified configuration above, major problem occured when the
database server was offline for more than 2..3 hours (all the 1024
reconnection attempts timed out).
Suddenly the instance stopped answering and also the logfile stopped
with the following line and then nothing.
Sat Mar 25 01:29:06 2017: DEBUG: Connecting to
'dbi:mysql:dbname:x.x.x.x:3307'

Increasing the file limits is a poor improvement, not a solution to root
cause.
Is there something wrong with DBI?
Or is there a better way to avoid depletion of resources?

kind regards,

-- 
rauno