[RADIATOR] mysql DBI timed out connections and too many open files (sockets)

Thu Apr 20 10:31:57 UTC 2017

Hi,

The radiusd process has to have effective open file limit, instead of
"unlimited". Debian systemd limits non-root processes to 1024, also the
centos-5 has the same limit even for root user process.
Despite the open file limit, You should see the number of open sockets
for radiusd process increasing.

I did a quick test on a random CentOS release 5.11 box with
Radiator-4.11-1. Installed the rpm, changed only radius.cfg to the
following:

grep -v "^#" /etc/radiator/radius.cfg
LogDir          /var/log/radius
DbDir           /etc/radiator
Trace           4
<Client DEFAULT>
        Secret  asd
        DupInterval 0
</Client>
<Realm DEFAULT>
        <AuthBy SQL>
                DBSource                dbi:mysql:dbname:127.0.0.1:3310
                Timeout                 1
                FailureBackoffTime      1
                DBUsername              someuser
                DBAuth                  pointlesspassword
                AuthSelect      select 'random' from nowhere
                AuthColumnDef   0,      User-Name, check
                NoDefault
        </AuthBy>
</Realm>

Started the service:
# /etc/init.d/radiator start
# ps uaxww | grep radiusd
root     18839  0.2  1.0  27352 15712 pts/0    S    11:52   0:03
/usr/bin/perl /usr/bin/radiusd -pid_file /var/log/radius/radiusd.pid
-config_file /etc/radiator/radius.cfg -daemon

# cat /proc/`pgrep radiusd`/limits | grep files
Max open files            1024                 1024                 files

Killed off the IP level access to non-existing Mysql
# iptables -t filter -A INPUT -p TCP -i lo -s 127.0.0.1 --dport 3310 -j DROP

# head /var/log/radius/logfile
Thu Apr 20 11:52:07 2017: DEBUG: Finished reading configuration file
'/etc/radiator/radius.cfg'
Thu Apr 20 11:52:08 2017: DEBUG: Creating authentication port 0.0.0.0:1645
Thu Apr 20 11:52:08 2017: DEBUG: Creating accounting port 0.0.0.0:1646
Thu Apr 20 11:52:08 2017: NOTICE: Server started: Radiator 4.11
Thu Apr 20 11:56:13 2017: DEBUG: Packet dump:
*** Received from 127.0.0.1 port 50719 ....
Code:       Access-Request
Identifier: 118
Authentic:  <129>)$<226>3<157>"<3><169><226><219><216>'<216><250><225>
Attributes:
        User-Name = "mikem"
        Service-Type = Framed-User
        NAS-IP-Address = 203.63.154.1
        NAS-Identifier = "203.63.154.1"
        NAS-Port = 1234
        Called-Station-Id = "123456789"
        Calling-Station-Id = "987654321"
        NAS-Port-Type = Async
        User-Password = <137>&<8>\<180>_^<201>6mqwT&<207><158>

Thu Apr 20 11:56:13 2017: DEBUG: Handling request with Handler
'Realm=DEFAULT', Identifier ''
Thu Apr 20 11:56:13 2017: DEBUG:  Deleting session for mikem,
203.63.154.1, 1234
Thu Apr 20 11:56:13 2017: DEBUG: Handling with Radius::AuthSQL:
Thu Apr 20 11:56:13 2017: DEBUG: Handling with Radius::AuthSQL:
Thu Apr 20 11:56:13 2017: DEBUG: Connecting to
'dbi:mysql:dbname:127.0.0.1:3310'

Thu Apr 20 11:56:14 2017: ERR: Could not connect to SQL database with
DBI->connect dbi:mysql:dbname:127.0.0.1:3310, someuser, pointlesspassword:
Thu Apr 20 11:56:14 2017: ERR: Could not connect to any SQL database.
Request is ignored. Backing off for 1 seconds
Thu Apr 20 11:56:14 2017: DEBUG: AuthBy SQL result: IGNORE, User
database access error

Then all I had to do, is hit it often with radpwtst.
$ for i in `seq 1 10000`; do radpwtst -secret asd -s 127.0.0.1 -noacct
-nostart -auth_port 1645 -timeout 1; done

The network connections attempts are growing and then the idling "sock"s
remain there.

# lsof -p `pgrep radiusd`
...
radiusd 18839 root  244u  sock     0,5      0t0  7898980 can't identify
protocol
radiusd 18839 root  245u  sock     0,5      0t0  7898989 can't identify
protocol
radiusd 18839 root  246u  sock     0,5      0t0  7899002 can't identify
protocol
radiusd 18839 root  247u  sock     0,5      0t0  7899011 can't identify
protocol
radiusd 18839 root  248u  sock     0,5      0t0  7899020 can't identify
protocol
radiusd 18839 root  249u  IPv4 7899029      0t0      TCP
127.0.0.1:33281->127.0.0.1:dyna-access (SYN_SENT)
radiusd 18839 root  250u  IPv4 7899041      0t0      TCP
127.0.0.1:33282->127.0.0.1:dyna-access (SYN_SENT)
radiusd 18839 root  251u  IPv4 7899050      0t0      TCP
127.0.0.1:33283->127.0.0.1:dyna-access (SYN_SENT)
radiusd 18839 root  252u  IPv4 7899068      0t0      TCP
127.0.0.1:33284->127.0.0.1:dyna-access (SYN_SENT)
...

50 minutes later the logfile stops growing:

Thu Apr 20 12:46:40 2017: DEBUG: Packet dump:
*** Received from 127.0.0.1 port 40107 ....
Code:       Access-Request
Identifier: 85
Authentic:  <177><239><172>6<253>o<251><18><144><157><147>"<163>k<130>.
Attributes:
        User-Name = "mikem"
        Service-Type = Framed-User
        NAS-IP-Address = 203.63.154.1
        NAS-Identifier = "203.63.154.1"
        NAS-Port = 1234
        Called-Station-Id = "123456789"
        Calling-Station-Id = "987654321"
        NAS-Port-Type = Async
        User-Password =
X<187><216><213>g<154><218><222><218><131><156>v<148>p2<198>

Thu Apr 20 12:46:40 2017: DEBUG: Handling request with Handler
'Realm=DEFAULT', Identifier ''
Thu Apr 20 12:46:40 2017: DEBUG:  Deleting session for mikem,
203.63.154.1, 1234
Thu Apr 20 12:46:40 2017: DEBUG: Handling with Radius::AuthSQL:
Thu Apr 20 12:46:40 2017: DEBUG: Handling with Radius::AuthSQL:
Thu Apr 20 12:46:40 2017: DEBUG: Connecting to
'dbi:mysql:dbname:127.0.0.1:3310'

Total list of lsof is 1064 line and the open "sock" count has reached 1019.
# lsof -p `pgrep radiusd` | grep -c sock
1019

The main process has reached "Too many open files" and is unable to
serve anymore.

# strace -p `pgrep radiusd`
stat64("/var/log/radius", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/var/log/radius/logfile", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE,
0666) = -1 EMFILE (Too many open files)
write(2, "Log could not append 'Thu Apr 20"..., 652) = 652
gettimeofday({1492682926, 331375}, NULL) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2161, ...}) = 0
stat64("/var/log/radius", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/var/log/radius/logfile", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE,
0666) = -1 EMFILE (Too many open files)
write(2, "Log could not append 'Thu Apr 20"..., 231) = 231
gettimeofday({1492682926, 332748}, NULL) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2161, ...}) = 0
stat64("/var/log/radius", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

It took me an hour and a half to reproduce the issue on entirely new
environment to Radiator without any changes on OS side.
The problem exists with debian and centos, so it's not a distro issue.

Please share the details about the try to reproduce the issue when the
problem didn't occur. I really would like to know, where's the difference.

kind regards,

-- 
rauno

On 20.04.2017 11:10, Heikki Vatiainen wrote:
> On 14.4.2017 13.56, Rauno Tuul wrote:
> 
>> The issue occurs, when attempting to establish a connection. For example
>> the host is totally down or firewalled.
>> Is the TCP connection has been established, then there's no problem.
> 
> Thanks for confirming this. I talked with one of our people about this
> and he said the problem could not be reproduced. What he used was not a
> Debian 8. Also, web search did not turn up similar problems, so this
> looks like something that is not a universal problem.
> 
> In other words, no real update on this yet.
> 
>> Easy way to reproduce the issue, is to drop mysql traffic with iptables.
>> Example setup:
>>          <AuthBy SQL>
>>                  DBSource                dbi:mysql:dbname:127.0.0.1:3306
>>                  Timeout                 1
>>                  FailureBackoffTime      1
>> Then drop the traffic with 'iptables -A INPUT -i lo -p tcp -m tcp
>> --dport 3306 -j DROP' and hit the instance with radpwtst every second.
>> If the debian systemd set the open file limit to 1024, then the instance
>> will collapse within 20 minutes.
> 
> Thanks for this additional information. I have made a note about this.
> 
> Thanks,
> Heikki
>