(RADIATOR) Multiple radius instances problem (possible remote consulting and professional services)

Thu Apr 26 16:28:01 CDT 2007

Hello,

A customer have the next configuration to authenticate more than 
20.000 concurrent users every day:

- Sun A: v240Z with 2 Gb RAM, Solaris 9 64bit, Radiator 3.14, Perl 5.8.7
- Sun B: v240Z with 2 Gb RAM, Solaris 9 64bit, Radiator 3.14, Perl 
5.8.7, MySQL Professional 5.0.17c
- Sun C: v880 with 4 Gb RAM, Solaris 9 64bit, Radiator 3.14, Perl 
5.8.7, MySQL Professional 5.0.17c
- Sun D: v440 with 16 Gb Ram, Solaris 9 64bit, Sun LDAP Server 5
- Sun E: v440 with 16 Gb Ram, Solaris 9 64bit, Sun LDAP Server 5

Those radius servers answer requests from:

- Around 35 Dial-up RASes with morre than 150 ports each
- 4 DSL RASes with more than 7.000 ports each

Each radius server has authentication and accounting instances. The 
Authentication instances ask the LDAP server (in fact only one, but 
if the first fails, it will ask the other) and also the MySQL servers 
(in the same fashion as the LDAP, the first, if fails, the second).

Taking a Trace -1 and a LogMicroseconds from those instances I got:

Dial-up instances: 8 req/sec max. Each authentication request takes 
0.15 sec to complete. This means around 7 req/sec before going into 
the udp queue.
DSL instances: 25 req/sec max. Each authentication request takes  sec 
to complete. This means

The auth and acct requests were attended between all three servers like this:

Sun A: 2 auth instances for Dial-up and 2 acct instances for Dial-up.
Sun B: 2 auth instances for DSL and 2 acct instances for DSL.
Sun C: 2 auth instances for DSL and 2 acct instances for DSL.

Since 5 days ago, the two dia-up auth instances in Sun A got stalled. 
No even radpwtst worked, but looking into the logfile, the process 
seems to be up and running ( a lot of registries got written every 
second, I mean, a lot of Access-Accept and Access-Reject, so the 
whole process is working find from radiator's point of view). For the 
time the  Sun A instances got stalled, a few seconds later. the Auth 
instances for Sun C got stalled also. The only way to recover the 
disaster was to implement a config file for those instances with a 
"bypass", just telling to any request to be accepted.

In the three Radiator Sun servers the udp_recv_hiwat parameter is set 
to more than 8 million and the udp buffer is set to the max, 64k 
(solaris boundary). Also, when the instances got stalled, there are a 
lot of Access-Accept that never leaves the boxes, and also there are 
a lot of access-request comming from the RASes that never reaches the 
Radiator application. It seems to be a socket buffer overflow problem.

How do I fix this?.

Best Regards.

Sergio Gonzalez

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.open.com.au/pipermail/radiator/attachments/20070426/bc857aba/attachment.html>