(RADIATOR) Multiple radius instances problem (possible remote consulting and professional services)

Thu Apr 26 18:02:32 CDT 2007

Hello Hugh

My comments between lines.

Thanks a lot for the help.

At 04:46 p.m. 26/04/2007, Hugh Irvine wrote:

>Hello Sergio -
>
>Thanks for the additional information.
>
>It is not clear to me why the dialup instances take so much longer
>than the DSL instances to do the authentication. You also don't show
>how long the accounting is taking.

SG: The Dial-up instances take longer because the process involve 
both mysql queries and LDAP bindings in a pair of perl hooks. The 
process for dial-up authentication goes in general like this:

- The RASes send the access-request
- There is a hook for the handler attending this request that makes a 
query to the RADONLINE table in MySQL Server asking for how many 
users of the same type of this request are in the RAS. There are two 
types of users: Regular and By-demand. Unfortunately, the same PBX on 
each RAS answer for both types. The hook verifies the number of ports 
for each service, and if the number of users of the type of the 
request is exceeded, Radiator sends an Access-Reject.
- If the number of users for the type of the request is no reached, 
then the hook sees if the user is regular or by-demand. If by-demand, 
radiator answer with an access-accept.
- if the request is a regular user, the hook looks into mysql again 
trying to find if the user is a per-hours users, or an unlimited user.
- if the user is an unlimited one, the hook then try to find in the 
LDAP server which branch inside of it matches the username/password 
pair (this does an LDAP search in the whole LDAP server and a binding 
for each branch that matches the username). If it does, it sends the 
access-accept, if not, an access-reject
- if the user is a per-hour one, the hook makes a query into the 
Accounting table to see if the user has enough timeleft to be 
connected. If he does, radiator sends an access-accept, if not, an 
access-reject

>In any case, if the problem is slow LDAP and SQL databases you should
>address those issues first.

SG: Unfortunately, the method to know if the user can be accepted or 
not cannot be changed. Worst, there is also another type of dial-up 
users, in other handler that invokes a AuthBy Proxy clause.

>I am guessing that there is some event like a DSL RAS rebooting that
>is causing a burst of authentication requests that swamp the
>authentication server(s).

SG: Unfortunately this is not the case. The DSL RASes just have many 
users at peak hours. The services is national wide. We are talking 
about 60.000 users today and 20.000 concurrent connections. The 
hardware and software I described in my last email was dimensioned 
for 250.000 users and around 90.000 concurrent connections.

>How many RADIUS requests per second are hitting the boxes?

SG: In total (DSL,Dial-up), there most be around 60 req/sec in peak 
hours. Those req/sec had been splitted into the instances I mentioned 
in my last email.

>BTW - the numbers you show for the SUN LDAP server are consistent
>with what I have observed at other sites - it doesn't seem to be able
>to process more than at the most 10 requests per second. This being
>the case, whenever you have more than 10 requests per second arriving
>in Radiator you will have a problem.

SG: It would be advisable to have an auth an a acct instances for, 
lets say, every two or three dial-up RASes?. Also, how can be handled 
the high rate of request in the DSL RASes?. The main goal is to 
optimize the configuration for those three sun servers.

>There may also be a problem with inserting the accounting data into
>the MySQL database, but you have not provided any information on that.

SG: As we spoke in Oct last year, the Accounting records goes around 
8 million per month. The insertion takes the same amount of time as 
in Oct. around 2 hundreds of a second, and the hit rate is max 8 req/sec.

>regards
>
>Hugh
>
>
>
>On 27 Apr 2007, at 07:28, Sergio Gonzalez wrote:
>
>>Hello,
>>
>>A customer have the next configuration to authenticate more than
>>20.000 concurrent users every day:
>>
>>- Sun A: v240Z with 2 Gb RAM, Solaris 9 64bit, Radiator 3.14, Perl
>>5.8.7
>>- Sun B: v240Z with 2 Gb RAM, Solaris 9 64bit, Radiator 3.14, Perl
>>5.8.7, MySQL Professional 5.0.17c
>>- Sun C: v880 with 4 Gb RAM, Solaris 9 64bit, Radiator 3.14, Perl
>>5.8.7, MySQL Professional 5.0.17c
>>- Sun D: v440 with 16 Gb Ram, Solaris 9 64bit, Sun LDAP Server 5
>>- Sun E: v440 with 16 Gb Ram, Solaris 9 64bit, Sun LDAP Server 5
>>
>>Those radius servers answer requests from:
>>
>>- Around 35 Dial-up RASes with morre than 150 ports each
>>- 4 DSL RASes with more than 7.000 ports each
>>
>>Each radius server has authentication and accounting instances. The
>>Authentication instances ask the LDAP server (in fact only one, but
>>if the first fails, it will ask the other) and also the MySQL
>>servers (in the same fashion as the LDAP, the first, if fails, the
>>second).
>>
>>Taking a Trace -1 and a LogMicroseconds from those instances I got:
>>
>>Dial-up instances: 8 req/sec max. Each authentication request takes
>>0.15 sec to complete. This means around 7 req/sec before going into
>>the udp queue.
>>DSL instances: 25 req/sec max. Each authentication request takes
>>sec to complete. This means
>>
>>The auth and acct requests were attended between all three servers
>>like this:
>>
>>Sun A: 2 auth instances for Dial-up and 2 acct instances for Dial-up.
>>Sun B: 2 auth instances for DSL and 2 acct instances for DSL.
>>Sun C: 2 auth instances for DSL and 2 acct instances for DSL.
>>
>>Since 5 days ago, the two dia-up auth instances in Sun A got
>>stalled. No even radpwtst worked, but looking into the logfile, the
>>process seems to be up and running ( a lot of registries got
>>written every second, I mean, a lot of Access-Accept and Access- 
>>Reject, so the whole process is working find from radiator's point
>>of view). For the time the  Sun A instances got stalled, a few
>>seconds later. the Auth instances for Sun C got stalled also. The
>>only way to recover the disaster was to implement a config file for
>>those instances with a "bypass", just telling to any request to be
>>accepted.
>>
>>
>>In the three Radiator Sun servers the udp_recv_hiwat parameter is
>>set to more than 8 million and the udp buffer is set to the max,
>>64k (solaris boundary). Also, when the instances got stalled, there
>>are a lot of Access-Accept that never leaves the boxes, and also
>>there are a lot of access-request comming from the RASes that never
>>reaches the Radiator application. It seems to be a socket buffer
>>overflow problem.
>>
>>How do I fix this?.
>>
>>
>>Best Regards.
>>
>>Sergio Gonzalez
>>
>>
>>
>
>
>
>NB:
>
>Have you read the reference manual ("doc/ref.html")?
>Have you searched the mailing list archive 
>(www.open.com.au/archives/ radiator)?
>Have you had a quick look on Google (www.google.com)?
>Have you included a copy of your configuration file (no secrets),
>together with a trace 4 debug showing what is happening?
>Have you checked the RadiusExpert wiki:
>http://www.open.com.au/wiki/index.php/Main_Page
>
>--
>Radiator: the most portable, flexible and configurable RADIUS server
>anywhere. Available on *NIX, *BSD, Windows, MacOS X.
>Includes support for reliable RADIUS transport (RadSec),
>and DIAMETER translation agent.
>-
>Nets: internetwork inventory and management - graphical, extensible,
>flexible with hardware, software, platform and database independence.
>-
>CATool: Private Certificate Authority for Unix and Unix-like systems.
>

--
Archive at http://www.open.com.au/archives/radiator/
Announcements on radiator-announce at open.com.au
To unsubscribe, email 'majordomo at open.com.au' with
'unsubscribe radiator' in the body of the message.