(RADIATOR) Input queue size

Sat Nov 15 04:20:28 CST 2003

Cheers, and thanks for the responses. I decided to compound 
your replies into one and reply in a single message, hope that's 
ok with you. 

Comments below.
-GSH

----- Original Message ----- 
From: "Claudio Lapidus" <c_lapidus at hotmail.com>
To: "Guðbjörn S. Hreinsson" <gsh at centrum.is>; <radiator at open.com.au>
Sent: Thursday, November 13, 2003 2:18 AM
Subject: Re: (RADIATOR) Input queue size

> Hello Guðbjörn
> 
> > this may be unrelated, but I am interested to any and all tuning
> > listmembers have done in the OS for Radiator performance. We
> > are running two radiator servers with one proxy radiator in front
> > and a seperate sql machine and ldap machine.
> 
> From CL:
> Fine, but what OS do you use? It might be interesting to have a hardware
> summary too.
...
> From HI:
> Yes its useful to know the hardware/software platform and the various 
> versions of Perl, etc.

There are two radius servers, these are HP lp1000r with 2 Intel pentium III 
1 GHz processors with 256 KB cache (coppermine), 1 GB of memory and 
1 36 GB scsi disk.

These are running RedHat AS 2.1, Radiator 3.7.1 and perl 5.6.1. 

We also have a mySQL db running on a seperate server, which is a DELL 
1650 with 2 Intel Xeon 2.6 GHz with 512 KB cache and 2 GB of memory. 
Same Linux version.

There is a seperate Radius proxy machine on a seperate server which has 
the same hw/os as the mySQL server, it has the same Radiator and perl 
version as the Radius application (?) servers.

There is a seperate ldap server, running on HP L2000/2 with two 500 MHz 
PA RISC 8500 cpus, 1 GB of memory and 4 36 GB disks. There is also a 
similar machine as a secondary containing a replica of the db. 

These guys are rarely busy, cpu, memory or disk based. Even during peak loads 
when a router has been reloaded the systems are not that loaded.

This radius setup serves xDSL, PSTN/ISDN, Cable and hotSpot NAS's.

> [snip]
> 
> > Lengthening the udp queues seems to really have adverse effects on
> > this situation. We have not really tried shortening the queue which
> > might really have even more adverse effects, without testing though
> > I can't tell.
> 
> From CL:
> I can imagine that lengthening the queue only adds to the effect of the
> server processing "old" packets, i.e. packets whose original timer (at the
> NAS) has already expired. The root problem is the mismatch between the speed
> of the NAS sending packets and the server processing them. Probably is worth
> trying to increase the timeout setting at the NAS, at least to diminish
> retransmissions (but beware of total authentication time then). A quicker
> failover to a less loaded secondary might help too.

Well, this seems only to happen at server reloads or similar problem times. 
You could throw more hw at this setup that's one way, we would also like 
to investigate the possibilities of improving this setup. We think that if the 
requests were more parallel (not threaded or multithreaded) this setup could 
process a lot more requests. This applies to both SQL and LDAP requests. 
We also think the choice of udp packets only for radius was not wise. For high 
loads (many packets) tcp is a lot better suited protocol. Then you also have 
timing and can throw away old packets immediately instead of processing and 
returning them and the NAS throwing them away... But that's nothing we can 
solve here.

Timeouts are 5 seconds at the the nases, it retries 3 times, waiting 5 seconds 
each time. We don't think changing this (much) is wise.

...
> From HI:
> Claudio is correct, the usual cause of problems of this sort is the 
> backend delay associated with querying the LDAP and/or SQL database. It 
> is very helpful to look at a trace 4 debug with "LogMicroseconds" 
> turned on (requires Time-HiRes from CPAN). This will show exactly how 
> much time is being spent waiting for the queries to complete.

On the average, ldap requests take 35 ms, sql requests take 13 ms, proxy 
delay is about 12 ms, altogether it takes about 120 ms on average to process 
a request (from nas send to reply). We've tuned this quite a bit and this seems 
to be as good as it's gonna get.

> And you are correct in your observation that increasing the queue size 
> can adversely affect performance due to the increased number of retry 
> requests that build up in the queue.

> > To counter this we have configured multiple instances of radiators
> > for authentication&authorization and accounting and instances for
> > seperate NAS's or NAS groups. This in effect simulates having a
> > threaded radiator to reduce the effect of this sequential processing.
> 
> From CL:
> OK, but are you sure that the bottleneck is in at the Radiator level or
> might it be at the LDAP server? In the latter case it probably won't be of
> much help anyway.

LDAP requests take a fixed time (35 ms). It's mostly due to the filter 
we have, if we take away everything but the uid we would probably get 
5 ms or less. The ldap server also does not really care how many requests 
you throw at it, all requests get replied to pretty much with the time of 
35 ms. It's not multithreaded btw.
...
> From HI:
> Correct again. We have observed these problems too, when parallel 
> requests can also slow things down.

Well, there's not so much we can do more about sql or ldap performance 
or network etc. We only observe that each authentication request takes 
a certain time and it's in sequence. If one request for some reason takes 
longer (this can happen...) all the queue waits. I.e sequential or linear 
processing. Add to it that these packets can be old but you really don't 
know if they are old (no timestamps) and you can have situations when 
things are really not going too well. In cases when a NAS with 3000 users 
on it is powercycled you simply have more incoming packets (old and 
retransmitted) than you can handle.

The only solution in that case is to drop the udp queue, restart radiator. 
We have a script that will detect queue buildup and restarts radiator... 

> BTW - this is one of the strong arguments against a multi-threaded 
> server, which may not help at all in some situations.
> 
> In general it is easier in the first instance to do what you have done 
> with multiple instances and a front end load balancer.
> 
> Just out of interest the largest Radiator setup we are familiar with is 
> using this architecture, with a load balancer feeding 6 Radiator hosts, 
> each one with an authentication and an accounting instance. The backend 
> is a *very* fast Oracle database server and the overall throughput has 
> been tested to over 1200 radius requests per second.

Maybe I'm just banging my head... but we add parellism to this setup 
by using multiple instances bound to seperate ports for different NAS's 
and NAS groups. Increasing the SQL/ldap process speed would help, 
but solve this. I don't think there is anything wrong with multithreading 
and from reading the perl discussion lists I don't threading or multi-threading 
will be in the perl 5 line, perhaps 6. But for processing packets parallel 
I don't think that threading is really needed? It might be more optimal but 
it's also more complicated. 

> > This has not seemed to be related to CPU load or network performance,
> > we have looked at these in detail.
> 
> From CL:
> No, it's probably more I/O bound, (disk, I mean).
...
> From HI:
> I would agree - again a trace 4 debug with LogMicroseconds will show us 
> exactly what is happening.

We have not obeserved this...

> > If anyone has input on this issue or OS tuning for Radiator I'd love
> > to hear about it. Hope you understand my attempt to explain the above
> > scenario. Basically we have a pretty stable environment today, but
> > perhaps overly complex to manage because of the multiple instances.
> 
> From CL:
> Back to my original question then, I'm struggling to measure the effective
> length of the input queue in Solaris. Linux's netstat shows it readily, and
> I remember Tru64 doing the same. But Solaris' netstat lacks this one,
> apparently. I'll have to continue my quest...

You can do something like "netstat -i -n -f inet -P udp 1" but it's not what 
you want. 

> > Hugh, is a "threaded" ldap handler on the horizon? Is this perl or
> > radiator related?
> 
> From CL:
> From my own corner, I wish it were possible to have more than one
> established connection with the SQL backend, so as to paralellize requests
> to a certain degree. But yes, I suppose that means multithreading, and AFAIK
> that's not possible under perl 5.6 nor 5.8 I think. Perhaps Perl 6 would do
> it?
...
> From HI:
> This topic comes up from time to time and the fundamental problem at 
> the moment is that Perl itself does not currently have "production 
> quality" threading support. This being the case, we have not pursued it 
> actively. And note my previous comments about whether or not this would 
> be a "good thing" in any case.
...
> From FD:
> It's really not that hard. You run a number of Radiator instances, with each
> one having it's own connection to the LDAP, SQL, or whatever backend. Then
> you front end those with an instance or two of Radiator running AuthBy
> ROUNDROBIN or AuthBy LOADBALANCE to distribute the requests among them.
> 
> You can process quite a lot of requests simultaneously this way. If your
> current server is not responding fast enough but the CPU utilization is not
> maxed out you are probably just hitting the ceiling on how many requests a
> single instance can process at a time. Start up some more processes on the
> box and use all those processor cycles that you paid for.
...
> From HI:
> As mentioned above, the easiest way to do this currently is with a load 
> balancer (you could use the AuthBy ROUNDROBIN, VOLUMEBALANCE, 
> LOADBALANCE modules) and multiple instances of Radiator. Note that in 
> most cases, at least using one instance for authentication and another 
> for accounting is a good first step.
> 
> We will continue to monitor the Perl support for multi-threading too, 
> of course.

Yes, we do this now. One radiator proxy (many instances) and two radiator 
app isntances (many instances). Both proxy and app servers are using different 
ports for authentication and accounting. 

Oh, and yes, we may need/want more out of Radiator but we simply 
love it.

Rgds,
-GSH

===
Archive at http://www.open.com.au/archives/radiator/
Announcements on radiator-announce at open.com.au
To unsubscribe, email 'majordomo at open.com.au' with
'unsubscribe radiator' in the body of the message.