[RADIATOR] radiator Timeout handling

Wed Apr 6 14:22:43 CDT 2011

On 9/20/2010 12:00 PM, Michael <ringo at vianet.ca> wrote:
>>>>>> >>>>> I'm having a couple issues with<AuthBy SQL>. Maybe it would be considered a bug i'm not sure.
>>>>>> >>>>>
>>>>>> >>>>> 1. the Timeout handling.
>>>>>> >>>>> ------------------------
>>>>>> >>>>>            
>>>>>>> >>>>>>  From my testing, it appears that radiator times out at this value, but seems to retry the sql query a second time, creating in another timeout count.
>>>>>>> >>>>>>              
>>>>>> >>>>> eg debug:
>>>>>> >>>>> Tue Sep 14 12:48:21 2010: DEBUG: Handling accounting with Radius::AuthSQL
>>>>>> >>>>> Tue Sep 14 12:48:21 2010: DEBUG: do query is: 'insert into `acct`<snip>
>>>>>> >>>>> Tue Sep 14 12:48:25 2010: ERR: do failed for 'insert into `acct`<snip>   SQL Timeout
>>>>>> >>>>> Tue Sep 14 12:48:29 2010: ERR: do failed for 'insert into `acct`<snip>   SQL Timeout
>>>>>> >>>>> Tue Sep 14 12:48:29 2010: DEBUG: AuthBy SQL result: IGNORE, Database failure
>>>>>> >>>>>
>>>>>> >>>>> Timeout is set for 4 seconds...
>>>>>> >>>>> so, query executed at 12:48:21, ERR timed out 4 seconds later, appeared to re-try but didn't say anything, and another ERR timeout 4 seconds after that.  That's 8 seconds of course.  It doubles the Timeout value.
>>>>>> >>>>>
>>>>>> >>>>> This is no good, for me.  If I set my SQL timeout value for 4 seconds, and my NAS timeout for 5 seconds, I expect my radiator to timeout before my NAS re-transmits.  my NAS will retry after 5 seconds because radiator hasn't responded.  And, radiator hasn't obeyed the timeout so it's still waiting for 8 seconds.  This causes the same accounting packet to enter radiator again, and causing another 8 seconds delay, and of course duplicate entries in the accounting logging since I'm also using AcctFailedLogFileName so the packet will eventually end up in the SQL table.
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> 2. SQL Timeout issue #2.
>>>>>> >>>>> ------------------------
>>>>>> >>>>> using the same debug example above, when the SQL query times out, it doesn't seem to use the FailureBackoffTime value. It only seems to use FailureBackoffTime when there is a connection failure, not a timeout.  So, every query is still presented to the SQL server.  If the timeout is due to lets say a write lock, when the lock releases, all the queued insert statements are executed creating in sometimes up to 10 duplicate accounting entries.
>> > Hello Michael -
>> >
>> > The behaviour you observe is in fact what the code does - the manual does not correctly describe this behaviour.
>> >
>> > The manual has been amended for the next release.
>> >
>> > Thanks for letting us know.
>> >
>> > regards
>> >
>> > Hugh
>> >
> Can I suggest an option to disable this behavior? In my case, I would
> prefer radiator to only allow one timeout, when a timeout occurs,
> respect the FailureBackoffTime. If it doesn't, radiator creates a very
> undesirable situation when it continues to try for every packet, the sql
> server that timed out. It basically bottlenecks my whole radius system
> since all radiator servers connect to the same accounting mysql server,
> and all nas's eventually mark each radius server "RADIUS_DEAD" and then
> all authentication seems to stop.
>
> Mike

I just ran into this same problem; my DB got into a state where
DBI->connect was working fine but actual INSERTs were timing out, and
the non-observance of FailureBackoffTime in this situation resulted in
both of my RADIUS servers being effectively stalled for 10 minutes (one
INSERT Timeout at a time) until the DB issue was resolved.

I would like to second Michael's request for a way to alter this behavior.

It appears that right now SqlDb.pm has a single $self->{backoff_until}
timer that applies collectively to all configured DBSources (i.e. it is
set only when all DBSources fail DBI->connect in sequence, and when set
it causes none of them to be tried again in reconnect() until the set
time).  Would it perhaps make more sense that:

1. each configured DBSource gets its own individual backoff_until timer
that is set when that DBSource fails DBI->connect, and when set causes
that DBSource to be skipped in reconnect() until the set time.

2. individual statement timeouts, such as the one in SqlDb::do(), could
also set the backoff_until timer for the individual DBSource currently
in use.  If this is judged not to be desirable in the general case, it
could be controlled by a separate configuration parameter
("TimeoutBackoffTime", perhaps?).

I'm half tempted to try to implement this myself, but I'm not confident
that I fully understand all the potential repercussions for other parts
of Radiator, and I know I'm not in a good position to test it thoroughly.

Thanks,
David