[RADIATOR] AuthRADIUS MaxFailedGraceTime -1?
David Zych
dmrz at illinois.edu
Thu Oct 9 14:41:46 CDT 2014
I noticed something today that seems very strange to me, and I'm curious
whether it's a bug or just an expectations mismatch.
Given this:
# Helper AuthBy to proxy stateless inner authentication requests to
# worker instances using simple round robin.
<AuthBy ROUNDROBIN>
Identifier workerproxy
include %D/private/localhost.secret
RetryTimeout 3
Retries 0
MaxTargetHosts 2
# don't keep piling real requests onto a stalled worker, but do send
# periodic Status-Server requests so we can continue to monitor its
# condition (otherwise we'll just blindly assume it's working again
# once there are no more proxied requests left to time out).
FailureBackoffTime 2
KeepaliveTimeout 2
# enable workers to identify and proactively IGNORE any proxied
# request old enough that this main instance has already given up on
# it (note: same system, so no worries about clock skew). This
# ensures that we won't exacerbate an external bottleneck scenario
# by wasting our limited resources processing stale requests.
StripFromRequest X-Proxy-Timestamp,X-Proxy-Timeout
AddToRequest X-Proxy-Timestamp=%t,X-Proxy-Timeout=3
ReplyTimeoutHook sub { CITES::proxy_updateTimestamp(@_) }
<Host 127.0.0.1>
AuthPort %{GlobalVar:radius.authworker1.port}
</Host>
<Host 127.0.0.1>
AuthPort %{GlobalVar:radius.authworker2.port}
</Host>
<Host 127.0.0.1>
AuthPort %{GlobalVar:radius.authworker3.port}
</Host>
<Host 127.0.0.1>
AuthPort %{GlobalVar:radius.authworker4.port}
</Host>
</AuthBy>
if one of the Hosts has not been responding for a while (e.g. because I
killed it for testing purposes), my expectation is that every 2 seconds
another Status-Server request (the one that was sent 3 seconds ago)
should time out and cause us to back off from that Host for an
additional 2 seconds.
What actually happens (in Radiator 4.13 patch set $Revision: 1.1690 $)
is this:
2014-10-09T13:31:55.121701-05:00 radius-dev1.cites.illinois.edu radius.auth1[15050]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request (). Now have 2 consecutive failures over 0 seconds. Backing off for 2 seconds
2014-10-09T13:31:57.072759-05:00 radius-dev1.cites.illinois.edu radius.auth1[15050]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request ()
2014-10-09T13:31:59.090082-05:00 radius-dev1.cites.illinois.edu radius.auth1[15050]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request (). Now have 2 consecutive failures over 0 seconds. Backing off for 2 seconds
2014-10-09T13:32:01.069226-05:00 radius-dev1.cites.illinois.edu radius.auth1[15050]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request ()
2014-10-09T13:32:03.103035-05:00 radius-dev1.cites.illinois.edu radius.auth1[15050]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request (). Now have 2 consecutive failures over 0 seconds. Backing off for 2 seconds
2014-10-09T13:32:05.119978-05:00 radius-dev1.cites.illinois.edu radius.auth1[15050]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request ()
2014-10-09T13:32:07.926511-05:00 radius-dev1.cites.illinois.edu radius.auth1[15050]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request (). Now have 2 consecutive failures over 0 seconds. Backing off for 2 seconds
2014-10-09T13:32:09.943017-05:00 radius-dev1.cites.illinois.edu radius.auth1[15050]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request ()
Note that only every _other_ failed request causes a fresh backoff.
The reason for the current behavior is the strict less-than comparison for
start_failure_grace_time in AuthRADIUS::failed (line 832):
elsif ($host->{FailureBackoffTime}
&& ++$host->{failedRequests} >= $host->{MaxFailedRequests}
&& $host->{start_failure_grace_time} < time - $host->{MaxFailedGraceTime})
{
$msg .= ". Now have $host->{failedRequests} consecutive failures over $host->{MaxFailedGraceTime} seconds. Backing off for $host->{FailureBackoffTime} seconds";
$host->{failedRequests} = 0; # For when we restart transmissions
$host->{backoff_until} = $host->{start_failure_grace_time} = time + $host->{FailureBackoffTime};
$host->{is_failed} = 1;
}
The timeout at :55 sets both backoff_until and start_failure_grace_time to :57.
But at :57, the less-than still returns false, so no new backoff is processed.
This strikes me as particularly counterintuitive given that the
time >= $self->{backoff_until} comparison in isWorking *does* permit
real requests to be proxied to the Host again starting at :57 (which
will of course fail, since the Host is still down).
If I add "MaxFailedGraceTime -1" to my AuthBy, I get the behavior I
wanted in the first place:
2014-10-09T14:02:25.203953-05:00 radius-dev1.cites.illinois.edu radius.auth1[24270]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request (). Now have 1 consecutive failures over -1 seconds. Backing off for 2 seconds
2014-10-09T14:02:27.031574-05:00 radius-dev1.cites.illinois.edu radius.auth1[24270]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request (). Now have 1 consecutive failures over -1 seconds. Backing off for 2 seconds
2014-10-09T14:02:29.212947-05:00 radius-dev1.cites.illinois.edu radius.auth1[24270]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request (). Now have 1 consecutive failures over -1 seconds. Backing off for 2 seconds
2014-10-09T14:02:31.008883-05:00 radius-dev1.cites.illinois.edu radius.auth1[24270]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request (). Now have 1 consecutive failures over -1 seconds. Backing off for 2 seconds
2014-10-09T14:02:33.179497-05:00 radius-dev1.cites.illinois.edu radius.auth1[24270]: AuthRADIUS workerproxy: No reply after 3 seconds and 0 retransmissions to 127.0.0.1:10011 for Status-Server request (). Now have 1 consecutive failures over -1 seconds. Backing off for 2 seconds
but it doesn't seem to me like I should have to do that; in my opinion
the comparison in AuthRADIUS::failed should be <= rather than < so that
the default MaxFailedGraceTime 0 really means 0 (right now I feel like 0
actually means 1).
Thanks,
David
More information about the radiator
mailing list