Good Day All-

We’ve been running AuthByLOADBALANCE for some time now and have noticed that if there is a message that does not get a response from the downstream hosts that it will be retried infinitely. This not only keeps the message around forever but as it is tried and failed, it increases the failure counts for the target hosts which makes them more likely to be marked unavailable and causes delivery problems with other requests.

For example a malformed request may be sent by an upstream client and handled by AuthByLOADBALANCE where the target hosts simply do not respond to the proxied request because they don’t like it. The request will be retried on the current host for Retries times by handle_timeout() after which the request is handed off to failed(), which tracks MaxFailedRequests for the host and marks it unavailable if applicable and then hands off the request to forward() which calls chooseHost() to find the next available host. The stock chooseHost() in AuthByRADIUS tracks if the request has reach the end of the list or not but chooseHost() in AuthByLOADBALANCE will always return a host if one is available and it could even be the same host as the last try if MaxFailedRequests has not been reached for that host. The end result is that the request will be retried forever and incrementing the failure count for downstream hosts, causing them to be marked unavailable.

After some looking at the code I think I could override failed() to track the number of unique hosts to which a request has been forwarded with something like


and then add a couple of checks in chooseHost() that are similar to the to original one-

if (@{$fp->{retryHosts}} < @{$self->{Hosts}})
foreach $host (@{$self->{Hosts}})
  next if ($fp->{retryHosts}->{$host})

The end result being that the request will be tried for each host in the list Retries times and then the next best candidate chosen by the volume algorithm until all hosts are tried and then the request fails. That may not be the optimal behavior but it beats trying forever.

Before doing that and bearing the burden of maintaining a custom AuthBy I figured I’d send it to the list and see if someone else has already solved this problem or if Open Systems would be willing to revisit the AuthByLOADBALANCE logic. Perhaps changing the interpretation of Retries to mean the total number of times a request is retried instead of a per host number in order to have a finite lifetime on a request? In that case chooseHost() could be called for each retry in handle_timeout() to increase the chances of success.



