[RADIATOR] Possible internal race condition leading to 'duplicate' packets and 'unknown reply' messages?

Stefan Paetow Stefan.Paetow at jisc.ac.uk
Tue Feb 18 22:49:36 UTC 2020


Hi, 

Apologies for the long introduction to this problem, I'm new to Radiator, so I'm just trying to make sure I explain things correctly.

So us (running the UK eduroam infrastructure) and our equivalent in another country are trying to resolve some spurious 'Unknown reply received in AuthRADSEC for request ...' messages they and us see in our Radiator logs. For clarity, we (Jisc) are running the latest (4.24.1) on both our servers, both accept both RADIUS and RadSec traffic. At this point it's not entirely clear whether this is limited to RadSec or not, but we're just working with the RadSec traffic for now. The other side has raised the issue with one of the servers, so we’ll stick to that for the moment.

We have 24 Radiator processes running on this server. We have a NoReplyHook that will send an Access-Reject packet back to an originating ‘downstream’ when the home server for an Access-Request stops responding to packets in time. The code in the hook has (probably) not changed for several years, so it's assumed it's been fine. So we get a question out of the blue asking whether we filter out the Proxy-State (which would be very unusual given it's fundamental to making RADIUS proxying work). 

So I've switched on the GlobalMessageLog for RadSec (with a hook for the other end's IP) and turns out in our log that our NoReplyHook code fires as appropriate, but because it doesn't include a Proxy-State, it then cannot be matched to the appropriate request packet on the downstream (and they see a 'AuthRADSEC Could not get extended identifier: No Proxy-State attribute found in reply' message, which prompted the question. In *our* log, we then see *another* RADIUS packet, with the same Identifier as the packet created by the NoReplyHook, being fired back to the downstream, this time *with* a Proxy-State that matches the incoming packet. 

So, we seem to first cause this in the other end:

<date time etc> AuthRADSEC Could not get extended identifier: No Proxy-State attribute found in reply from ...
<date time etc> Unknown reply received in AuthRADSEC for request ...

Our side looks like this:

1581703432.398767 Fri Feb 14 18:03:52 2020
RADIUS message from <downstream> port 48376 to <our server> port 2083 ....
Packet length = 516
Code:       Access-Request
Identifier: 209
Authentic:  <215>u<248>-n<185>2<179>3<15><227><243>jI<127><249>
Attributes:
	User-Name = "<redacted>"
	Chargeable-User-Identity = 
	Location-Capable = <0><0><0><1>
	Calling-Station-Id = "18-81-0e-58-3c-7a"
	Called-Station-Id = "bc-26-c7-be-49-10:eduroam"
	NAS-Port = <0><0><0><8>
	cisco-avpair = "audit-session-id=0515c70a0002c48dd4e0465e"
	Acct-Session-Id = "5e46e0d3/18:81:0e:58:3c:7a/196306"
	NAS-IP-Address = <10><199><21><5>
	NAS-Identifier = "HCVDL_WLC5K1"
	Airespace-WLAN-Id = <0><0><0><3>
	Service-Type = <0><0><0><2>
	Framed-MTU = <0><0><5><20>
	NAS-Port-Type = <0><0><0><19>
	Tunnel-Type = <0><0><0><13>
	Tunnel-Medium-Type = <0><0><0><6>
	Tunnel-Private-Group-ID = 921
	EAP-Message = <trimmed>
	State = <trimmed>
	Message-Authenticator = <242><183><209>b<235>_Q<6><171><241><180><208><246><172>g<15>
	Operator-Name = "<redacted>"
	Proxy-State = OSC-Extended-Id=26321

1581703432.498707 Fri Feb 14 18:03:52 2020
RADIUS message from <our server> port 2083 to <downstream> port 48376 ....
Packet length = 131
Code:       Access-Reject
Identifier: 209
Authentic:  ...
Attributes:
	Cisco-Abort-Cause = "Home organisation RADIUS server(s) not responding, so eduroam UK rejected the request"
	Message-Authenticator = ...

1581703432.593273 Fri Feb 14 18:03:52 2020
RADIUS message from <our server> port 2083 to <downstream> port 48376 ....
Packet length = 170
Code:       Access-Reject
Identifier: 209
Authentic:  ...
Attributes:
	Cisco-Abort-Cause = "Home organisation RADIUS server(s) not responding, so eduroam UK rejected the request"
	Message-Authenticator = ...
	Reply-Message = "Request Denied"
	Proxy-State = OSC-Extended-Id=26321

You can see here that there's a duplicate Access-Reject packet. The Proxy-State of the second one matches the Proxy-State of the incoming Access-Request, the Identifier of all three packets is identical.

Is there a race condition thing going on that potentially causes a 'mis-fire' of sorts? To make this more interesting, I then adjusted the NoReplyReject hook code to include the Proxy-State and a Reply-Message (to match the above 'second' packet's contents), but the Reply-Message I send in the hook matches the 'Cisco-Abort-Cause' attribute. So the new packet looks like this:

Packet length = 240
Code:       Access-Reject
Identifier: 218
Authentic:  ...
Attributes:
	Cisco-Abort-Cause = "Home organisation RADIUS server(s) not responding, so eduroam UK rejected the request"
	Proxy-State = ...
	Reply-Message = "Home organisation RADIUS server(s) not responding, so eduroam UK rejected the request"
	Message-Authenticator = ...

This has now resolved the spurious 'No Proxy-State attribute found' messages on the downstream end, but then I still see the occasional duplicate like this (and the downstream sees the 'Unknown reply from...' message for the second packet):

1582046688.805439 Tue Feb 18 17:24:48 2020
RADIUS message from <our server> port 2083 to <downstream> port 35322 ....
Packet length = 240
Code:       Access-Reject
Identifier: 130
Authentic:  <146>{Y<251><133>s Zi<140><22>@<218><21><198><172>
Attributes:
    Cisco-Abort-Cause = "Home organisation RADIUS server(s) not responding, so eduroam UK rejected the request"
    Proxy-State = OSC-Extended-Id=4738
    Reply-Message = "Home organisation RADIUS server(s) not responding, so eduroam UK rejected the request"
    Message-Authenticator = <196><245>^8<231>m<17>.<202>4<196>o<160>:<156>r

1582046688.900040 Tue Feb 18 17:24:48 2020
RADIUS message from <our server> port 2083 to <downstream> port 35322 ....
Packet length = 169
Code:       Access-Reject
Identifier: 130
Authentic:  <135><235>T<0>;<165>Klf<156><135>k<5>B<182><205>
Attributes:
    Cisco-Abort-Cause = "Home organisation RADIUS server(s) not responding, so eduroam UK rejected the request"
    Reply-Message = "Request Denied"
    Message-Authenticator = <255><163><7>QRS<26><181><245><156><251><11>S<8><202><251>
    Proxy-State = OSC-Extended-Id=4738

See? The Identifier is the same, the Proxy-State is the same, but the Reply-Message is different. This leads me to think that there's something happening here that shouldn't?

Given that these things are spurious from this downstream source, but both them and us see this from other sources as well (other eduroam providers elsewhere) which also run Radiator, I wonder... 

This information comes from a live server that handles hundreds of requests a second, so I don't know how we could run this in debug mode without a hit in performance.

Any suggestions are welcome?

With Kind Regards

Stefan Paetow
Federated Roaming Technical Specialist

t: +44 (0)1235 822 125
gpg: 0x3FCE5142
xmpp: stefanp at jabber.dev.ja.net
skype: stefan.paetow.janet

jisc.ac.uk

Jisc is a registered charity (number 1149740) and a company limited by guarantee which is registered in England under Company No. 5747339, VAT No. GB 197 0632 86. Jisc’s registered office is: One Castlepark, Tower Hill, Bristol, BS2 0JA. T 0203 697 5800.





More information about the radiator mailing list