The Problem
I’ve had this issue a couple times: when you place a call with your MOC client, you hear a few slow beeps before the call actually starts ringing. In general, this isn’t a showstopper. When you are working remotely on public WiFi etc, I understand that it might take a second or two to set up the call. But recently I’ve seen this happen for internal calls. And the delay has been close to 5 seconds. Worse yet, inbound calls were having the same delay; callers from the outside just heard dead silence while the call was being set up. And even worst yet: calls to response groups were timing out due to the long delay in the call ringing PLUS a delay between when an agent answered and when the call was connected. All told, there was an unnecessary delay of about 9 seconds.
I was really having a hard time figuring out what the problem was. The customer and I were looking at SIP Stack traces / wireshark captures and the strangest things were happening. We saw the call get offered to the mediation server and the mediation server would reply back with a SIP 183 trying. Then dead air for a few seconds… absolutely no message. Then the Front End and Mediation would exchange a SIP message or two. Then more dead air. Then finally, like 6 seconds later, we get a SIP Ringing message and the call goes through.
I looked everywhere in the logs for signs of what was happening. No errors on Mediation or Front End. It really just seemed like OCS had suddenly developed an inexplicable delay on every call.
But I did spot something very odd in the Front End server log – it seemed unrelated at the time, but it turned out to be the key.
Both the Mediation server and the Front End are configured with a parameter for the A/V Edge Authentication: edge.company.com:5062 for A/V authentication. Fine and dandy. For every call that’s offered to the Mediation server, it starts up a stream with the Edge server, just in case you happen to be calling from the outside. It gets that stream working so that there’s no delay in the call if/when OCS figures out you are remote; even if you are placing the call from inside the network.
The Unexpected Breakthrough
So the Front End server sits and waits for the OK from the AV Edge Authentication before proceeding with the call. This generally takes 1/100th of a second or so. But this is exactly where the delay in our calls was occurring. And the SIP Stack log revealed something in that request for AV Authentication:
<credentialsResponse credentialsRequestID=”5541212″>
<credentials>
<username>AgGGJAHMDLkByxJYc6ZaSgyqvYbLBFrF69SYPK2Eoa0BBBBBBboUW+3f4R7GOrqoXxq1a5dABc123=</username>
<password>+5ljflkjdf$lkjkjdfnvfkdf=</password>
<duration>480</duration>
</credentials>
<mediaRelayList>
<mediaRelay>
<location>intranet</location>
<hostName>EDGESERVER02</hostName>
<udpPort>3478</udpPort>
<tcpPort>443</tcpPort>
</mediaRelay>
</mediaRelayList>
</credentialsResponse>
</response>
Now this may not look odd to you, but it was odd. Because EDGESERVER02 was an old edge server, not in production anymore. The box still existed, but it had its services shut down and was not referred to by any other OCS server. Whaaaaa??? Where did it get this? And why was it using the shortname of the defunct edge server? It wasn’t even an FQDN – it was the NETBIOS name of a server not even in the AD domain. I was bewildered. OCS had apparently lost its mind; nostalgic for the days that it had 2 edge servers to pick from. I really expected it to show edge.company.com here in the Media Relay list… that would have made sense, as that was the FQDN of the edge server.
I didn’t mention this before, but we had recently been trying to load balance the edge servers. So we had created a new a record for edge.company.com and applied it to the internal interface of the edge servers, EDGESERVER01 and EDGSERVER02. I told the customer to create this new cert, and just to be sure, add the two different edge server names / FQDNs as SANs. The SAN list looked like this:
- edge.company.com
- EDGESERVER01
- EDGESERVER02
- EDGESERVER01.company.com
- EDGESERVER02.company.com
So he did minted that cert, applied it and… that’s what caused the problem.
I should point out that Elan Shudnow’s post on A/V negotiation (thanks Liam!) has a much more detailed view of what happens during communicator calls. Too bad I didn’t see the part about the SAN messing up AV calls while troubleshooting this isuee! Although it’s interesting to note that the calls actually _do_ work; it’s just that it results in delays in making/receiving calls.
The Awful Conclusion
Apparently, during A/V authentication, the edge server reads from its internal edge certificate and offers one of the names in either the subject or the SAN in the “contact me here for A/V auth” setup. Just at random. In my case, the edge server was telling clients to contact EDGESERVER02 – which was no longer in production; just because that name appeared in the SAN list. This really shouldn’t happen. The Edge should be offering the name that’s configured on the internal interface (edge.company.com), regardless of what the cert says. I’d love to see this change in OCS Wave 14. Way too late, I know. But this seems like a bad idea in the way the Edge works.
Anyway, I had the customer re-issue the cert without the SANs and of course this fixed everything. The Edge offered the right name for A/V auth & the calls went through without a delay.
I learned:
- The Edge server is contacted for every single call, no matter if it’s an internal one.
- SANs are bad luck – especially SANs on the
Do yourself a favor and don’t use SANs on the edge internal interface cert.
Cant thank you enough for this post. We had only re-built our Edge server this week, so I knew it had to be something to do with that appearing on the scene again…
thanks for your time on this.