What was more irritating was that the system was working most of the time, and, of course, there was no way to consistently reproduce the issue.
I will explain the issue here, hoping that people don't make the same mistake again.
Think about a single instance code unit which is receiving an event in order to process the incoming data (this was using the ComCom objects), and, in order to catch possible errors in the AL execution, a NAV Timer (Navision Timer) is used to reply successfully or with an error to the receiving event initiator (the COM component that started the conversation).
Normal flow (from the NAS perspective) was:
Sounds reasonable, and in most cases this would work great. However, people were experiencing mixed up answers especially when errors were to be reported.
Of course, my first thought was that the component connecting to the NAS was the guilty one, as, after all, it is a multi-thread, multi-connection component (it runs as a Windows Service), and getting all this threads straight is always a tricky business. However, after inspecting the components code, and correcting some issues, the problem was still there.
I decided it was time to get the hands really dirty and started making my own component which would simulate the system and start hammering the NAS with requests using multiple threads and inspecting the replies, and sure enough, especially when the NAS would error, the replies would contain a different ID than the original sending ID on the same virtual connection.
So, there must be something in the NAS that does not work correctly. More code inspection, but nothing showed up until.... Hey, wait a second... We are using a global variable to save data and then reusing this data later on when the NTimer event fires, and, since the NAS is single threaded, this should all work fine, right?
Well, of course the first thing that came into my mind was that there was a possibility that a new incoming event would be "queued" before the NTimer event would be scheduled, and this would mean that a new MessageReceived event would be triggered before the NTimer could even process the previous message. Bingo! That is the issue!
After further investigating the NAS code, it turns out that, although the previous scenario is possible, it actually does not queue sequentially. Incoming events are basically random when the host COM cannot handle the call. I could go very deeply here, but what happens is that, when the NAS COM event is called, the NAS actually checks to see if it is already handling another incoming event. If this is the case, it signals that it cannot process the incoming event (via a RPC_E_SERVERCALL_RETRYLATER) and COM basically waits "some time" before it retries the call. The problem is that this "some time" is not known and it changes, and that is when the trouble really starts.
Basically, what happened in our scenario was that 2 consecutive MessageReceived events will fire (before the first NTimer would) and obviously one GlobalVal would be lost and we would reply twice to a single event (the NTimers would fire, but with the same GlobalVal).
The cure was simple. When the MessageReceived event is fired, we process the information and reply to the event on the same connection. After all, ComCom objects do report back AL errors in XML format, so, in principle, there is no need to complicate matters.
So, after getting rid of the NTimer, and replying to the incoming request within the MessageReceived event, everything works as expected.
Jorge Alberto TorresSoftware Developer Engineer
These postings are provided "AS IS" with no warranties and confer no rights. You assume all risk for your use.