Walk-the-Plank Bugs

Walk-the-Plank Bugs
July 15, 2003^*
Walter Oney

There's an entire class of bug that I would call "walk the plank" errors that drivers can have. Picture this: you've inherited an NT4 driver for a PC card from someone else. This driver's DPC routine needed to send an IRP_MJ_INTERNAL_DEVICE_CONTROL request to another driver to get some crucial piece of information. The original programmer, whose stock options vested during the dot-COM boom, is living large in Tahiti and is no longer available to help you port this driver to XP. He did know his stuff well enough, however, that his DPC logic was along these lines: (a) leave the device's interrupt inhibited, (b) queue a work item to finish the work, (c) in the work item callback, call IoBuildDeviceIoControlRequest and KeWaitForSingleObject to talk to that other driver, and then uninhibit the device's interrupt. He used ExInitializeWorkItem and ExQueueWorkItem to create and queue the work item because they were the only routines available in NT version 4. Not knowing any better just yet, you preserve this exact logic in the XP version of the driver.

Time passes, and your QA department goes to work on your driver. Just before your dream vacation to Omaha, you get a call that test systems are crashing with bug code 0xCE -- DRIVER_UNLOADED_WITHOUT_CANCELLING_PENDING_OPERATIONS -- some of the time when people remove your PC card without first going through the little safe removal tray icon thingy.¹ If you're lucky, you'll figure out without undue pain that the crash is occurring at an address that corresponds to your work item callback routine, but that the page containing that address is no longer valid. What has happened is this: pulling the card triggers a series of Plug and Play events that ends with the memory manager unmapping the virtual pages containing your driver code. Unfortunately, a work item you queued via ExQueueWorkItem has been advancing its way through the queue of some worker thread and comes to the fore only after your driver is unloaded. The system worker thread calls a subroutine that isn't there any more. In other words, your driver has forced the system to walk off the end of a plank straight into the briny deep.

In this article, I'll describe a half dozen bugs of this class and discuss how you can prevent them.

Work Items

The first "walk the plank" bug I'll discuss is the one summarized in the introduction to this article. It frequently happens that code running at DISPATCH_LEVEL needs to perform some operation that can only be done at PASSIVE_LEVEL. As you know, lowering the IRQL by calling KeLowerIrql breaks the synchronization assumptions of whoever called you and is, therefore, not allowed. You need to create a work item instead. Conceptually, a work item is a small data structure containing a pointer to a callback routine in your driver. The system queues work items for processing by one of several worker threads that the system creates. Each worker thread spends its life pulling items off its queue and invoking the associated callback routines.

The old way...

NT4 drivers, and WDM drivers designed solely for the Windows 98 and Millennium platforms, would use code like the following to create, queue, and process a work item:

typedef struct _RANDOM_JUNK : public _WORK_QUEUE_ITEM {
<your stuff>
} RANDOM_JUNK, *PRANDOM_JUNK;

VOID DpcForIsr(...)
{
PRANDOM_JUNK item = (PRANDOM_JUNK) ExAllocatePool(NonPagedPool, sizeof(RANDOM_JUNK));
ExInitializeWorkItem(item, (PWORKER_THREAD_ROUTINE) Callback, (PVOID) item);
. . .
ExQueueWorkItem(item, CriticalWorkQueue);
}

VOID Callback(PRANDOM_JUNK item)
{
. . .
ExFreePool(item);
}

If your driver should manage to completely process an IRP_MN_REMOVE_DEVICE request, the PnP Manager will decide your driver is no longer needed in memory and call the Memory Manager to unmap your driver code pages. If this should occur after you queue the work item, you can experience the bug. Some worker thread has a work item queue that contains the item you queued. When that item advances to the head of the queue, the worker thread will attempt to call your Callback routine. It's obvious that something bad will happen if your driver has already unloaded: in this case, the worker thread will jump off into what we driver experts call the Fire Swamp of Driver Code. It could also happen that the worker thread will call your callback routine and that your driver will be unloaded some time before your callback manages to execute the return instruction that passes control back to the worker thread. Here, the system yanks the plank out from under your callback routine between one machine instruction and the next.

An easy mistake to make (I know how easy, because I made it in the first edition of Programming the Microsoft Windows Driver Model) is to try to put some interlock in place to guard your own callback routine. For example, you might try calling IoAcquireRemoveLock just before queuing the work item and making the matching call to IoReleaseRemoveLock at the end of the callback routine. The idea is to hold up the processing of IRP_MN_REMOVE_DEVICE, which will be blocking on a call to IoReleaseRemoveLockAndWait, until your work item returns. The trouble with this scheme is that the driver might be unloaded right after your callback releases the remove lock but before it manages to actually return back to the worker thread. It's true that you're probably safe in this situation, especially if your driver never runs on multiprocessor systems, but we'd like to aim for certainly safe if we can.

The new way

Microsoft added three routines to Windows 2000 and later systems to provide a certainly-safe way to use work items. They are:

IoAllocateWorkItem allocates memory for a work item structure.
IoQueueWorkItem queues the work item in a remove-safe way.
IoFreeWorkItem releases the memory previously allocated by IoAllocateWorkItem.

Code to use these new routines would look something like this example:

typedef struct _RANDOM_JUNK {
<your stuff>
PIO_WORKITEM item;
} RANDOM_JUNK, *PRANDOM_JUNK;

VOID DpcForIsr(PKDPC dpc, PDEVICE_OBJECT fdo, ...)
{
. . .
PRANDOM_JUNK ctx = (PRANDOM_JUNK) ExAllocatePool(NonPagedPool, sizeof(RANDOM_JUNK));
if (junk)
    {
    <initialize ctx structure>
    ctx->item = IoAllocateWorkItem(fdo);
    if (ctx->item)
      IoQueueWorkItem(item, (PIO_WORKITEM_ROUTINE) Callback, CriticalWorkQueue, ctx);
    else
      ExFreePool(ctx);
    }
. . .
}

VOID Callback(PDEVICE_OBJECT fdo, PRANDOM_JUNK fdo)
{
. . .
IoFreeWorkItem(ctx->item);
ExFreePool(ctx);
}

What's new about this scheme is that IoQueueWorkItem calls ObReferenceObject to claim an extra reference to your device object, fdo. The extra reference on your device object lasts until your callback routine returns. Should it come to pass that you manage to completely process an IRP_MN_REMOVE_DEVICE while the work item is still outstanding, that extra reference will preserve the device object even though you've called IoDeleteDevice and even though every other program has released its reference. So long as the device object exists, the driver code will also stay mapped in memory.

Version Compatibility

The IoXxxWorkItem routines are part of the Windows 2000 and later kernels. Windows 98 and Millennium systems don't support these DDIs. Notwithstanding that, WHQL currently requires that drivers not use ExQueueWorkItem. You can't achieve binary portability by calling MmGetSystemRoutineAddress to make a runtime decision about which work item routines to call because Windows 98/Me doesn't support that routine either. The solution I use in my own drivers is to ship a WDM lower filter driver (WDMSTUB.SYS) that defines the Windows 2000 work item routines in such a way that the system loader finds them when it later loads a function driver. WDMSTUB.SYS is a sample driver accompanying Programming the Microsoft Windows Driver Model Second Edition (Microsoft Press 2003) (hereafter, PMWDM2).²

Completion Routines

Another bug of the same ilk as the work item bug can occur with a standard I/O completion routine. One example of this bug is as follows. I need to suppose that your driver sends an asynchronous IRP down the PnP stack for some reason or another. You follow normal guidelines, which include using an IO_REMOVE_LOCK to make sure that you don't allow the driver underneath you to unload until it's finished handling this IRP. Your code might look something like this. (Refer to PMWDM2 at pp. 294-95 for a detailed explanation of these mechanics.)

VOID SomeFunction(...)
{
PDEVICE_EXTENSION pdx = . . .;
PIRP Irp = IoAllocateIrp(pdx->LowerDeviceObject->StackSize, FALSE);
. . .
NTSTATUS status = IoAcquireRemoveLock(&pdx->RemoveLock, Irp);
if (!NT_SUCCESS(status))
    {
    IoFreeIrp(Irp);
    . . .
    }
else
    {
    IoSetCompletionRoutine(Irp, (PIO_COMPLETION_ROUTINE) CompletionRoutine, pdx, TRUE, TRUE, TRUE);
    IoCallDriver(pdx->LowerDeviceObject, Irp);
    }
}

NTSTATUS CompletionRoutine(PDEVICE_OBJECT junk, PIRP Irp, PDEVICE_EXTENSION pdx)
{
. . .
IoFreeIrp(Irp);
A IoReleaseRemoveLock(&pdx->RemoveLock, Irp);
return STATUS_MORE_PROCESSING_REQUIRED;
}

NTSTATUS HandleRemoveDevice(...)
{
. . .
B IoReleaseRemoveLockAndWait(&pdx->RemoveLock, PIRP Irp);
IoDetachDevice(pdx->LowerDeviceObject);
IoDeleteDevice(fdo);
. . .
}

Note that I have to leave out quite a bit of the code that would be in a real driver, or else we'd be here until next Wednesday trying to understand all the mechanics. The executive summary of these code fragments is this: SomeFunction creates and forwards the asynchronous IRP, but only after verifying (by a call to IoAcquireRemoveLock) that we haven't yet been sent an IRP_MN_REMOVE_DEVICE. Or, at least, that we haven't reached the point labeled "B" in our handling of such an IRP.

Now suppose that subsequent events unfold in the following order:

The PnP Manager sends us an IRP_MN_REMOVE_DEVICE. HandleRemoveDevice's call to IoReleaseRemoveLockAndWait will block for the time being. Just before sending the IRP, the PnP Manager calls ObReferenceObject for each of the DEVICE_OBJECTs in the PnP stack for our device, including our own.
The lower driver completes the IRP. IoCompleteRequest will call our CompletionRoutine, which calls IoReleaseRemoveLock.
Supposing that the just-completed IRP was the last one pending in or below our drivers, the call to IoReleaseRemoveLock causes the thread within which we called HandleRemoveDevice to become eligible to run. Whereupon HandleRemoveDevice will call IoDetachDevice for the lower device object and IoDeleteDevice for the FDO that (I am assuming) this driver previously created.
HandleRemoveDevice will return. Real soon now, the PnP Manager will call ObDereferenceObject to dereference all of the DEVICE_OBJECTs.
Releasing the last reference to our DEVICE_OBJECT allows the object manager to actually delete the storage.
Deleting the last DEVICE_OBJECT created by our driver allows the memory manager to unmap our driver image from memory.
IoReleaseRemoveLock returns to the completion routine, which isn't there any more. We get our feet very wet at that point...

IoSetCompletionRoutineEx:

To avoid this bug, you can install the completion routine by calling IoSetCompletionRoutineEx instead of IoSetCompletionRoutine:

. . .
IoSetCompletionRoutineEx(pdx->DeviceObject, Irp, (PIO_COMPLETION_ROUTINE) CompletionRoutine, pdx, TRUE, TRUE, TRUE);
. . .

The extended version of IoSetCompletionRoutine takes an additional PDEVICE_OBJECT argument. Just before the I/O Manager calls your completion routine, it calls ObReferenceObject to take an extra reference to that object. That reference will pin the DEVICE_OBJECT, and your driver code, in memory. The I/O Manager releases the extra reference when your completion routine returns. In the scenario I outlined above, the extra reference would prevent the Object Manager from deleting the DEVICE_OBJECT at step 5. Consequently, the memory manager would not unmap the driver at step 6, and there would be no problem at step 7. The Object Manager would finally delete the device object after your completion routine returns.

Note that IoSetCompletionRoutineEx doesn't reference your device object. The reference occurs just before the I/O Manager calls the completion routine, if it ever does. Therefore, you need to use some other means to guarantee that your driver is still in memory when the IRP completes. In the preceding example, we used the IO_REMOVE_LOCK object to provide that guarantee.

Since IoSetCompletionRoutineEx allocates a small memory block, there are two fine points about this API that you should pay attention to. First of all, don't call it and then change your mind about actually submitting the IRP -- that will "leak pool" in the quaint vernacular of driver programmers.³ Second, you need to check the NTSTATUS return value. If it indicates an error, it did not install a completion routine.

Version Compatibility:

IoSetCompletionRoutineEx is available in Windows XP and later systems. You can use MmGetSystemRoutineAddress to get a pointer to this function at run time.⁴

Deferred Procedure Crashes -- uh, Calls

Our next bug involves the Deferred Procedure Call (DPC) mechanism. Here's a summary of what you need to know about DPCs:

To use a DPC, you create and initialize a KDPC object. You specify the DPC routine address in a call to the initialization function, KeInitializeDpc. Note that IoInitializeDpcRequest is simply a macro that initializes the KDPC object built into the standard DEVICE_OBJECT structure.
You schedule a deferred procedure call by calling KeInsertQueueDpc. This DDI checks first to see if the KDPC object is already on the DPC queue for some CPU. If not, it puts the KDPC object onto a queue. KeInsertQueueDpc normally queues a DPC for the CPU on which you call it, but you can call KeSetTargetProcessorDpc beforehand to direct a DPC onto a specific CPU. The "importance" attribute of a KDPC object governs the queue position and influences how soon the deferred call will occur. Note that IoRequestDpc is simply a macro that calls KeInsertQueueDpc to queue the KDPC object built into the standard DEVICE_OBJECT structure.
Running at DISPATCH_LEVEL, the DPC dispatcher removes KDPC objects from a queue and calls the associated DPC routines.

The DPC mechanism can give rise to the same kind of problem as a work item. Namely, while a DPC object is on a queue for some CPU, it's possible that the system will unload the driver containing the callback routine. When the DPC dispatcher gets around to calling the DPC routine, the routine is gone. Alternatively, the driver could be unloaded while a DPC routine is actually running, so that the code disappears between one instruction and the next.

KeFlushQueuedDPCs allows you to avoid this kind of bug. You could use this routine in a code path that releases I/O resources. For example:

VOID StopDevice(PDEVICE_OBJECT fdo, BOOLEAN oktouch)
{
ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
. . .
IoDisconnectInterrupt(. . .);
KeFlushQueuedDPCs();
}

You must call KeFlushQueuedDPCs at PASSIVE_LEVEL. It returns after making sure that no code is executing at a higher IRQL on any of the CPUs in the computer. As you know, DPC routines execute at DISPATCH_LEVEL. Consequently, you can be sure that, by the time this DDI returns, any DPC that was queued for any CPU at the time of the call has been executed. In particular, any DPC you queued has executed.

Version compatibility:

KeFlushQueuedDPCs is part of the Windows 2003 Server product. You can use MmGetSystemRoutineAddress to get a pointer to this function at run time.³ NDAs to which I'm a party forbid me from telling you exactly how you could write your own version of this function for an earlier platform, and I certainly could not conscientiously advise you to reverse engineer the implementation of this function in a multiprocessor 2003 Server kernel.

Kernel Thread Termination

Kernel threads that you create by calling PsCreateSystemThread have long posed the kind of problem we're talking about in this article. While your thread is still running, the PnP Manager can unload your driver. The driver code will then disappear in between two instructions in your thread routine.

This particular walk-the-plank bug is made worse by the fact that there isn't any way to terminate a system thread from outside the thread. As you know, the thread routine itself must call PsTerminateSystemThread. The basic mechanics for starting and stopping a kernel thread are, therefore, as follows (see PMWDM2 at 682-85):

Initialize a kernel event object -- I call it the "kill" event -- that will be accessible to the thread routine and to a driver routine like StopDevice or RemoveDevice that will later want to halt the thead.
Call PsCreateSystemThread to create the thread. Then call ObReferenceObjectByHandle to "convert" the resulting thread handle into a PKTHREAD that points to the underlying kernel thread object and ZwClose to discard the now-unneeded thread handle.
At the time you want to terminate the thread, set the "kill event" and wait until the thread object reaches the "signalled" state.
Within the thread routine, use some means to wait for the "kill" event to be signalled. I often organize my kernel threads so there's a "do something" event that gets signalled when there's work for the thread to perform. The thread routine spends much of its time waiting for a call to KeWaitForMultipleObjects to detect that either the "kill" or "do something" events has been signalled. When the thread detects the "kill" event, it calls PsTerminateSystemThread to terminate the thread.
The call to PsTerminateSystemThread signals the thread object, which allows the waiter at step 3 to wake up. At that point, it will be safe to allow the driver code to be removed from memory.

Version compatibility:

Unfortunately, a kernel thread object is not a "dispatcher object" in Windows 98 or Millennium. If you use the PKTHREAD as an argument to KeWaitForSingleObject on that platform, the system will crash. PMWDM2 outlines a scheme (see p. 695 and the POLLING sample driver) involving an "I'm dead" event and a priority boost that allows you to more-or-less reliably wait for a kernel thread to exit.

Disappearing Objects

The bugs I've discussed to this point in the article all involve situations in which your code disappears too soon. The last class of bug I'll mention are a bit different: they arise when you release memory that contains an object to which some kernel function has a pointer.

Kernel timers:

Let's first consider the case of a kernel timer. Don't do the following:

I bet it's almost obvious what's wrong here. When you call KeSetTimer, the kernel places your timer object on an internal queue. Each time the system timer interrupts, the interrupt handler walks through the queue looking for objects that have timed out. Since your timer object is an automatic variable, it passes out of scope when SomeFunction returns. It's likely to the point of virtual certainty that the memory originally occupied by timer will be overwritten by other data in the near future, which will cause the timer interrupt handler to crash.

To avoid leaving your timer on a system queue when you return, just be sure that you either wait for the timer to expire or call KeCancelTimer. For example, the following code would be perfectly safe:

VOID SomeFunction(. . .)
{
KTIMER timer;
KeInitializeTimer(&timer);
LARGE_INTEGER timeout;
timeout.QuadPart = -50000000; // i.e., 5 sec worth of 100-ns units
KeSetTimer(&timer, timeout, NULL);
if (<some condition>)
KeWaitForSingleObject(&timer, Executive, KernelMode, FALSE, NULL);
else
KeCancelTimer(&timer);
return; // <== okay now
}

Exactly the same kind of bug can occur if you put a KTIMER in your DEVICE_EXTENSION structure and call IoDeleteDevice without canceling the timer. Or if you allocate a block of memory to hold the KTIMER and release it too soon. I think you get the idea.

Here are two variations on this same theme that are a bit more subtle. Suppose for some reason that you did your wait in user mode:

KeWaitForSingleObject(&timer, Executive, UserMode, FALSE, NULL);

or with the "alertable" flag set:

KeWaitForSingleObject(&timer, Executive, KernelMode, TRUE, NULL);

In the UserMode wait case, the thread's kernel stack would be temporarily pageable. Your timer object might not be in memory when the system timer next interrupts, leading to a page fault at elevated IRQL. In the alertable-wait case, the wait might terminate early due to a thread alert. If you forgot to call KeCancelTimer, the timer would still be ticking when it passed out of scope.

Another subtle bug could arise if you establish a periodic timer:

KeInitializeTimerEx(&timer, SynchronizationTimer);
KeSetTimerEx(&timer, timeout, 5000, NULL);
KeWaitForSingleObject(&timer, Executive, KernelMode, FALSE, NULL);
return; // <== oops

This timer expires initially after the designated timeout value and every 5000 milliseconds thereafter. Because the timer keeps on ticking after each expiration, you must remember to cancel it before it passes out of scope.

Finally, consider what happens if you specify the optional DPC argument to KeSetTimer or KeSetTimerEx. You thereby indicate that you want the system to call the associated DPC routine when the timer expires. I think it would be pretty easy to create a periodic timer with a DPC in your AddDevice function, say, and forget to cancel it in your RemoveDevice function. It would also be easy to forget to call KeFlushQueuedDPCs in order to make sure that any last DPC associated with your timer was done.

Lookaside Lists

When you create a memory lookaside list by calling ExInitialize[N]PagedLookasideList, the system places your lookaside list object on an internal queue that it traverses every so often in order to adjust the list depth based on recent usage. Be sure to make the matching call to ExDelete[N]PagedLookasideList before allowing the list object to pass out of scope.

Summary

The following situations all give rise to "walk the plank" bugs. In the preceding article, I summarized these problems and their workarounds.

Unloading a driver with a work item still queued. IoQueueWorkItem (2000 and later) prevents this bug by pinning the driver in memory until the associated work item callback routine returns.
Triggering a driver to unload while in an I/O completion routine. Using IoSetCompletionRoutineEx (XP and later) to install the completion routine will prevent this problem, but you still need to make sure by other means that your driver doesn't unload before the system calls your completion routine.
Unloading a driver with a DPC object still queued on some CPU. KeFlushQueuedDPCs (Server 2003 and later) avoids this embarrassment by waiting until all queued DPCs have executed before returning.
Unloading a driver with a kernel thread still running. To avoid this problem, you need to block your RemoveDevice (or, less likely) your DriverUnload function until all your threads manage to call PsTerminateSystemThread. You need to use different techniques in NT-type systems than in Windows-type systems.
Releasing the memory that contains a KTIMER object that is still queued. Use KeCancelTimer to avoid this problem.
Releasing the memory that contains an active lookaside list. Call ExDelete[N]PagedLookasideList first to avoid this bug.

About the author:

Walter Oney is a freelance driver programmer, seminar leader, and author based in Boston, Massachusetts. You can reach him by e-mail at waltoney@oneysoft.com. Information about the Walter Oney Software seminar series, and other services, is available online at http://www.oneysoft.com.

* -- Revised Sept. 4, 2003, to clarify two features about IoSetCompletionRoutineEx.

1 -- Monty Python fans take note: the word "thingy" is not being used here in its strictly literal sense.

2 -- Redistribution of WDMSTUB.SYS requires acceptance of a royalty-free license from the author. The purpose of the license is to ensure that end users don't end up with inconsistent or stale versions.

3 -- If there was absolutely no other way to organize your code, you can force the completion routine to be called by calling IoSetNextIrpStackLocation followed by IoCompleteRequest.

4 -- MmGetSystemRoutineAddress is not available in Windows 98 or Millennium systems. WDMSTUB.SYS does implement MmGetSystemRoutineAddress, however.