In Your Face

Accessing Hardware Registers
July 15, 2003
Mark Roddy

Sometimes we all get it wrong. For years the various gurus of nt kernel development have been preaching the following: "if the device's raw resource is in PORT space, use HAL PORT operations, otherwise it is in REGISTER space and you should use HAL REGISTER operations." It seemed to be sensible advice, however a recent discussion on the NTDEV email list with Jake Oshins of Microsoft, the lead NT HAL developer, revealed the startling information that this advice, repeated in the DDK, in books by essentially everyone who has written a book on NT device driver programming, and in all of the leading seminars --- is wrong.

Introduction and background.

But first some background, as in what exactly are we talking about here, and what are the issues involved. The NT HAL abstracts the interface between the processor, system memory, and the IO busses (and the chipset(s) that glue them together,) relieving the kernel programmer from having to consider the details of exactly how a particular platform accesses device memory. Specifically, this article discusses how NT supports Programmed IO: (PIO) using the system CPUs to transfer data between system memory and device memory.

Consider the following vastly simplified system diagram:

Processors are attached to a system bus, as are the system physical memory array and a HOST-PCI bridge device. Attached to the HOST-PCI bridge device is a PCI bus, and attached to the PCI bus is a PCI device. The PCI device contains a memory location, a register, that must be read and written from the system processors in order to control the device.

The HAL provides support for two separate IO bus address spaces, which we will call: REGISTER address space and PORT address space. This terminology reflects the HAL naming conventions, and avoids having overloading IO and Memory. Unfortunately, the term REGISTER is overloaded, and so the convention in this article is that an all capitalized REGISTER refers to IO Bus REGISTER address space, while the uncapitalized term register refers to memory on a peripheral device that can be accessed from the host.

IO busses may support one or both address spaces. The control registers and data buffers on a device may be implemented (by the hardware designers,) in either address space. How an IO bus implements these address spaces is a function of the bus design. Details of how the PCI bus, for example, implements IO bus transactions are discussed exhaustively in the MindShare PCI Systems Architecture series. Note that there may very well be semantic differences between PORT and REGISTER bus transactions, for example PCI REGISTER write operations can be posted while PORT write operations can never be posted.

When a processor needs to read or write a device register, it does so using one of two methods. Either it uses standard memory LOAD/STORE operations (pointer dereference,) or it uses the native processor support for PORT address operations (IN/OUT instructions.) Of course not all processors actually support native PORT address operations, which is why this starts to get a little complicated. Consider the simple case where the device register is in REGISTER space. The processor performs a LOAD or STORE operation on the appropriate physical address. This address is guaranteed to not correspond to a valid location in the system physical memory array, and instead is claimed by the HOST-PCI bridge, which then performs the appropriate bus transaction on its attached PCI bus. For processors with native PORT IO operations, a similar process occurs, but the processor is using IN and OUT instructions rather than LOAD/STORE instructions.

What happens on systems that do not have native support for PORT operations? In the simplest cases, the PCI bridge device is programmed to map specific memory addresses to PCI bus PORT addresses. Typically a block of the physical address space is set aside as a map for PORT addresses. Now the processor can simply do a LOAD or STORE operation with an address within the PORT mapping, and the PCI bridge will take care of issuing the appropriate PCI bus transaction. There are more complicated arrangements, particularly in high end systems, but they are all variations on this theme.

HAL support for device PIO operations is provided through a set of platform specialized functions. REGISTER operations have names like READ_REGISTER_UCHAR while PORT operations have names like READ_PORT_UCHAR. There are equivalent functions for WRITE operations, and there are versions for 8, 16, and 32 bit accesses. In addition buffer transfer operations exist for both PORT and REGISTER operations.

Register Address Space PIO HAL Functions	Port Address Space PIO HAL Functions
READ_REGISTER_UCHAR	READ_PORT_UCHAR
READ_REGISTER_USHORT	READ_PORT_USHORT
READ_REGISTER_ULONG	READ_PORT_ULONG
WRITE_REGISTER_UCHAR	WRITE_PORT_UCHAR
WRITE_REGISTER_USHORT	WRITE_PORT_USHORT
WRITE_REGISTER_ULONG	WRITE_PORT_ULONG
READ_REGISTER_BUFFER_UCHAR	READ_PORT_BUFFER_UCHAR
READ_REGISTER_BUFFER_USHORT	READ_PORT_BUFFER_USHORT
READ_REGISTER_BUFFER_ULONG	READ_PORT_BUFFER_ULONG
WRITE_REGISTER_BUFFER_UCHAR	WRITE_PORT_BUFFER_UCHAR
WRITE_REGISTER_BUFFER_USHORT	WRITE_PORT_BUFFER_USHORT
WRITE_REGISTER_BUFFER_ULONG	WRITE_PORT_BUFFER_ULONG

Common IO busses such as PCI, and of course the hideous ISA bus, provide both REGISTER and PORT address spaces. Platforms and processors however may not directly support both address spaces. X86 processors natively support PORT operations with IN and OUT instructions. IA64 processors also support PORT operations, although this support is virtualized. Historically, NT ran on PowerPC, MIPS, and Alpha systems, and none of these CPUs directly supported PORT operations. The HAL hides these details within both the device access operations referenced above, and through address translation and mapping operations that must be performed by device drivers prior to accessing device memory.

Translating device resources.

Some device driver programmers, particularly those with a DOS background, would like to simply read PCI config space for their device (or know a priori that their ISA device always uses e.g. port 0x320,) and use these values directly to access their device registers and buffers. This approach is forbidden on NT platforms for a variety of reasons. Instead, device drivers are required to translate what are referred to as raw resources into translated resources, and to only ever use the translated resources to access device registers and buffers. Prior to Windows 2000, the translation process was performed by the device driver explicitly using HAL operations. Starting with Windows 2000, the PnP Manager performs almost all of the translation operations for the driver, providing both the original raw resources and the platform specific translated resources to a PnP driver in an IRP_MN_START_DEVICE request. The translation process provides a platform independent mapping of device resources.

Both the raw and translated resources provided in an IRP_MN_START_DEVICE request indicate in which bus address space the resource is located. If the translated resource is in REGISTER space, then the device driver must take the additional step of calling MmMapIoSpace to obtain a virtual address for the bus relative (logical) memory address. As mentioned above, the REGISTER address reported from PCI config space is not directly usable but instead must be converted into a virtual address.

It is the translation process (from raw resources to translated resources,) that causes the confusion we are discussing here. A device resource can move from PORT space to REGISTER space when it is translated. (Theoretically a raw resource could move from REGISTER to PORT spaces as well, but I know of no platform where this is currently true, and I doubt that there ever will be one in the future.)

The raw resources represent the physical configuration of the device. The device has a set of zero or more PORT space registers and zero or more REGISTER space registers, as a function of its hardware design, and PnP reports those register resources as the raw resources provided to a driver's IRP_MN_START_DEVICE method. The raw resources reflect the actual hardware design. If the hardware device engineer put the registers in PORT space, that is where the raw resource descriptor will report that resource.

The translated resources represent the operating system's effort to make the device hardware registers available to the device driver, taking into account the platform's actual capabilities. On different platforms the same device hardware register could be in different translated resource IO space addresses. On some platforms simply moving a device from one bus to another bus could result in its hardware resources appearing in a different translated address space.

So a device driver is handed two sets of resources, raw and translated, for each device register, and one set can be in PORT space while the other is in REGISTER space. The HAL provides two sets of functions, one set to access PORT space and one to access REGISTER space. The question is, which set of HAL functions should a driver use to access a device resource, the set indicated by the raw resources, or the set indicated by the translated resources?

In the case where the platform does not support PORT address space operations (e.g. obsolete MIPS and PowerPC platforms,) there is always a conversion from raw PORT space to translated REGISTER space. For these two platforms the HALs respectively defined PORT access operations as pointer dereference operations, and it didn't matter if the device driver used HAL PORT or REGISTER functions to access PORT IO space. So essentially the question posed above was moot on these platforms: use either the raw-indicated HAL functions or the translated-indicated functions.

For example, here is the MIPS version of READ_PORT_UCHAR:

#define READ_PORT_UCHAR(x) \
*(volatile UCHAR * const)(x)

And here is the MIPS version of READ_REGISTER_UCHAR:

#define READ_REGISTER_UCHAR(x) \
*(volatile UCHAR * const)(x)

On this platform I can state with certainty that it didn't matter if you chose the raw resource functions or the translated resource functions :-) The PowerPC HAL provided similar identical mappings for PORT and REGISTER operations.

Now consider a platform which directly supports PORT access operations, but maps some IO bus PORT addresses into REGISTER while leaving others in PORT IO space.

On the platform depicted above PORT registers on PCI bus 0 are accessed using the processor's native PORT IO instructions, while PORT registers on PCI Bus 1 are accessed using the processor's REGISTER space LOAD/STORE operations, as the HOST-PCI bridge device for PCI Bus 1 translates memory addresses in a specific range into PORT transactions for its child bus.

On such a platform, sometimes PORT raw address space is mapped to PORT translated address space, and sometimes it isn't. (There are examples of such platforms, x86 systems with multiple busses where one root bus PORT address space is reached using x86 IO operations while other root bus PORT address spaces are accessed using memory mapped operations.) The HAL PORT access operations on such a platform would use the native processor IN/OUT instructions. If the driver uses the raw-indicated address space, it would use HAL PORT access operations for all raw PORT resources, and would fail to access those resources on devices not attached to Host-PCI Bridge 1 that were translated into REGISTER space.

The raw resource cannot be used as the guide to which access function to use. On some platforms choosing the HAL PIO function based on the raw resource will cause incorrect results.

Use the translated resource, Luke.

The correct answer is to always use the translated resource address space to decide which HAL PIO function to use to access device registers. The bad news is that this means that, to be completely correct, your driver should be written to use either set of functions. (Using the old advice, one could simply read the specifications for the device and know which HAL function to use, as the raw resource always reflected the hardware design.) This means that PIO operations have to switch at runtime between either the PORT or REGISTER HAL. As these HAL functions are actually macros on some platforms, some thought has to go into how to correctly and efficiently implement the runtime switch. Perhaps a future release of the new driver framework will provide an efficient encapsulation that will make this boilerplate code.

Why can't the HAL do things the other way? It would be simpler if the old advice was the correct advice. No wrapper functions to accommodate macro functions, no jump tables to efficiently call the right HAL function, no fuss no muss. Unfortunately, as these operations are defined, there simply is not enough information in a pointer for the HAL PORT functions to decode a PORT address and use the correct access method on all platforms. (Of course the address could be used as an index into some table that would produce the correct access method, but this would impose considerable overhead on these operations.)

Although it isn't documented anywhere I know of, there is another complication with the HAL PIO functions. On some platforms PORT buffer functions behave one way while on other platforms they behave another. For example, on a standard x86 platform, READ_PORT_BUFFER_UCHAR reads the specified number of bytes from the specified port address into the specified buffer. The port address is treated like a FIFO, the operation does not increment the port address with each byte fetch. On other platforms, READ_PORT_BUFFER_UCHAR does not consider the port address to be a FIFO, but instead, just like the REGISTER versions of these functions on all platforms, does a buffer copy, incrementing the port address after each byte fetch. I actually have no advice at all to give about how to program this feature correctly, other than to avoid hardware that uses PORT space for buffer transfers.

So, while theory is nice, practice is bliss. To complete this article I built a simple switching mechanism for the standard HAL PIO functions, and tested it out on my PLX9054 PCI test device.. The switch mechanism is constructed using four C++ classes. The following UML diagram illustrates the class relationships:

The HalPio class is an abstract base class. Its methods are all pure virtual and it has no data associated with it. HalPio simply defines a callable interface. Each method corresponds to one of the HAL PIO functions documented in the table earlier in the article. Note that I chose to overload Read and Write operations, having three sets of each, one for UCHAR, USHORT, and ULONG versions of the HAL functions, rather than having separately named methods for each size of operand. You might find this an offensive use of C++ obfuscation, but I like it.

HalPortPio and HalRegisterPio classes inherit from HalPio and provide, respectively, HAL PORT operations and HAL REGISTER operations. They do so by wrapping the appropriate HAL PIO function. This simple architecture allows for runtime switching between PORT and REGISTER HAL PIO functions by using either the HalPortPio class or the HalRegisterPio class to invoke the HAL operation.

The fourth class, DevicePioAccess, uses the HalPio class hierarchy to provide the runtime switch. In start device processing, as a function driver parses its resources, for each device resource it chooses to use it would create a new instance of a DevicePioAccess object for that resource. The only constructor for DevicePioAccess specifies which set of access routines to use. Once constructed, a DevicePioAccess object provides access to the correct HalPio derived class through its Pio() method.

The implementation also instantiates an instance each of HalPortPio and HalRegisterPio. As these are stateless wrappers to function calls, there is no need to instantiate any more than these two instances. The static methods specifyPortIo() and specifyRegisterIo() are helpers to make it easy to construct a DevicePioAccess instance, as they each return a reference to one of the two HalPio objects instantiated by the implementation.

In start device processing an FDO driver might do the following:

bar0_Access = new DevicePioAccess(barIsPort
? DevicePioAccess::specifyPortIo()
: DevicePioAccess::specifyRegisterIo() );

To use a DevicePioAccess object a driver could do the following:

ULONG tempReg = bar0_Access->Pio().PioRead(UlongOffset(bar0, PCI9054_PCI_DOORBELL));
bar0_Access->Pio().PioWrite(UlongOffset(bar0, PCI9054_PCI_DOORBELL), tempReg);

The unoptimized (debug version,) overhead of using this switch interface is 24 instructions on an x86 build using the W2k3 DDK compiler. No measurements were made of the overhead on the free build, but it should be significantly lower than 24 instructions.

The use of C++ is of course somewhat non-standard, but the same concepts can easily be implemented using standard C structures and functions. C++ just provides a lot of the plumbing for such an interface built into its virtual function mechanism.

You might find the overhead of this mechanism unacceptable. In reality, you can safely ignore runtime switching for HAL PIO operations for REGISTER device resources, as they will never get mapped to PORT space (on any current or future platforms.) You could continue to directly use the HAL PIO functions for all REGISTER resources. So the issue is really what to do with PORT resources. The correct thing to do is to provide a runtime switch, such as I outlined above, and to use it at least for all PORT device resources. For code clarity you might choose to use the switch mechanism for both sets of resources.

About the author:

Mark Roddy is an independent consultant specializing in Windows NT kernel software development. Mark has been working exclusively in the NT kernel since 1994, with a focus on storage subsystems and highly reliable computer platforms. In addition to software development, he has been training developers since 1996, and currently works with Azius to provide Windows NT device driver training.