Writing an AHCI Driver

Published on January 9, 2024

Now that I’ve wrapped up the 0.1.0 Release of AcadiaOS I’m looking to cleanup some of the “just get it working” hacks that exist in the codebase. First up on that list is the AHCI Driver.

What is AHCI

AHCI stands for Advanced Host Controller Interface and if you like acronyms boy are you in for a treat. AHCI is a way to interface with SATA (which replaced PATA (a.k.a. IDE)) via its HBA. AHCI has since been superseded by NVMe but is simpler to implement (or so I’ve been told) so I’ve started here.

To try to explain it without acronym soup, AHCI allows you to access disk drives and optical drives (SATA devices) by writing relevant ATA commands to memory addresses that are backed by hardware firmware. There are a wide variety of commands available but best I can tell the main ones used these days are to identify the device and read/write via direct memory access (DMA).

Essentially you give the device an offset to read from as well as physical memory address to write to. The device firmware copies the amount of data you requested to the physical address then triggers an interrupt to indicate that the operation is complete. Likewise writing via DMA is the same but in reverse.

Disclaimer, all of the above is basically just summarizing Wikipedia and the OSDev wiki and I don’t really know what I’m talking about.

Current State

The current AHCI implementation in Denali is straightforward but very brittle. It relies on everything following the happy path and is cobbled together more based on trial an error of what worked rather than following the specification closely.

As a part of this article we’re going to dive into the related specs and look at how they relate to each other. The trickiest part of writing the driver is the fact that the necessary information is spread across several different specs rather than contained in one place. The specs I reference in this post are:

AHCI 1.3.1
SATA 3.2
ATA/ATAPI Command Set 3 (ACS-3)

The SATA and ACS specs cost money so I can’t link them directly but it isn’t hard to find drafts of them available online.

How AHCI Works

AHCI allows you to control SATA devices by writing commands to memory. The layout of these structures is nicely shown in the AHCI Spec Figure 4:

AHCI Memory

There are several pieces here that I’ve annotated:

The Generic Host Control (GHC) is a set of registers that allow you to manage the whole controller and get its status. These registers are referred to in the spec using GHC.RegisterName so the interrupt status register for instance is “GHC.IS” for short.
Each device (hard disk or disc drive) that is attached to the controller is exposed as a “port” with a set of registers to control it individually. These registers are referred to as PxRegisterName so for instance the command issue register is PxCI.
For each port it has a separately allocated piece of memory that can accept up to 32 “commands” to execute.
When the controller is finished executing a command for a device it will write a Frame Information Structure (FIS) and raise an interrupt. (The GHC has a register to show which devices have a pending interrupt GHC.IS).

Each device can support up to 32 pending commands (but only receive one FIS at a time). The memory structure is as follows:

Port Memory

To issue a command we can:

Select a command slot that isn’t currently in use.
Write the command to that command table (the specifics of this are explained later) and then set that command as active in the commands issued register for that port (PxCI).
Wait an interrupt indicating that this command has finished and read the resulting FIS. We mostly just care about the status byte. The controller will unset the bit for the completed command in the PxCI register when it raises the interrupt so we can disambiguate which command has finished.

Although this sounds straightforward there are a few moving parts here that we need to get up and running.

We need to find where the HBA structure is in memory.
We need to allocate some space for the command tables and received FIS structures.
We need to set up interrupt handling for each port.
Before this we will issue a hardware reset command to the controller to get everything in a clean state.

Finding the HBA structure

We find the HBA structure by iterating through the PCI configuration space and finding AHCI device. I’m not going to delve too deeply into this because PCI could be a whole separate post and it is a little trickier to explain because it is harder to access the specifications (the good people at the PCI “Special Interest Group” are happy to let you download the PDF for the low low price of $4500 if you are not a member).

The short story is that we are looking for the device with the right class code - Class Code 0x1 (Storage Device), Subclass 0x6 (SATA Controller), Subtype 0x1 (AHCI).

Once we have the correct configuration space we can read the address at offset 0x24 (called the ABAR for AHCI Base Address) which points to the start of the GHC registers.

We can mostly ignore the other information in the configuration space for now as we aren’t dealing with Message Signaled Interrupts yet.

Hardware Reset

Now that we have found the Global Host Controller registers we’re going to initiate a hardware reset of the AHCI Controller. The advantage of this is we will know the exact state that the controller and its ports are in. Other than that it ensures that we aren’t dependent on the specific way limine uses the AHCI controller.

I suspect that this is not how most production operating systems handle things but this should give us a clean slate for now.

From the AHCI spec:

10.4.3 HBA Reset

If the HBA becomes unusable for multiple ports, and a software reset or port reset does not correct the problem, software may reset the entire HBA by setting GHC.HR to ‘1’. When software sets the GHC.HR bit to ‘1’, the HBA shall perform an internal reset action. The bit shall be cleared to ‘0’ by the HBA when the reset is complete. A software write of ‘0’ to GHC.HR shall have no effect. To perform the HBA reset, software sets GHC.HR to ‘1’ and may poll until this bit is read to be ‘0’, at which point software knows that the HBA reset has completed.

If the HBA has not cleared GHC.HR to ‘0’ within 1 second of software setting GHC.HR to ‘1’, the HBA is in a hung or locked state.

When GHC.HR is set to ‘1’, GHC.AE, GHC.IE, the IS register, and all port register fields (except PxFB/PxFBU/PxCLB/PxCLBU) that are not HwInit in the HBA’s register memory space are reset. The HBA’s configuration space and all other global registers/bits are not affected by setting GHC.HR to ‘1’. Any HwInit bits in the port specific registers are not affected by setting GHC.HR to ‘1’. The port specific registers PxFB, PxFBU, PxCLB, and PxCLBU are not affected by setting GHC.HR to ‘1’. If the HBA supports staggered spin-up, the PxCMD.SUD bit will be reset to ‘0’; software is responsible for setting the PxCMD.SUD and PxSCTL.DET fields appropriately such that communication can be established on the Serial ATA link. If the HBA does not support staggered spin-up, the HBA reset shall cause a COMRESET to be sent on the port.

Despite the long text, this process is fairly straightforward. We set the Hardware Reset bit and then poll for it to be set to 0. We then set the AHCI enable bit. For now we can leave interrupts disabled until we have reset the ports. Once this is done we sleep for a few milliseconds to allow the ports time to spin up. For now we are just using 50ms because that is the smallest resolution we support sleeping for (1 scheduling tick) but I think theoretically we could sleep for only a millisecond or two.

ahci_hba_->global_host_control |= kGlobalHostControl_HW_Reset;

while (ahci_hba_->global_host_control & kGlobalHostControl_HW_Reset) {
  continue;
}

ahci_hba_->global_host_control |= kGlobalHostControl_AHCI_Enable;

return static_cast<glcr::ErrorCode>(ZThreadSleep(50));

Port Initialization

Now we can initialize each port that is implemented. There are two cases we need to handle. Either the port has received a COMRESET and is running, or staggered spin up is supported and we need to enable the port. As our VM doesn’t require staggered spin up, we will skip it for now and come back to it in the future.

Before initializing each port we need to check if it has a device attached. We can do that by checking the PxSSTS register described in the AHCI spec section 3.3.10.

AHCI 1.3.1 Section 3.3.10

We are looking for a value 0x103, 0x100 indicating that the device is active and 0x3 indicating that communication is established. For each port where we detect this value we continue the initialization process.

Memory Structures

We need to initialize the memory structures for each active port as shown in the image before (under How AHCI works).

We need a command list structure of length 0x400 (technically it need not be that long if fewer than 32 commands are supported but it doesn’t use much additional memory). Additionally a spot is needed for received FIS structure of length 0x100. Finally each of the 32 commands in the command list must point to a command table. Technically these can be quite large because each can hold up to 2^16 physical region descriptors (using ~1 MiB of memory). I’ve opted to limit it to just 8 16-byte descriptors so each command table would be length 0x100 as well. For now we don’t support scatter gather buffers and just allocate one contiguous memory section for each read.

In total all of these memory structures takes 0x2500 bytes (3 pages of RAM). We allocate them all in one block and manually set up the pointers to their physical addresses in the HBA port control.

// 0x0-0x400 -> Command List
// 0x400-0x500 -> Received FIS
// 0x500-0x2500 -> Command Tables (0x100 each) (Max PRDT Length is 8 for now)
uint64_t paddr;
command_structures_ =
    mmth::OwnedMemoryRegion::ContiguousPhysical(0x2500, &paddr);
command_list_ = reinterpret_cast<CommandList*>(command_structures_.vaddr());
port_struct_->command_list_base = paddr;

received_fis_ =
    reinterpret_cast<ReceivedFis*>(command_structures_.vaddr() + 0x400);
port_struct_->fis_base = paddr + 0x400;
port_struct_->command |= kCommand_FIS_Receive_Enable;

command_tables_ = glcr::ArrayView(
    reinterpret_cast<CommandTable*>(command_structures_.vaddr() + 0x500), 32);

for (uint64_t i = 0; i < 32; i++) {
  // This leaves space for 2 prdt entries.
  command_list_->command_headers[i].command_table_base_addr =
      (paddr + 0x500) + (0x100 * i);
  commands_[i] = nullptr;
}
port_struct_->interrupt_enable = 
    kInterrupt_D2H_FIS | kInterrupt_PIO_FIS | kInterrupt_DMA_FIS |
    kInterrupt_DeviceBits_FIS | kInterrupt_Unknown_FIS;
port_struct_->sata_error = -1;
port_struct_->command |= kCommand_Start;

There are a few other things going on here. Once we allocate the space to receive FIS structures we let the port know that it can send FISes using the PxCMD register.

Additionally at the end we enable interrupts, clear the error register, and tell the port it can start processing commands.

Interrupt Handling

Now that the device is initialized we can actually begin to send it commands. To do so we need to register an interrupt handler with the correct PCI interrupt line (for now we will use the direct interrupt line rather than Message Signaled Interrupts). Registering interrupt handlers is a whole other beast so for this post we will just focus on their implementation.

The first step is to de-multiplex the interrupt in the controller by checking the interrupt status register. Each port that has an interrupt pending will raise it’s corresponding bit in the Interrupt Status register. We can delegate to each port the handling of an interrupt, then clear the interrupt bit once it is done. The relevant code in this case looks like this:

for (uint64_t i = 0; i < num_ports_; i++) {
  if (!ports_[i].empty() && (ahci_hba_->interrupt_status & (1 << i))) {
    ports_[i]->HandleIrq();
    ahci_hba_->interrupt_status &= ~(1 << i);
  }
}

Then on the port side we can handle the interrupt. This requires determining what kind of interrupt was generated using the port’s Interrupt Status register (PxIS). Each of the 17 defined bits in this register correspond to a different interrupt type and can be individual enabled and disabled using the port’s Interrupt Enable register (PxIE). For now as we registered when setting up the port we will only handle the interrupts related to receiving FISes from the device.

void AhciDevice::HandleIrq() {
  uint32_t int_status = port_struct_->interrupt_status;
  port_struct_->interrupt_status = int_status;

  bool has_error = false;
  if (int_status & kInterrupt_D2H_FIS) {
    dbgln("D2H Received");
    // Device to host.
    volatile DeviceToHostRegisterFis& fis =
        received_fis_->device_to_host_register_fis;
    if (!CheckFisType(FIS_TYPE_REG_D2H, fis.fis_type)) {
      return;
    }
    if (fis.error) {
      dbgln("D2H err: {x}", fis.error);
      dbgln("status: {x}", fis.status);
      has_error = true;
    }
  }
  if (int_status & kInterrupt_PIO_FIS) {
    // Like above ...
  }
  if (int_status & kInterrupt_DMA_FIS) {
    // Like above ...
  }
  // ...
}

To handle the interrupt we read the raised interrupts from the PxIS register and write the values back to it to clear them. Then we can specify how to handle each type of interrupt that we receive. For now we will just debug print the type and any errors from the interrupt since we aren’t sending any commands.

Something I’m not sure about is that as soon as we enable interrupts we seem to receive a FIS from the device with an error bit set. Both the hard drive and the optical drive on QEMU send a FIS with error bit 0x1 set. Additionally the status field is set to 0x30 for the hard drive and 0x70 for the optical drive.

I was able to find a OSDev Forum post referencing that this behavior is caused by the reset sending an EXECUTE DEVICE DIAGNOSTIC command (0x90) to the device. It notes that this is largely undocumented behavior but at least this information offers some clarity on the outputs. Reading the ATA Command Set section 7.9.4 we can see that the command outputs code 0x01 to the error bits when Device 0 passed, Device 1 passed or not present. According a footnote we can “See the appropriate transport standard for the definition of device 0 and device 1.” I really thought I was already looking at the “appropriate transport standard” but alas. All that to say we’ll just ignore this interrupt for now.

Sending a Command

Now that the AHCI ports are initialized and can handle an interrupt, we can send commands to them. To start with lets send the IDENTIFY DEVICE command to each device. This command asks the device to send 512 bytes of information about itself back to us. These bytes contain 40 years of certified-crufty backwards compatibility. I mean just feast your eyes on the number of retired and obsolete fields in just the first page of the spec.

IDENTIFY DEVICE Response

We’ll ignore almost all of this information and just try to get the sector size and sector count from the drive. To do so we need to figure out how to send a command to the device. To be honest I feel like the specs fall down here in actually explaining this. The trick is to send a Register Host to Device FIS in one of the command slots. This FIS type has a field for the command as well as some common parameters such as LBA and count. In retrospect it is fairly clear once you are aware of it, but if you are just reading the SATA spec and looking at the possible commands, making the logical jump to the Register Host To Device FIS feels damn near impossible.

First up we chose an empty command slot to use:

uint64_t slot;
for (slot = 0; slot < 32; slot++) {
  if (!(commands_issued_ & (1 << slot))) {
    break;
  }
}
if (slot == 32) {
  dbgln("All slots full");
  return glcr::INTERNAL;
}

The commands_issued_ variable is just for our own accounting of which slots are currently in use by another command.

Next we can populate the FIS for that slot. The spec for the Register Host to Device FIS is as follows:

We don’t need to initialize most of the fields here because the IDENTIFY_DEVICE call doesn’t rely on an LBA or sector count. One of the keys is setting the high bit “C” in the byte that contains PM Port which indicates to the HBA that this FIS contains a new command (I spent a while trying to figure out why this wasn’t working without that). The code for this is relatively straightforward.

auto* fis = reinterpret_cast<HostToDeviceRegisterFis*>(
    command_tables_[slot].command_fis);
*fis = HostToDeviceRegisterFis{
    .fis_type = FIS_TYPE_REG_H2D,
    .pmp_and_c = 0x80,
    .command = kIdentifyDevice, // 0xEC
};

We also need to let the HBA know where it can put the result in memory. For this we use the physical region descriptor table corresponding to this command slot. As described before, for simplicity now we are only using a single entry to do this. We allocate a 512 byte memory region and set it’s physical address and size in the first slot of the command slots PRDT.

uint64_t paddr;
auto region =
    mmth::OwnedMemoryRegion::ContiguousPhysical(0x200, &paddr);
command_tables_[slot].prdt[0].region_address = command.paddr;
command_tables_[slot].prdt[0].byte_count = 0x200; // 512 bytes
command_list_->command_headers[slot].prd_table_length = 1;

All that is left to do is to issue the command! We set the size of the command FIS (in double words for some reason?) as well as let the HBA know it can prefetch the data from memory. Then we set the bit for this command slot in the PxCI register which will cause the device to start processing it.

// Set the command FIS length (in double words).
command_list_->command_headers[slot].command =
    (sizeof(HostToDeviceRegisterFis) / 4) & 0x1F;

// Set prefetch bit.
command_list_->command_headers[slot].command |= (1 << 7);

// TODO: Synchronization-wise we need to ensure this is set in the same
// critical section as where we select a slot.
commands_issued_ |= (1 << slot);
port_struct_->command_issue |= (1 << slot);

But wait! How will we know when this command has completed? We somehow need to wait until we receive an interrupt for this command to process the data it sent. To handle this we can add a semaphore for each port command slot to allow signalling when we receive a completion interrupt for that command. I think it might make sense to have some sort of callback instead so we can pass errors back to the caller instead of just a completion signal. However I’m not sure what type of errors exist that are resolvable by the caller so for now this works.

void IdentifyDevice() {
...
  // Issue command.
  commands_issued_ |= (1 << slot);
  port_struct_->command_issue |= (1 << slot);

  command_signals_[slot].Wait();

  // Continue processing.
...
}

void AhciPort::HandleIrq() {
  uint32_t int_status = port_struct_->interrupt_status;
  port_struct_->interrupt_status = int_status;

...
  // Parse received FIS.
...

  uint32_t commands_finished = commands_issued_ & ~port_struct_->command_issue;

  for (uint64_t i = 0; i < 32; i++) {
    if (commands_finished & (1 << i)) {
      command_signals_[i].Signal();
      commands_issued_ &= ~(1 << i);
    }
  }
}

OK now that we have retrieved the information from the drive we can parse it. For the sector size, the default is 512 bytes which we will use unless the LOGICAL SECTOR SIZE SUPPORTED bit is set in double word 106, bit 12. If that is set we can check the double words at 117 and 118 to get the 32 bit sector size value. For the sector count, we need to check if the device supports 48 bit addressing using double word 83 bit 10. If it is used we can get the number of sectors from the 4 double words starting at 100. Otherwise we read the number of sectors from the 2 double words starting at index 60.

  uint16_t* ident = reinterpret_cast<uint16_t*>(region.vaddr());
  if (ident[106] & (1 << 12)) {
    sector_size_ = *reinterpret_cast<uint32_t*>(ident + 117);
  } else {
    sector_size_ = 512;
  }

  if (ident[83] & (1 << 10)) {
    lba_count_ = *reinterpret_cast<uint64_t*>(ident + 100);
  } else {
    lba_count_ = *reinterpret_cast<uint32_t*>(ident + 60);
  }
  dbgln("Sector size: {x}", sector_size_);
  dbgln("LBA Count: {x}", lba_count_);
  is_init_ = true;
}

You might be rightfully thinking that it would be less brittle to make a struct definition that we could point at this address which would implicitly contain these offsets - and you would be correct. But to be honest, I can’t be bothered to create a 256 entry struct definition just to get these values. Maybe in the future.

Reading Data

Now that we have the ability to read the IDENTIFY DEVICE data we are only a short hop, skip, and jump away from reading data from the drive. The main differences when reading data are (a) the command number, (b) we must specify the Logical Block Address (LBA) we want to read from and the number of sectors to read, and (c) we need to dynamically size the entry in the Physical Region Descriptor Table (we will still use only one entry for now).

Because much of this is similar we can fairly easily create a shared struct with the necessary information and construct the requests in parallel.

struct Command {
  uint8_t command;
  uint64_t lba;
  uint32_t sectors;
  uint64_t paddr;
};

Then from that we can create an IssueCommand function that constructs the Register Host to Device FIS in a similar way for both. Before that I’d like to take this opportunity to point out how the LBA in this FIS is stored in a way that truly only a mother could love:

That aside we simply update the FIS construction to set the command, LBA, and sector count. Following that we set the PRDT values (although we still only use one slot).

auto* fis = reinterpret_cast<HostToDeviceRegisterFis*>(
    command_tables_[slot].command_fis);
*fis = HostToDeviceRegisterFis{
    .fis_type = FIS_TYPE_REG_H2D,
    .pmp_and_c = 0x80,
    .command = command.command,

    .lba0 = static_cast<uint8_t>(command.lba & 0xFF),
    .lba1 = static_cast<uint8_t>((command.lba >> 8) & 0xFF),
    .lba2 = static_cast<uint8_t>((command.lba >> 16) & 0xFF),
    .device = (1 << 6),  // ATA LBA Mode

    .lba3 = static_cast<uint8_t>((command.lba >> 24) & 0xFF),
    .lba4 = static_cast<uint8_t>((command.lba >> 32) & 0xFF),
    .lba5 = static_cast<uint8_t>((command.lba >> 40) & 0xFF),

   .count = command.sectors,
};

command_tables_[slot].prdt[0].region_address = command.paddr;
command_tables_[slot].prdt[0].byte_count = 512 * command.sectors;

Then issuing either the identify device command or the read command is relatively straightforward:

// IDENTIFY DEVICE
CommandInfo identify{
    .command = kIdentifyDevice,
    .lba = 0,
    .sectors = 1,
    .paddr = 0,
};
auto region =
    mmth::OwnedMemoryRegion::ContiguousPhysical(0x200, &identify.paddr);
ASSIGN_OR_RETURN(auto* sem, IssueCommand(identify));
sem->Wait();

// DMA READ
CommandInfo dma_read{
    .command = kDmaReadExt,
    .lba = lba,
    .sectors = sector_cnt,
    .paddr = 0,
};
auto region =
    mmth::OwnedMemoryRegion::ContiguousPhysical(0x200 * sector_cnt, &read.paddr);
ASSIGN_OR_RETURN(auto* sem, IssueCommand(dma_read));
sem->Wait();

From here the world is our oyster and we can read any arbitrary data from the disk. The bulk of this code isn’t actually all that long (~200 LOC in the AHCI Port implementation ). However I probably added and deleted several times that trying to get everything working and refactored down to a nice interface.

Coming next

This is nowhere near a full implementation. Among the things we skipped that I plan to come back to at some point are:

Staggered spin up: In controllers that support this, each device is powered down after RESET and must be started individually.
Message Signaled Interrupts: The hot new way to handle PCI device interrupts. Has only been available since 1998 so support may vary.
Port Multiplier Support: Something that gets mentioned all over the specs but I’ve avoided evening looking into until this moment. But it looks like it allows several devices behind a single port.
Scatter Gather buffers: For big files we may not always be able to find a sufficient contiguous chunk of physical memory. This means we may have to use more than one entry in the PRDT!
Error Handling & Retry: Even though QEMU may succeed in executing commands 100% of the time, real hardware may not and we should probably handle that.
Less that 32 commands supported: We kinda always assume that the device can handle 32 commands even though it may not (how many it does is exposed in the GHC registers).

Tags:

osdev