Creating a full cache model

TO DO: Stats definitely need to be updated

In this chapter, we will take the framework for a memory object we created in the last chapter and add caching logic to it.

SimpleCache SimObject

After creating the SConscript file, that you can download here, we can create the SimObject Python file. We will call this simple memory object SimpleCache and create the SimObject Python file in src/learning_gem5/simple_cache.

from m5.params import *
from m5.proxy import *
from MemObject import MemObject

class SimpleCache(MemObject):
    type = 'SimpleCache'
    cxx_header = "learning_gem5/simple_cache/simple_cache.hh"

    cpu_side = VectorSlavePort("CPU side port, receives requests")
    mem_side = MasterPort("Memory side port, sends requests")

    latency = Param.Cycles(1, "Cycles taken on a hit or to resolve a miss")

    size = Param.MemorySize('16kB', "The size of the cache")

    system = Param.System(Parent.any, "The system this cache is part of")

There are a couple of differences between this SimObject file and the one from the previous chapter. First, we have a couple of extra parameters. Namely, a latency for cache accesses and the size of the cache. parameters-chapter goes into more detail about these kinds of SimObject parameters.

Next, we include a System parameter, which is a pointer to the main system this cache is connected to. This is needed so we can get the cache block size from the system object when we are initializing the cache. To reference the system object this cache is connected to, we use a special proxy parameter. In this case, we use Parent.any.

In the Python config file, when a SimpleCache is instantiated, this proxy parameter searches through all of the parents of the SimpleCache instance to find a SimObject that matches the System type. Since we often use a System as the root SimObject, you will often see a system parameter resolved with this proxy parameter.

The third and final difference between the SimpleCache and the SimpleMemobj is that instead of having two named CPU ports (inst_port and data_port), the SimpleCache use another special parameter: the VectorPort. VectorPorts behave similarly to regular ports (e.g., they are resolved via getMasterPort and getSlavePort), but they allow this object to connect to multiple peers. Then, in the resolution functions the parameter we ignored before (PortID idx) is used to differentiate between the different ports. By using a vector port, this cache can be connected into the system more flexibly than the SimpleMemobj.

Implementing the SimpleCache

Most of the code for the SimpleCache is the same as the SimpleMemobj. There are a couple of changes in the constructor and the key memory object functions.

First, we need to create the CPU side ports dynamically in the constructor and initialize the extra member functions based on the SimObject parameters.

SimpleCache::SimpleCache(SimpleCacheParams *params) :
    MemObject(params),
    latency(params->latency),
    blockSize(params->system->cacheLineSize()),
    capacity(params->size / blockSize),
    memPort(params->name + ".mem_side", this),
    blocked(false), outstandingPacket(nullptr), waitingPortId(-1)
{
    for (int i = 0; i < params->port_cpu_side_connection_count; ++i) {
        cpuPorts.emplace_back(name() + csprintf(".cpu_side[%d]", i), i, this);
    }
}

In this function, we use the cacheLineSize from the system parameters to set the blockSize for this cache. We also initialize the capacity based on the block size and the parameter and initialize other member variables we will need below. Finally, we must create a number of CPUSidePorts based on the number of connections to this object. Since the cpu_side port was declared as a VectorSlavePort in the SimObject Python file, the parameter automatically has a variable port_cpu_side_connection_count. This is based on the Python name of the parameter. For each of these connections we add a new CPUSidePort to a cpuPorts vector declared in the SimpleCache class.

We also add one extra member variable to the CPUSidePort to save its id, and we add this as a parameter to its constructor.

Next, we need to implement getMasterPort and getSlavePort. The getMasterPort is exactly the same as the SimpleMemobj. For getSlavePort, we now need to return the port based on the id requested.

BaseSlavePort&
SimpleCache::getSlavePort(const std::string& if_name, PortID idx)
{
    if (if_name == "cpu_side" && idx < cpuPorts.size()) {
        return cpuPorts[idx];
    } else {
        return MemObject::getSlavePort(if_name, idx);
    }
}

The implementation of the CPUSidePort and the MemSidePort is almost the same as in the SimpleMemobj. The only difference is we need to add an extra parameter to handleRequest that is the id of the port which the request originated. Without this id, we would not be able to forward the response to the correct port. The SimpleMemobj knew which port to send replies based on whether the original request was an instruction or data accesses. However, this information is not useful to the SimpleCache since it uses a vector of ports and not named ports.

The new handleRequest function does two different things than the handleRequest function in the SimpleMemobj. First, it stores the port id of the request as discussed above. Since the SimpleCache is blocking and only allows a single request outstanding at a time, we only need to save a single port id.

Second, it takes time to access a cache. Therefore, we need to take into account the latency to access the cache tags and the cache data for a request. We added an extra parameter to the cache object for this, and in handleRequest we now use an event to stall the request for the needed amount of time. We schedule a new event for latency cycles in the future. The clockEdge function returns the tick that the nth cycle in the future occurs on.

bool
SimpleCache::handleRequest(PacketPtr pkt, int port_id)
{
    if (blocked) {
        return false;
    }
    DPRINTF(SimpleCache, "Got request for addr %#x\n", pkt->getAddr());

    blocked = true;
    waitingPortId = port_id;

    schedule(new AccessEvent(this, pkt), clockEdge(latency));

    return true;
}

The AccessEvent is a little more complicated than the EventWrapper we used in events-chapter. Instead of using an EventWrapper, in the SimpleCache we will use a new class. The reason we cannot use an EventWrapper, is that we need to pass the packet (pkt) from handleRequest to the event handler function. The following code is the AccessEvent class. We only need to implement the process function, that calls the function we want to use as our event handler, in this case accessTming. We also pass the flag AutoDelete to the event constructor so we do not need to worry about freeing the memory for the dynamically created object. The event code will automatically delete the object after the process function has executed.

class AccessEvent : public Event
{
  private:
    SimpleCache *cache;
    PacketPtr pkt;
  public:
    AccessEvent(SimpleCache *cache, PacketPtr pkt) :
        Event(Default_Pri, AutoDelete), cache(cache), pkt(pkt)
    { }
    void process() override {
        cache->accessTiming(pkt);
    }
};

Now, we need to implement the event handler, accessTiming.

void
SimpleCache::accessTiming(PacketPtr pkt)
{
    bool hit = accessFunctional(pkt);
    if (hit) {
        pkt->makeResponse();
        sendResponse(pkt);
    } else {
        <miss handling>
    }
}

This function first functionally accesses the cache. This function accessFunctional (described below) performs the functional access of the cache and either reads or writes the cache on a hit or returns that the access was a miss.

If the access is a hit, we simply need to respond to the packet. To respond, you first must call the function makeResponse on the packet. This converts the packet from a request packet to a response packet. For instance, if the memory command in the packet was a ReadReq this gets converted into a ReadResp. Writes behave similarly. Then, we can send the response back to the CPU.

The sendResponse function does the same things as the handleResponse function in the SimpleMemobj except that it uses the waitingPortId to send the packet to the right port. In this function, we need to mark the SimpleCache unblocked before calling sendPacket in case the peer on the CPU side immediately calls sendTimingReq. Then, we try to send retries to the CPU side ports if the SimpleCache can now receive requests and the ports need to be sent retries.

void SimpleCache::sendResponse(PacketPtr pkt)
{
    int port = waitingPortId;

    blocked = false;
    waitingPortId = -1;

    cpuPorts[port].sendPacket(pkt);
    for (auto& port : cpuPorts) {
        port.trySendRetry();
    }
}

Back to the accessTiming function, we now need to handle the cache miss case. On a miss, we first have to check to see if the missing packet is to an entire cache block. If the packet is aligned and the size of the request is the size of a cache block, then we can simply forward the request to memory, just like in the SimpleMemobj.

However, if the packet is smaller than a cache block, then we need to create a new packet to read the entire cache block from memory. Here, whether the packet is a read or a write request, we send a read request to memory to load the data for the cache block into the cache. In the case of a write, it will occur in the cache after we have loaded the data from memory.

Then, we create a new packet, that is blockSize in size and we call the allocate function to allocate memory in the Packet object for the data that we will read from memory. Note: this memory is freed when we free the packet. We use the original request object in the packet so the memory-side objects know the original requestor and the original request type for statistics.

Finally, we save the original packet pointer (pkt) in a member variable outstandingPacket so we can recover it when the SimpleCache receives a response. Then, we send the new packet across the memory side port.

void
SimpleCache::accessTiming(PacketPtr pkt)
{
    bool hit = accessFunctional(pkt);
    if (hit) {
        pkt->makeResponse();
        sendResponse(pkt);
    } else {
        Addr addr = pkt->getAddr();
        Addr block_addr = pkt->getBlockAddr(blockSize);
        unsigned size = pkt->getSize();
        if (addr == block_addr && size == blockSize) {
            DPRINTF(SimpleCache, "forwarding packet\n");
            memPort.sendPacket(pkt);
        } else {
            DPRINTF(SimpleCache, "Upgrading packet to block size\n");
            panic_if(addr - block_addr + size > blockSize,
                     "Cannot handle accesses that span multiple cache lines");

            assert(pkt->needsResponse());
            MemCmd cmd;
            if (pkt->isWrite() || pkt->isRead()) {
                cmd = MemCmd::ReadReq;
            } else {
                panic("Unknown packet type in upgrade size");
            }

            PacketPtr new_pkt = new Packet(pkt->req, cmd, blockSize);
            new_pkt->allocate();

            outstandingPacket = pkt;

            memPort.sendPacket(new_pkt);
        }
    }
}

On a response from memory, we know that this was caused by a cache miss. The first step is to insert the responding packet into the cache.

Then, either there is an outstandingPacket, in which case we need to forward that packet to the original requestor, or there is no outstandingPacket which means we should forward the pkt in the response to the original requestor.

If the packet we are receiving as a response was an upgrade packet because the original request was smaller than a cache line, then we need to copy the new data to the outstandingPacket packet or write to the cache on a write. Then, we need to delete the new packet that we made in the miss handling logic.

bool
SimpleCache::handleResponse(PacketPtr pkt)
{
    assert(blocked);
    DPRINTF(SimpleCache, "Got response for addr %#x\n", pkt->getAddr());
    insert(pkt);

    if (outstandingPacket != nullptr) {
        accessFunctional(outstandingPacket);
        outstandingPacket->makeResponse();
        delete pkt;
        pkt = outstandingPacket;
        outstandingPacket = nullptr;
    } // else, pkt contains the data it needs

    sendResponse(pkt);

    return true;
}

Functional cache logic

Now, we need to implement two more functions: accessFunctional and insert. These two functions make up the key components of the cache logic.

First, to functionally update the cache, we first need storage for the cache contents. The simplest possible cache storage is a map (hashtable) that maps from addresses to data. Thus, we will add the following member to the SimpleCache.

std::unordered_map<Addr, uint8_t*> cacheStore;

To access the cache, we first check to see if there is an entry in the map which matches the address in the packet. We use the getBlockAddr function of the Packet type to get the block-aligned address. Then, we simply search for that address in the map. If we do not find the address, then this function returns false, the data is not in the cache, and it is a miss.

Otherwise, if the packet is a write request, we need to update the data in the cache. To do this, we write the data from the packet to the cache. We use the writeDataToBlock function which writes the data in the packet to the write offset into a potentially larger block of data. This function takes the cache block offset and the block size (as a parameter) and writes the correct offset into the pointer passed as the first parameter.

If the packet is a read request, we need to update the packet’s data with the data from the cache. The setDataFromBlock function performs the same offset calculation as the writeDataToBlock function, but writes the packet with the data from the pointer in the first parameter.

bool
SimpleCache::accessFunctional(PacketPtr pkt)
{
    Addr block_addr = pkt->getBlockAddr(blockSize);
    auto it = cacheStore.find(block_addr);
    if (it != cacheStore.end()) {
        if (pkt->isWrite()) {
            pkt->writeDataToBlock(it->second, blockSize);
        } else if (pkt->isRead()) {
            pkt->setDataFromBlock(it->second, blockSize);
        } else {
            panic("Unknown packet type!");
        }
        return true;
    }
    return false;
}

Finally, we also need to implement the insert function. This function is called every time the memory side port responds to a request.

The first step is to check if the cache is currently full. If the cache has more entries (blocks) than the capacity of the cache as set by the SimObject parameter, then we need to evict something. The following code evicts a random entry by leveraging the hashtable implementation of the C++ unordered_map.

On an eviction, we need to write the data back to the backing memory in case it has been updated. For this, we create a new Request-Packet pair. The packet uses a new memory command: MemCmd::WritebackDirty. Then, we send the packet across the memory side port (memPort) and erase the entry in the cache storage map.

Then, after a block has potentially been evicted, we add the new address to the cache. For this we simply allocate space for the block and add an entry to the map. Finally, we write the data from the response packet in to the newly allocated block. This data is guaranteed to be the size of the cache block since we made sure to make a new packet in the cache miss logic if the packet was smaller than a cache block.

void
SimpleCache::insert(PacketPtr pkt)
{
    if (cacheStore.size() >= capacity) {
        // Select random thing to evict. This is a little convoluted since we
        // are using a std::unordered_map. See http://bit.ly/2hrnLP2
        int bucket, bucket_size;
        do {
            bucket = random_mt.random(0, (int)cacheStore.bucket_count() - 1);
        } while ( (bucket_size = cacheStore.bucket_size(bucket)) == 0 );
        auto block = std::next(cacheStore.begin(bucket),
                               random_mt.random(0, bucket_size - 1));

        RequestPtr req = new Request(block->first, blockSize, 0, 0);
        PacketPtr new_pkt = new Packet(req, MemCmd::WritebackDirty, blockSize);
        new_pkt->dataDynamic(block->second); // This will be deleted later

        DPRINTF(SimpleCache, "Writing packet back %s\n", pkt->print());
        memPort.sendTimingReq(new_pkt);

        cacheStore.erase(block->first);
    }
    uint8_t *data = new uint8_t[blockSize];
    cacheStore[pkt->getAddr()] = data;

    pkt->writeDataToBlock(data, blockSize);
}

Creating a config file for the cache

The last step in our implementation is to create a new Python config script that uses our cache. We can use the outline from the last chapter as a starting point. The only difference is we may want to set the parameters of this cache (e.g., set the size of the cache to 1kB) and instead of using the named ports (data_port and inst_port), we just use the cpu_side port twice. Since cpu_side is a VectorPort, it will automatically create multiple port connections.

import m5
from m5.objects import *

...

system.cache = SimpleCache(size='1kB')

system.cpu.icache_port = system.cache.cpu_side
system.cpu.dcache_port = system.cache.cpu_side

system.membus = SystemXBar()

system.cache.mem_side = system.membus.slave

...

The Python config file can be downloaded here.

Running this script should produce the expected output from the hello binary.

    gem5 Simulator System.  http://gem5.org
    gem5 is copyrighted software; use the --copyright option for details.

    gem5 compiled Jan 10 2017 17:38:15
    gem5 started Jan 10 2017 17:40:03
    gem5 executing on chinook, pid 29031
    command line: build/X86/gem5.opt configs/learning_gem5/part2/simple_cache.py

    Global frequency set at 1000000000000 ticks per second
    warn: DRAM device capacity (8192 Mbytes) does not match the address range assigned (512 Mbytes)
    0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
    warn: CoherentXBar system.membus has no snooping ports attached!
    warn: ClockedObject: More than one power state change request encountered within the same simulation tick
    Beginning simulation!
    info: Entering event queue @ 0.  Starting simulation...
    Hello world!
    Exiting @ tick 56082000 because target called exit()

Modifying the size of the cache, for instance to 128 KB, should improve the performance of the system.

    gem5 Simulator System.  http://gem5.org
    gem5 is copyrighted software; use the --copyright option for details.

    gem5 compiled Jan 10 2017 17:38:15
    gem5 started Jan 10 2017 17:41:10
    gem5 executing on chinook, pid 29037
    command line: build/X86/gem5.opt configs/learning_gem5/part2/simple_cache.py

    Global frequency set at 1000000000000 ticks per second
    warn: DRAM device capacity (8192 Mbytes) does not match the address range assigned (512 Mbytes)
    0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
    warn: CoherentXBar system.membus has no snooping ports attached!
    warn: ClockedObject: More than one power state change request encountered within the same simulation tick
    Beginning simulation!
    info: Entering event queue @ 0.  Starting simulation...
    Hello world!
    Exiting @ tick 32685000 because target called exit()

Adding statistics to the cache

Knowing the overall execution time of the system is one important metric. However, you may want to include other statistics as well, such as the hit and miss rates of the cache. To do this, we need to add some statistics to the SimpleCache object.

First, we need to declare the statistics in the SimpleCache object. They are part of the Stats namespace. In this case, we’ll make four statistics. The number of hits and the number of misses are just simple Scalar counts. We will also add a missLatency which is a histogram of the time it takes to satisfy a miss. Finally, we’ll add a special statistic called a Formula for the hitRatio that is a combination of other statistics (the number of hits and misses).

class SimpleCache : public MemObject
{
  private:
    ...

    Tick missTime; // To track the miss latency

    Stats::Scalar hits;
    Stats::Scalar misses;
    Stats::Histogram missLatency;
    Stats::Formula hitRatio;

  public:
    ...

    void regStats() override;
};

Next, we have to define the function to override the regStats function so the statistics are registered with gem5’s statistics infrastructure. Here, for each statistic, we give it a name based on the “parent” SimObject name and a description. For the histogram statistic, we also need to initialize it with how many buckets we want in the histogram. Finally, for the formula, we simply need to write the formula down in code.

void
SimpleCache::regStats()
{
    // If you don't do this you get errors about uninitialized stats.
    MemObject::regStats();

    hits.name(name() + ".hits")
        .desc("Number of hits")
        ;

    misses.name(name() + ".misses")
        .desc("Number of misses")
        ;

    missLatency.name(name() + ".missLatency")
        .desc("Ticks for misses to the cache")
        .init(16) // number of buckets
        ;

    hitRatio.name(name() + ".hitRatio")
        .desc("The ratio of hits to the total accesses to the cache")
        ;

    hitRatio = hits / (hits + misses);

}

Finally, we need to use update the statistics in our code. In the accessTiming class, we can increment the hits and misses on a hit and miss respectively. Additionally, on a miss, we save the current time so we can measure the latency.

void
SimpleCache::accessTiming(PacketPtr pkt)
{
    bool hit = accessFunctional(pkt);
    if (hit) {
        hits++; // update stats
        pkt->makeResponse();
        sendResponse(pkt);
    } else {
        misses++; // update stats
        missTime = curTick();
        ...

Then, when we get a response, we need to add the measured latency to our histogram. For this, we use the sample function. This adds a single point to the histogram. This histogram automatically resizes the buckets to fit the data it receives.

bool
SimpleCache::handleResponse(PacketPtr pkt)
{
    insert(pkt);

    missLatency.sample(curTick() - missTime);
    ...

The complete code for the SimpleCache header file can be downloaded here, and the complete code for the implementation of the SimpleCache can be downloaded here.

Now, if we run the above config file, we can check on the statistics in the stats.txt file. For the 1 KB case, we get the following statistics. 91% of the accesses are hits and the average miss latency is 53334 ticks (or 53 ns).

    system.cache.hits                                8431                       # Number of hits
    system.cache.misses                               877                       # Number of misses
    system.cache.missLatency::samples                 877                       # Ticks for misses to the cache
    system.cache.missLatency::mean           53334.093501                       # Ticks for misses to the cache
    system.cache.missLatency::gmean          44506.409356                       # Ticks for misses to the cache
    system.cache.missLatency::stdev          36749.446469                       # Ticks for misses to the cache
    system.cache.missLatency::0-32767                 305     34.78%     34.78% # Ticks for misses to the cache
    system.cache.missLatency::32768-65535             365     41.62%     76.40% # Ticks for misses to the cache
    system.cache.missLatency::65536-98303             164     18.70%     95.10% # Ticks for misses to the cache
    system.cache.missLatency::98304-131071             12      1.37%     96.47% # Ticks for misses to the cache
    system.cache.missLatency::131072-163839            17      1.94%     98.40% # Ticks for misses to the cache
    system.cache.missLatency::163840-196607             7      0.80%     99.20% # Ticks for misses to the cache
    system.cache.missLatency::196608-229375             0      0.00%     99.20% # Ticks for misses to the cache
    system.cache.missLatency::229376-262143             0      0.00%     99.20% # Ticks for misses to the cache
    system.cache.missLatency::262144-294911             2      0.23%     99.43% # Ticks for misses to the cache
    system.cache.missLatency::294912-327679             4      0.46%     99.89% # Ticks for misses to the cache
    system.cache.missLatency::327680-360447             1      0.11%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::360448-393215             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::393216-425983             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::425984-458751             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::458752-491519             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::491520-524287             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::total                   877                       # Ticks for misses to the cache
    system.cache.hitRatio                        0.905780                       # The ratio of hits to the total access

And when using a 128 KB cache, we get a slightly higher hit ratio. It seems like our cache is working as expected!

    system.cache.hits                                8944                       # Number of hits
    system.cache.misses                               364                       # Number of misses
    system.cache.missLatency::samples                 364                       # Ticks for misses to the cache
    system.cache.missLatency::mean           64222.527473                       # Ticks for misses to the cache
    system.cache.missLatency::gmean          61837.584812                       # Ticks for misses to the cache
    system.cache.missLatency::stdev          27232.443748                       # Ticks for misses to the cache
    system.cache.missLatency::0-32767                   0      0.00%      0.00% # Ticks for misses to the cache
    system.cache.missLatency::32768-65535             254     69.78%     69.78% # Ticks for misses to the cache
    system.cache.missLatency::65536-98303             106     29.12%     98.90% # Ticks for misses to the cache
    system.cache.missLatency::98304-131071              0      0.00%     98.90% # Ticks for misses to the cache
    system.cache.missLatency::131072-163839             0      0.00%     98.90% # Ticks for misses to the cache
    system.cache.missLatency::163840-196607             0      0.00%     98.90% # Ticks for misses to the cache
    system.cache.missLatency::196608-229375             0      0.00%     98.90% # Ticks for misses to the cache
    system.cache.missLatency::229376-262143             0      0.00%     98.90% # Ticks for misses to the cache
    system.cache.missLatency::262144-294911             2      0.55%     99.45% # Ticks for misses to the cache
    system.cache.missLatency::294912-327679             1      0.27%     99.73% # Ticks for misses to the cache
    system.cache.missLatency::327680-360447             1      0.27%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::360448-393215             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::393216-425983             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::425984-458751             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::458752-491519             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::491520-524287             0      0.00%    100.00% # Ticks for misses to the cache
    system.cache.missLatency::total                   364                       # Ticks for misses to the cache
    system.cache.hitRatio                        0.960894                       # The ratio of hits to the total access