An Overview of MPP Systems

An Overview of MPP Systems

by Mike Doubek and Jason Wells

Introduction

Multiprocessor systems have consistantly held a place in the commercial sector since their advent in the mid 1960's. Today there are two commonly employed techniques in modern multiprocessor design: Symmetric Multiprocessing (SMP) and Massively Parallel Processing (MPP). The salient difference in these two techniques is that all SMP processors access a single memory device (although this memory device may be multiported), whereas in a MPP system memory is distributed into nodes, or multiple memory devices each holding a different portion of the main memory space. Thus MPP systems, also referred to as distributed memory systems, have the the ability to make many accesses to different nodes in parallel. Although SMP is an appropriate choice for small scale multiprocessing such as workstations, most large scale multiprocessors take advantage of the scalability and parallelness of MPP design. This paper will profile, compare and contrast three commonly used MPP architectures: UMA (uniform memory access), NUMA (non uniform memory access), and COMA (cache only memory architecture).
Before exploring the differences between these architectures an overview of cache coherency is warranted. By definition all processors in an MPP system should be able access to a flat memory space addressing the entire system memory. This is of course can be accomplished by each processor making requests as needed to the memory node containing the specified memory, but to achieve acceptable performance a cacheing scheme becomes necessary. Each of the three aforementioned MPP architectures uses a different cacheing scheme, but for correct operation all three architectures must assure at a hardware level that all copies of a memory block (or line as it is often called in MPP design) contains the most recent and correct data. As soon as a processor writes new data to a cached line, all other cached lines must be either invalidated or updated. The method employed to accomplish this is called a cache coherency scheme. The three MPP architectures studied in this paper are cache coherent (CC-).
There are two basic categories of cache coherency schemes: write invalidate, which invalidates all old cached copies of a changed line, and write update, which will update all old cached copies of a changed line. Both of these require sending messages over the memory network to inform other caches of the change. Rather than blindly broadcasting these most MPP machines maintain some form of a sharing list which can be used to locate all nodes containing the changed or "dirty" line. There are two common protocols for maintaining a sharing structure: a directory based protocol and a snooping protocol. Directory based cache coherency maintains a section of memory which contains just memory block sharing information. A snooping protocol maintains a list with each cached line that denotes which processors are sharing this line. Although snooping protocols require significantly more cache memory (wider cache lines) they have the advantage of making a sharing list immediately available at cache write without having to perform a directory look up as required in directory based protocols.

COMA

Currently, there are two popular architectures for scalable distributed shared memory machines. These are CC-NUMA (Cache-Coherent Non-Uniform Memory Access) and COMA (Cache-Only Memory Architecture). Both of these architectures are intended to be used in large multi-processor systems which have their main memories distributed amongst the various processing nodes which make up the while machine. CC-NUMA uses a local cache at each node to hold copies of both local data and also data from other processing nodes. This reduces memory access latency, but moving data from a remote node to the local cache is a rather costly operation. The COMA architecture tries to alleviate some of this overhead. In a COMA machine, additional hardware including tag and state memory is added to the DRAM of each processing node to convert it into a kind of cache which we call attraction memory. This additional hardware enables the disassociation of the actual data location in the machine from the physical addresses produced by the processors. Doing so enables data to be replicated and migrated automatically upon demand around the machine, creating a much more flexible platform for applications, but requires more complex hardware and data coherence protocols.

Conventional COMA

In a Cache-Only Memory Architecture, each processing node holds a portion of the address space, similar to the NUMA architecture. However, unlike NUMA, the partitioning of data among the memories at the constituent nodes does not have to remain static because all the memory is organized like a group of large caches. The memory local to one particular processing node serves two functions. First, it is the local data cache for the processor to use as it's main memory. Second, it is also a part of the distributed main memory for the entire COMA system. In other words, each node's memory is both a cache and a virtual part of shared memory. This means that each local memory chunk contains both information recently accessed by this node's processor and information that this node as never needed to access. This is the attraction memory mentioned above
In order to manage this attraction memory correctly, a coherence protocol attracts data needed by a processor to it's local attraction memory from other nodes in the system. This protocol works as follows: Cached information is divided into items. On a memory reference, a virtual address is translated into an item identifier. The item identifier space is logically the same as the physical address space of conventional machines, but there's no permanent mapping between an item identifier and a physical memory location. Instead, an item identifier corresponds to a location in an attraction memory, whose tag matches the item identifier.
The first model of the conventional COMA architecture is the Single-bus DDM protocol. The DDM (Data Diffusion Machine) bus is a method of interconnection between different processing nodes in the system. It connects directly to the attraction memory at the various nodes and is also connected to a separate control system that handles arbitration, selection, and other protocol issues.

It is via this bus that data residing on a remote node is migrated to the local processor's attraction memory (AM). The distribution and coherence of data among the attraction memories at all the nodes in the system are controlled by the snooping protocol and the processor accesses to it's own local memory is managed by a lower protocol. All transactions between processors and their local memories and between remote memories via the DDM bus are all done at the item level as mentioned above. The attraction memory adds a small state field to each item that it handles. The states that item slots in the AM's of the system can be in are summarized in the table below:

1) Invalid: this AM does not contain the item
2) Exclusive: this AM and another contain the item
3) Shared: this AM and probably another AM contain the item
4) Reading: this AM is waiting for a data value after having issued a read
5) Waiting: this AM is waiting to change to Exclusive after having issued an erase
6) Reading-and-Waiting: this AM is waiting for a data value later to become Exclusive
7) Answering: this AM has promised to answer a read request

To maintain coherency, all copies of an item except the one being updated are invalidated on a write. This maintains only one changed copy of an item and prohibits two different versions of an item from existing at the same time. The types of operations permitted on the DDM bus are listed below:

1) Erase: erase all copies of this item
2) Exclusive: acknowledge an erase request
3) Read: read a copy of the item
4) Data: Carry the data in reply to an earlier read request
5) Inject: carry the only copy of an item and look for an AM to move into (caused by a replacement)
6) Out: carry the item on it's way out of the AM (caused by a replacement, will terminate when another copy of the item is found)

A summary of the course of action on the DDM bus is provided in the state diagram below:



As an example, a read attempt of an item that was marked as being in the Invalid State will cause it to change to the Reading state. The bus selection mechanism will select an AM to service the request which will, eventually, cause a Data transaction on the DDM bus. The requesting AM, now in the Reading state, will grab the Data transaction and changed to the Shared state. Processors are allowed to write only to items that are marked in as being in the Exclusive state. This provides further protection from having to copies of the same item at different nodes which contain different data. If the item that needs to be written to is marked as being in the Shared state, all other copies must be deleted and their deletions acknowledged before the write can take place. This happens by having the node desiring a write send an Erase transaction and waiting for the Exclusive acknowledgement from the Waiting state.
Replacing an item that is in the Shared state causes an Out transaction to be triggered on the bus. The space used by the item can now be reclaimed. If an Out transaction sees an AM slot containing the same item in Shared, Reading, Writing, or Reading_and_Waiting states, it does nothing. If another AM slot isn't found in these states, the transaction becomes an Inject. An Inject transaction can also take place if a replacement if performed on an item marked in the Exclusive state, since that is the only copy of the item.

Hierarchical DDM

So far in our discussion of the DDM bus, we have only looked at the single DDM bus configuration. There is also the hierarchical DDM configuration. The hierarchical DDM system replaces the separate control system which used to reside on the other side of the DDM bus from the AM, with another DDM bus. Levels configured like this are then stacked on top of each other to form the hierarchy of the system. To understand how this configuration of a COMA machine works, consider this example of a read request on a hierarchical DDM configured COMA machine. If the AM connected to the bus cannot satisfy the read request, the next higher level in the hierarchy retransmits the request on it's DDM bus. The AM slot where the read is destined to reside is changed to the Reading state. When the request reaches a level that has an AM with a copy of the item, the item's state is changed to Answering and it is retransmitted down to the lower bus.
To perform a write to an AM slot not on the same hierarchy level as the originator, the request is passed up through each level that does not contain an exclusive copy of the desired item. Along the path to finding this Exclusive item, the intermediate locations containing non-Exclusive copies of the data item are switched to the Waiting state. When the Exclusive copy is finally found, the Waiting state locations are assigned Exclusive copies of the data item as it is passed back down to the AM that requested the write.

Simple COMA

Another implementation of the COMA architecture is put forth in the idea of Simple COMA. Simple COMA divides the job of managing the global virtual address space between hardware and software to provide the automatic data migration and greater replication of a traditional COMA with much less complex hardware and coherence protocols. For example, a memory management unit (MMU) replaces the complex hardware used in the other COMA designs. [See existing picture, Figure 7]
The Attraction Memory in this design works by having the MMU map the appropriate virtual addresses to the physical addresses. Whenever an access is made to the AM, the MMU effectively performs the AM tag check by looking for a data item mapped at the given virtual address. If a mapping exists, then the space for the data item has already been allocated to this node's AM. If there is no mapping in the MMU, then the MMU causes a virtual memory system exception to the processor so that a mapping can be created.
The management of an Attraction Memory can be divided up into two categories: page allocation and page replacement. Page allocation is handled by the operating system running on the machine. It is triggered when the MMU is unable to find the allocated space for the desired data. The page fault handled must also initialize the state of the data item to Invalid. Page replacement is a little more complex. First the MMU will remove the virtual address of the item. Then the state on the item to be freed must be set to Invalid, unless it contains data marked as Exclusive. In this case, the data must then be relocated. The location can then be replaced.
Overall, the Simple COMA design provides for a simpler design by using less hardware and performing everything it can in software. This may hamper the speed of the design a little, but the ease of implementation far outweighs this negative feature.

DDMlite - An Implementation

Due to the obscurity and complexity of the explanation of the COMA architecture given above, a example of an actual implementation will be discussed. One example of an actual COMA machine is the DDMlite developed by Andreas Moestedt at the Swedish Institute of Computer Science. Each node in this design consisted of four Motorola 88100 microprocessors and 32 MB of DRAM which was used as the Attraction Memory. The DDMlite supported up to six nodes on it's DDM bus. The interconnect logic was implemented using XiLinx FPGAs which allowed for rapid development.
At each node in the DDMlite machine, there is a node controller. This device handles all memory transactions between and within nodes. The node controller handles all accesses to the nodes local memory (it's AM) while also snooping the DDM bus for requests for data contained locally in the DRAM, snooping the DDM bus for replies to it's own requests, and replacing already resident data items by moving them to other nodes if there are no free slots when receiving new data.
The protocol used by the DDMlite to control access to it's DDM bus contains four states. These are:

1) Exclusive: Similar to Master state, this node has the cache item but no other copies are allowed in the system.
2) Master: Implies that it is the owner of the cache item and there can be other copies in the system. This node is also responsible for servicing all the read requests to that line.
3) Shared: implies that this node has a shared copy of the desired cache item and that it is maintained by another node.
4) Invalid: implies it is an invalid copy.

When an item needs to be replaced within a node's AM and that AM is currently filled, an item needs to be sacrificed. The DDMlite's protocol specifies that the priority for removing items to make room for a newer one is I > S > M > E. An example read from node 1 to node 2 would proceed as follows: First the node controller of node 1 would check for the data in it's local Attraction Memory. Not finding it there, node 1's controller then arbitrates for use of the DDM bus. Once granted use of the bus, the controller sends out a request for the data item. Node 2's controller would be snooping the bus and would pick up the request from node 1. It would then check it's local Attraction Memory and discover the requested data item that happens to be marked as being in the Exclusive state. Node 2's controller would send the requested data item over the DDM bus to node 1. Node 1 would be snooping the bus for a response to it's request. Once it has the requested data, node 1 places it in it's AM and changes the item's state to from Exclusive to Shared.
The DDMlite project proved that a multiprocessor architecture was feasible, even with limited resources. It also showed that the Single DDM bus configuration was a viable architecture for such a machine.

NUMA

Non Uniform Memory Access systems are those systems in which memory access times are not uniform, but rather dependent on the route to and latency at the node containing the requested memory line. A usual method of NUMA cacheing is to have each processor maintain its own internal and/or secondary cache, and then associate one or more of these cached processors with a memory node. Memory associated to processors in such a way is called local memory from the viewpoint of that processor because it has direct access to that node's memory ports and need not go through any memory network routing. Each node also is directly connected to a memory network interface which will manage all remote (non-local) memory requests. Thus local memory requests are much faster than remote requests and much effort is placed on designing an intelligent network cache to manage cacheing of remote memory lines.
The choice of memory network topology is one of the most important aspects of NUMA design. In its simplest form a hierarchical ring topology would be sufficient. Hypercube and grid topologies are also popular in NUMA design. Below is a digram depicting examples of these three topologies.


NUMAchine

Famous NUMA machines include the University of Toronto's NUMAmachine, Stanford's DASH, SGI's Origin series and Hewlett Packard's V, S, and X class servers. To better illustrate the concepts of NUMA we will examine the NUMAchine architecture. The NUMAchine is a scalable NUMA system based on the topology of hierarchical rings. It is implemented as stations consisting of 2 MIPS R4400 processors, memory and network cache interconnected by a hierarchy of unidirectional circular buses or rings. Several stations form a local ring, then several local rings connect to form a central ring which handles interring communication between the local rings, and so forth with each ring connecting group of those hierarchically below it. Stations are select bit numbered as shown in the above diagram for routing purposes explained later. Memory is thereby implemented on a four level hierarchy: the on-chip processor cache, a secondary per processor SRAM cache, the on station network cache and local memory, and the remander of memory modules on remote stations. Below is a diagram of a NUMAchine station.


Off station memory access are transmitted to the ring structure via a ring interface unit in a the form of a transaction which consists of packets, or data which can be transmitted between stations in one bus cycle. Transactions such as invalidates and read requests require only one packet but transaction such as writes, which contain data, require multiple packets. These packets are not necessarily consecutive therefore a transaction identifier must be present to distinguish each packet. Each packet also contains routing information in a routing mask. Each ring position is specified by a unique bit in a field representing that hierarchical level. Thus adding an additional position on a ring would increase by one the length of the field representing rings of that level. Adding an additional ring level would result in the addition of another field in the routing mask. This scheme requires order log[n] routing bitspace, assuming balanced rings and uniquely specifying each station.
The greatest advantage of this routing scheme is the simplicity of the routing logic. Since rings in the NUMAching are UNIdirectional there is only one possible route between any two ring positions. A packet is ignored by the nodes on a ring if the routing fields of greater level does not match the pattern for that specific ring level. Those packets are routed up level by ring interconnects as necessary until it can be distributed back down to all the stations which need that packet. When a packet does have a matching upper level node address with a ring then nodes which are selected in the current ring bit mask will receive the data and send it downward (either to a station or a lower level ring). This bit is then cleared from the routing mask of the current level, since that the packet has been sent to that specified sublevel. When all routing bits of the current ring are set to zero then the packet is treated as a slot on that ring which may then be filled with another packet. This routing is further illustrated in the symmetric multicasting example diagram below.
Another advantage to this routing scheme is the facilitation of symmetric multicasting, meaning that packets with multiple destinations can be routed symmetrically instead of performing a global broadcast. By logically oring the routing identification bits of all stations to receive a packet, a routing mask is built such that at the top level only rings involved in the transaction receive the data. Routing fields for lower levels contain the or of routing bits required for all rings of that level; often data will be sent to subring where it is not needed. Although this can cause congestion in lower levels, it prevents higher levels from being as much of a bottleneck as they would if global multicasting were used instead. The diagram above is an example of symmetric multicasting on a 3x3x4 NUMAchine. In the diagram node $ (= 1000 010 100) dirties a line also shared by nodes & (= 0010 001 010) and % (= 1000 010 010). The initial routing mask A (= 1100 011 110) is produced by logically oring the node addresses of the two destinations (node & OR node %). This packet travels up level to the topmost ring where it is then distributed downward. From the diagram it is easy to see the origin of the nomenclature "symmetric" multicasting since all sub levels must be traversed equally.


NUMAchine uses a hierarchical directory based cache coherency scheme maintained on station for each line in the network cache. Each cache line has an additional two bits stored with it: a global bit and a valid bit. If the global bit is not set then the station is said to have the memory private and need never check the sharing directory when updating the line. When a line is global an invalidate packet is sent to all nodes sharing this line and also a write back packet containing the updated data is sent to the memory where this address physically located. NUMAchine also has an on station coherency agent to maintain consistancy between the per processor cache and the station network cache. Below is a state transition diagram for cached lines.


A program which is sufficiently local (i.e. where processors primarily access memory physically residing on their station or on other local ring stations) will be able to achieve high performance, but heavy remote accessing and sharing can quickly turn central rings into bottle necks. Like most parallel machines, the NUMAchine maintains non intrusive performance monitoring registers. Below is a network cache hit rate chart and a memory network usage chart released by the University of Toronto using these registers to profile their machine's performance running the SPLASH-2 benchmark suite. One can see from these that central rings increase in congestion more quickly than the local rings but yet manage to stay within a tollerable level even on memory intensive applications making the NUMAchine a simple but viable MPP system.


UMA

The UMA (Uniform Memory Architecture) differs from the CC-NUMA machines in that all memory accesses take the same amount of time. This is accomplished by having the memory of the machine be a completely separate unit from the processors. In the NUMA architecture, the memory was distributed among the various processor nodes and memory accesses to the local memory would happen faster than accesses that had to go to memory stored on a remote processor node.
The UMA architecture has proven to be less popular than the NUMA processor, but some companies and research groups have stuck with UMA. One example of this is the MTA (Multithreaded Architecture) produced by the Tera Computer Company. This machine's memory structure is arranged such that main memory is distributed about the machine, but it is interleaved 64 ways. This way, memory references by the processors are scattered among all the banks of all the memory units. This defines the MTA as a UMA architecture machine.
The MTA is special in that it uses custom processors which are multithreaded and capable of handling up to 128 distinct instruction streams. These processors are able to switch between these streams on every clock cycle. These processors enable the MTA to hide up to 384ns of memory latency when running at their optimal 333 MHz clock rate. This enables the MTA to remove the drawback of having no local memory and requiring that all memory accesses go out to the distributed memory system. Because of this, the MTA has been able to prove that UMA is also a viable solution to symmetrical multiprocessing.


Conclusion

In surveying these different architectures of multiprocessing systems, advantages and disadvantages do not appear absolute. Both NUMA and COMA can be designed to work using different types of hardware. While COMA machines can be built more cheaply, they are not as scalable as NUMA machines due to their differences in topology. Aside from these differences, neither COMA nor NUMA showed any distinct advantage over the other. They simply represent different design philosophies, each with their own tradeoffs.
NUMA and COMA both have had a considerable amount of research attention. Yet, UMA, seems to have had very little. The only information to be found on this architecture was the commercial information from the Tera Computer Company. Perhaps this is because UMA was a sort of introductory architecture which spawned the NUMA and then the COMA architectures.
Overall, this survey proved very interesting and insightful into how an architecture designer can approach the problem of designing a large, scalable multiprocessing machine. There are different options to consider, the choices of which depend on what the goal of the machine and what the designer has to work with. The studies researched for this survey have all proved that there designs provide feasible, real world solutions for multiprocessing systems.