May 27, 2025 Predictions About Computer Memory Models William Collier collier@acm.org Outline: It may help you to follow the talk if you read and print out these notes beforehand. 1. Memory models (MM) will continue to evolve. Sequential Consistency (SC) (Lamport, 1978 [8]) says: do not visibly violate Program Order (PO). SCII (RAPA Plus) Obey rules of computation (CMP), PO, and write atomicity (WA) [2]. Show how to unrelax relaxable rules. Scheurich's problem. Memory model by fiat [9]. 2. Single copy write atomicity (SCWA) will be seen as simply part of the CMP rule. "data-race free" will disappear from architecture manuals. 3. Long thread tests (LTT's) will almost entirely replace short thread tests (STT's). 4. Indistinguishable memory models. The fundamental goal of memory models. The WSisWA theorem. 5. X=X program. What happens when two threads write into each other's instruction stream? 6. Making new entries in www.oeis.org be tables will result in finding many new integer sequences. This has nothing to do with memory models. 7. Work with Bill Rubin to see structures in LTT output. ============================================== The memory model (MM) for a shared memory multiprocessor (SMMP) defines the SMMP's behavior when two or more processors read and/or write the same operands at the same time, Some memory models require that all instructions be executed in the order defined by the program they are contained in. This is called the rule of program order (PO). This form of the rule is called "strong behavior." Other memory models allow "relaxed behavior", that is, they allow some instructions to be executed out of order in order to increase performance. Some memory models require that every store operation appear to become visible to every part of a machine at the same instant. This is the rule of multiple copy write atomicity (MCWA). This form of the rule is called "strong behavior." Other memory models allow "relaxed behavior", that is, they allow a change in the value of an operand to become visible at different times to different parts of the system in order to increase performance. The term 'computer architecture' for a oomputer is used as defining the programming interface for the computer. Out of habit I use the two phrases interchangeably. The first memory model, called Sequential Consistency (SC), was defined by Leslie Lamport as requiring that violations of the rule of program order do not become visible: "... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program." [8] In the 1970's efforts to understand the effects of multiprocessing led to many small programs being created. The point of such tests was to show that if a program calculated a certain set of answers, then it was possible to deduce that certain events had happened before certain other events. In some cases it became possible to see events happening out of order or nonatomically. Collections of these programs can be found in [2, 10]. Here is a test program created by Ron Smith. The goal was to show that if read events were allowed to occur before logically preceding write events, then this behavior could become visible to users. Execution 1. Initially, A = B = U = V = 0. P0 P1 L0: A = 1; L0: B = 1; L1: U = B; L1: V = A; Terminally, A = B = 1, U = V = 0. Call Execution 1 a short thread test (STT). The logic showing the existence of a circuit in the data is: the write A in A=1 comes before the read of B in U=B by program order. the read of B in U=B comes before the write into B in B=1 by the rule of computation. the write of B in B=1 comes before the read of A in V=A by program order. the read of A in V=A comes before the write of A in A=1 by the rule of computation. Note: the existence of the circuit reveals weak behavior of the machine since one of the rules used in constructing the circuit must have been violated. In [2] I speculated that Smith's STT above could be expanded into a long thread test (LTT) of the following form. Execution 2. Initially, A = B = U[i] = V[i] = 0, 0 <= i <= N. P0 P1 L11: A = 1; L11: B = 1; L12: U[1] = B; L12: V[1] = A; L21: A = 2; L21: B = 2; L22: U[2] = B; L22: V[2] = A; L31: A = 3; L31: B = 3; L32: U[3] = B; L32: V[3] = A; . . . . . . LN1: A = N; LN1: B = N; LN2: U[N] = B; LN2: V[N] = A; Seek, for example, U[3] = 2 and V[3] = 2, to see that a read comes before a write. The logic showing the existence of a circuit in the data is: the write A in A=3 comes before the read of B in U[3]=B by program order. the read of B in U[3]=B comes before the write into B in B=3 by the rule of computation. the write of B in B=3 comes before the read of A in V[3]=A by program order. the read of A in V[3]=A comes before the write of A in A=3 by the rule of computation. After [2] came out, the program in Execution 2 was run, and its output data confirmed that, as the System/390 architecture allowed, read operations could indeed be executed before logically preceding write operations. See Variation 1, p. 221, of RAPA. See Test 4, p. 224, of RAPA. See Test T400 in http://www.mpdiag.com/analysis.html#t04. Execution 3. Scheurich's execution (1987): Initially, (A,B,X,Y} = {0,0,0,0). P1 P2 P3 L1: A = 1; L1: B = A; L1: X = B; L2: Y = A; Terminally, (A,B,X,Y} = {1,1,1,0). The logic for Execution 3 the write of A in A=1 comes before the read of A in B=A by the rule of computation. the read of A in B=A comes before the write of B in B=A by the rule of computation. the write of B in B=A comes before the read of B in X=B by the rule of computation. the read of B in X=B comes before the read of A in Y=A by program order. the read of A in Y=A comes before the write of A in A=1 by the rule of computation. the write of A in A=1 comes before the read of A in B=A by the rule of computation. It appears that the execution violates either program order (P3 fetches A well before it fetches B) or write atomicity (P3 sees the old value of A well after it has seen the new value of A, via operand B). Intel's response to Scheurich's execution at one point was to declare that it was impossible to occur. This is memory modeling by fiat. It would be easy to laugh at this, but to do so would be to ignore the fact that the folks who wrote the fiat knew something which they could not express rationally. Perhaps it is something like this: signals take time to propagate. A memory model should allow for one to express durations. Such a model would allow one to explain that P3 cannot see the new value of A and then at a later time still see the old value of A, in the way described in Execution 3 above. Hopefully this new memory model would still be simple enough for programmers to use in writing shared memory routines. ============================================== Sequential consistency provided a valuable guide for writing SMMP programs for many years. However, it can now be seen to be missing vital elements. When SC was first formulated, the rule of computation seemed so obvious that it was not necessary to state it formally. But, as noted in [2], the events of an execution can always be arranged to obey any set of rules other than the rule of computation. In other words, without using the rule of computation, it is impossible to deduce the existence of a circuit among the events of an execution. I propose that the definition of SC should be updated further as follows: The rule of computation (CMP). Strong form only. I do not believe that any weak form of CMP can be permitted. The rule of program order (PO). PO in its strong form, or PO in a weak form, plus fences to enable programmers to achieve strong forms of PO where needed. The rule of write atomicity (WA). WA in its strong form, or WA in a weak form, plus atomic instructions to enable programmers to achieve strong forms of WA where needed. Call this form of SC, SCII. ============================================== From the short thread tests which were created at this time it became evident that the phrase "write atomicity" was used to describe two very different behaviors of a machine. 1. Single Copy Write Atomicity (SCWA). If two threads both write to an operand, or one reads the operand and the other writes into it, the resultant value(s) should be one of the original values (not a value which is partly created from one operand and partly from another.) 2. Multiple copy write atomicity. If several copies of an operand exist in a system (say in distinct copies in several caches), then, if one copy changes value, all of the copies (appear to) change value at the same instant. We have already seen the MCWA rule in the discussion of the Scheurich execution above. ============================================== Single copy write atomicity (SCWA) and data race freedom (Adve [1]). The rules of computation require that for Execution 4 the terminal value of A can be either 1 or 2 and no other value: Execution 4. Initially, A = 0. T0 T1 L1: A = 1; L2: A = 2; Terminally, A can be either 0 or 2. and for Execution 5 the terminal value of A can be either 0 or 1 and no other value. Execution 5. Initially, A = X = 0. T0 T1 L1: A = X; L2: X = 1; Terminally, A can be either 0 or 1. These rules ensure that the system appears to be data race free. How valuable is SCWA as a rule for reasoning about the behavior of theads? If a program needs to change a large object which can also be changed by other programs at the same time, then there is no question that a lock needs to be set to allow only one program at a time to access the object. If the object is small and can be changed by an atomic instruction, then the atomic instruction should be used to modify the object. If the object is small and efforts to change it using nonatomic instructions results in Executions 4 or 5 above, the machine does not compute and engineers should be asked to fix it. This may be a point of some contention. ============================================== Here is an argument that Data Race Free behavior can be achieved purely via programming means, that is, without any support from system hardware. Dijkstra [5] showed that the function of Test and Set can be achieved using only nonatomic BAL instructions. He used this capability to guard entrance to a critical section. Ron Smith and I [4] sharpened Dijkstra's result by requiring all of the operands used in controlling access to a critical section be only one bit long. Since they are only one bit long, interactions on these operands are data race free. And there will be only one thread in the critical section at any one time so its operations will also be data race free. ============================================== 3. Long thread tests will almost entirely replace short thread tests. The STT's in [2,10] will be recast as LTT's. In testing CPU- GPU interactions Srivastava (11] found that system set up time in running one LTT was reduced by a factor of 15,000 over running a long sequence of STT's. (But note that there may be some places where a STT can not be recast as a LTT.) ============================================== 4. Indistinguishable memory models. Suppose that MM1 and MM2 are two memory models. Suppose further that any result which can be calculated on MM1 can also be calculated on MM2. And vice versa. Then we say that MM1 and MM2 are indistinguishable. The fundamental goal of computer memory models is to identify indistinguishable memory models. A system can then be presented to customers as emobodying the best qualities of both MM's, and no program can be written which reveals the deception. Here are two examples. Example 1. If I remember correctly, on the early 360 machines the architecture required that (it appeared that) operands were fetched a word at a time. The rule improved the performance of large machines while hurting the performance of small machines. The someone realized that the rule could be changed to say that operands were fetched a byte at a time. The small machines would be faster. The large machines could continue to fetch operands a word at a time, but no program could detect the violation of the memory model. Example 2. Let MM1 be the MM which obeys this rule: every pair of threads see all changes in the values of operands in the same order. Call a MM which obeys this rule Write Synchronized (WS). Let MM2 be the MM which obeys this rule: every change in the value of an operand becomes visible (or appears to beome visible) to all threads in the system at the same instant. Call a MM which obeys this rule Write Atomic (WA). In RAPA I proved the WSisWA theorem: if a system obeys the rule of WS, it obeys (appears to obey) the rule of WA. I submitted an invention disclosure on the WSisWA theorem. It was rated publish. After RAPA came out, I was told that if it had been rated file, the invention would have been worth as much to IBM as the channel patent was. Maybe. Maybe not. Achieving write atomicity is expensive, no matter how it is done. Note. I recently took a look again at the proof of the theorem. I believe the theorem is true, but the proof may need some work. ============================================== 5. The X=X program. Is it reasonable to test a machine to see what it does when told to divide a number by zero? Of course it is. You need to know that the machine deals gracefully with a potential error situation. Is it reasonable to test a machine to see what it does when a thread modifies its own instruction stream? Yes, this behavior is explicitly allowed (on at least some machines). Is it reasonable to test a machine to see what it does when two threads each modify the other's instruction stream? Presumably this behavior is not allowed. It is still valid to test the machine to see that the machine handles the situation gracefully, the same as in the divide by zero test. If a machine stores into an operand exactly the value which was already in the operand, it has performed a nop-store operation. To software a nop-store appears to be a nop. To hardware it appears to be a store operation. Is it fair to test a machine to see how it handles two threads each doing nop-stores into the other's instruction stream? Of course it is fair, just as it is fair to test the machine for its handling of a divide by zero operation. For an additional dollop of piquancy in the test programs, the instruction streams might contain self-modifying instructions, such as the 370 instructions OI *,1 and XI *,1 which self-modify into each other. ============================================== 6. The Online Encyclopedia of Integer Sequences (OEIS). This discussion has nothing to do with memory models. In 2018 Jeffrey, et al [7] created the following table. I independently created the same table a few weeks later [4]. i\j |0 1 2 3 4 5 6 7 8 ----+------------------------------------------------------------ 0|2 2 2 2 2 2 2 2 2 1|2 3 7 18 47 123 322 843 2207 2|2 4 14 52 194 724 2702 10084 37634 3|2 5 23 110 527 2525 12098 57965 277727 4|2 6 34 198 1154 6726 39202 228486 1331714 5|2 7 47 322 2207 15127 103682 710647 4870847 6|2 8 62 488 3842 30248 238142 1874888 14760962 7|2 9 79 702 6239 55449 492802 4379769 38925119 8|2 10 98 970 9602 95050 940898 9313930 92198402 9|2 11 119 1298 14159 154451 1684802 18378371 200477279 Let a[i][j] be the element in row i and column j. Let the table be initialized as follows: a[i][0] = 2, i >= 0. a[i][1] = i + 2, i >= 0. Then the values in each row are given by: a[i][j] = a[i][1] * a[i][j-1] - a[i][j-2], i >= 0, j > 2. The values C[i][j] in each column are given by: The values in the columns are given by (for n > 0) C[0][n] = 2. C[1][n] = n + 2. C[2][n] = n^2-2, from A008865. C[3][n] = n^3 + 6*n^2 + 9*n + 2, from A058794. C[4][n] = ??? C[5][n] = n^5 - 5*n^3 + 5*n, from A230586. C[6][n] = ??? The point is that the columns are defined by many rules, not just by the above polynomials. You can follow the links to my website to see some of the rules. I would really like to work with someone who'd like to take a look at this. ============================================= Ideas not covered elsewhere. Are there forms of data races, other than those described in Executions 4 and 5 above? ====================== Single copy write atomicity (SCWA) will be seen as simply part of the CMP rule. "data-race free" will disappear from architecture manuals. ====================== A test program can not prove that a machine obeys a rule of architecture. The only thing a test program can do is to show that a machine fails to obey a rule of architecture. ====================== I used the BXLE instruction to test and advance a bit string,' saving many bytes in an operating system which had to run in 2K bytes. 1965. ====================== I defined graph_sets and the reification of nondeterminism in in order to prove the WSisWA theorem. Nondeterminsim disappears. One is left instead to deal only with a very large set of chromatic graphs. 1987. ============================================== 7. Structure in the output from long thread tests I have talked about the circuits which Bill Rubin and I see in the output from the LTT's. We think we see more structure in the output data. We plan to share results later this summer. I have two hopes for this work. One, I hope it will enable people to see a little more deeply into the internal workings of a SMMP. Two, I hope the work will grow into a small, but interesting subfield of number theory. ============================================== References 1. S. Adve. Designing Memory Consistency Models For Shared- Memory Multiprocessors (PhD thesis). 1993. 2. W. W. Collier. Reasoning About Parallel Architectures, Prentice-Hall, 1992. 3. W. W. Collier and R. M. Smith. Interactive Tie-Breaking system. 1992. https://patentimages.storage.googleapis.com/eb/52/ec/67e872a2ce4625/US3676860.pdf 4. W. W. Collier. Entry 299741 in Online Encyclopedia of Integer Sequences. 2018. https://oeis.org/a299741. 5. E. W. Dijkstra. Solution of a problem in concurrent programming control. Communications of the ACM, Volume 8, Issue 9, Page 569. 1965. https://dl.acm.org/doi/pdf/10.1145/365559.365617 6. Intel. At one point page 11 in section 2.5 "2.5 Stores are transitively visible" of this document http://developer.intel.com/products/processor/manuals/318147.pdf. Intel explicitly forbade Scheurich's test case from happening. 7, L. E. Jeffery, B. Selcoe and A. Hone. Entry 298675 in Online Encyclopedia of Integer Sequences. 2018. https://oeis.org/a298675. 8. L. Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs", IEEE Trans. Comput. C- 28,9 (Sept. 1979), 690-691. 9. M. Dubois, C. Scheurich, and F. A. Briggs. Memory Access Buffering in Multiprocessors. Preceedings of the Thirteenth Annual International Symposium on Computer Architecture 14:2 (1986), 434-442. 10. P. Sewell. ARM Litmus Tests, https://www.cl.cam.ac.uk/~pes20/arm-supplemental/ 11. S. Srivastava. Testing Memory Models of Heterogeneou CPU- GPU Systems, Master's thesis, University of California, Santa Cruz, June 2024. See https://escholarship.org/uc/item/5hm7q1jv and https://escholarship.org/content/qt5hm7q1jv/qt5hm7q1jv.pdf?t=sgthsf ==============================================