Nov 20, 2023 The X=X test program William Collier collier@acm.org Abstract The X=X program is a new program to stress test a shared memory multiprocessor (SMMP). Multiple threads execute the same code. Each thread reads from and writes into the shared code. Each such action of each thread is a NOP. However, the interaction among the threads may reveal real problems on real machines. 1. The purpose of the X=X test program. The X=X test program tries to uncover problems on a SMMP. One group ran it on an IBM Z16 machine (1,000,000,000 iterations over an hour and a half). The part of the machine running the X=X program ran slowly. No other effects were seen. No evidence exists at this time that the X=X program finds problems on real machines. 2. Version 1 of the X=X program. There are several threads. Each thread has access to a shared data area (DA). All threads execute the same code in a loop. In the loop each thread randomly selects a portion of the DA, reads the data in the portion, and writes the data back into the portion from which it was just read. This is a NOP. And so logically the entire program is a NOP. The machine does not know it is performing NOPs, and therefore it performs all the checks it performs when threads try for real to read and write the same data at the same time. Probably no machine in today's world would fail Version 1 of the X=X test. 3. Version 2 of the X=X program. Version 2 differs from Version 1 in only one regard. The DA is the area of memory occupied by the program being executed. Each thread reads and writes into its own instruction stream. Each thread experiences other threads reading and writing into its instruction stream. Logically, the program is still a giant NOP, and so, logically, it should run with no problem. 4. The data cache and the instruction cache Each processor in a SMMP has a bit of local memory, called a data cache, associated with it. The processor saves in its data cache the operands which the processor has recently accessed and the operands which the processor thinks the thread will soon access. Each processor has another bit of local memory, called an instruction cache, associated with it. The processor saves in its instruction cache the instructions which the processor is currently executing and the instructions the processor thinks the thread may soon execute. In the X=X test program, threads read and write words in the X=X program which the thread is executing. Thus words may simultaneously be in both the data cache and the instruction cache for the thread. This is unusual. Problems may be revealed. 5. Should the X=X program be written in basic assembler language (BAL) or in a higher level language (HLL)? It is straightforward to write it in BAL, but the result is good for evaluating only a specific machine. If written in a HLL, there are problems to deal with. a. Find the beginning and the end of the program. b. Ensure that no one but the X=X program changes the code in the X=X program. c. Prevent the compiler from recognizing that the program consists of nothing but NOPs. Composite approach: a HLL program creates threads and does other housekeeping, but then calls subroutines, some of which are HLL and others are BAL. To be continued. Please share these notes with anyone you think might find them interesting. -------------------------------------------------- Bill Collier was a programmer in IBM Poughkeepsie from 1960 to 1993. He wrote "Reasoning About Parallel Architectures" (Prentice-Hall 1992). He holds degrees from Harvard and Syracuse University in math and computer science. He joined ACM in 1960. 7. Write in assembler or in a higher level language? It is not clear which choice is preferable. If one writes the X=X program in BAL, one has complete control over exactly which areas of memory are copied to themselves and over the order in which the copies are performed. Unfortunately in this approach the program has to be written afresh in a different BAL for each different machine type to be tested. If one writes the X=X program in a HLL, the program can be easily compiled and run on different machines. However, compilers for HLL sometimes rearrange code for the presumed benefit of programmers, and compilers for different machines may differ, in ways which turn out to be significant, in the code they generate for different machines. Thus the compiler itself may contribute to results which can be erroneously interpreted as evidence of machine malfunction. For a quick and dirty start to machine testing, consider coding the X=X program in C/C++. The variables in the DA which are being copied to themselves are declared volatile. This should ensure that statements referencing the variables are not reordered by the compiler. The variables themselves can be declared in a union of the form: union { int a[1000]; int b[1000]; int c[1000]; } ab *ab_ptr; In Version 1 of the X=X program, the X=X program uses malloc() to get an area of storage. Then ab_ptr is set to address the beginning of the malloc() area. In Version 2 of the X=X program, the X=X program (somehow) gets the address of the beginning of the X=X program and sets ab_ptr to that address. Here is a sample X=X program. In both Version 1 and Version 2 of the X=X program, there are multiple threads, all simultaneously executing these statements: int temp[12] = {5,7,9,8,6,4,5,7,9,8,6,4}; ab.a[17] = ab.b[17]; // nop for (i=0;i<1000;i++) ab.a[i] = ab.b[i]; // nop for (i=220;i<800;i++) ab.c[i] = ab.b[i]; // nop for (i=800;i>220;i--) ab.c[i] = ab.b[i]; // nop for (i=0;i<1000;i+=3) ab.a[i] = ab.c[i]; // nop for (i=0;i<1000;i++) ab.a[i] = ab.b[i] | ab.c[i]; // nop for (i = 0; i < 12; i++) ab.a[temp[i]] = ab.b[temp[i]]; // nop Note: In the last 24 hours I have spent time with two people who know far more about C/C++ than I do. We could not agree on the language needed to accomplish the goals stated above. -------------------------------------------------- page 8 8. Version 3 of the X=X program. One way a SMMP can deal with the contention created by the X=X program is, I am told, for the machine to detect that a thread is seeking to modify the code it is executing and then to enter a super-cautious mode of execution where operations are performed only one, very careful, step at a time. Here is a way for the program to avoid triggering super- cautious mode while still provoking a lot of the contention of Version 2 of the X=X program. Create two data areas, DA1 and DA2. Each contains a copy of the program being executed. Divide the threads into two groups, the even threads and the odd threads. The even threads execute the code in DA2, but access the code/data in DA1. The odd threads execute the code in DA1, but access the code/data in DA2. Thus no thread seeks to read/write its own code (and thereby fails to trigger super-cautious mode), but it does read and write the code of other threads.