| 
       
									
 
 |  | 
 
 
										
											| 
										
										Services  Technologies 
									 Non-blocking | 1 (4) |  In order to make benefits of the parallelism 
									with multi-core and multi-processors and 
									thus achieve higher experienced performance, the programs 
									must obviously be designed in such a way 
									that it can be divided into pieces that 
									can be run more or less individually on each 
									logical processor. This can be done in 
									several ways. One way is to use a 
									distributed model similar to that used for 
									computer clusters, where the program is 
									divided into separate program parts that 
									regularly send messages between each others. 
									However, this can be inefficient, especially 
									if the messages must be sent often. More 
									efficient, and maybe also more intuitive, is 
									that the program parts communicate through 
									common shared variables. This model fits 
									especially well on the new processors which 
									have several logical processors that is 
									directly connected with a common shared work 
									memory. It is within this model that 
									multi-thread programming fits. Multi-thread 
									programming has previously been used mostly 
									for designing programs in a manageable way 
									when one has several tasks that should be 
									performed regularly. Now, it becomes more or less 
									necessary to design program multi-threaded 
									in order to get increased performance.
 
									Parallel programming with common shared data
									We are now describing more 
									concrete what type 
									of problems that can show up with this type 
									of parallel programming. Assume we have a 
									number of thread, i.e., different program 
									pieces that can be run in parallel, and that 
									these threads want to increment a common 
									shared counter. This counter we have defined 
									as a global variable:   
									
									int volatile counter=0; 
									
									In order to increment the counter, we try to 
									do as usual; all threads contain a program 
									line as follows:  
									
									counter++; 
									This can seem to be perfectly ok, but what 
									happens really when we are executing our 
									multi-threaded program and two or more "counter++" 
									are executed simultaneously? Unfortunately, 
									there is a great risk that the counter gets 
									wrong value, as can be seen in the figure 
									below. 
									 
									The reason for this is that "counter++" 
									normally is not compromised of only one 
									machine instruction, and that operations on 
									the work memory normally is done step by 
									step with either reading or writing - not 
									both. In the case of incrementing the 
									counter this is done in 3 steps; first the 
									variable's current value is read, then the 
									value is incremented by one within the 
									processor's arithmetic unit, and finally the 
									new value is written back to the variable in 
									memory. With several simultaneous of "counter++" 
									can thus the internal steps towards the 
									memory be interleaved in such an unfortunate 
									way that the computation becomes wrong. 
									There has of course been solutions on this 
									problems since long. What we want to achieve 
									is that "counter++" should be run in one 
									single pass like a transaction, without 
									being interrupted by anything else when the 
									execution of "counter++" has started.  
									The standard solutions for handling critical 
									regions, that "counter++" is an example of 
									in this case, is to use some kind of locking 
									mechanism. The system are mostly often 
									providing some kind of locking mechanism of 
									the "mutex" or "semaphore" types. The 
									greatest difference between "mutex" and 
									semaphore" are normally that "semaphore" can 
									handling longer waiting periods on a lock in 
									more efficient way (it simply lets the 
									operating system run another thread during 
									the time the lock is busy, after having 
									waited for a certain while on it to become 
									available). With a lock surrounding "counter++" 
									the increment will work painlessly, as is 
									seen in the figure below.  
									 
									Now, another 
									problem shows up - what happened to the 
									parallelism? As the purpose with the lock 
									was to stop other threads to change the 
									variable while we are working on it, the 
									result will be that only one thread is doing 
									something. This wasn't any problem with a 
									processor that can only execute one thread 
									at a time, but with a processor with several 
									logical processors this becomes poor 
									utilization. Moreover are operations with "mutex" 
									and "semaphore" relatively expensive in time 
									in themselves (especially if compare with a 
									simple increment of a variable). As locking 
									mechanisms by definition blocks other 
									threads from being run, this can also cause 
									several other problems which are of special 
									importance for time-critical systems. These 
									problems can be difficulties with safely 
									guaranteeing deadlines (as it is hard to 
									guarantee exactly when a certain lock gets 
									released and becomes available) and avoiding 
									risk for dead-locks (if the lock is never 
									released). There are solutions 
									without locks
									If we are recalling our example with "counter++", 
									what we really wanted to was to be able to 
									run all the steps in the increment as one 
									single non-interrupted and atomic unit. 
									Fortunately there is support for doing 
									shorter atomic steps in the hardware. In all 
									modern processors, there are special 
									machine-code instructions that can perform a 
									read directly followed by a write to the 
									memory as on single unit. This type of 
									operations are called atomic and is run like 
									a transaction, either completely with all 
									steps or not at all. The most powerful of 
									these atomic instructions are called "compare-and-swap" 
									(CAS), and is available directly or 
									indirectly in most processors and systems. 
									See the code below for a semantic 
									description of this operation (observe that 
									this is not the actual code, the real 
									implementation is in hardware and/or system 
									libraries). 
									
									int CAS(int *p, int old, int new) {atomic {
 if(*p == old) {
 *p=new;
 return 1;
 }
 else return 0;
 }
 }
 The operation 
									is mostly used in combination with an 
									earlier read of the value in a variable and 
									makes sure that no one has changed the value 
									before one has calculated the new value and 
									written this to the variable. Also larger operations 
									can be done atomically
									The fundamental atomic operations that are 
									available in the hardware is relatively 
									limited, e.g., they can only operate on one 
									single memory word at a time and has a 
									limited semantic. By combining several 
									atomic operations, one can write programs 
									that actually can replace critical regions 
									of larger sizes and complexity. However, 
									this not trivial, it can be seen like an 
									advanced mathematical puzzle where every 
									piece is simple and limited. The thing that 
									makes it so complex, is that every single 
									line of code in the program can be executed 
									simultaneously as any other line or be 
									interrupted for an indefinitely long time 
									period. One has thus no clue what is 
									happening between every line of code in the 
									program, and must be able to take care of 
									this and adjust for eventual simultaneous 
									changes in concerned variables. In order to 
									re-attach to our problem with "counter++", 
									we will now design a program that can 
									increase a variable with an arbitrary value 
									atomically. 
									  
									
									int FAA(int volatile *p, int value) {int old;
 do {
 old=*p;
 } while(!CAS(p,old,old+value));
 return old;
 }
 
									This function first reads the old value of 
									the variable, then calculates the new value, 
									and lastly makes sure using CAS that the 
									variable has not been changed before the 
									variable gets updated with the new value. If 
									the value has been changed after that this 
									function read the value of the variable, the 
									CAS will fail and the function will repeat. 
									Thus, we can now make use of "FAA(&counter,1)" 
									and in this way make "counter++" atomic - 
									without any locking mechanism. Now, one 
									might ask if this function will always 
									succeed for all thread, or is there any 
									possibility that some thread are spinning 
									forever in the program loop? In theory, the 
									answer is yes to this possibility, although 
									in practice this is clearly very unlikely 
									under normal loads.
 |  |