Computer scientists and engineers have understood for a long time – almost from the inception of computer science as a discipline – that a possible route to accelerating the execution of a computational task is to exploit the parallelism inherent in its program flow.
If it were possible to identify those aspects of a program that can be executed in parallel, re-organise the program flow accordingly, and provide hardware to host each parallel fragment, the overall task should run to completion in greatly reduced time.
In practice, the problem has proved, over time, to be relatively intractable. For a considerable period, finding a solution was taken off the critical path to improving performance, as the semiconductor industry continued to provide rapidly-increasing processing power through faster and faster CPU clock rates. That era has now come to an end and microprocessor designers have turned to multicore architectures to further increase throughput. The availability of a suitable hardware platform on which to run parallel-structured tasks once again focuses attention on the desirability of parallelizing program flows.
The feasibility of taking a single sequential program and breaking it apart to run various pieces in parallel depends on whether these pieces rely on each other for completion. This is driven by dependencies within the program, and understanding such dependencies is critical to creating an effective parallel implementation that is functionally identical to the original sequential version.