You cannot probe or put breakpoints in a For Loop when iteration parallelism is enabled. If you want to temporarily debug the loop, check the Allow debugging box. The iterations of the loop will execute serially, but the (P) and (C) terminals will remain on the loop. Turn debugging off when you are finished debugging to reenable parallel execution.
After you enable iteration parallelism on a For Loop and close the dialog, you can configure the loop further by wiring values to the (P) and (C) terminals.
The number of loop instances used is the minimum of the value entered in the dialog box and the run-time value specified at the (P) terminal. If you leave (P) unwired, the default run-time value is the number of logical processors on the machine, so it is recommended that you leave (P) unwired. To specify a different number of loop instances to use at run-time, wire a value to the (P) terminal. See Table 1 below for an explanation of how the number wired to (P) translates to the number of loop instances used at run-time. The special cases for -1 and 0 are available in LabVIEW 2010 and later.
If you choose the Specify partitioning with chunk size (C) terminal schedule, you must wire a chunk size to the (C) terminal. Consider the total number of iterations when selecting the chunk size. If the chunk size is too large, it will limit the amount of parallel work available. If the chunk size is too small, it will increase the amount of overhead incurred by requesting the chunks.
For finer control over the chunk sizes, you can wire an array of chunk sizes to the (C) terminal. For example, if you know that the first iterations of the loop take longer than the last iterations, you may want to create an array with small chunk sizes at the beginning to prevent the first chunks from containing too many long iterations and with large chunk sizes at the end to bundle the short iterations together. If you wire too many chunk sizes, LabVIEW ignores the extra values. If you wire too few chunk sizes, LabVIEW uses the last element in the array to determine the size of the remaining chunks of iterations.
When to Use Iteration Parallelism
In general, you should only use iteration parallelism when the loop would produce the same results regardless of the order in which the iterations are executed. If the computation in one iteration of the loop relies on a value computed in an earlier iteration, reordering the iterations may produce incorrect results. For example, if an element of an array is written on the i'th iteration and read on the i+1'th iteration, the read could happen before the write, producing a different value, since parallelizing the loop may cause these iterations to occur in a different order.
Iteration Dependence Analysis
One of the strengths of LabVIEW is that you do not need to analyze loops for dependencies yourself, since LabVIEW automatically determines whether there are dependencies between the iterations. When you enable iteration parallelism on a For Loop, LabVIEW analyzes the reads and writes to the data accessed in the loop to determine if the same data could be written on one iteration and read or written on another, creating a dependence.
When LabVIEW detects an iteration dependence, it breaks the VI and describes the problem in the Error List window. In addition, if the For Loop contains nodes that have side effects, the Error List window displays a warning. (You can configure LabVIEW to show warnings by default in the Debugging section of the Environment category in the Tools>>Options dialog.)
In LabVIEW, most For Loops that do not have shift registers are safe to parallelize. For example, the loop in Figure 3 reads the i'th value of array A and produces the i'th value of the result array on every iteration. These iterations can safely execute in any order.

Figure 3 Loop that can be parallelized,
Certain types of For Loops with shift registers are safe to parallelize. For example, you can rewrite the loop from Figure 3 using shift registers as shown in Figure 4. Each iteration of the loop replaces a different element of the result array. Since the iterations do not depend on values computed in other iterations, it is safe to execute the iterations in parallel.

Figure 4 Loop with shift registers that can be parallelized
Figure 5 shows another example of a parallelizable loop with shift registers. Ordinarily, this loop would not be safe to parallelize because each iteration adds to the result from the previous iteration. However, LabVIEW recognizes this pattern as a reduction and generates code to compute partial sums in parallel and to add the partial sums at the end. See the shipping example Parallel For Loop Reduction.vi for additional information on reductions.

Figure 5 Loop computing a reduction that can be parallelized
Possible Errors and Warnings
Table 2 lists the possible errors and warnings that will be reported if the loop may not be safe to parallelize. If there is a shift register on the For Loop, the value must be an array where the iterations access different elements, or the value must used in a recognized reduction. (When there is a dependence between iterations, consider using another technique for obtaining parallelism, such as pipelining.)
Additionally, the shift register cannot be stacked, and it must be initialized. A For Loop cannot be parallelized if it contains a conditional terminal, a feedback node, or a Boolean control with latching mechanical action.
If the For Loop contains a node that may have side effects, the error list window will give you a warning. Examples of nodes with side effects include Local Variables and the Write to Text File function. When you see this warning, you should evaluate whether it is safe for your application to execute the operations out of order. For example, the order in which results are written to a file may or may not matter, depending on the application.
This feature is supported on desktop and real-time targets, but it is not supported on FPGA, PDA, Touch Panel, or embedded devices.
Array dependence between loop iterations |
Dependence between loop iterations |
Stacked shift register |
Uninitialized shift register |
Conditional terminal |
Feedback node |
Boolean control with latch mechanical action |
Node with side effects |
Feature not supported on target |
Table 2 All parallel For Loop errors and warnings
Due to rounding effects with floating point numbers, the results can differ, though usually only in the lowest-order bits, when operations are performed in a different order. See the article on precision for more information about how this can occur with iteration parallelism and also in sequential code.
Find Parallelizable Loops Tool
To easily identify opportunities for iteration parallelism, use the Find Parallelizable Loops tool (Tools>>Profile>>Find Parallelizable Loops…). The tool lists all For Loops in the hierarchy of the current VI or project, and marks them as Parallelizable, Possibly not parallelizable, or Not parallelizable. Figure 6 shows a screenshot of this tool.

Figure 6 Find Parallelizable Loops tool
Performance Tips
Iteration parallelism is able to achieve significant performance gains on multicore machines, as shown in the Performance Study section. However, to get the most out of this feature, there are some performance tips to keep in mind. This section gives a brief overview of the LabVIEW execution system and explains how to configure iteration parallelism to get the best performance.
LabVIEW manages a pool of execution system threads for running sections of LabVIEW diagrams. (In LabVIEW 2010, the number of threads is at least as many as the number of cores on your machine, and no fewer than four.) During a sequential phase of a LabVIEW application, some of these threads may be sleeping. When the amount of parallelism increases, LabVIEW must signal the operating system to wake up the idle threads. It is the operating system’s responsibility to resume execution of these idle threads and to preemptively share the available processors with the other threads in the system.
Within each execution system thread, LabVIEW cooperatively multitasks among sections of code called clumps. The LabVIEW compiler creates clumps by determining which sections of your diagram can run in parallel with other sections. At execution time, clumps periodically yield their execution to the scheduler to give other clumps that may be waiting a chance to run. If another clump is waiting, the scheduler pauses the currently running clump, and executes waiting clump. If nothing is waiting, the clump can continue running.
With iteration parallelism, the compiler puts each parallel loop instance into its own clump and generates code in each loop instance for getting the next chunk of iterations. Between this generated code and the execution system, there is a small amount of overhead associated with enabling iteration parallelism. Thus, loops that perform a trivial amount of computation will likely not benefit from iteration parallelism. The time saved by executing the iterations in parallel must be greater than the time spent scheduling the iterations for there to be a performance improvement.
In general, when clumps execute efficiently without blocking, it may not improve performance to have more clumps than there are threads. In this case, task switching does not reduce the overall execution time, and when the number of clumps is significantly larger than the number of threads, it can cause unnecessary overhead.
When For Loops with iteration parallelism are nested, the total number of loop instances is the product of the number of instances for each loop, which can easily exceed the number of threads. Additionally, the overhead of waking up threads for the inner loop is repeated on each iteration of the outer loop. However, if only the outer loop is parallelized, this overhead is only incurred once. As a result, it is usually best to enable parallelism only on the outermost loop.
Similarly, if you are aware of other sections of code which will execute at the same time, and you want to limit the amount of resources given to the For Loop, you can wire a fraction of the number of logical processors to (P) using the CPU Information VI. Figure 7 shows two For Loops with iteration parallelism enabled. Since the loops can execute at the same time, it may be beneficial to wire half of the number of logical processors to (P) on each loop.

Figure 7 Limiting the number of workers
Alternatively, when the computation must wait for something like an I/O operation to complete before proceeding, it can be beneficial to have more clumps available than there are threads. This is commonly referred to as “oversubscribing.” This creates additional clumps which can be swapped in when other clumps are waiting. In the LabVIEW execution system, when a clump executes an operation that causes it to wait, the clump yields to allow other clumps to execute. Thus, if the For Loop contains blocking nodes, like I/O operations, you may want to use more parallel loop instances than the number of cores in your machine.
Finally, you should avoid calling serializing functions, like non-reentrant subVIs, in the loop, since the parallel loop instances would have to take turns executing the function. If possible, make subVIs reentrant to increase the parallelism available. Use the VI Analyzer or the Find Parallelizable Loops tool to find calls to non-reentrant subVIs in For Loops with parallelism enabled.
In summary,
- only enable iteration parallelism when the loop performs a significant amount of computation to outweigh the scheduling overhead,
- limit the total number of parallel loop instances to the number of cores unless the iterations call blocking nodes, and
- avoid calling serializing nodes in the loop.
Performance Study
The loop shown below calculates the Mandelbrot set. Each iteration of the outer loop computes one row of the result array. Since the values computed for one row do not depend on the values computed for any other row, the iterations of the outer loop can execute in parallel.
Figure 8 Mandelbrot set computation
| Sequential Time | Parallel Time | Speedup |
LabVIEW 2009 | 14.9 s | 5.8 s | 2.6 |
LabVIEW 2010 | 9.5 s | 2.6 s | 3.7 |
Table 3 Performance results for computing a 500 by 500 Mandelbrot set on a quad-core machine
Using LabVIEW 2010, the sequential version of this algorithm takes 9.46 seconds on a 500 by 500 set. Changing the outer loop to use iteration parallelism reduces the execution time to 2.55 seconds on a quad-core machine, which is 3.7 times faster than the sequential version.
This algorithm highlights the benefit of the dynamic scheduling strategy introduced in LabVIEW 2010. When the same benchmarks are executed using LabVIEW 2009, the parallel version is 2.6 times faster than the sequential version. While this is a significant improvement, the scheduling strategy in LabVIEW 2010 can achieve even higher performance.
Summary
To achieve better performance on multicore machines, consider enabling iteration parallelism on For Loops where the iterations do not depend on each other. The Find Parallelizable Loops tool can help you find loops which are candidates for iteration parallelism in your projects or VI hierarchies. Loops that perform a significant amount of computation per iteration and that do not call serializing nodes, like non-reentrant subVIs, will benefit the most from this feature.