Skip to content

Commit

Permalink
middle-end: Support vectorization of loops with multiple exits.
Browse files Browse the repository at this point in the history
Hi All,

This patch adds initial support for early break vectorization in GCC. In other
words it implements support for vectorization of loops with multiple exits.
The support is added for any target that implements a vector cbranch optab,
this includes both fully masked and non-masked targets.

Depending on the operation, the vectorizer may also require support for boolean
mask reductions using Inclusive OR/Bitwise AND.  This is however only checked
then the comparison would produce multiple statements.

This also fully decouples the vectorizer's notion of exit from the existing loop
infrastructure's exit.  Before this patch the vectorizer always picked the
natural loop latch connected exit as the main exit.

After this patch the vectorizer is free to choose any exit it deems appropriate
as the main exit.  This means that even if the main exit is not countable (i.e.
the termination condition could not be determined) we might still be able to
vectorize should one of the other exits be countable.

In such situations the loop is reflowed which enabled vectorization of many
other loop forms.

Concretely the kind of loops supported are of the forms:

 for (int i = 0; i < N; i++)
 {
   <statements1>
   if (<condition>)
     {
       ...
       <action>;
     }
   <statements2>
 }

where <action> can be:
 - break
 - return
 - goto

Any number of statements can be used before the <action> occurs.

Since this is an initial version for GCC 14 it has the following limitations and
features:

- Only fixed sized iterations and buffers are supported.  That is to say any
  vectors loaded or stored must be to statically allocated arrays with known
  sizes. N must also be known.  This limitation is because our primary target
  for this optimization is SVE.  For VLA SVE we can't easily do cross page
  iteraion checks. The result is likely to also not be beneficial. For that
  reason we punt support for variable buffers till we have First-Faulting
  support in GCC 15.
- any stores in <statements1> should not be to the same objects as in
  <condition>.  Loads are fine as long as they don't have the possibility to
  alias.  More concretely, we block RAW dependencies when the intermediate value
  can't be separated fromt the store, or the store itself can't be moved.
- Prologue peeling, alignment peelinig and loop versioning are supported.
- Fully masked loops, unmasked loops and partially masked loops are supported
- Any number of loop early exits are supported.
- No support for epilogue vectorization.  The only epilogue supported is the
  scalar final one.  Peeling code supports it but the code motion code cannot
  find instructions to make the move in the epilog.
- Early breaks are only supported for inner loop vectorization.

With the help of IPA and LTO this still gets hit quite often.  During bootstrap
it hit rather frequently.  Additionally TSVC s332, s481 and s482 all pass now
since these are tests for support for early exit vectorization.

This implementation does not support completely handling the early break inside
the vector loop itself but instead supports adding checks such that if we know
that we have to exit in the current iteration then we branch to scalar code to
actually do the final VF iterations which handles all the code in <action>.

For the scalar loop we know that whatever exit you take you have to perform at
most VF iterations.  For vector code we only case about the state of fully
performed iteration and reset the scalar code to the (partially) remaining loop.

That is to say, the first vector loop executes so long as the early exit isn't
needed.  Once the exit is taken, the scalar code will perform at most VF extra
iterations.  The exact number depending on peeling and iteration start and which
exit was taken (natural or early).   For this scalar loop, all early exits are
treated the same.

When we vectorize we move any statement not related to the early break itself
and that would be incorrect to execute before the break (i.e. has side effects)
to after the break.  If this is not possible we decline to vectorize.  The
analysis and code motion also takes into account that it doesn't introduce a RAW
dependency after the move of the stores.

This means that we check at the start of iterations whether we are going to exit
or not.  During the analyis phase we check whether we are allowed to do this
moving of statements.  Also note that we only move the scalar statements, but
only do so after peeling but just before we start transforming statements.

With this the vector flow no longer necessarily needs to match that of the
scalar code.  In addition most of the infrastructure is in place to support
general control flow safely, however we are punting this to GCC 15.

Codegen:

for e.g.

unsigned vect_a[N];
unsigned vect_b[N];

unsigned test4(unsigned x)
{
 unsigned ret = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;
   if (vect_a[i] > x)
     break;
   vect_a[i] = x;

 }
 return ret;
}

We generate for Adv. SIMD:

test4:
        adrp    x2, .LC0
        adrp    x3, .LANCHOR0
        dup     v2.4s, w0
        add     x3, x3, :lo12:.LANCHOR0
        movi    v4.4s, 0x4
        add     x4, x3, 3216
        ldr     q1, [x2, #:lo12:.LC0]
        mov     x1, 0
        mov     w2, 0
        .p2align 3,,7
.L3:
        ldr     q0, [x3, x1]
        add     v3.4s, v1.4s, v2.4s
        add     v1.4s, v1.4s, v4.4s
        cmhi    v0.4s, v0.4s, v2.4s
        umaxp   v0.4s, v0.4s, v0.4s
        fmov    x5, d0
        cbnz    x5, .L6
        add     w2, w2, 1
        str     q3, [x1, x4]
        str     q2, [x3, x1]
        add     x1, x1, 16
        cmp     w2, 200
        bne     .L3
        mov     w7, 3
.L2:
        lsl     w2, w2, 2
        add     x5, x3, 3216
        add     w6, w2, w0
        sxtw    x4, w2
        ldr     w1, [x3, x4, lsl 2]
        str     w6, [x5, x4, lsl 2]
        cmp     w0, w1
        bcc     .L4
        add     w1, w2, 1
        str     w0, [x3, x4, lsl 2]
        add     w6, w1, w0
        sxtw    x1, w1
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        add     w4, w2, 2
        str     w0, [x3, x1, lsl 2]
        sxtw    x1, w4
        add     w6, w1, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
        add     w2, w2, 3
        cmp     w7, 3
        beq     .L4
        sxtw    x1, w2
        add     w2, w2, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w2, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
.L4:
        mov     w0, 0
        ret
        .p2align 2,,3
.L6:
        mov     w7, 4
        b       .L2

and for SVE:

test4:
        adrp    x2, .LANCHOR0
        add     x2, x2, :lo12:.LANCHOR0
        add     x5, x2, 3216
        mov     x3, 0
        mov     w1, 0
        cntw    x4
        mov     z1.s, w0
        index   z0.s, #0, #1
        ptrue   p1.b, all
        ptrue   p0.s, all
        .p2align 3,,7
.L3:
        ld1w    z2.s, p1/z, [x2, x3, lsl 2]
        add     z3.s, z0.s, z1.s
        cmplo   p2.s, p0/z, z1.s, z2.s
        b.any   .L2
        st1w    z3.s, p1, [x5, x3, lsl 2]
        add     w1, w1, 1
        st1w    z1.s, p1, [x2, x3, lsl 2]
        add     x3, x3, x4
        incw    z0.s
        cmp     w3, 803
        bls     .L3
.L5:
        mov     w0, 0
        ret
        .p2align 2,,3
.L2:
        cntw    x5
        mul     w1, w1, w5
        cbz     w5, .L5
        sxtw    x1, w1
        sub     w5, w5, #1
        add     x5, x5, x1
        add     x6, x2, 3216
        b       .L6
        .p2align 2,,3
.L14:
        str     w0, [x2, x1, lsl 2]
        cmp     x1, x5
        beq     .L5
        mov     x1, x4
.L6:
        ldr     w3, [x2, x1, lsl 2]
        add     w4, w0, w1
        str     w4, [x6, x1, lsl 2]
        add     x4, x1, 1
        cmp     w0, w3
        bcs     .L14
        mov     w0, 0
        ret

On the workloads this work is based on we see between 2-3x performance uplift
using this patch.

Follow up plan:
 - Boolean vectorization has several shortcomings.  I've filed PR110223 with the
   bigger ones that cause vectorization to fail with this patch.
 - SLP support.  This is planned for GCC 15 as for majority of the cases build
   SLP itself fails.  This means I'll need to spend time in making this more
   robust first.  Additionally it requires:
     * Adding support for vectorizing CFG (gconds)
     * Support for CFG to differ between vector and scalar loops.
   Both of which would be disruptive to the tree and I suspect I'll be handling
   fallouts from this patch for a while.  So I plan to work on the surrounding
   building blocks first for the remainder of the year.

Additionally it also contains reduced cases from issues found running over
various codebases.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Also regtested with:
 -march=armv8.3-a+sve
 -march=armv8.3-a+nosve
 -march=armv9-a
 -mcpu=neoverse-v1
 -mcpu=neoverse-n2

Bootstrapped Regtested x86_64-pc-linux-gnu and no issues.
Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues.

gcc/ChangeLog:

	* tree-if-conv.cc (idx_within_array_bound): Expose.
	* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New.
	(vect_analyze_data_ref_dependences): Use it.
	* tree-vect-loop-manip.cc (vect_iv_increment_position): New.
	(vect_set_loop_controls_directly,
	vect_set_loop_condition_partial_vectors,
	vect_set_loop_condition_partial_vectors_avx512,
	vect_set_loop_condition_normal): Support multiple exits.
	(slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for
	multiple exits.
	(slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB
	count and instead look at loop shape.
	(vect_update_ivs_after_vectorizer): Drop asserts.
	(vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations.
	(vect_do_peeling): Support multiple exits.
	(vect_loop_versioning): Likewise.
	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise
	early_breaks.
	(vect_analyze_loop_form): Support loop flows with more than single BB
	loop body.
	(vect_create_loop_vinfo): Support niters analysis for multiple exits.
	(vect_analyze_loop): Likewise.
	(vect_get_vect_def): New.
	(vect_create_epilog_for_reduction): Support early exit reductions.
	(vectorizable_live_operation_1): New.
	(find_connected_edge): New.
	(vectorizable_live_operation): Support early exit live operations.
	(move_early_exit_stmts): New.
	(vect_transform_loop): Use it.
	* tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond.
	(vect_recog_bitfield_ref_pattern): Support gconds and bools.
	(vect_recog_gcond_pattern): New.
	(possible_vector_mask_operation_p): Support gcond masks.
	(vect_determine_mask_precision): Likewise.
	(vect_mark_pattern_stmts): Set gcond def type.
	(can_vectorize_live_stmts): Force early break inductions to be live.
	* tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for
	early breaks.
	(vect_mark_stmts_to_be_vectorized): Process gcond usage.
	(perm_mask_for_reverse): Expose.
	(vectorizable_comparison_1): New.
	(vectorizable_early_exit): New.
	(vect_analyze_stmt): Support early break and gcond.
	(vect_transform_stmt): Likewise.
	(vect_is_simple_use): Likewise.
	(vect_get_vector_types_for_stmt): Likewise.
	* tree-vectorizer.cc (pass_vectorize::execute): Update exits for value
	numbering.
	* tree-vectorizer.h (enum vect_def_type): Add vect_condition_def.
	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES,
	LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB,
	LOOP_VINFO_EARLY_BRK_VUSES): New.
	(is_loop_header_bb_p): Drop assert.
	(class loop): Add early_breaks, early_break_stores, early_break_dest_bb,
	early_break_vuses.
	(vect_iv_increment_position, perm_mask_for_reverse,
	ref_within_array_bound): New.
	(slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.
  • Loading branch information
TamarChristinaArm committed Dec 24, 2023
1 parent f1dcc0f commit 01f4251
Show file tree
Hide file tree
Showing 8 changed files with 1,330 additions and 233 deletions.
2 changes: 1 addition & 1 deletion gcc/tree-if-conv.cc
Original file line number Diff line number Diff line change
Expand Up @@ -844,7 +844,7 @@ idx_within_array_bound (tree ref, tree *idx, void *dta)

/* Return TRUE if ref is a within bound array reference. */

static bool
bool
ref_within_array_bound (gimple *stmt, tree ref)
{
class loop *loop = loop_containing_stmt (stmt);
Expand Down
237 changes: 237 additions & 0 deletions gcc/tree-vect-data-refs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -613,6 +613,238 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
return opt_result::success ();
}

/* Funcion vect_analyze_early_break_dependences.
Examime all the data references in the loop and make sure that if we have
mulitple exits that we are able to safely move stores such that they become
safe for vectorization. The function also calculates the place where to move
the instructions to and computes what the new vUSE chain should be.
This works in tandem with the CFG that will be produced by
slpeel_tree_duplicate_loop_to_edge_cfg later on.
This function tries to validate whether an early break vectorization
is possible for the current instruction sequence. Returns True i
possible, otherwise False.
Requirements:
- Any memory access must be to a fixed size buffer.
- There must not be any loads and stores to the same object.
- Multiple loads are allowed as long as they don't alias.
NOTE:
This implemementation is very conservative. Any overlappig loads/stores
that take place before the early break statement gets rejected aside from
WAR dependencies.
i.e.:
a[i] = 8
c = a[i]
if (b[i])
...
is not allowed, but
c = a[i]
a[i] = 8
if (b[i])
...
is which is the common case. */

static opt_result
vect_analyze_early_break_dependences (loop_vec_info loop_vinfo)
{
DUMP_VECT_SCOPE ("vect_analyze_early_break_dependences");

/* List of all load data references found during traversal. */
auto_vec<data_reference *> bases;
basic_block dest_bb = NULL;

hash_set <gimple *> visited;
class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
class loop *loop_nest = loop_outer (loop);

if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
"loop contains multiple exits, analyzing"
" statement dependencies.\n");

for (gimple *c : LOOP_VINFO_LOOP_CONDS (loop_vinfo))
{
stmt_vec_info loop_cond_info = loop_vinfo->lookup_stmt (c);
if (STMT_VINFO_TYPE (loop_cond_info) != loop_exit_ctrl_vec_info_type)
continue;

gimple_stmt_iterator gsi = gsi_for_stmt (c);

/* Now analyze all the remaining statements and try to determine which
instructions are allowed/needed to be moved. */
while (!gsi_end_p (gsi))
{
gimple *stmt = gsi_stmt (gsi);
gsi_prev (&gsi);
if (!gimple_has_ops (stmt)
|| is_gimple_debug (stmt))
continue;

stmt_vec_info stmt_vinfo = loop_vinfo->lookup_stmt (stmt);
auto dr_ref = STMT_VINFO_DATA_REF (stmt_vinfo);
if (!dr_ref)
continue;

/* We currently only support statically allocated objects due to
not having first-faulting loads support or peeling for
alignment support. Compute the size of the referenced object
(it could be dynamically allocated). */
tree obj = DR_BASE_ADDRESS (dr_ref);
if (!obj || TREE_CODE (obj) != ADDR_EXPR)
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"early breaks only supported on statically"
" allocated objects.\n");
return opt_result::failure_at (c,
"can't safely apply code motion to "
"dependencies of %G to vectorize "
"the early exit.\n", c);
}

tree refop = TREE_OPERAND (obj, 0);
tree refbase = get_base_address (refop);
if (!refbase || !DECL_P (refbase) || !DECL_SIZE (refbase)
|| TREE_CODE (DECL_SIZE (refbase)) != INTEGER_CST)
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"early breaks only supported on"
" statically allocated objects.\n");
return opt_result::failure_at (c,
"can't safely apply code motion to "
"dependencies of %G to vectorize "
"the early exit.\n", c);
}

/* Check if vector accesses to the object will be within bounds.
must be a constant or assume loop will be versioned or niters
bounded by VF so accesses are within range. */
if (!ref_within_array_bound (stmt, DR_REF (dr_ref)))
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"early breaks not supported: vectorization "
"would %s beyond size of obj.",
DR_IS_READ (dr_ref) ? "read" : "write");
return opt_result::failure_at (c,
"can't safely apply code motion to "
"dependencies of %G to vectorize "
"the early exit.\n", c);
}

if (DR_IS_READ (dr_ref))
bases.safe_push (dr_ref);
else if (DR_IS_WRITE (dr_ref))
{
/* We are moving writes down in the CFG. To be sure that this
is valid after vectorization we have to check all the loads
we are sinking the stores past to see if any of them may
alias or are the same object.
Same objects will not be an issue because unless the store
is marked volatile the value can be forwarded. If the
store is marked volatile we don't vectorize the loop
anyway.
That leaves the check for aliasing. We don't really need
to care about the stores aliasing with each other since the
stores are moved in order so the effects are still observed
correctly. This leaves the check for WAR dependencies
which we would be introducing here if the DR can alias.
The check is quadratic in loads/stores but I have not found
a better API to do this. I believe all loads and stores
must be checked. We also must check them when we
encountered the store, since we don't care about loads past
the store. */

for (auto dr_read : bases)
if (dr_may_alias_p (dr_ref, dr_read, loop_nest))
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION,
vect_location,
"early breaks not supported: "
"overlapping loads and stores "
"found before the break "
"statement.\n");

return opt_result::failure_at (stmt,
"can't safely apply code motion to dependencies"
" to vectorize the early exit. %G may alias with"
" %G\n", stmt, dr_read->stmt);
}
}

if (gimple_vdef (stmt))
{
if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
"==> recording stmt %G", stmt);

LOOP_VINFO_EARLY_BRK_STORES (loop_vinfo).safe_push (stmt);
}
else if (gimple_vuse (stmt))
{
LOOP_VINFO_EARLY_BRK_VUSES (loop_vinfo).safe_insert (0, stmt);
if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
"marked statement for vUSE update: %G", stmt);
}
}

/* Save destination as we go, BB are visited in order and the last one
is where statements should be moved to. */
if (!dest_bb)
dest_bb = gimple_bb (c);
else
{
basic_block curr_bb = gimple_bb (c);
if (dominated_by_p (CDI_DOMINATORS, curr_bb, dest_bb))
dest_bb = curr_bb;
}
}

basic_block dest_bb0 = EDGE_SUCC (dest_bb, 0)->dest;
basic_block dest_bb1 = EDGE_SUCC (dest_bb, 1)->dest;
dest_bb = flow_bb_inside_loop_p (loop, dest_bb0) ? dest_bb0 : dest_bb1;
/* We don't allow outer -> inner loop transitions which should have been
trapped already during loop form analysis. */
gcc_assert (dest_bb->loop_father == loop);

gcc_assert (dest_bb);
LOOP_VINFO_EARLY_BRK_DEST_BB (loop_vinfo) = dest_bb;

if (!LOOP_VINFO_EARLY_BRK_VUSES (loop_vinfo).is_empty ())
{
/* All uses shall be updated to that of the first load. Entries are
stored in reverse order. */
tree vuse = gimple_vuse (LOOP_VINFO_EARLY_BRK_VUSES (loop_vinfo).last ());
for (auto g : LOOP_VINFO_EARLY_BRK_VUSES (loop_vinfo))
{
if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
"will update use: %T, mem_ref: %G", vuse, g);
}
}

if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
"recorded statements to be moved to BB %d\n",
LOOP_VINFO_EARLY_BRK_DEST_BB (loop_vinfo)->index);

return opt_result::success ();
}

/* Function vect_analyze_data_ref_dependences.
Examine all the data references in the loop, and make sure there do not
Expand Down Expand Up @@ -657,6 +889,11 @@ vect_analyze_data_ref_dependences (loop_vec_info loop_vinfo,
return res;
}

/* If we have early break statements in the loop, check to see if they
are of a form we can vectorizer. */
if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
return vect_analyze_early_break_dependences (loop_vinfo);

return opt_result::success ();
}

Expand Down
Loading

0 comments on commit 01f4251

Please sign in to comment.