Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRT-1811: fallback queries #1993

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

neisw
Copy link
Contributor

@neisw neisw commented Sep 15, 2024

  • Queries previous releases without variant, capability or testid filtering
  • Caches results so navigating down through capabilities doesn't cause the query to run again
  • Compares stats from previous releases based on testIdentification and selects the one with the highest pass rate
  • Includes Advanced Option for Fallback Basis default is ignore currently
  • Fallback test details shows the overriden release status
    • Override Release Assessment
    • Fallback Release Assessment
    • Fallback Release basis with dates
    • Override Release basis with dates (new tab)

In testing I'm not seeing many regressions that only show for fallback (which is good).
But testing with:
Sample 4.18 9/21-9/28
Base 4.17 8/2 - 8/30
Fallback Basis: Keep

image

Will generate a regression for operator conditions etcd that falls back to 4.15 1/29 - 2/28

image

The overriden basis data is still available via the Override Tests tab
image

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 15, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 15, 2024

@neisw: This pull request references TRT-1811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

  • Queries previous releases without variant, capability or testid filtering
  • Caches results so navigating down through capabilities doesn't cause the query to run again
  • Compares stats from previous releases based on testIdentification and selects the one with the highest pass rate
  • Currently just a POC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 15, 2024
Copy link
Contributor

openshift-ci bot commented Sep 15, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Sep 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neisw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2024
@neisw neisw force-pushed the fallback-queries branch 3 times, most recently from 4d62d3a to 5e973fc Compare October 1, 2024 17:36
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2024

@neisw: This pull request references TRT-1811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

  • Queries previous releases without variant, capability or testid filtering
  • Caches results so navigating down through capabilities doesn't cause the query to run again
  • Compares stats from previous releases based on testIdentification and selects the one with the highest pass rate
  • Includes Advanced Option for Fallback Basis default is ignore currently
  • Fallback test details shows the overriden release status
  • Override Release Assessment
  • Fallback Release Assessment
  • Fallback Release basis with dates
  • Override Release basis with dates (new tab)

In testing I'm not seeing too many over regressions that only show for fallback (which is good).
But testing with:
Sample 4.18 9/21-9/28
Base 4.17 8/2 - 8/30
Fallback Basis: Keep

image

Will generate a regression for operator conditions etcd that falls back to 4.15 1/29 - 2/28

image

The overriden basis data is still available via the Override Tests tab
image

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@neisw neisw changed the title TRT-1811: POC fallback queries TRT-1811: fallback queries Oct 1, 2024
@neisw neisw marked this pull request as ready for review October 1, 2024 19:45
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2024

@neisw: This pull request references TRT-1811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

  • Queries previous releases without variant, capability or testid filtering
  • Caches results so navigating down through capabilities doesn't cause the query to run again
  • Compares stats from previous releases based on testIdentification and selects the one with the highest pass rate
  • Includes Advanced Option for Fallback Basis default is ignore currently
  • Fallback test details shows the overriden release status
  • Override Release Assessment
  • Fallback Release Assessment
  • Fallback Release basis with dates
  • Override Release basis with dates (new tab)

In testing I'm not seeing many regressions that only show for fallback (which is good).
But testing with:
Sample 4.18 9/21-9/28
Base 4.17 8/2 - 8/30
Fallback Basis: Keep

image

Will generate a regression for operator conditions etcd that falls back to 4.15 1/29 - 2/28

image

The overriden basis data is still available via the Override Tests tab
image

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Oct 1, 2024

@neisw: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/lint 4fd5899 link true /test lint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase to include #2031
(and preferably the real fix for the 4.18 view)

Comment on lines +132 to +138
<p>Fallback Basis: {ignoreFallbackBasis ? 'ignore' : 'keep'}</p>
<Switch
checked={ignoreFallbackBasis}
onChange={handleChangeIgnoreFallbackBasis}
name="ignoreFallbackBasis"
color="primary"
/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it please have a tooltip to explain what it's for

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i would find this less confusing if the boolean were reversed and so we had "Enable Fallback Basis" defaulting to false (for now)

Comment on lines +448 to +449
<Tab label="Basis Tests" {...tabProps(0)} />
<Tab label="Override Tests" {...tabProps(1)} />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These titles are too ambiguous for a newb. It's admittedly difficult to come up with better. Maybe:

Suggested change
<Tab label="Basis Tests" {...tabProps(0)} />
<Tab label="Override Tests" {...tabProps(1)} />
<Tab label="Severest Result" {...tabProps(0)} />
<Tab label="Specified Result" {...tabProps(1)} />

Again some kind of explanation for newbs would be helpful... though I'm not sure tooltips work well on tabs

Copy link
Member

@stbenjam stbenjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you share a localhost link to a page with a fallback? It'll make it easier if I'm running locally

  • Are you seeing any errors with redis locally? About 25% of the time things fail to cache - maybe it's my local network, the redis is running elsewhere

INFO[2024-10-03T07:28:16.163-04:00] Fallback (4.15) QueryTestStatus completed in 23.094652875s with 180541 base results from db
WARN[2024-10-03T07:28:16.797-04:00] couldn't persist new item to cache            error="write tcp 192.168.1.157:62924->192.168.1.215:6379: i/o timeout"
  • Is there any way to wire up ignoreFallback to the cache being unavailable? Running locally if you don't have a cache, it's very very slow

Some UI comments:

  • I think we should focus on only showing the user one square / one assessment per test. Having 2 muddies the waters: which is the right one? You can put it in a tooltip that it was overridden and what the original result was

  • Would it be possible to show all the releases in tabs? e.g. the tab titles could be the basis release, and let me see all the ones we have (e.g. all 4). It'd help for archeology, i.e. to see how the test did over time

func Prior30Days(gaDate *time.Time) time.Time {
// see if we can leverage ParseCRReleaseTime or other
// existing helper
return gaDate.Add(-1 * 30 * 24 * time.Hour)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed we're inconsistent, sometimes we're showing 30 days before GA, but the buttons in the UI are for 4 weeks (28 days).

Not a new problem but seems like we should pick one or the other.

Comment on lines +751 to +757
// this gets removed when we stop loading fallback data for the current base release
// and the section up to gets uncommented
fallbackRelease, err = previousRelease(fallbackRelease)
if err != nil {
log.WithError(err).Errorf("Failure determining fallback release for %s", fallbackRelease)
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bug here: if previousRelease fails, the loop logs the error and continues -- causing the same release to be processed again in the next iteration, since previousRelease returns it's prior argument on failure.

IMHO it'd be simpler to construct the list of releases you want then range over it. It also avoids the fixed 4 in the for loop, I don't think you can assume there's always that many

Copy link
Member

@stbenjam stbenjam Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

var selectedReleases []*crtype.Release
fallbackRelease := f.BaseRelease

// Get base plus up to 3 fallback releases
for i := 0; i < 4; i++ {
	var crRelease *crtype.Release

	for i := range releases {
		if release.Release == fallbackRelease {
			crRelease = &releases[i]
			break
		}
	}

	if crRelease != nil {
		selectedReleases = append(selectedReleases, crRelease)
	}

	// Attempt to get the previous release
	var err error
	fallbackRelease, err = previousRelease(fallbackRelease)
	if err != nil {
		log.WithError(err).Errorf("failure determining fallback release for %s", fallbackRelease)
		break // Stop attempting if previousRelease fails
	}
}

for _, crRelease := range selectedReleases {
	// launch goroutines to fetch the data
}

}
} else {
// if we miss on a prior release we stop
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This break belongs to if cachedTestStatuses, ok := c.cachedFallbackTestStatuses.Releases[priorRelease]; ok

Do you want to break if previousRelease returns err too?

}

wg.Add(1)
go func(queryRelease *crtype.Release, queryStart, queryEnd time.Time) {
Copy link
Member

@stbenjam stbenjam Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the only case where we're not doing it, but we need to start propagating the context from the http Request down into component readiness and using it in all our goroutines.

When a client disconnects, these still continue executing which isn't ideal. At the minimum it makes sippy a target for a ddos attack (clients can quickly reconnect over and over again and spawn off many goroutines that stay running).

i.e.:

func (s *Server) jsonComponentReportFromBigQuery(w http.ResponseWriter, req *http.Request) {
    ctx := req.Context()

    // [....]
    componentreadiness.GetComponentReportFromBigQuery(ctx, ...)
    // [...]
}

func GetComponentReportFromBigQuery(ctx context.Context, ...) {
    // [...]
    someOtherHelperThatUsesGoRoutines(ctx, ....)
    // [...]
}


func someOtherHelperThatUsesGoRoutines(ctx context.Context, ...) {
    // [...]
    go func() {
        select {
        case <-ctx.Done():
             // any clean up required
            return // terminate when context does
        default:
            // actual body of work to do
        }
    }()
    // [...]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a separate card for this: https://issues.redhat.com/browse/TRT-1845

Comment on lines +724 to +726
for _, release := range releases {
if release.Release == fallbackRelease {
crRelease = &release
Copy link
Member

@stbenjam stbenjam Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't take the address of a loop variable like this, instead:

Suggested change
for _, release := range releases {
if release.Release == fallbackRelease {
crRelease = &release
for i, release := range releases {
if release.Release == fallbackRelease {
crRelease = &releases[i]

Or update go.mod to require 1.22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants