Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime of CircleCI jobs exceed the limit for the free tier #166

Closed
geedo0 opened this issue Aug 13, 2024 · 7 comments
Closed

Runtime of CircleCI jobs exceed the limit for the free tier #166

geedo0 opened this issue Aug 13, 2024 · 7 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@geedo0
Copy link

geedo0 commented Aug 13, 2024

CircleCI's free tier limits a job's run time to 1 hour doc. The CI checks run by OpenSSH have been approaching that 1 hour limit and are now at the point where they just barely run past that limit and fail the CI checks example. Previous runs were already pretty close to the limit, the addition of new algorithms to test was enough to tip the run-time over the limit.

There's no solution that doesn't come with trade-offs, but here are some possibilities:

  • Migrate to a new CI framework like Github Actions. This seems to be the right choice, but I don't have the context on why CircleCI was chosen in the first place.
  • Upgrade to the paid Performance Plan pricing.
  • Audit the unit tests and skip tests that have long running times and aren't particularly interesting for OQS.
  • Look into refactoring the tests into parallel jobs that fit within the limits.
@geedo0
Copy link
Author

geedo0 commented Aug 20, 2024

@SWilson4 Here's that CircleCI issue I mentioned. I've got a first pass at porting it to Github Actions here. However, there's some difference in how the container/host handles process reaping which breaks a unit test, but I can patch around that.

name: CI Checks
on: [ push, pull_request ]
jobs:
  ubuntu_build:
    runs-on: ubuntu-latest
    container:
      image: openquantumsafe/ci-ubuntu-focal-x86_64:latest
    steps:
    - uses: actions/checkout@v4
    - name: Set up SSH environment
      run: |
        mkdir -p -m 0755 /var/empty
        groupadd sshd
        useradd -g sshd -c 'sshd privsep' -d /var/empty -s /bin/false sshd
    - name: Clone liboqs
      run: ./oqs-scripts/clone_liboqs.sh
    - name: Build liboqs
      run: ./oqs-scripts/build_liboqs.sh
    - name: Build OpenSSH
      run: env WITH_OPENSSL=true ./oqs-scripts/build_openssh.sh
    - name: Run tests documented to pass
      run: ./oqs-test/run_tests.sh
    - name: Ensure we have the ssh and sshd syntax right once for each algorithm
      run: python3 oqs-test/try_connection.py doone

@SWilson4
Copy link
Member

That config looks reasonable to me. Would it help if I made a draft PR to move the liboqs downstream trigger from OQS-v8 (on CircleCI) to OQS-v9 (on GitHub Actions)?

@geedo0
Copy link
Author

geedo0 commented Aug 20, 2024

Yeah, that'll help. I can put together the PR in OpenSSH to get Github Actions working.

geedo0 added a commit to geedo0/openssh that referenced this issue Aug 20, 2024
We are hitting the 1 hour time limit of Circle CI (Issue open-quantum-safe#166). This migrates the existing CircleCI job completely to Github Actions which has a 5 hour time limit.

For the most part, this is pretty much a one-to-one migration. The one oddity I noticed is a regression in one of the unit tests. I've root-caused it to a change in how Github's Ubuntu host + OQS' CI container handles zombie processes. In this configuration, they don't seem to reap zombies as aggressively as it did in CircleCI. This causes the test to "fail" because it counts a zombie as "alive". The creation of a zombie is by design due to how ssh-agent's subprocess mode works (ssh-agent forks into a child process to allow an arbitrary parent to take over, the child then becomes a zombie when it recognizes its orphaned status). To work around this, I added a check to the assertion to count zombies as "not alive" and this allows the test to pass on Github Actions.

Since upstream OpenSSH provided its own set of Github Actions, I simply moved those to the `upstream-github` directory to avoid conflicts and preserve the source.
geedo0 added a commit to geedo0/openssh that referenced this issue Aug 20, 2024
We are hitting the 1 hour time limit of Circle CI (Issue open-quantum-safe#166). This migrates the existing CircleCI job completely to Github Actions which has a 5 hour time limit.

For the most part, this is pretty much a one-to-one migration. The one oddity I noticed is a regression in one of the unit tests. I've root-caused it to a change in how Github's Ubuntu host + OQS' CI container handles zombie processes. In this configuration, they don't seem to reap zombies as aggressively as it did in CircleCI. This causes the test to "fail" because it counts a zombie as "alive". The creation of a zombie is by design due to how ssh-agent's subprocess mode works (ssh-agent forks into a child process to allow an arbitrary parent to take over, the child then becomes a zombie when it recognizes its orphaned status). To work around this, I added a check to the assertion to count zombies as "not alive" and this allows the test to pass on Github Actions.

Since upstream OpenSSH provided its own set of Github Actions, I simply moved those to the `upstream-github` directory to avoid conflicts and preserve the source.

Signed-off-by: gcr <[email protected]>
@baentsch
Copy link
Member

Migrate to a new CI framework like Github Actions.

That's the way to go -- we already did in liboqs (thanks to @SWilson4) and plan doing elsewhere too, e.g., open-quantum-safe/oqs-provider#248.

why CircleCI was chosen in the first place

That was because at the time it had support for platforms other CI systems didn't. For oqs-openssh this is irrelevant as a) GH CI supports more platforms by now and b) there's (currently) no PLATFORMS.md (guaranteed supported systems) this sub project "warrants", so let's get off CCI here rather than spend time working around its quirks.

@baentsch baentsch added the wontfix This will not be worked on label Aug 21, 2024
geedo0 added a commit to geedo0/openssh that referenced this issue Aug 21, 2024
We are hitting the 1 hour time limit of Circle CI (Issue open-quantum-safe#166). This migrates the existing CircleCI job completely to Github Actions which has a 5 hour time limit.

For the most part, this is pretty much a one-to-one migration. Since upstream OpenSSH provided its own set of Github Actions, I simply moved those to the `upstream-github` directory to avoid conflicts and preserve the source. I did run into two issues with getting the integration tests to pass. Beyond that, I ran into two issues that arose from migrating to Github Actions which need to be partched around.

The combination of Github Actions' host with the OQS CI container results in a lazier reaping of zombie processes which breaks this test. In this test, ssh-agent is run as a subprocess to some arbitrary user command. This enables exclusive access to ssh-agent to that specific process. The way this works under the hood is that ssh-agent forks into a child process and the parent process exec's into the arbitrary command ([code ref](https://github.com/open-quantum-safe/openssh/blob/OQS-v9/ssh-agent.c#L2384)) which runs to completion. The child process than polls its parent process until it detects its own orphaned status and terminates itself. This, by design, results in a zombie process which must be reaped. The test's assertion uses `kill -0` to check for liveness, but that counts zombies as "alive". The workaround for this then is to add an additional check to assert that zombies are in fact "dead".

The `percent` test tests % expansions inside SSH config files (e.g. home directory, username, port number). The assertion for the home directory uses the `HOME` environmental variable. Unfortunately, when running a container on a Github Runner, they unconditionally override the value of `HOME` with `/github/home` ([issue ref](actions/runner#863)) and this breaks the test assertion. The fix here is to get a more reliable reference for the home directory and use that for the assertion.

Signed-off-by: gcr <[email protected]>
geedo0 added a commit to geedo0/openssh that referenced this issue Aug 21, 2024
We are hitting the 1 hour time limit of Circle CI (Issue open-quantum-safe#166). This migrates the existing CircleCI job completely to Github Actions which has a 5 hour time limit.

For the most part, this is pretty much a one-to-one migration. Since upstream OpenSSH provided its own set of Github Actions, I simply moved those to the `upstream-github` directory to avoid conflicts and preserve the source. I did run into two issues with getting the integration tests to pass. Beyond that, I ran into two issues that arose from migrating to Github Actions which need to be partched around.

agent-subprocess zombie process reaping

The combination of Github Actions' host with the OQS CI container results in a lazier reaping of zombie processes which breaks this test. In this test, ssh-agent is run as a subprocess to some arbitrary user command. This enables exclusive access to ssh-agent to that specific process. The way this works under the hood is that ssh-agent forks into a child process and the parent process exec's into the arbitrary command ([code ref](https://github.com/open-quantum-safe/openssh/blob/OQS-v9/ssh-agent.c#L2384)) which runs to completion. The child process than polls its parent process until it detects its own orphaned status and terminates itself. This, by design, results in a zombie process which must be reaped. The test's assertion uses `kill -0` to check for liveness, but that counts zombies as "alive". The workaround for this then is to add an additional check to assert that zombies are in fact "dead".

percent expansion is broken due to Github's HOME override

The `percent` test tests % expansions inside SSH config files (e.g. home directory, username, port number). The assertion for the home directory uses the `HOME` environmental variable. Unfortunately, when running a container on a Github Runner, they unconditionally override the value of `HOME` with `/github/home` ([issue ref](actions/runner#863)) and this breaks the test assertion. The fix here is to get a more reliable reference for the home directory and use that for the assertion.

Signed-off-by: gcr <[email protected]>
geedo0 added a commit to geedo0/openssh that referenced this issue Aug 21, 2024
We are hitting the 1 hour time limit of Circle CI (Issue open-quantum-safe#166). This migrates the existing CircleCI job completely to Github Actions which has a 5 hour time limit.

For the most part, this is pretty much a one-to-one migration. Since upstream OpenSSH provided its own set of Github Actions, I simply moved those to the `upstream-github` directory to avoid conflicts and preserve the source. I did run into two issues with getting the integration tests to pass. Beyond that, I ran into two issues that arose from migrating to Github Actions which need to be partched around.

agent-subprocess zombie process reaping

The combination of Github Actions' host with the OQS CI container results in a lazier reaping of zombie processes which breaks this test. In this test, ssh-agent is run as a subprocess to some arbitrary user command. This enables exclusive access to ssh-agent to that specific process. The way this works under the hood is that ssh-agent forks into a child process and the parent process exec's into the arbitrary command ([code ref](https://github.com/open-quantum-safe/openssh/blob/OQS-v9/ssh-agent.c#L2384)) which runs to completion. The child process than polls its parent process until it detects its own orphaned status and terminates itself. This, by design, results in a zombie process which must be reaped. The test's assertion uses `kill -0` to check for liveness, but that counts zombies as "alive". The workaround for this then is to add an additional check to assert that zombies are in fact "dead".

percent expansion is broken due to Github's HOME override

The `percent` test tests % expansions inside SSH config files (e.g. home directory, username, port number). The assertion for the home directory uses the `HOME` environmental variable. Unfortunately, when running a container on a Github Runner, they unconditionally override the value of `HOME` with `/github/home` ([issue ref](actions/runner#863)) and this breaks the test assertion. The fix here is to get a more reliable reference for the home directory and use that for the assertion.

Signed-off-by: Gerardo Ravago <[email protected]>
@geedo0 geedo0 self-assigned this Aug 21, 2024
@SWilson4
Copy link
Member

Merging open-quantum-safe/liboqs#1898 will complete the move to GitHub Actions.

@geedo0
Copy link
Author

geedo0 commented Aug 22, 2024

Thanks everyone for quickly driving this to completion!

@geedo0 geedo0 closed this as completed Aug 22, 2024
@baentsch
Copy link
Member

Thanks everyone for quickly driving this to completion!

The reverse is more due: Thanks for the quick and thorough update (really, upgrade) to oqs-openssh, @geedo0 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants