Commit Graph

4201 Commits

Author SHA1 Message Date
John Ericson
08c9225a76 Merge branch 'nix-2.20' into hydra.nixos.org 2024-05-21 18:08:58 -04:00
John Ericson
21044bc4d9 flake.lock: Update
Flake lock file updates:

• Updated input 'nix':
    'github:NixOS/nix/8f42912c80c0a03f62f6a3d28a3af05a9762565d' (2024-01-30)
  → 'github:NixOS/nix/ab48ea416a203e9ccefb70aa634e27477e4c1ac4' (2024-05-15)
2024-05-21 18:01:30 -04:00
Pierre Bourdon
b8d03adaf4 queue runner: attempt at slightly smarter scheduling criteria
Instead of just going for "whatever is the oldest build we know of",
use the following first:

- Is the step more constrained? If so, schedule it first to avoid
  filling up "more desirable" build slots with less constrained builds.

- Does the step have more dependents? If so, schedule it first to try
  and maximize open parallelism and breadth of scheduling options.
2024-04-21 17:36:16 +02:00
Pierre Bourdon
ee1a7a7813 web: serveFile: also serve a CSP putting served HTML in its own origin 2024-04-21 16:14:24 +02:00
Pierre Bourdon
5c3e508e55 queue-runner: release machine reservation while copying outputs
This allows for better builder usage when the queue runner is busy. To
avoid running into uncontrollable imbalances between builder/queue
runner, we only release the machine reservation after the local
throttler has found a slot to start copying the outputs for that build.
2024-04-21 01:55:19 +02:00
Pierre Bourdon
026e3a3103 queue-runner: switch to pseudorandom ordering of builds processing
We don't rely on sequential / monotonic build IDs processing anymore, so
randomizing actually has the advantage of mixing builds for different
systems together, to avoid only one chunk of builds for a single system
getting processed while builders for other systems are starved.
2024-04-20 23:05:26 +02:00
Pierre Bourdon
6606a7f86e queue runner: introduce some parallelism for remote paths lookup
Each output for a given step being ingested is looked up in parallel,
which should basically multiply the speed of builds ingestion by the
average number of outputs per derivation.
2024-04-20 22:28:18 +02:00
Pierre Bourdon
f31b95d371 queue-runner: reduce the time between queue monitor restarts
This will induce more DB queries (though these are fairly cheap), but at
the benefit of processing bumps within 1m instead of within 10m.
2024-04-20 16:58:10 +02:00
Pierre Bourdon
54f8daf6b1 queue-runner: remove id > X from new builds query
Running the query with/without it shows that it makes no difference to
postgres, since there's an index on finished=0 already. This allows a
few simplifications, but also paves the way towards running multiple
parallel monitor threads in the future.
2024-04-20 16:53:52 +02:00
Pierre Bourdon
cc6bafe538 queue-runner: add prom metrics to allow detecting internal bottlenecks
By looking at the ratio of running vs. waiting for the dispatcher and
the queue monitor, we should get better visibility into what hydra is
currently bottlenecked on.

There are other side effects we can try to measure to get to the same
result, but having a simple way doesn't cost us much.
2024-04-20 16:48:03 +02:00
Pierre Bourdon
6189ba9c5e web: replace 'errormsg' with 'errormsg IS NULL' in most cases
This is implement in an extremely hacky way due to poor DBIx feature
support. Ideally, what we'd need is a way to tell DBIx to ignore the
errormsg column unless explicitly requested, and to automatically add a
computed 'errormsg IS NULL' column in others. Since it does not support
that, this commit instead hacks some support via method overrides while
taking care to not break anything obvious.
2024-04-12 20:14:09 +02:00
Pierre Bourdon
258e9314a9 web: include current step status on /machines 2024-04-11 17:15:58 +02:00
Pierre Bourdon
a51bd392a2 queue-runner: limit parallelism of CPU intensive operations
My current theory is that running more parallel xz than available CPU
cores is reducing our overall throughput by requiring more scheduling
overhead and more cache thrashing.
2024-04-11 16:43:01 +02:00
John Ericson
b676b08fac Merge pull request #1368 from Ma27/login-submit-btn
Use `submit` event in login form
2024-03-26 11:23:51 -04:00
John Ericson
d614163e9c Merge pull request #1370 from Ma27/reconnect-db
hydra-queue-runner: drop broken connections from pool
2024-03-26 11:21:35 -04:00
Maximilian Bosch
a596d6c3c1 Only show stepname if it doesn't equal the name of the drv
When building e.g. nixpkgs, the "Running builds" view will mostly look
like this

    hello.x86_64-linux (Build of hello-X.Y)
    exa.x86_64-linux (Build of exa-X.Y)
    ...

This doesn't provide any useful information. Showing the step name only
makes sense if it's not a child of the job's derivation. With this
patch, that information will only be shown if the drv name (i.e. w/o
`/nix/store/` prefix, .drv ext & hash) is not equal to the drv name of
the job itself (build.nixname).
2024-03-18 18:46:01 +01:00
Maximilian Bosch
415f9f2daa Running builds view: show build step names
When using Hydra to build machine configurations, you'll often see
"nixosConfigurations.foo" five times, i.e. for each build step being
run. This isn't very helpful I think because in such a case, a single
build step can also be compiling the Linux kernel.

This change also fetches the `drvpath` and `type` from the `buildsteps`
relation. We're already joining it, so this doesn't make much difference
(confirmed via query logging that this doesn't cause extra SQL queries).

Unfortunately build steps don't have a human readable name, so I'm
deriving it from the drvpath by stripping away the hash (assuming that
it'll never contain a `-` and that `/nix/store/` is used as prefix). I
decided against using the Nix bindings for that to avoid too much
overhead due to store operations for each build step.
2024-03-18 18:46:01 +01:00
Maximilian Bosch
9b465e7a67 Make "timed out" and "log limit exceeded" builds aborted
In 73694087a0 I gave builds that failed
because of a timeout or exceeded log limit a stop sign and I stand by
that reasoning: with that it's possible to distinguish between actual
build failures and rather transient things such as timeouts.

Back then I considered it a feature that these are shown in a different
tab, but I don't think that's a good idea anymore. When using a jobset to
e.g. track the regressions from a mass rebuild (like a compiler or gcc
update), "Newly failed builds" should exclusively display regressions (and
flaky builds of course, not much I can do about that).

Also, when a bunch of builds fail in such a jobset because of e.g. a
broken connection to a builder that results in a timeout, I want to be
able to restart them all w/o rebuilding actual regressions.

To make it clear that we not only have "Aborted" builds in the tab, I
renamed the label to "Aborted / Timed out".
2024-03-16 22:10:40 +01:00
Maximilian Bosch
9b62c52e5c hydra-queue-runner: drop broken connections from pool
Closes #1336

When restarting postgresql, the connections are still reused in
`hydra-queue-runner` causing errors like this

    main thread: Lost connection to the database server.
    queue monitor: Lost connection to the database server.

and no more builds being processed.

`hydra-evaluator` doesn't have that issue since it crashes right away.
We could let it retry indefinitely as well (see below), but I don't
want to change too much.

If the DB is still unreachable 10s later, the process will stop with a
non-zero exit code because of a missing DB connection. This however
isn't such a big deal because it will be immediately restarted
afterwards. With the current configuration, Hydra will never give up,
but restart (and retry) infinitely. To me that seems reasonable, i.e. to
retry DB connections on a long-running process. If this doesn't work
out, the monitoring should fire anyways because the queue fills up, but
I'm open to discuss that.

Please note that this isn't reproducible with the DB and the queue
runner on the same machine when using `services.hydra-dev`, because of
the `Requires=` dependency `hydra-queue-runner.service` ->
`hydra-init.service` -> `postgresql.service` that causes the queue
runner to be restarted on `systemctl restart postgresql`.

Internally, Hydra uses Nix's pool data structure: it basically has N
slots (here DB connections) and whenever a new one is requested, an idle
slot is provided or a new one is created (when N slots are active, it'll
be waited until one slot is free). The issue in the code here is however
that whenever an error is encountered, the slot is released, however the
same broken connection will be reused the next time. By using
`Pool::Handle::markBad`, Nix will drop a broken slot. This is now being
done when `pqxx::broken_connection` was caught.
2024-03-16 22:10:40 +01:00
Maximilian Bosch
ef6be80f54 Use submit event in login form
It's a pet peeve from me when logging into my personal Hydra that I
always have to press the button rather than hitting Return after entering
my password.

Reason for that is that the form doesn't have a "submit" button, so far
it was always listened to the "click" event. Submit does that and you
can hit Return alternatively.
2024-03-16 22:10:40 +01:00
K900
969eb3eeac urlencode drv names when fetching logs
Otherwise names with special characters like + break things.
2024-03-16 22:10:40 +01:00
Pierre Bourdon
18466e8326 queue-runner: try larger pipe buffer sizes 2024-03-16 22:10:40 +01:00
ajs124
6ed21490ee lazy-load evaluation errors
Closes #1362
2024-03-16 22:10:40 +01:00
Maximilian Bosch
99afff03b0 hydra-queue-runner: drop broken connections from pool
Closes #1336

When restarting postgresql, the connections are still reused in
`hydra-queue-runner` causing errors like this

    main thread: Lost connection to the database server.
    queue monitor: Lost connection to the database server.

and no more builds being processed.

`hydra-evaluator` doesn't have that issue since it crashes right away.
We could let it retry indefinitely as well (see below), but I don't
want to change too much.

If the DB is still unreachable 10s later, the process will stop with a
non-zero exit code because of a missing DB connection. This however
isn't such a big deal because it will be immediately restarted
afterwards. With the current configuration, Hydra will never give up,
but restart (and retry) infinitely. To me that seems reasonable, i.e. to
retry DB connections on a long-running process. If this doesn't work
out, the monitoring should fire anyways because the queue fills up, but
I'm open to discuss that.

Please note that this isn't reproducible with the DB and the queue
runner on the same machine when using `services.hydra-dev`, because of
the `Requires=` dependency `hydra-queue-runner.service` ->
`hydra-init.service` -> `postgresql.service` that causes the queue
runner to be restarted on `systemctl restart postgresql`.

Internally, Hydra uses Nix's pool data structure: it basically has N
slots (here DB connections) and whenever a new one is requested, an idle
slot is provided or a new one is created (when N slots are active, it'll
be waited until one slot is free). The issue in the code here is however
that whenever an error is encountered, the slot is released, however the
same broken connection will be reused the next time. By using
`Pool::Handle::markBad`, Nix will drop a broken slot. This is now being
done when `pqxx::broken_connection` was caught.
2024-03-15 14:09:31 +01:00
ajs124
8f56209bd6 Merge pull request #1361 from Ma27/fix-gitea-test
flake: fix gitea integration test
2024-03-08 15:28:07 +01:00
Maximilian Bosch
806c375c33 Don't send gitea status update when build is started
This was the source of a flaky test because sometimes hydra-notify was
quick enough to send out `buildStarted` and sometimes it apparently
wasn't which was quickly spottable with `nix build --rebuild`.

Removing that status update doesn't make a difference functionally,
gitea doesn't differentiate between "queued" and "running", so we send
the same status ("pending") out on both events, so we'd even safe one
avoidable request.
2024-03-08 11:07:38 +01:00
Maximilian Bosch
669617ab54 Use submit event in login form
It's a pet peeve from me when logging into my personal Hydra that I
always have to press the button rather than hitting Return after entering
my password.

Reason for that is that the form doesn't have a "submit" button, so far
it was always listened to the "click" event. Submit does that and you
can hit Return alternatively.
2024-03-07 18:49:52 +01:00
Janne Heß
c45c06509a Merge pull request #1364 from K900/urlencode-logs
urlencode drv names when fetching logs
2024-03-01 10:11:08 +01:00
K900
9db5d0a88d urlencode drv names when fetching logs
Otherwise names with special characters like + break things.
2024-02-26 22:48:16 +03:00
Maximilian Bosch
ceff5c5cfe flake: fix gitea integration test
This is an integration test that confirms that jobset definitions from
git repositories are correctly built and status updates pushed to the
gitea instance. The following things needed to be fixed:

* We're still on 23.05 where gitea is marked as insecure. Not going to
  update nixpkgs right now, but going for the quick fix.
* Since gitea 1.19 tokens have scopes that describe what's possible.
  Not specifying the scope in the DB appears to imply that no
  permissions are granted.
* Apparently we have three status updates now (for three status hooks,
  queued/started/finished). No idea why that was broken before, but the
  behavior still looks correct.
2024-02-12 18:30:03 +01:00
John Ericson
c1bd50a80d Merge pull request #1354 from NixOS/nix-2.20
Nix 2.20
2024-01-30 14:07:49 -05:00
John Ericson
14aabc1cc9 Update to released Nix 2.20
Flake lock file updates:

• Updated input 'nix':
    'github:NixOS/nix/8df68a213fc52a57b02a57005b0e06cc8de40ce3' (2024-01-25)
  → 'github:NixOS/nix/8f42912c80c0a03f62f6a3d28a3af05a9762565d' (2024-01-30)
2024-01-30 13:33:20 -05:00
John Ericson
7b826ec5ad Merge branch 'nix-next' into nix-2.20 2024-01-30 13:26:45 -05:00
John Ericson
838648c0ce Merge pull request #1349 from NixOS/ca-no-new-col
Allow building content-addressed derivations with hydra, minimally
2024-01-26 17:54:02 -05:00
John Ericson
6ac4292912 Merge pull request #1351 from Ma27/hacking-fixes
Small fixes for the development environment
2024-01-26 17:22:42 -05:00
John Ericson
b503280256 Add migration to drop non-null constraints 2024-01-26 11:53:58 -05:00
Maximilian Bosch
b4c91b5a6a package: move foreman to nativeCheckInputs
In 1bd195a513 strictDeps was set for the
Hydra package. As a result, `checkInputs` aren't available anymore in
the local dev-shell which is the sole purpose of foreman, to start
services and a database for development.
2024-01-26 17:30:07 +01:00
Maximilian Bosch
8477009310 doc/manual: fix instructions in contribution guidelines
In 5db374cb50 the `bootstrap` script was
removed, however it's still referenced in the contribution guidelines.
Change that to `autoreconfPhase` as intended by the commit.
2024-01-26 17:28:07 +01:00
John Ericson
c62eaf248f Remove now-unneeded workaround 2024-01-26 01:20:07 -05:00
John Ericson
13b5f007ef Merge branch 'master' into ca-no-new-col 2024-01-26 01:19:45 -05:00
John Ericson
7f5889559e Merge pull request #1350 from NixOS/remove-old-workaround
Remove now-unneeded workaround
2024-01-26 01:13:37 -05:00
John Ericson
5ee0e443e4 Remove now-unneeded workaround 2024-01-26 01:08:11 -05:00
John Ericson
323b556dc8 Minimal CA support
This verison has a worse UI, but also chnages the schema less: One
non-null constraint is removed, but no new columns are added.

Co-Authored-By: Andrea Ciceri <andrea.ciceri@autistici.org>
Co-Authored-By: regnat <rg@regnat.ovh>
2024-01-26 00:34:58 -05:00
John Ericson
458b9e4242 Merge pull request #1348 from NixOS/ca-prep
More CA derivations prep
2024-01-25 21:53:40 -05:00
John Ericson
fcde5908d8 More CA derivations prep
Again, with care not to change the schema in any way.
2024-01-25 21:32:22 -05:00
John Ericson
083ef46c12 Merge pull request #1344 from delroth/google-popup
web: disable Sign in with Google popup
2024-01-25 16:36:16 -05:00
John Ericson
7a53b866f6 Merge branch 'master' into nix-next
• Updated input 'nix' (merge):
    'github:NixOS/nix/212ba69e6f995992f8b4e4c0656d19c0156c8714'
    'github:NixOS/nix/2c4bb93ba5a97e7078896ebc36385ce172960e4e' (2024-01-25)
  → 'github:NixOS/nix/8df68a213fc52a57b02a57005b0e06cc8de40ce3' (2024-01-25)
2024-01-25 16:26:07 -05:00
John Ericson
8a02bb7c36 Merge pull request #1347 from NixOS/simplify-req-features
Simplify `StoreConfig::getDefaultSystemFeatures` call
2024-01-25 16:17:25 -05:00
John Ericson
c64eed7d07 Simplify StoreConfig::getDefaultSystemFeatures call
That method is now static.
2024-01-25 15:58:07 -05:00
John Ericson
aed130cd17 flake.lock: Update
Flake lock file updates:

• Updated input 'nix':
    'github:NixOS/nix/03e96b9dc011a16a0f6db9c7cb021ff93f8dcf88' (2024-01-19)
  → 'github:NixOS/nix/2c4bb93ba5a97e7078896ebc36385ce172960e4e' (2024-01-25)
2024-01-25 15:57:39 -05:00