(Go: >> BACK << -|- >> HOME <<)

Page MenuHomePhabricator

upgrade contint servers to bullseye
Closed, ResolvedPublic

Description

The CI servers (Puppet role(ci)) are running Debian Buster which rea and must be upgraded to Bullseye. As of April 2024 the hosts are:

  • contint2002.wikimedia.org | Zuul, Zuul merger, Jenkins, Jenkins agent
  • contint1002.wikimedia.org | Zuul merger, Jenkins agent

There are dependencies before the upgrade can happen, notably Zuul requires python2.7 which is no more officially supported on Bullseye and is only included for the purpose of building Chromium.

Runbook

The reimaging of the two hosts is done in three phases:

  1. Reimage contint1002
  2. Switch over services from contint1002 to contint2002
  3. Reimage contint2002

1) reimage contint1002.wikimedia.org

We need to bring down the two services on the host, reimage it and bring back the two services. The host is:

  • attached as a Jenkins agent to the Jenkins controller which runs on the other host
  • running the secondary zuul-merger daemon (and its companion git-daemon)

Disable services

  • Disable the zuul-merger on contint1002 by setting profile::zuul::merger::enable: false. That should stop and mask the service. There is another Zuul merger system running on contint2002.wikimedia.org.
  • Run the host down cookbook to disable monitoring and alarms

Reimage

  • Reimage contint1002 to Bullseye. Data in /srv can be wiped out, they are merely used for caching (git repos, docker images and build layers)
  • While cookbook is still running but host is already back up and ssh access has been restored.. manually run "sudo a2dismod mpm_event" and run puppet again. Cookbook should now detect a succesful puppet run and finish cleanly.

Enable services

After host is back and provisioned, verify:

  • /srv is a standalone partition!
  • Docker daemon is started.
  • Zuul has been deployed (not by Puppet): /srv/deployment/zuul/venv/bin/zuul-merger. - FAILED
  • git-daemon is up (systemctl status git-daemon).

Enable the services:

  • Enable the Jenkins agent via https://integration.wikimedia.org/ci/computer/contint1002/ the ssh host key would need to be verified again since the reimaging causes the host key to change.
  • Set profile::zuul::merger::enable: true. Running Puppet will unmask it and start the service. It logs in /var/log/zuul/merger.log.

2) Switch over services

Before reimaging contint2002, we need its services to be moved to the reimage contint1002.

Before the maintenance

  • Disable the zuul-merger on contint2002 by setting profile::zuul::merger::enable: false. That should stop and mask the service. There is another Zuul merger system running on contint1002.wikimedia.org.
  • Clean up some of the Jenkins artifacts to reduce the amount of data that will be transfered
Rsync data and states

Synchronize data and states to pre warm the other host:

  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint1002.wikimedia.org/ci--srv-jenkins-
  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint1002.wikimedia.org/ci--var-lib-jenkins-
  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint1002.wikimedia.org/ci--var-lib-zuul-

Switch over

  • Downtime both contint2002 and contint1002
  • Disable Puppet
  • Stop the services sudo systemctl stop jenkins and sudo systemctl stop zuul
Rsync data and states

Now that services are stopped, resynchronize all artifacts and states:

  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /srv/jenkins/ rsync://contint1002.wikimedia.org/ci--srv-jenkins-
  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/jenkins/ rsync://contint1002.wikimedia.org/ci--var-lib-jenkins-
  • sudo rsync -ap --whole-file --delete-delay --info=progress2 /var/lib/zuul/ rsync://contint1002.wikimedia.org/ci--var-lib-zuul-
change DNS
  • Change contint.wikimedia.org CNAME from contint2002.wikimedia.org to contint1002.wikimedia.org
change primary host in Puppet/Hiera/CI config
  • profile::ci::manager_host: contint1002.wikimedia.org
  • In profile::zuul::merger::conf change gearman_server to the IP of contint1002.wikimedia.org: 208.80.153.39
  • Run Puppet on contint1002 to point the zuul-merger to the new host
Start services
  • Update Zuul config: from integration/config: ./fab deploy_zuul
  • Enable and run Puppet on contint1002 which should bring up both Jenkins and Zuul

Verify:

3) reimage contint2002.wikimedia.org

TODO copy paste 3) reimage contint1002.wikimedia.org checklist here.

References

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -73
operations/puppetproduction+0 -1
operations/puppetproduction+0 -5
operations/puppetproduction+0 -4
operations/puppetproduction+3 -2
operations/puppetproduction+1 -3
operations/puppetproduction+6 -4
operations/puppetproduction+2 -2
operations/puppetproduction+2 -0
integration/configmaster+10 -10
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/dnsmaster+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+3 -1
operations/puppetproduction+38 -26
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2024-04-16T19:47:44Z] <hashar@deploy1002> Started deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517

Mentioned in SAL (#wikimedia-operations) [2024-04-16T19:47:52Z] <hashar@deploy1002> Finished deploy [zuul/deploy@efce3ee]: Redeploy Zuul following host reimaging - T334517 (duration: 00m 08s)

Change #1020344 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: set new default docker version for bullseye

https://gerrit.wikimedia.org/r/1020344

Mentioned in SAL (#wikimedia-operations) [2024-04-16T20:30:21Z] <mutante> CI - jenkins and zuul-merger are re-enabled on contint1002 after distro upgrade to bullseye - T334517

kind of weird: contint1002 and contint2002 both have rsyncd running and rsync snippets once created by puppet but I can't find the code in puppet and there are references to old server contint2001 that are nowhere in the repo either.

it's almost like puppetized rsync existed and was removed and just the remnants weren't deleted by puppet.

data pathes and sizes:

existing primary server:

root@contint2002:/# du -hs /var/lib/jenkins
2.2G	/var/lib/jenkins
root@contint2002:/# du -hs /var/lib/zuul/
6.0M	/var/lib/zuul/
root@contint2002:/# du -hs /srv/jenkins
291G	/srv/jenkins

reimaged server:

root@contint1002:/# du -hs /var/lib/jenkins
3.0M	/var/lib/jenkins
root@contint1002:/# du -hs /var/lib/zuul/
44K	/var/lib/zuul/
root@contint1002:/# du -hs /srv/jenkins
4.0K	/srv/jenkins

Mentioned in SAL (#wikimedia-operations) [2024-04-17T23:14:01Z] <mutante> rsyncing jenkins data from contint2002 to contint1002, pre-sync in preparation for migration next week - /srv/jenkins (291G) and much smaller zuul and jenkins data dirs T334517

Change #1020950 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: disable zuul merger on contint2002 for migration

https://gerrit.wikimedia.org/r/1020950

Change #1020951 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch contint.wikimedia.org from contint2002 to contint1002

https://gerrit.wikimedia.org/r/1020951

Change #1020954 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: switch contint manager_host from 2002 to 1002

https://gerrit.wikimedia.org/r/1020954

bugs in runbook:

  • there is no contint.discovery.wmnet name - it's contint.wikimedia.org
  • there is also a "gearman_server" setting which is an IP address hardcoded in Hiera and it's not mentioned in the runbook (profile::zuul::merger::conf)
  • doesn't mention rsync source and dest host settings in Hiera, which need to be switched with the failover
  • at the end, delete host-name based special settings like the docker version that are now globally the same again (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020316)

Change #1020955 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: switch gearman_server IP from contint2002 to contint1002

https://gerrit.wikimedia.org/r/1020955

Change #1020957 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: switch source and destination server for data rsync

https://gerrit.wikimedia.org/r/1020957

Change #1020958 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org

https://gerrit.wikimedia.org/r/1020958

  • confirmed rsync commands are working and executed them as noted in the runbook above - /srv/jenkins , /var/lib/zuul and /var/lib/jenkins have been pre-synced to contint1002

@hashar This happened today:

19:50 < Reedy> !log Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1028587
20:08 <+wikibugs> (Merged) jenkins-bot: jjb: Add some ruby3.1 jobs [integration/config] - https://gerrit.wikimedia.org/r/1028593 (owner: Reedy)
20:08 <+wikibugs> (PS1) Reedy: jjb: Add ruby3.1 mediawiki-ruby-api job [integration/config] - https://gerrit.wikimedia.org/r/1028594

20:29 < Reedy> !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1028596

19:50 < Reedy> !log Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1028587

Do we have to think of that when switching the server over? Asking because the "on contint primary" caught my eye.

on contint primary

That is from the deployment script https://gerrit.wikimedia.org/g/integration/config/+/refs/heads/master/fab which refers to the contint.wikimedia.org hostname. That is thus switching once the DNS switch has settled :)

Mentioned in SAL (#wikimedia-operations) [2024-05-13T14:09:04Z] <dzahn@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on contint2002.wikimedia.org with reason: T334517

Mentioned in SAL (#wikimedia-operations) [2024-05-13T14:09:19Z] <dzahn@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint2002.wikimedia.org with reason: T334517

Mentioned in SAL (#wikimedia-operations) [2024-05-13T14:09:40Z] <dzahn@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on contint1002.wikimedia.org with reason: T334517

Mentioned in SAL (#wikimedia-operations) [2024-05-13T14:09:55Z] <dzahn@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint1002.wikimedia.org with reason: T334517

Change #1020950 merged by Dzahn:

[operations/puppet@production] ci: disable zuul merger on contint2002 for migration

https://gerrit.wikimedia.org/r/1020950

Mentioned in SAL (#wikimedia-operations) [2024-05-13T14:15:11Z] <mutante> CI - migration in progress - stopping jenkins and zuul (T334517)

Change #1020951 merged by Dzahn:

[operations/dns@master] switch contint.wikimedia.org from contint2002 to contint1002

https://gerrit.wikimedia.org/r/1020951

Change #1020954 merged by Dzahn:

[operations/puppet@production] ci: switch contint manager_host from 2002 to 1002

https://gerrit.wikimedia.org/r/1020954

Change #1020955 merged by Dzahn:

[operations/puppet@production] ci: switch gearman_server IP from contint2002 to contint1002

https://gerrit.wikimedia.org/r/1020955

Change #1020957 merged by Dzahn:

[operations/puppet@production] ci: switch source and destination server for data rsync

https://gerrit.wikimedia.org/r/1020957

Mentioned in SAL (#wikimedia-operations) [2024-05-13T14:34:17Z] <mutante> CI - switch over to other contint server finished - T334517

Change #1030991 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: switch contint2002 to contint1002

https://gerrit.wikimedia.org/r/1030991

Change #1030991 merged by jenkins-bot:

[integration/config@master] jjb: switch contint2002 to contint1002

https://gerrit.wikimedia.org/r/1030991

Mentioned in SAL (#wikimedia-releng) [2024-05-13T15:18:42Z] <hashar> deployment-prep: deleted security rule for 208.80.154.17 ssh and port 2 (sic) and allow 208.80.154.132 / contint1002 port 22 instead # T334517

Change #1020958 merged by Dzahn:

[operations/puppet@production] ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org

https://gerrit.wikimedia.org/r/1020958

Change #1032010 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci/zuul: use localhost as gearman server

https://gerrit.wikimedia.org/r/1032010

Change #1032010 merged by Dzahn:

[operations/puppet@production] ci/zuul: use localhost as gearman server

https://gerrit.wikimedia.org/r/1032010

Change #1032013 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] zuul: add DNS lookup for gearman server IP

https://gerrit.wikimedia.org/r/1032013

Change #1032013 merged by Dzahn:

[operations/puppet@production] zuul: add DNS lookup for gearman server IP

https://gerrit.wikimedia.org/r/1032013

Change #1032023 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: set puppet7 on role level

https://gerrit.wikimedia.org/r/1032023

Change #1020344 merged by Dzahn:

[operations/puppet@production] contint: set new default docker version for bullseye

https://gerrit.wikimedia.org/r/1020344

Mentioned in SAL (#wikimedia-operations) [2024-05-16T14:43:02Z] <dzahn@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on contint2002.wikimedia.org with reason: T334517

Mentioned in SAL (#wikimedia-operations) [2024-05-16T14:43:18Z] <dzahn@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on contint2002.wikimedia.org with reason: T334517

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Change #1032023 merged by Dzahn:

[operations/puppet@production] ci: set puppet7 at role level

https://gerrit.wikimedia.org/r/1032023

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye executed with errors:

  • contint2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye executed with errors:

  • contint2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye executed with errors:

  • contint2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS buster

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS buster executed with errors:

  • contint2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint2002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint2002.wikimedia.org with OS bullseye completed:

  • contint2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405162014_dzahn_464740_contint2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

After a couple problems with reimaging (networking config via DHCP in Debian installer failing due to networking changes, RAID setup failure due to changing device names from attached cdrom drive, cookbook locks override, apache module race condition on first puppet run), now contint2002 has been reimaged with bullseye.

jenkins and zuul services are masked as they should be.

Congratulations on having reimaged contint2002! It is missing steps though:

  • attached as a Jenkins agent to the Jenkins controller which runs on the other host
  • running the secondary zuul-merger daemon (and its companion git-daemon)

And we can also get rid of the Ganeti VM contint1003.eqiad.wmnet which was created for testing Zuul under Bullseye ( T361224 ).

Mentioned in SAL (#wikimedia-releng) [2024-05-22T07:35:42Z] <hashar> Jenkins: added back contint2002 as an Agent. The host got reimaged from Buster to Bullseye # T334517

And we can also get rid of the Ganeti VM contint1003.eqiad.wmnet

I wanted to ask you about this anyways. Would have just used another ticket. I'll get right on that.

Change #1034954 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove contint1003 - former testing machine

https://gerrit.wikimedia.org/r/1034954

Change #1034955 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] role: delete ci_test role, not used anymore

https://gerrit.wikimedia.org/r/1034955

Change #1034954 merged by Dzahn:

[operations/puppet@production] site: remove contint1003 - former testing machine

https://gerrit.wikimedia.org/r/1034954

Change #1034965 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] hieradata: delete host contint1003

https://gerrit.wikimedia.org/r/1034965

Change #1034965 merged by Dzahn:

[operations/puppet@production] hieradata: delete host contint1003

https://gerrit.wikimedia.org/r/1034965

@hashar contint1003 is removed. What's next? Do we need a Hiera change to unmask any services?

Change #1034955 merged by Dzahn:

[operations/puppet@production] role: delete ci_test role, not used anymore

https://gerrit.wikimedia.org/r/1034955

Change #1036762 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: enable zuul-merger daemon on contint2002

https://gerrit.wikimedia.org/r/1036762

Congratulations on having reimaged contint2002! It is missing steps though:

  • attached as a Jenkins agent to the Jenkins controller which runs on the other host

How do you do that? Is this just an action in the integration.wikimedia.org web UI?

  • running the secondary zuul-merger daemon (and its companion git-daemon)

git-daemon is already running.

Active: active (running) since Thu 2024-05-16 20:30:55 UTC; 1 weeks 5 days ago

zuul-merger is masked and I uploaded https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036762 to unmask it.

And we can also get rid of the Ganeti VM contint1003.eqiad.wmnet which was created for testing Zuul under Bullseye ( T361224 ).

done!

@hashar I see jobs at https://integration.wikimedia.org/ci/computer/contint2002/

Doesn't this mean the " attached as a Jenkins agent to the Jenkins controller" is done?

@hashar I see jobs at https://integration.wikimedia.org/ci/computer/contint2002/

Doesn't this mean the " attached as a Jenkins agent to the Jenkins controller" is done?

YES, see the comment I have made above when I have added it back:

Mentioned in SAL (#wikimedia-releng) [2024-05-22T07:35:42Z] <hashar> Jenkins: added back contint2002 as an Agent. The host got reimaged from Buster to Bullseye # T334517

Change #1036762 merged by Jelto:

[operations/puppet@production] contint: enable zuul-merger daemon on contint2002

https://gerrit.wikimedia.org/r/1036762

zuul-merger is now running on contint2002.wikimedia.org

That concludes the OS upgrade to Bullseye for those two hosts.

:) cool, thanks for this and the explanation on IRC