User Details
- User Since
- May 11 2015, 8:31 AM (475 w, 2 d)
- Availability
- Available
- IRC Nick
- jynus
- LDAP User
- Jcrespo
- MediaWiki User
- JCrespo (WMF) [ Global Accounts ]
Today
I wasn't worried about immediate alerts. I know those won't change for now.
@ABran-WMF It very likely change it, because as you shared, the exporter does:
I don't see a clear difference with the current icinga/perl implementation.
I am also not going to enable any account until the end of the load (including monitoring) to avoid any bad interaction.
The process seems to have failed at the last steps. Retrying with a higher buffer pool and stopping s1.
Yesterday
I am leaving for the day, but there is a chance this is not worth debugging because the hosts are about to be decommissioned (unless it happens on the new ones, too). Filing it in case it could be useful for other perf issues for other hosts.
While technically the host didn't crash- it had an "unscheduled normal shutdown", given it is the source of s3 backups on eqiad, I am going to recover it from backups.
Mon, Jun 17
And this is the wiki distribution:
This is the API request I filed: T267365
@ABran-WMF As you can see, codfw health status is much better (I queries just before restarting it) ^
Fri, Jun 14
Done!
Deleted from zarcillo and stopped.
Wed, Jun 12
The alerts should be configurable by lag and by role from puppet- that means: I don't want alerts for backup sources that are < 4h, as I regularly stop those while taking the backups. E.g. core db hosts vs misc vs test hosts, etc.
db1205 is the secondary media backups metadata db server, usually just a standby to db1204. Unless it is the active server because the primary is unavailable, it just has to be checked that replication restarts correctly after maintenance.
backup1011 is a mediabackups storage server. Ideally, mediabackups are paused during the maintenance to avoid backup errors.
backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monitoring that it came back and new backups happen normally would be enough.
backup1010 is in intermittent usage to support mediabackups disk space, but mostly idle at the time, so unless it situtation changes by july and finally gets pooled for bacula, it will require no action.
@Marostegui, in order to resolve this ticket, now that read activity I assume is lower, do you think I could get a host from es4 and es5 on both dcs depooled for a day and with exclusive usage in order to take a final, archivable, full backup of those sections? Doesn't have to happen at the same time on the 4 hosts:
@ABran-WMF Thanks for handling it. To confirm, the issue happened at 2024-06-11 13:53:41 (Tuesday), right (or before?)? Because I may recover the host from backups just to be 100% sure there is no leftover corruption.
Fri, Jun 7
This is ready for dc-ops.
This is ready for dc ops.
This is ready for dc ops.
Tue, Jun 4
Then I mistook the ops user with the ops db, sorry.
ops is the user under which the query killer events & logs run. If you drop it, events will fail and dbs will be overloaded, as it happens usually when the events for a db haven't been loaded properly.
Wed, May 29
I will migrate the backups to 10.6 without removing yet the 10.4 backup sources.
@Volans not Amir, but Re: your first question, my understanding is that this was a compromise to make sure there was something good enough and simple short term, rather than overengineering from the start. That doesn't mean that what you suggest is discarded, but something that could be improved later on. For example, I am personally interested on having a querable service/API later for backup checks, but this is better than nothing ATM, with relatively small effort. Later on, a database could import the file and generate it, for example. So I am a fan of interating slowly as long as it is an improvement 0:-D.
Thu, May 23
Wed, May 22
Followup to T361087.
I did a disk stress test for an hour or so, saw no media errors, smart errors or raid controller weirdness.
Resolving for now.A disk was rebuilt on the 17 of May:
Tue, May 21
- Stop es4 and es5 backups
- Generate a full clusterX and clusterY last backup
- Archive it into long term backups
- Remove dump user
May 14 2024
May 13 2024
Thanks, the upgrade is no issue, but data will have a lot of backup errors due to not beeing depooled before maintenance, will need some work.
May 9 2024
All backups now will be generated from 10.6 servers, with the exception of s1. Leaving a couple of hosts with 10.4 before upgrading them/decomming them.
@Marostegui es6 and es7 backups are enabled, and a first run was done here. They seem mostly empty, though:
May 7 2024
May 6 2024
It was failing back in 2021:
Here it is the 2 file versions (with the hash it can be checked they are the same files):
Apr 30 2024
Apr 25 2024
In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.
If booted into bullseye.