User Details
- User Since
- Oct 3 2014, 8:06 AM (507 w, 2 d)
- Availability
- Available
- IRC Nick
- godog
- LDAP User
- Filippo Giunchedi
- MediaWiki User
- FGiunchedi (WMF) [ Global Accounts ]
Fri, Jun 14
This is effectively done, what's left to do on my end is put sandbox/filippo/pontoon-puppetserver branch for review and get it merged, which I'll do in July when I'm back from vacation
Thu, Jun 13
503s are gone as of ~12:20 UTC
Wed, Jun 12
hey @Peter, sure that seems reasonable to me. I have prepped https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042223 to be merged whenever the server is ready to be switched over on your end! I'll be out starting next week until the end of June; though other o11y folks can merge the patch if things are good to go before I'm back. HTH!
{{done}}; resolving though reopen if sth is amiss
Indeed using envoy_cluster_upstream_rq is not going to fly; that is a metric with a lot of cardinality and can't be reasonably used as-is. Make sure you are filtering on more tags to get a smaller result back, depending on needs. For comparison a metric looks like this:
We've definitely made progress here with various limits in place, resolving
Tue, Jun 11
Mon, Jun 10
Yes tl;dr is unexpected high database lag during T367019: Switchover s2 master (db2204 -> db2207) cc @ABran-WMF
I'm optimistically resolving the task, though please feel free to reopen if sth is amiss and thank you for your report
Calling this one done, debian package is uploaded and container updated
@Benwing2 things are back to normal!
Thank you for the report @Benwing2; we're experiencing an issue with some wikis including enwiktionary : https://www.wikimediastatus.net/incidents/94wcdd42gzxz
I've silenced BenthosKafkaConsumerLag alert for now re: this until we investigate and fix
Fri, Jun 7
I tested the package above in pontoon and it works as expected, I'll be importing it into apt and rolling it out to production/baremetal early next week
Thu, Jun 6
Thank you @MatthewVernon; I'll be trimming the retention further
Wed, Jun 5
I have refreshed the Debian packaging and pushed a new packaging-wikimedia branch to the gerrit repo for prometheus-statsd-exporter; the resulting package is available at /var/cache/pbuilder/result/bookworm-amd64/prometheus-statsd-exporter_0.26.1-1_amd64.deb on build2001. Note that I've patched the source to also accept the --statsd.relay-address argument (as opposed to upstream's --statsd.relay.address) to ease upgrades. Once we have the new version rolled out we can change puppet to use upstream's flag
Tue, Jun 4
I'm not sure exactly what happened, though while working today on {T366555} centrallog1002 md1 raid wouldn't come up cleanly. I've assembled it with three disks and then put back the fourth; also correcting this mismatch in the process
Mon, Jun 3
Fri, May 31
May 24 2024
Also might be related: https://github.com/rsyslog/rsyslog/issues/5215 https://github.com/rsyslog/rsyslog/issues/5176
Upstream issue: https://github.com/rsyslog/rsyslog/issues/4186
May 23 2024
Change is deployed, not a permanent fix though at least the ongoing toil is reduced now
May 22 2024
I can't authoritatively answer the questions though IIRC from my chat with @Clement_Goubert on using mesh/envoy vs not it was for symmetry with the rest of mw. To be clear: I don't feel strongly either way, whichever is best practice in this case works for me (ditto for the first question FWIW)
May 20 2024
We had a peak of ~700GB used, so we're fine. Wrapping up T361229 and T359449 will eventually lead us to a raid0 of ~2TB
We current have tracing enabled for cxserver and citoid in staging. As a first step and to gain confidence I'll enable tracing for those in production.
For the folks subscribed to this task and interested in beta-testing, please see sandbox/filippo/pontoon-puppetserver branch and its README.md: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/sandbox/filippo/pontoon-puppetserver/modules/pontoon/
May 16 2024
Splitting up curator work SGTM!
May 15 2024
This is done! Latest chart version is deployed
May 14 2024
May 13 2024
An example task of such migration is https://phabricator.wikimedia.org/T246998, which basically translates to:
- provision a new oidc client for prometheus in idp
- introduce a prometheus apache configuration to proxy requests for prometheus-SITE.wikimedia.org to oauth2-proxy
- configure oauth2-proxy to proxy authenticated requests to prometheus.svc.SITE.wmnet
A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload
May 10 2024
From my POV prometheus in magru is live and working, see also https://prometheus-magru.wikimedia.org/
I've run into this today, what seems to happen is the following:
May 8 2024
Also cc T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends and @elukey since the thanos-fe work here will help with that task too
May 7 2024
Thank you for taking a look at this @andrea.denisse @Dzahn. Filtering the targets by job ncredir (https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&var-datasource=thanos&var-Filters=job%7C%3D%7Cncredir) you'll notice that the port is 3904, which is mtail and not benthos. So indeed this is T362776: replace mtail with benthos on ncredir instances and the missing bits are removing the now-obsolete ncredir job from Prometheus. (cc @Vgutierrez)
May 6 2024
Indeed I agree that would be the root cause @colewhite pointed out. In light of the fact that (as far as I'm aware) we don't have an ETA to tweak the statsd-exporter deployment on wikikube as described in T359640; I think we should go back to the graphite/statsd metric for edits, so numbers are accurate
I've tried installing prometheus7001 today with help from @Muehlenhoff although there's no console and some pxe/tftp interaction with install7001 is suspected. I'll hold off further steps for now until VMs can be installed
Thank you for the investigation @Scott_French ! That sounds sensible to me and I'm happy to review patches for the o11y bits; on the general confd bits I'm not sure who owns the system though
Thank you all for looking into this!
Apr 30 2024
lshw as requested
I have spent some time investigating this issue and I believe this is a case of https://github.com/prometheus/common/issues/598 . Specifically prometheus does reload certs from disk, however they are not used for existing connections, only new ones! If existing connections are idle for > 5 minutes then they are recycled, if that doesn't happen then existing (possibly expired) certificates are used.