(Go: >> BACK << -|- >> HOME <<)

Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (17)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (507 w, 2 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
FGiunchedi (WMF) [ Global Accounts ]

Recent Activity

Fri, Jun 14

fgiunchedi added a comment to T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver.

This is effectively done, what's left to do on my end is put sandbox/filippo/pontoon-puppetserver branch for review and get it merged, which I'll do in July when I'm back from vacation

Fri, Jun 14, 8:48 AM · User-fgiunchedi, Pontoon

Thu, Jun 13

fgiunchedi added a comment to T367401: steady increase in 503s from mw-api-ext-ro.discovery.wmnet since 5 UTC.

503s are gone as of ~12:20 UTC

Thu, Jun 13, 12:37 PM · serviceops
fgiunchedi created T367401: steady increase in 503s from mw-api-ext-ro.discovery.wmnet since 5 UTC.
Thu, Jun 13, 12:02 PM · serviceops

Wed, Jun 12

fgiunchedi updated the task description for T356386: Move all o11y services to discovery.wmnet.
Wed, Jun 12, 3:43 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi added a comment to T367064: Change graphite-synthetic-testing to new Graphite instance in Grafana.

Thank you @fgiunchedi . If I do the work tomorrow Thursday, do you have time to merge it during the day? I can stop the collections early in the morning CET and then you can do your thing when you have time and sync with me on Slack/IRC?

Wed, Jun 12, 2:58 PM · SRE Observability, Synthetic-Performance-Testing, Quality-and-Test-Engineering-Team
fgiunchedi added a comment to T367064: Change graphite-synthetic-testing to new Graphite instance in Grafana.

hey @Peter, sure that seems reasonable to me. I have prepped https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042223 to be merged whenever the server is ready to be switched over on your end! I'll be out starting next week until the end of June; though other o11y folks can merge the patch if things are good to go before I'm back. HTH!

Wed, Jun 12, 12:21 PM · SRE Observability, Synthetic-Performance-Testing, Quality-and-Test-Engineering-Team
fgiunchedi closed T362633: Growth team product KPI Grafana dashboard has `update_` task type, which does not exist as Resolved.

{{done}}; resolving though reopen if sth is amiss

Wed, Jun 12, 10:25 AM · Grafana, SRE, Growth-Team, GrowthExperiments-Homepage
fgiunchedi added a comment to T367143: Miscweb K8s dashboard loading issues.

Indeed using envoy_cluster_upstream_rq is not going to fly; that is a metric with a lot of cardinality and can't be reasonably used as-is. Make sure you are filtering on more tags to get a smaller result back, depending on needs. For comparison a metric looks like this:

Wed, Jun 12, 10:20 AM · SRE Observability
fgiunchedi closed T349999: Limit thanos-query resource usage as Resolved.

We've definitely made progress here with various limits in place, resolving

Wed, Jun 12, 10:14 AM · Observability-Metrics

Tue, Jun 11

fgiunchedi added a project to T367149: Add "file age" node textfile exporter capability: Observability-Metrics.
Tue, Jun 11, 2:08 PM · Observability-Metrics, Observability-Alerting
fgiunchedi updated the task description for T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors.
Tue, Jun 11, 10:53 AM · Observability-Logging, SRE, serviceops, MW-on-K8s
fgiunchedi created T367149: Add "file age" node textfile exporter capability.
Tue, Jun 11, 9:41 AM · Observability-Metrics, Observability-Alerting

Mon, Jun 10

fgiunchedi updated the task description for T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors.
Mon, Jun 10, 4:00 PM · Observability-Logging, SRE, serviceops, MW-on-K8s
fgiunchedi created T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors.
Mon, Jun 10, 3:47 PM · Observability-Logging, SRE, serviceops, MW-on-K8s
fgiunchedi added a comment to T360895: Memory upgrade request for prometheus200[56].

I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC) or another time that works for you.

Mon, Jun 10, 3:30 PM · DC-Ops, SRE, ops-codfw, Observability-Metrics
fgiunchedi created T367065: Move profile::idp::client::httpd::site checks to Prometheus blackbox probes.
Mon, Jun 10, 2:44 PM · Infrastructure-Foundations, Observability-Alerting
fgiunchedi updated subscribers of T367033: page saves to English Wiktionary are getting lost.

Yes tl;dr is unexpected high database lag during T367019: Switchover s2 master (db2204 -> db2207) cc @ABran-WMF

Mon, Jun 10, 1:34 PM · Wikimedia-Incident
fgiunchedi closed T367033: page saves to English Wiktionary are getting lost as Resolved.

I'm optimistically resolving the task, though please feel free to reopen if sth is amiss and thank you for your report

Mon, Jun 10, 1:01 PM · Wikimedia-Incident
fgiunchedi closed T302373: Upgrade prometheus-statsd-exporter as Resolved.

Calling this one done, debian package is uploaded and container updated

Mon, Jun 10, 11:54 AM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T367033: page saves to English Wiktionary are getting lost.

@Benwing2 things are back to normal!

Mon, Jun 10, 11:52 AM · Wikimedia-Incident
fgiunchedi added a comment to T367033: page saves to English Wiktionary are getting lost.

Thank you for the report @Benwing2; we're experiencing an issue with some wikis including enwiktionary : https://www.wikimediastatus.net/incidents/94wcdd42gzxz

Mon, Jun 10, 10:15 AM · Wikimedia-Incident
fgiunchedi added a comment to T366308: More Benthos instances consumes slower?.

I've silenced BenthosKafkaConsumerLag alert for now re: this until we investigate and fix

Mon, Jun 10, 8:26 AM · Observability-Logging

Fri, Jun 7

fgiunchedi added a comment to T360895: Memory upgrade request for prometheus200[56].

I don't have any new on hands. But I can pull some of the extra dimm from the decommissioned servers. I'll leave enough to make sure the servers being recycled are still operational if that's what we want to do. Sound good?

Fri, Jun 7, 3:34 PM · DC-Ops, SRE, ops-codfw, Observability-Metrics
fgiunchedi added a comment to T360895: Memory upgrade request for prometheus200[56].

I didn't realize it at the time, though codfw in T354685 got 192GB per host, whereas a week later in T354684 eqiad got 384GB per host.

Fri, Jun 7, 9:16 AM · DC-Ops, SRE, ops-codfw, Observability-Metrics
fgiunchedi added a comment to T302373: Upgrade prometheus-statsd-exporter.

I tested the package above in pontoon and it works as expected, I'll be importing it into apt and rolling it out to production/baremetal early next week

Fri, Jun 7, 8:19 AM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Observability-Metrics
fgiunchedi created P64240 (An Untitled Masterwork).
Fri, Jun 7, 8:11 AM
fgiunchedi updated the task description for T353912: Observability Bookworm upgrades.
Fri, Jun 7, 7:16 AM · SRE Observability (FY2023/2024-Q4), Patch-For-Review

Thu, Jun 6

fgiunchedi added a comment to T351927: Decide and tweak Thanos retention.

Thank you @MatthewVernon; I'll be trimming the retention further

Thu, Jun 6, 10:30 AM · User-fgiunchedi, Observability-Metrics

Wed, Jun 5

fgiunchedi created T366710: Switch k8s logs to their own kafka topics.
Wed, Jun 5, 3:01 PM · Patch-For-Review, Observability-Logging
fgiunchedi added a comment to T302373: Upgrade prometheus-statsd-exporter.

I have refreshed the Debian packaging and pushed a new packaging-wikimedia branch to the gerrit repo for prometheus-statsd-exporter; the resulting package is available at /var/cache/pbuilder/result/bookworm-amd64/prometheus-statsd-exporter_0.26.1-1_amd64.deb on build2001. Note that I've patched the source to also accept the --statsd.relay-address argument (as opposed to upstream's --statsd.relay.address) to ease upgrades. Once we have the new version rolled out we can change puppet to use upstream's flag

Wed, Jun 5, 1:24 PM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.
  1. As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:
    1. Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI

The need for a per-group statsd-exporter is necessary only if we continue to use statsd-exporter v0.9.0. Upstream enabled the ability to have inconsistent label sets in v0.10.2 I suggest we upgrade statsd-exporter and eliminate the need for per-group exporters: T302373: Upgrade prometheus-statsd-exporter

Wed, Jun 5, 7:59 AM · Patch-For-Review, MW-on-K8s, serviceops, SRE Observability (FY2023/2024-Q4), Observability-Metrics

Tue, Jun 4

fgiunchedi added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

A few thoughts on this:

  1. I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node. For the pods to connect to it, we'll just have to pass down the node ip as an env variable to php-fpm, using the downward api to pass status.hostIP.
  2. This has the added advantage of reproducing the load pattern we have on bare metal, of course
  3. As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:
    1. Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI
Tue, Jun 4, 2:08 PM · Patch-For-Review, MW-on-K8s, serviceops, SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi updated the task description for T366596: datahub-mce-consumer spamming logstash with messages.
Tue, Jun 4, 1:17 PM · Data-Platform-SRE (2024.05.27 - 2024.06.16)
fgiunchedi created T366596: datahub-mce-consumer spamming logstash with messages.
Tue, Jun 4, 1:14 PM · Data-Platform-SRE (2024.05.27 - 2024.06.16)
fgiunchedi added a comment to T363660: Degraded RAID on centrallog1002.

I'm not sure exactly what happened, though while working today on {T366555} centrallog1002 md1 raid wouldn't come up cleanly. I've assembled it with three disks and then put back the fourth; also correcting this mismatch in the process

Tue, Jun 4, 11:07 AM · DC-Ops, SRE, ops-eqiad
fgiunchedi merged T366580: Degraded RAID on centrallog1002 into T363660: Degraded RAID on centrallog1002.
Tue, Jun 4, 11:02 AM · DC-Ops, SRE, ops-eqiad
fgiunchedi merged task T366580: Degraded RAID on centrallog1002 into T363660: Degraded RAID on centrallog1002.
Tue, Jun 4, 11:02 AM · SRE, DC-Ops, ops-eqiad
fgiunchedi created T366573: Make sure burrow starts at boot.
Tue, Jun 4, 9:16 AM · SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi created T366571: enable navtiming and statsv services on systemd to start at boot.
Tue, Jun 4, 9:10 AM · SRE Observability (FY2023/2024-Q4), Observability-Metrics

Mon, Jun 3

fgiunchedi created T366492: grafana-server exploding in memory.
Mon, Jun 3, 3:57 PM · Observability-Metrics

Fri, May 31

fgiunchedi created T366346: Mute helmfile apply notifications from cirrus-streaming-updater deploys.
Fri, May 31, 10:36 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Discovery-Search (Current work), CirrusSearch

May 24 2024

fgiunchedi added a comment to T357616: Logs from containers sometimes not visible in logstash.

Also might be related: https://github.com/rsyslog/rsyslog/issues/5215 https://github.com/rsyslog/rsyslog/issues/5176

May 24 2024, 1:40 PM · Patch-For-Review, Observability-Logging, serviceops
fgiunchedi added a comment to T365791: imfile state files not cleaned up in /var/spool/rsyslog.

Upstream issue: https://github.com/rsyslog/rsyslog/issues/4186

May 24 2024, 1:38 PM · Observability-Logging
fgiunchedi closed T358111: oauth2-proxy config changes don't cause any change in the helm Deployment, a subtask of T321211: distributed tracing v1: tech debt blockers, as Resolved.
May 24 2024, 1:38 PM · Observability-Tracing, Epic
fgiunchedi closed T358111: oauth2-proxy config changes don't cause any change in the helm Deployment as Resolved.
May 24 2024, 1:38 PM · Observability-Tracing, Patch-For-Review
fgiunchedi updated the task description for T356386: Move all o11y services to discovery.wmnet.
May 24 2024, 9:40 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi created T365791: imfile state files not cleaned up in /var/spool/rsyslog.
May 24 2024, 8:49 AM · Observability-Logging

May 23 2024

fgiunchedi added a comment to T343529: Prometheus doesn't reload or alert on expired client certificates.

Change is deployed, not a permanent fix though at least the ongoing toil is reduced now

May 23 2024, 9:39 AM · SRE Observability (FY2023/2024-Q4), Prod-Kubernetes, Observability-Metrics, User-fgiunchedi, Kubernetes, serviceops-radar

May 22 2024

fgiunchedi added a comment to T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors.

hello, a silly question from an uninitiated: will the historic "wrongly labelled" metric data will be re-bucketed, or are we to assume that there will be no metric data for the period in time in question?

May 22 2024, 1:17 PM · MW-1.43-notes (1.43.0-wmf.5; 2024-05-14), MediaWiki-REST-API, MW-Interfaces-Team, Release-Engineering-Team
fgiunchedi added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

I can't authoritatively answer the questions though IIRC from my chat with @Clement_Goubert on using mesh/envoy vs not it was for symmetry with the rest of mw. To be clear: I don't feel strongly either way, whichever is best practice in this case works for me (ditto for the first question FWIW)

May 22 2024, 1:11 PM · Patch-For-Review, MW-on-K8s, serviceops, SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi closed T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors as Resolved.

The fix has been backported and deployed. It should now be safe to re-enable the metrics.

May 22 2024, 12:45 PM · MW-1.43-notes (1.43.0-wmf.5; 2024-05-14), MediaWiki-REST-API, MW-Interfaces-Team, Release-Engineering-Team

May 20 2024

andrea.denisse awarded T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver a Love token.
May 20 2024, 6:00 PM · User-fgiunchedi, Pontoon
fgiunchedi renamed T228380: Tech debt: sunsetting of Graphite from Tech debt: sunsetting of Graphite (part 1) to Tech debt: sunsetting of Graphite.
May 20 2024, 1:32 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics
fgiunchedi closed T359068: Not enough space on titan2001 for thanos-compact as Resolved.

We had a peak of ~700GB used, so we're fine. Wrapping up T361229 and T359449 will eventually lead us to a raid0 of ~2TB

May 20 2024, 12:24 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi moved T343529: Prometheus doesn't reload or alert on expired client certificates from Up next to Doing on the User-fgiunchedi board.
May 20 2024, 12:18 PM · SRE Observability (FY2023/2024-Q4), Prod-Kubernetes, Observability-Metrics, User-fgiunchedi, Kubernetes, serviceops-radar
fgiunchedi added a comment to T343529: Prometheus doesn't reload or alert on expired client certificates.

Change #1025682 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use longer-expiration pki client certs for k8s

https://gerrit.wikimedia.org/r/1025682

May 20 2024, 10:20 AM · SRE Observability (FY2023/2024-Q4), Prod-Kubernetes, Observability-Metrics, User-fgiunchedi, Kubernetes, serviceops-radar
fgiunchedi added a comment to T320563: our various Envoys are configured to report traces to local OpenTelemetry Collector.

We current have tracing enabled for cxserver and citoid in staging. As a first step and to gain confidence I'll enable tracing for those in production.

May 20 2024, 10:09 AM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
fgiunchedi added a comment to T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver.

For the folks subscribed to this task and interested in beta-testing, please see sandbox/filippo/pontoon-puppetserver branch and its README.md: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/sandbox/filippo/pontoon-puppetserver/modules/pontoon/

May 20 2024, 8:20 AM · User-fgiunchedi, Pontoon
fgiunchedi updated the task description for T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver.
May 20 2024, 8:18 AM · User-fgiunchedi, Pontoon

May 16 2024

fgiunchedi updated the task description for T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors.
May 16 2024, 9:43 AM · MW-1.43-notes (1.43.0-wmf.5; 2024-05-14), MediaWiki-REST-API, MW-Interfaces-Team, Release-Engineering-Team
fgiunchedi updated subscribers of T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors.

I've temporarily blackholed the metrics from graphite, though of course mw should stop sending them. cc @hashar @daniel @BPirkle

May 16 2024, 9:33 AM · MW-1.43-notes (1.43.0-wmf.5; 2024-05-14), MediaWiki-REST-API, MW-Interfaces-Team, Release-Engineering-Team
fgiunchedi renamed T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors from Per-page graphite metrics created for MediaWiki.rest_api_latency to Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors.
May 16 2024, 9:24 AM · MW-1.43-notes (1.43.0-wmf.5; 2024-05-14), MediaWiki-REST-API, MW-Interfaces-Team, Release-Engineering-Team
fgiunchedi created T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors.
May 16 2024, 9:17 AM · MW-1.43-notes (1.43.0-wmf.5; 2024-05-14), MediaWiki-REST-API, MW-Interfaces-Team, Release-Engineering-Team
fgiunchedi created T365105: Delete 'monitoring' project.
May 16 2024, 8:50 AM · Cloud-VPS
fgiunchedi added a comment to T364190: Curator fails to complete regularly.

Splitting up curator work SGTM!

May 16 2024, 8:36 AM · Observability-Logging

May 15 2024

fgiunchedi awarded T364973: Zone delegation for o11y.wmcloud.org a Like token.
May 15 2024, 12:14 PM · Cloud-VPS
fgiunchedi created T364973: Zone delegation for o11y.wmcloud.org.
May 15 2024, 10:15 AM · Cloud-VPS
fgiunchedi closed T364477: Upgrade jaeger helm chart version to latest upstream as Resolved.

This is done! Latest chart version is deployed

May 15 2024, 8:21 AM · User-fgiunchedi, Patch-For-Review, Observability-Tracing
fgiunchedi closed T364477: Upgrade jaeger helm chart version to latest upstream, a subtask of T320549: distributed tracing v0 [minimum viable], as Resolved.
May 15 2024, 8:19 AM · Epic, Observability-Tracing

May 14 2024

fgiunchedi moved T364477: Upgrade jaeger helm chart version to latest upstream from Backlog to Doing on the User-fgiunchedi board.
May 14 2024, 12:39 PM · User-fgiunchedi, Patch-For-Review, Observability-Tracing
fgiunchedi added a project to T364477: Upgrade jaeger helm chart version to latest upstream: User-fgiunchedi.
May 14 2024, 12:38 PM · User-fgiunchedi, Patch-For-Review, Observability-Tracing
fgiunchedi added a comment to T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit.

A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload

This would be ideal, although as-is the prometheus-url is used to build links for each panel in the pyrra UI to prometheus (thanos in our case) for detailed query/metric view.

FWIW I opened https://github.com/pyrra-dev/pyrra/issues/986 about this last year which got some support but hasn't made any progress yet.

May 14 2024, 8:09 AM · SRE Observability

May 13 2024

fgiunchedi created P62364 (An Untitled Masterwork).
May 13 2024, 2:31 PM
fgiunchedi added a comment to T326657: Add prometheus-https load balancer.

An example task of such migration is https://phabricator.wikimedia.org/T246998, which basically translates to:

  • provision a new oidc client for prometheus in idp
  • introduce a prometheus apache configuration to proxy requests for prometheus-SITE.wikimedia.org to oauth2-proxy
  • configure oauth2-proxy to proxy authenticated requests to prometheus.svc.SITE.wmnet
May 13 2024, 10:43 AM · Traffic, Patch-For-Review, Observability-Metrics
fgiunchedi updated the task description for T356386: Move all o11y services to discovery.wmnet.
May 13 2024, 10:33 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi updated the task description for T356386: Move all o11y services to discovery.wmnet.
May 13 2024, 10:32 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi added a comment to T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit.

A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload

May 13 2024, 9:06 AM · SRE Observability

May 10 2024

fgiunchedi added a comment to T364016: Q4:magru VM tracking task.

From my POV prometheus in magru is live and working, see also https://prometheus-magru.wikimedia.org/

May 10 2024, 9:23 AM · Traffic, Infrastructure-Foundations
fgiunchedi updated the task description for T364016: Q4:magru VM tracking task.
May 10 2024, 9:22 AM · Traffic, Infrastructure-Foundations
fgiunchedi awarded T364454: mgmt ssh access for prometheus hosts in magru a Like token.
May 10 2024, 9:21 AM · netops, Traffic, Infrastructure-Foundations
fgiunchedi added a comment to T364006: Splunk On-Call resets overrides after changes to the escalation.

I've run into this today, what seems to happen is the following:

May 10 2024, 9:19 AM · SRE Observability (FY2023/2024-Q4), Observability-Alerting
fgiunchedi added a comment to T350192: On-call batphone escalation configuration holidays FY2023-24.

Mentioned in SAL (#wikimedia-operations) [2024-05-10T08:30:40Z] <godog> restore SRE business hours oncall for EMEA - T350192

I noticed that batphone wasn't in the steps for sre business hours escalation, I've added back a step to route to sre emea rotation and this worked. Removing the step I believe resets the overrides when we put back the step again. I've then reset the overrides and put back folks that were there previously, which restored things as they were.

May 10 2024, 9:13 AM · SRE Observability (FY2023/2024-Q4)
fgiunchedi added a comment to T350192: On-call batphone escalation configuration holidays FY2023-24.

Mentioned in SAL (#wikimedia-operations) [2024-05-10T08:30:40Z] <godog> restore SRE business hours oncall for EMEA - T350192

May 10 2024, 9:03 AM · SRE Observability (FY2023/2024-Q4)
fgiunchedi updated the task description for T350192: On-call batphone escalation configuration holidays FY2023-24.
May 10 2024, 8:31 AM · SRE Observability (FY2023/2024-Q4)

May 8 2024

fgiunchedi created T364454: mgmt ssh access for prometheus hosts in magru.
May 8 2024, 9:22 AM · netops, Traffic, Infrastructure-Foundations
fgiunchedi updated subscribers of T360414: Phase out cergen for Observability services.

Also cc T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends and @elukey since the thanos-fe work here will help with that task too

May 8 2024, 9:19 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE
fgiunchedi added a comment to T363660: Degraded RAID on centrallog1002.

@fgiunchedi Good to know, thank you. Do you think we should do the syncing again to the new drive?

May 8 2024, 8:56 AM · DC-Ops, SRE, ops-eqiad
fgiunchedi added a comment to T357333: SystemdUnitFailed alerts are too noisy for data-persistence.

@fgiunchedi this is still causing problems (and getting emailed and poked on IRC every 4 hours is far far too noisy, we do need a more flexible policy here).

May 8 2024, 8:55 AM · Data-Persistence, Observability-Alerting

May 7 2024

fgiunchedi closed T355431: Testing task as Invalid.
May 7 2024, 10:32 AM · User-fgiunchedi
fgiunchedi closed T355432: And the test subtask, a subtask of T355431: Testing task, as Invalid.
May 7 2024, 10:31 AM · User-fgiunchedi
fgiunchedi closed T355432: And the test subtask as Invalid.
May 7 2024, 10:31 AM · User-fgiunchedi
fgiunchedi updated the task description for T356994: Alertmanager IRC notifications feedback and improvements.
May 7 2024, 10:25 AM · Observability-Alerting
fgiunchedi added a comment to T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy.

Thank you for taking a look at this @andrea.denisse @Dzahn. Filtering the targets by job ncredir (https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&var-datasource=thanos&var-Filters=job%7C%3D%7Cncredir) you'll notice that the port is 3904, which is mtail and not benthos. So indeed this is T362776: replace mtail with benthos on ncredir instances and the missing bits are removing the now-obsolete ncredir job from Prometheus. (cc @Vgutierrez)

May 7 2024, 8:12 AM · Traffic

May 6 2024

fgiunchedi added a comment to T363914: Discrepancy between Graphite & Prometheus editResponseTime counts.

Indeed I agree that would be the root cause @colewhite pointed out. In light of the fact that (as far as I'm aware) we don't have an ETA to tweak the statsd-exporter deployment on wikikube as described in T359640; I think we should go back to the graphite/statsd metric for edits, so numbers are accurate

May 6 2024, 2:50 PM · MediaWiki-Platform-Team, Observability-Metrics
fgiunchedi updated subscribers of T364016: Q4:magru VM tracking task.

I've tried installing prometheus7001 today with help from @Muehlenhoff although there's no console and some pxe/tftp interaction with install7001 is suspected. I'll hold off further steps for now until VMs can be installed

May 6 2024, 2:45 PM · Traffic, Infrastructure-Foundations
fgiunchedi added a comment to T363924: confd prom exporter cannot distinguish targets with a common base name.

Thank you for the investigation @Scott_French ! That sounds sensible to me and I'm happy to review patches for the o11y bits; on the general confd bits I'm not sure who owns the system though

May 6 2024, 9:00 AM · SRE Observability, SRE
fgiunchedi added a comment to T363660: Degraded RAID on centrallog1002.

Thank you all for looking into this!

May 6 2024, 8:55 AM · DC-Ops, SRE, ops-eqiad
fgiunchedi added a project to T353457: Karma UI shows duplicate alerts: User-fgiunchedi.
May 6 2024, 8:20 AM · User-fgiunchedi, SRE Observability (FY2023/2024-Q4), cloud-services-team, Observability-Alerting

Apr 30 2024

fgiunchedi added a comment to T363660: Degraded RAID on centrallog1002.

lshw as requested

Apr 30 2024, 3:43 PM · DC-Ops, SRE, ops-eqiad
fgiunchedi added a comment to T343529: Prometheus doesn't reload or alert on expired client certificates.

I have spent some time investigating this issue and I believe this is a case of https://github.com/prometheus/common/issues/598 . Specifically prometheus does reload certs from disk, however they are not used for existing connections, only new ones! If existing connections are idle for > 5 minutes then they are recycled, if that doesn't happen then existing (possibly expired) certificates are used.

Apr 30 2024, 10:01 AM · SRE Observability (FY2023/2024-Q4), Prod-Kubernetes, Observability-Metrics, User-fgiunchedi, Kubernetes, serviceops-radar