(Go: >> BACK << -|- >> HOME <<)

Page MenuHomePhabricator

scap should have a better error message when it can't find keyholder key
Open, In Progress, LowPublic

Description

Currently, it falls back to trying every key available in keyholder, which will error unless it succeeds before hitting MaxAuthTries (default of 6).

From discussion in #wikimedia-releng, while debugging T313259:

<hasharAway> so maybe on the deployment server manually amend /srv/deployment/phabricator/deployment/scap/scap.cfg
<hasharAway> and add
<hasharAway> `keyholder_key: phabricator`
 * brennen tries that
<hasharAway> $ ls -la /etc/keyholder.d/phabricator
<hasharAway> -r--r----- 1 root keyholder 1766 Nov 30  2020 /etc/keyholder.d/phabricator
<hasharAway> I don't know why it would have broken
<hasharAway> maybe due to some codechange done recently in scap
<brennen> seems like it might be working
<hasharAway> I might be the one to blame
<hasharAway> cause I know close to nothing about scap code and if I know about that get_keyholder_key method it must be that I have altered it recently
<brennen> that did it - thanks hasharAway!
<hasharAway> well
<hasharAway> great :]
<hasharAway> Using key: /etc/keyholder.d/phabricator
<hasharAway> from the scap log
<hasharAway> while previously we had:
<hasharAway> `Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'phabricator/deployment', '-g', 'default', 'fetch', '--refresh-config']`
<hasharAway> `Unable to find keyholder key for phab_deploy`
<hasharAway> `['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'phabricator/deployment', '-g', 'default', 'fetch', '--refresh-config'] (ran as phab-deploy@phab2001.codfw.wmnet) `
<hasharAway> if we can't find a keyholder key using the ssh_user  or the keyholder_key config value if it is set
<hasharAway> then I think scap should abort entirely
<hasharAway> else it tries to do every single keys from the keyholder ( see above logstash link)
<brennen> right, and then just fails on too many auth attempts
<hasharAway> and fails unless you deploy with one of the first 6 keys
<brennen> my guess is that at one time the fallback might have worked because there weren't very many keys to try

Details

TitleReferenceAuthorSource BranchDest Branch
deploy.py: Handle missing keyholder key with clear error loggingrepos/releng/scap!328sandeepsT313624-improve-keyholder-error-handlingmaster
Customize query in GitLab

Event Timeline

brennen moved this task from Backlog to Radar on the User-brennen board.
thcipriani renamed this task from scap should fail if it can't find a keyholder key using ssh_user or keyholder_key values to scap should have a better error message when it can't find keyholder key.May 24 2023, 5:00 PM
Sandeeps changed the task status from Open to In Progress.Apr 19 2024, 6:02 PM

To verify scap behavior and error messaging when it cannot find the specified keyholder key, and to confirm if scap retries all available keys in the keyholder.d, I tested the scenarios regarding the scap keyholder. Here are the results:

Tested with no ssh key in the keyfolder.d and the output is:

{"name": "deploy", "msg": "Unable to find keyholder key for %s", "args": ["scap_deploy"], "levelno": 10, "filename": "cli.py", "exc_text": null, "lineno": 152, "funcName": "get_keyholder_key", "created": 1715803062.4578805, "msecs": 457.88049697875977, "relativeCreated": 1282.752513885498, "host": "deploy"}

Description: scap fails to find the keyholder key and attempts SSH without a key.

Tested with different key in keyholder.d

{"name": "deploy", "msg": "Unable to find keyholder key for %s", "args": ["scap_deploy"], "levelno": 10, "filename": "cli.py", "exc_text": null, "lineno": 152, "funcName": "get_keyholder_key", "created": 1715809119.1792183, "msecs": 179.21829223632812, "relativeCreated": 1296.0443496704102, "host": "deploy"}

Description: Even with a different key, scap searches for the specified key in scap.cfg, fails, and attempts SSH without a key.

Deleted keyholder_key Parameter from /etc/scap.cfg:

  • Output: Same as above.
  • Description: scap looks for the key specified in the project’s scap.cfg (e.g., phabricator), fails to find it, and attempts SSH without a key.

Removed keyholder_key Parameter from project/scap.cfg:

  • Output: Same as above.
  • Description: scap defaults to key_safe_name based on ssh_user, fails to find it, and attempts SSH without a key.

@brennen In all scenarios, scap did not retry all available keys, It Looks like it performs as expected. Please confirm if this is the correct behavior. If so, can we close the story?

Mentioned in SAL (#wikimedia-releng) [2024-05-16T17:52:12Z] <brennen> running a no-op test deployment to phab2002 to reproduce errors for T313624

Mentioned in SAL (#wikimedia-operations) [2024-05-16T17:52:47Z] <brennen@deploy1002> Started deploy [phabricator/deployment@7d858df]: test scap deployment with keyholder key misconfigured for T313624

Mentioned in SAL (#wikimedia-operations) [2024-05-16T17:53:25Z] <brennen@deploy1002> Finished deploy [phabricator/deployment@7d858df]: test scap deployment with keyholder key misconfigured for T313624 (duration: 00m 38s)

A full log of the discussion that led to this task is here:

https://wm-bot.wmcloud.org/browser/index.php?start=07%2F22%2F2022&end=07%2F22%2F2022&display=%23wikimedia-releng

See also: 8a7d4bf15d900d47da6dea78d59171f83ab33a3f

The original problem was that if scap doesn't find a key, you wind up with Too many authentication failures, which is confusing to the deployer.

This still happens - I just reproduced on deploy1002:

Received disconnect from [host] port 22:2: Too many authentication failures                                                                                            
Disconnected from [host] port 22

See P62533 for a detailed log.

@hashar's idea was that scap should abort on not finding a key:

<hasharAway> if we can't find a keyholder key using the ssh_user  or the keyholder_key config value if it is set
<hasharAway> then I think scap should abort entirely

I still think that seems reasonable.

One additional question: Are there any deployment repos currently where scap does not find a key, but the deployment succeeds because a key in keyholder works before MaxAuthTries is reached? If so, those might need fixed.

@brennen In my case on Scap3-dev env if Scap doesn't find a key it tries to do ssh without a key, and yes it succeeding the deployment. Do you think we need a code change for this behavior?