Deployment Runbook — New Host

When to Use This Runbook

Follow this runbook when adding a new Ubuntu host to the SSH baseline. The procedure assumes:

Preflight will validate all of these before any changes are made.


Step 1: Bootstrap the ansible automation account

On the target host, as root (e.g. via console, ScreenConnect, or your initial admin SSH session):

# Copy the bootstrap script to the host. Easiest: paste via SSH session or
# fetch from the repo.
curl -fsSL https://raw.githubusercontent.com/Puffing-Billy-Railway/pbr-infra/main/scripts/bootstrap-ansible-user.sh \
    -o /tmp/bootstrap-ansible-user.sh

# Inspect it before running
less /tmp/bootstrap-ansible-user.sh

# Run as root
sudo bash /tmp/bootstrap-ansible-user.sh

The script is idempotent. It creates the local ansible account, adds it to the sudo group, locks the password (key auth only), installs the control node's public key at ~ansible/.ssh/authorized_keys, and writes /etc/sudoers.d/ansible with NOPASSWD.

Full source:

#!/bin/bash
# Run as root on a fresh host before adding to ssh-baseline inventory.
# Creates the local ansible automation user with sudo group membership,
# key-only auth, and NOPASSWD sudoers. Idempotent.
set -e

PUBKEY="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBMc7IDlr/IZ5M/2HcXU7cGCKZ03SLjpr5cbmiHnokdP ansible-svc@pbr-ansible-kl1"

useradd -m -s /bin/bash -c "Ansible automation" ansible 2>/dev/null || true
usermod -aG sudo ansible
passwd -l ansible

install -d -m 0700 -o ansible -g ansible /home/ansible/.ssh
grep -qxF "$PUBKEY" /home/ansible/.ssh/authorized_keys 2>/dev/null \
    || echo "$PUBKEY" >> /home/ansible/.ssh/authorized_keys
chmod 0600 /home/ansible/.ssh/authorized_keys
chown ansible:ansible /home/ansible/.ssh/authorized_keys

echo "ansible ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/ansible
chmod 0440 /etc/sudoers.d/ansible
visudo -c -f /etc/sudoers.d/ansible

id ansible

Verify bootstrap success from the control node:

ansible -i 'NEW_HOST_IP,' all -m ping \
    -u ansible -e ansible_user=ansible \
    --private-key ~/.ssh/ansible_svc

Expected: NEW_HOST_IP | SUCCESS => {"ping": "pong"}. If this fails, fix bootstrap first — do not proceed.


Step 2: Create local pbr_admin break-glass account

On the target host, as root:

useradd -m -s /bin/bash -c "PBR break-glass admin" pbr_admin
passwd pbr_admin
# Set the password from 1Password (PBR > Linux > pbr_admin)
usermod -aG sudo pbr_admin
id pbr_admin

This account must exist before the baseline role runs; preflight verifies it.


Step 3: Pre-clean AD (PowerShell, on a domain-joined Windows host with AD module)

If the host has ever been joined to AD — even an aborted attempt — the AD computer object must be deleted before re-joining. Always check, even for fresh hosts (the name may collide with a decommissioned host).

# Check whether the computer object exists
Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue

# If it exists and you're sure it's safe to delete
Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue | Remove-ADComputer -Confirm:$false

# Confirm gone
Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue

Note: Even with proper pre-clean, the first realm join attempt may fail due to AD multi-master replication lag. See Step 6 for the expected retry behaviour.


Step 4: Add host to inventory

On pbr-ansible-kl1, edit ~/pbr-infra/inventory/hosts.yml. The host must be added in two places:

  1. Under all.children.linux.hosts (with ansible_host: <IP>)
  2. Under all.children.targets.hosts (no ansible_host — inherited)
---
all:
  children:
    linux:
      hosts:
        # ... existing hosts ...
        pbr-NEWHOST-kl1:
          ansible_host: 10.1.X.Y          # &lt;-- add here

    targets:
      hosts:
        # ... existing hosts ...
        pbr-NEWHOST-kl1:                  # &lt;-- and here

Why two places: The linux group lists known hosts (used for ad-hoc commands, monitoring, fact-gathering). The targets group is the deployment scope — playbooks use hosts: targets to ensure the control node and any informational-only hosts cannot be hit accidentally.

Commit and push the inventory change:

cd ~/pbr-infra
git add inventory/hosts.yml
git commit -m "inventory: add pbr-NEWHOST-kl1"
git push origin main

Step 5: Run preflight (no-changes verification)

cd ~/pbr-infra
ansible-playbook playbooks/preflight.yml -l pbr-NEWHOST-kl1 \
    --vault-password-file ~/.ansible_vault_pass

Preflight is read-only — it makes zero changes to the host. It validates:

If preflight fails, fix the cause and re-run. Do not proceed to the baseline step until preflight is clean.


Step 6: Run the baseline role

cd ~/pbr-infra
ansible-playbook playbooks/ssh-baseline.yml -l pbr-NEWHOST-kl1 \
    --vault-password-file ~/.ansible_vault_pass

The playbook runs preflight again (defence in depth) then applies the role. Expected duration: ~3-5 minutes per host on a typical KVM VM.

Expected behaviour: realm join may fail on first attempt

Despite a clean AD pre-clean, the first realm join attempt sometimes fails. This is a known pattern caused by AD multi-master replication lag — the join hits a DC that hasn't yet seen the deletion of the pre-cleaned computer object. The output looks like this (with no_log: true hiding the actual error):

TASK [ssh-baseline : Join Active Directory domain] *****************************
fatal: [pbr-NEWHOST-kl1]: FAILED! =&gt; changed=true
  censored: 'the output has been hidden due to the fact that no_log: true was specified for this result'

Fix: Just re-run the playbook. The role is idempotent and the second attempt almost always succeeds:

ansible-playbook playbooks/ssh-baseline.yml -l pbr-NEWHOST-kl1 \
    --vault-password-file ~/.ansible_vault_pass

If the second attempt also fails, dig deeper (see Troubleshooting in the Known Limitations page). The most common diagnostic is to read the host's journalctl for adcli/realmd/Kerberos errors:

ansible pbr-NEWHOST-kl1 -m shell -a '
    journalctl --since "10 minutes ago" --no-pager 2&gt;&amp;1 \
        | grep -iE "realm|adcli|krb5|sssd|kerberos" | tail -40
    timedatectl status | head -8
' --become --vault-password-file ~/.ansible_vault_pass

Step 7: Run post-deployment verification

cd ~/pbr-infra
ansible-playbook playbooks/verify.yml -l pbr-NEWHOST-kl1 \
    -e verify_test_user=a.mfraser \
    --vault-password-file ~/.ansible_vault_pass

Replace a.mfraser with any AD username that is a member of SG_ServerAccess or SG_Sudo and has an sshPublicKey populated.

Verify checks:

The verification summary at the end looks like:

TASK [Verification summary] ****************************************************
ok: [pbr-NEWHOST-kl1] =&gt;
  msg:
  - '==================== VERIFICATION PASSED ===================='
  - 'Joined to realm:        pbr.org.au'
  - 'AD user resolves:       a.mfraser (1234:5678)'
  - 'SSH key retrieved:      ssh-ed25519 AAAAC3...'
  - 'sshd config valid:      yes'
  - 'All services running:   ssh, sssd, fail2ban, auditd'
  - ''
  - 'Next: SSH from your workstation as a.mfraser@pbr-NEWHOST-kl1'
  - 'Expect: key auth + Duo push for SSH; Duo push + AD password for sudo'

Step 8: Manual SSH validation from your workstation

This step proves the end-user experience actually works. From your workstation:

Test 1: AD user via SSH

ssh a.mfraser@pbr-NEWHOST-kl1.pbr.org.au

Expected: SSH key auth completes (no password prompt), then a Duo push to your phone. Approve the push, you land in a shell as your AD user.

Test 2: sudo as AD user

sudo whoami

Expected: Duo push prompt (auto-pushed), then AD password prompt, then root. Within the 30-minute timestamp window, subsequent sudo commands skip both prompts.

Test 3: pbr_admin break-glass

ssh pbr_admin@pbr-NEWHOST-kl1.pbr.org.au

Expected: Password-only prompt (no key, no Duo) — local password from 1Password.

sudo whoami

Expected: Local password prompt only (no Duo). Returns root.

Test 4: Ansible NOPASSWD path still works

From the control node (already validated by verify.yml but worth a manual check):

ansible pbr-NEWHOST-kl1 -m shell -a 'sudo -n true' --become

Expected: Success. Confirms PAM stack hasn't broken automation.


Step 9: Clean up tee'd log files (if any)

If you piped playbook output to a log file during deployment:

# Check whether any log contains the AD service account password
grep -l "MDT_JD\|--login-user" /tmp/*.log 2&gt;/dev/null

# Shred any logs created during this deployment
shred -u /tmp/NEWHOST-*.log 2&gt;/dev/null

Even with no_log: true restored, transient diagnostic logs from troubleshooting may contain sensitive material. Always scrub.


Royal TS Connection Notes

Royal TS 7's Rebex SSH library has a constraint: it does not support OpenSSH's AuthenticationMethods publickey,keyboard-interactive directive natively. Without configuration, Royal TS will fail to connect to baselined hosts.

Workaround: set Authentication Method to "Any"

  1. Open the host's Royal TS connection properties
  2. Navigate to Advanced > Security
  3. Set Authentication method to Any
  4. Save and reconnect

This lets Rebex negotiate either method per the server's policy, and the server's AuthenticationMethods directive will require both.

Auto-push approval

Royal TS's keyboard-interactive UI does not support pre-filling the Duo response. You will press Enter once at the Duo prompt to confirm the push. This is acceptable for a single round-trip MFA.

Alternative: External Application launching Windows OpenSSH

If Rebex limitations bite, configure Royal TS to launch Windows' native ssh.exe as an External Application connection instead. PowerShell ssh.exe handles AuthenticationMethods publickey,keyboard-interactive correctly and integrates with the 1Password SSH agent via the OpenSSH named pipe (\\.\pipe\openssh-ssh-agent).



Revision #1
Created 2026-05-13 05:24:26 UTC by PBR_AI
Updated 2026-05-13 05:24:26 UTC by PBR_AI