Deployment Runbook — New Host
When to Use This Runbook
Follow this runbook when adding a new Ubuntu host to the SSH baseline. The procedure assumes:
- The host runs Ubuntu 22.04 or 24.04 LTS (the role's supported versions)
- The host has a real hostname (not
ubuntuorlocalhost) - The host can reach AD DCs on TCP 88 (Kerberos) and 389 (LDAP)
- The host can reach
https://pki.pbr.org.au/ca(SCEPman root CA) - The host has NTP synchronisation working (
timedatectl statusshowsNTPSynchronized=yes)
Preflight will validate all of these before any changes are made.
Step 1: Bootstrap the ansible automation account
On the target host, as root (e.g. via console, ScreenConnect, or your initial admin SSH session):
# Copy the bootstrap script to the host. Easiest: paste via SSH session or
# fetch from the repo.
curl -fsSL https://raw.githubusercontent.com/Puffing-Billy-Railway/pbr-infra/main/scripts/bootstrap-ansible-user.sh \
-o /tmp/bootstrap-ansible-user.sh
# Inspect it before running
less /tmp/bootstrap-ansible-user.sh
# Run as root
sudo bash /tmp/bootstrap-ansible-user.sh
The script is idempotent. It creates the local ansible account, adds it to the sudo group, locks the password (key auth only), installs the control node's public key at ~ansible/.ssh/authorized_keys, and writes /etc/sudoers.d/ansible with NOPASSWD.
Full source:
#!/bin/bash
# Run as root on a fresh host before adding to ssh-baseline inventory.
# Creates the local ansible automation user with sudo group membership,
# key-only auth, and NOPASSWD sudoers. Idempotent.
set -e
PUBKEY="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBMc7IDlr/IZ5M/2HcXU7cGCKZ03SLjpr5cbmiHnokdP ansible-svc@pbr-ansible-kl1"
useradd -m -s /bin/bash -c "Ansible automation" ansible 2>/dev/null || true
usermod -aG sudo ansible
passwd -l ansible
install -d -m 0700 -o ansible -g ansible /home/ansible/.ssh
grep -qxF "$PUBKEY" /home/ansible/.ssh/authorized_keys 2>/dev/null \
|| echo "$PUBKEY" >> /home/ansible/.ssh/authorized_keys
chmod 0600 /home/ansible/.ssh/authorized_keys
chown ansible:ansible /home/ansible/.ssh/authorized_keys
echo "ansible ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/ansible
chmod 0440 /etc/sudoers.d/ansible
visudo -c -f /etc/sudoers.d/ansible
id ansible
Verify bootstrap success from the control node:
ansible -i 'NEW_HOST_IP,' all -m ping \
-u ansible -e ansible_user=ansible \
--private-key ~/.ssh/ansible_svc
Expected: NEW_HOST_IP | SUCCESS => {"ping": "pong"}. If this fails, fix bootstrap first — do not proceed.
Step 2: Create local pbr_admin break-glass account
On the target host, as root:
useradd -m -s /bin/bash -c "PBR break-glass admin" pbr_admin
passwd pbr_admin
# Set the password from 1Password (PBR > Linux > pbr_admin)
usermod -aG sudo pbr_admin
id pbr_admin
This account must exist before the baseline role runs; preflight verifies it.
Step 3: Pre-clean AD (PowerShell, on a domain-joined Windows host with AD module)
If the host has ever been joined to AD — even an aborted attempt — the AD computer object must be deleted before re-joining. Always check, even for fresh hosts (the name may collide with a decommissioned host).
# Check whether the computer object exists
Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue
# If it exists and you're sure it's safe to delete
Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue | Remove-ADComputer -Confirm:$false
# Confirm gone
Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue
Note: Even with proper pre-clean, the first realm join attempt may fail due to AD multi-master replication lag. See Step 6 for the expected retry behaviour.
Step 4: Add host to inventory
On pbr-ansible-kl1, edit ~/pbr-infra/inventory/hosts.yml. The host must be added in two places:
- Under
all.children.linux.hosts(withansible_host: <IP>) - Under
all.children.targets.hosts(noansible_host— inherited)
---
all:
children:
linux:
hosts:
# ... existing hosts ...
pbr-NEWHOST-kl1:
ansible_host: 10.1.X.Y # <-- add here
targets:
hosts:
# ... existing hosts ...
pbr-NEWHOST-kl1: # <-- and here
Why two places: The linux group lists known hosts (used for ad-hoc commands, monitoring, fact-gathering). The targets group is the deployment scope — playbooks use hosts: targets to ensure the control node and any informational-only hosts cannot be hit accidentally.
Commit and push the inventory change:
cd ~/pbr-infra
git add inventory/hosts.yml
git commit -m "inventory: add pbr-NEWHOST-kl1"
git push origin main
Step 5: Run preflight (no-changes verification)
cd ~/pbr-infra
ansible-playbook playbooks/preflight.yml -l pbr-NEWHOST-kl1 \
--vault-password-file ~/.ansible_vault_pass
Preflight is read-only — it makes zero changes to the host. It validates:
- OS is Ubuntu 22.04 or 24.04
- Hostname is set to a real value and resolves
- System clock is NTP-synchronised
- Required local users (
ansible,pbr_admin) exist - APT Universe component is enabled (for
oddjob,oddjob-mkhomedir) visudo -cpasses (ignoring the known ThreatLocker drop-in permission issue)- AD DCs are reachable on TCP 88 and 389
- No existing realm membership conflicts
- SCEPman
/caendpoint returns a valid CA cert - AD schema has the
sshPublicKeyattribute - Vault password file exists with correct permissions
- Required collections are installed on the control node
If preflight fails, fix the cause and re-run. Do not proceed to the baseline step until preflight is clean.
Step 6: Run the baseline role
cd ~/pbr-infra
ansible-playbook playbooks/ssh-baseline.yml -l pbr-NEWHOST-kl1 \
--vault-password-file ~/.ansible_vault_pass
The playbook runs preflight again (defence in depth) then applies the role. Expected duration: ~3-5 minutes per host on a typical KVM VM.
Expected behaviour: realm join may fail on first attempt
Despite a clean AD pre-clean, the first realm join attempt sometimes fails. This is a known pattern caused by AD multi-master replication lag — the join hits a DC that hasn't yet seen the deletion of the pre-cleaned computer object. The output looks like this (with no_log: true hiding the actual error):
TASK [ssh-baseline : Join Active Directory domain] *****************************
fatal: [pbr-NEWHOST-kl1]: FAILED! => changed=true
censored: 'the output has been hidden due to the fact that no_log: true was specified for this result'
Fix: Just re-run the playbook. The role is idempotent and the second attempt almost always succeeds:
ansible-playbook playbooks/ssh-baseline.yml -l pbr-NEWHOST-kl1 \
--vault-password-file ~/.ansible_vault_pass
If the second attempt also fails, dig deeper (see Troubleshooting in the Known Limitations page). The most common diagnostic is to read the host's journalctl for adcli/realmd/Kerberos errors:
ansible pbr-NEWHOST-kl1 -m shell -a '
journalctl --since "10 minutes ago" --no-pager 2>&1 \
| grep -iE "realm|adcli|krb5|sssd|kerberos" | tail -40
timedatectl status | head -8
' --become --vault-password-file ~/.ansible_vault_pass
Step 7: Run post-deployment verification
cd ~/pbr-infra
ansible-playbook playbooks/verify.yml -l pbr-NEWHOST-kl1 \
-e verify_test_user=a.mfraser \
--vault-password-file ~/.ansible_vault_pass
Replace a.mfraser with any AD username that is a member of SG_ServerAccess or SG_Sudo and has an sshPublicKey populated.
Verify checks:
- Realm membership reports the correct domain
- The test AD user resolves via SSSD (
getent passwd) - The test user's SSH public key is retrievable via
sss_ssh_authorizedkeys sshd -tpasses (full config validates)- Services
ssh,sssd,fail2banare running auditdis running on managed hosts (skipped on LXC)- fail2ban sshd jail is active
pam_duo.sois referenced in/etc/pam.d/sudo- The sudo timestamp_timeout drop-in exists
- The ansible NOPASSWD sudo path still works (proves PAM stack didn't break automation)
pbr_adminis not insg_sudo(would force Duo on break-glass account)
The verification summary at the end looks like:
TASK [Verification summary] ****************************************************
ok: [pbr-NEWHOST-kl1] =>
msg:
- '==================== VERIFICATION PASSED ===================='
- 'Joined to realm: pbr.org.au'
- 'AD user resolves: a.mfraser (1234:5678)'
- 'SSH key retrieved: ssh-ed25519 AAAAC3...'
- 'sshd config valid: yes'
- 'All services running: ssh, sssd, fail2ban, auditd'
- ''
- 'Next: SSH from your workstation as a.mfraser@pbr-NEWHOST-kl1'
- 'Expect: key auth + Duo push for SSH; Duo push + AD password for sudo'
Step 8: Manual SSH validation from your workstation
This step proves the end-user experience actually works. From your workstation:
Test 1: AD user via SSH
ssh a.mfraser@pbr-NEWHOST-kl1.pbr.org.au
Expected: SSH key auth completes (no password prompt), then a Duo push to your phone. Approve the push, you land in a shell as your AD user.
Test 2: sudo as AD user
sudo whoami
Expected: Duo push prompt (auto-pushed), then AD password prompt, then root. Within the 30-minute timestamp window, subsequent sudo commands skip both prompts.
Test 3: pbr_admin break-glass
ssh pbr_admin@pbr-NEWHOST-kl1.pbr.org.au
Expected: Password-only prompt (no key, no Duo) — local password from 1Password.
sudo whoami
Expected: Local password prompt only (no Duo). Returns root.
Test 4: Ansible NOPASSWD path still works
From the control node (already validated by verify.yml but worth a manual check):
ansible pbr-NEWHOST-kl1 -m shell -a 'sudo -n true' --become
Expected: Success. Confirms PAM stack hasn't broken automation.
Step 9: Clean up tee'd log files (if any)
If you piped playbook output to a log file during deployment:
# Check whether any log contains the AD service account password
grep -l "MDT_JD\|--login-user" /tmp/*.log 2>/dev/null
# Shred any logs created during this deployment
shred -u /tmp/NEWHOST-*.log 2>/dev/null
Even with no_log: true restored, transient diagnostic logs from troubleshooting may contain sensitive material. Always scrub.
Royal TS Connection Notes
Royal TS 7's Rebex SSH library has a constraint: it does not support OpenSSH's AuthenticationMethods publickey,keyboard-interactive directive natively. Without configuration, Royal TS will fail to connect to baselined hosts.
Workaround: set Authentication Method to "Any"
- Open the host's Royal TS connection properties
- Navigate to Advanced > Security
- Set Authentication method to
Any - Save and reconnect
This lets Rebex negotiate either method per the server's policy, and the server's AuthenticationMethods directive will require both.
Auto-push approval
Royal TS's keyboard-interactive UI does not support pre-filling the Duo response. You will press Enter once at the Duo prompt to confirm the push. This is acceptable for a single round-trip MFA.
Alternative: External Application launching Windows OpenSSH
If Rebex limitations bite, configure Royal TS to launch Windows' native ssh.exe as an External Application connection instead. PowerShell ssh.exe handles AuthenticationMethods publickey,keyboard-interactive correctly and integrates with the 1Password SSH agent via the OpenSSH named pipe (\\.\pipe\openssh-ssh-agent).
Where to Read Next
- Known Limitations, Troubleshooting & Version History — detailed troubleshooting if deployment fails
- Configuration Reference — per-host overrides via
host_vars/if a host needs non-default settings - Playbook Reference — details on preflight, verify, and teardown