Deployment Runbook — New Host

When to Use This Runbook 

 Follow this runbook when adding a new Ubuntu host to the SSH baseline. The procedure assumes: 

 
 The host runs Ubuntu 22.04 or 24.04 LTS (the role's supported versions) 
 The host has a real hostname (not ubuntu or localhost ) 
 The host can reach AD DCs on TCP 88 (Kerberos) and 389 (LDAP) 
 The host can reach https://pki.pbr.org.au/ca (SCEPman root CA) 
 The host has NTP synchronisation working ( timedatectl status shows NTPSynchronized=yes ) 
 

 Preflight will validate all of these before any changes are made. 

 

 Step 1: Bootstrap the ansible automation account 

 On the target host , as root (e.g. via console, ScreenConnect, or your initial admin SSH session): 

 # Copy the bootstrap script to the host. Easiest: paste via SSH session or
# fetch from the repo.
curl -fsSL https://raw.githubusercontent.com/Puffing-Billy-Railway/pbr-infra/main/scripts/bootstrap-ansible-user.sh \
 -o /tmp/bootstrap-ansible-user.sh

# Inspect it before running
less /tmp/bootstrap-ansible-user.sh

# Run as root
sudo bash /tmp/bootstrap-ansible-user.sh 

 The script is idempotent. It creates the local ansible account, adds it to the sudo group, locks the password (key auth only), installs the control node's public key at ~ansible/.ssh/authorized_keys , and writes /etc/sudoers.d/ansible with NOPASSWD. 

 Full source: 

 #!/bin/bash
# Run as root on a fresh host before adding to ssh-baseline inventory.
# Creates the local ansible automation user with sudo group membership,
# key-only auth, and NOPASSWD sudoers. Idempotent.
set -e

PUBKEY="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBMc7IDlr/IZ5M/2HcXU7cGCKZ03SLjpr5cbmiHnokdP ansible-svc@pbr-ansible-kl1"

useradd -m -s /bin/bash -c "Ansible automation" ansible 2>/dev/null || true
usermod -aG sudo ansible
passwd -l ansible

install -d -m 0700 -o ansible -g ansible /home/ansible/.ssh
grep -qxF "$PUBKEY" /home/ansible/.ssh/authorized_keys 2>/dev/null \
 || echo "$PUBKEY" >> /home/ansible/.ssh/authorized_keys
chmod 0600 /home/ansible/.ssh/authorized_keys
chown ansible:ansible /home/ansible/.ssh/authorized_keys

echo "ansible ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/ansible
chmod 0440 /etc/sudoers.d/ansible
visudo -c -f /etc/sudoers.d/ansible

id ansible 

 Verify bootstrap success from the control node: 

 ansible -i 'NEW_HOST_IP,' all -m ping \
 -u ansible -e ansible_user=ansible \
 --private-key ~/.ssh/ansible_svc 

 Expected: NEW_HOST_IP | SUCCESS => {"ping": "pong"} . If this fails, fix bootstrap first — do not proceed. 

 

 Step 2: Create local pbr_admin break-glass account 

 On the target host , as root: 

 useradd -m -s /bin/bash -c "PBR break-glass admin" pbr_admin
passwd pbr_admin
# Set the password from 1Password (PBR &gt; Linux &gt; pbr_admin)
usermod -aG sudo pbr_admin
id pbr_admin 

 This account must exist before the baseline role runs; preflight verifies it. 

 

 Step 3: Pre-clean AD (PowerShell, on a domain-joined Windows host with AD module) 

 If the host has ever been joined to AD — even an aborted attempt — the AD computer object must be deleted before re-joining. Always check, even for fresh hosts (the name may collide with a decommissioned host). 

 # Check whether the computer object exists
Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue

# If it exists and you're sure it's safe to delete
Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue | Remove-ADComputer -Confirm:$false

# Confirm gone
Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue 

 Note: Even with proper pre-clean, the first realm join attempt may fail due to AD multi-master replication lag. See Step 6 for the expected retry behaviour. 

 

 Step 4: Add host to inventory 

 On pbr-ansible-kl1 , edit ~/pbr-infra/inventory/hosts.yml . The host must be added in two places : 

 
 Under all.children.linux.hosts (with ansible_host: <IP> ) 
 Under all.children.targets.hosts (no ansible_host — inherited) 
 

 ---
all:
 children:
 linux:
 hosts:
 # ... existing hosts ...
 pbr-NEWHOST-kl1:
 ansible_host: 10.1.X.Y # &lt;-- add here

 targets:
 hosts:
 # ... existing hosts ...
 pbr-NEWHOST-kl1: # &lt;-- and here 

 Why two places: The linux group lists known hosts (used for ad-hoc commands, monitoring, fact-gathering). The targets group is the deployment scope — playbooks use hosts: targets to ensure the control node and any informational-only hosts cannot be hit accidentally. 

 Commit and push the inventory change: 

 cd ~/pbr-infra
git add inventory/hosts.yml
git commit -m "inventory: add pbr-NEWHOST-kl1"
git push origin main 

 

 Step 5: Run preflight (no-changes verification) 

 cd ~/pbr-infra
ansible-playbook playbooks/preflight.yml -l pbr-NEWHOST-kl1 \
 --vault-password-file ~/.ansible_vault_pass 

 Preflight is read-only — it makes zero changes to the host. It validates: 

 
 OS is Ubuntu 22.04 or 24.04 
 Hostname is set to a real value and resolves 
 System clock is NTP-synchronised 
 Required local users ( ansible , pbr_admin ) exist 
 APT Universe component is enabled (for oddjob , oddjob-mkhomedir ) 
 visudo -c passes (ignoring the known ThreatLocker drop-in permission issue) 
 AD DCs are reachable on TCP 88 and 389 
 No existing realm membership conflicts 
 SCEPman /ca endpoint returns a valid CA cert 
 AD schema has the sshPublicKey attribute 
 Vault password file exists with correct permissions 
 Required collections are installed on the control node 
 

 If preflight fails, fix the cause and re-run. Do not proceed to the baseline step until preflight is clean. 

 

 Step 6: Run the baseline role 

 cd ~/pbr-infra
ansible-playbook playbooks/ssh-baseline.yml -l pbr-NEWHOST-kl1 \
 --vault-password-file ~/.ansible_vault_pass 

 The playbook runs preflight again (defence in depth) then applies the role. Expected duration: ~3-5 minutes per host on a typical KVM VM. 

 Expected behaviour: realm join may fail on first attempt 

 Despite a clean AD pre-clean, the first realm join attempt sometimes fails. This is a known pattern caused by AD multi-master replication lag — the join hits a DC that hasn't yet seen the deletion of the pre-cleaned computer object. The output looks like this (with no_log: true hiding the actual error): 

 TASK [ssh-baseline : Join Active Directory domain] *****************************
fatal: [pbr-NEWHOST-kl1]: FAILED! =&gt; changed=true
 censored: 'the output has been hidden due to the fact that no_log: true was specified for this result' 

 Fix: Just re-run the playbook. The role is idempotent and the second attempt almost always succeeds: 

 ansible-playbook playbooks/ssh-baseline.yml -l pbr-NEWHOST-kl1 \
 --vault-password-file ~/.ansible_vault_pass 

 If the second attempt also fails, dig deeper (see Troubleshooting in the Known Limitations page). The most common diagnostic is to read the host's journalctl for adcli/realmd/Kerberos errors: 

 ansible pbr-NEWHOST-kl1 -m shell -a '
 journalctl --since "10 minutes ago" --no-pager 2&gt;&amp;1 \
 | grep -iE "realm|adcli|krb5|sssd|kerberos" | tail -40
 timedatectl status | head -8
' --become --vault-password-file ~/.ansible_vault_pass 

 

 Step 7: Run post-deployment verification 

 cd ~/pbr-infra
ansible-playbook playbooks/verify.yml -l pbr-NEWHOST-kl1 \
 -e verify_test_user=a.mfraser \
 --vault-password-file ~/.ansible_vault_pass 

 Replace a.mfraser with any AD username that is a member of SG_ServerAccess or SG_Sudo and has an sshPublicKey populated. 

 Verify checks: 

 
 Realm membership reports the correct domain 
 The test AD user resolves via SSSD ( getent passwd ) 
 The test user's SSH public key is retrievable via sss_ssh_authorizedkeys 
 sshd -t passes (full config validates) 
 Services ssh , sssd , fail2ban are running 
 auditd is running on managed hosts (skipped on LXC) 
 fail2ban sshd jail is active 
 pam_duo.so is referenced in /etc/pam.d/sudo 
 The sudo timestamp_timeout drop-in exists 
 The ansible NOPASSWD sudo path still works (proves PAM stack didn't break automation) 
 pbr_admin is not in sg_sudo (would force Duo on break-glass account) 
 

 The verification summary at the end looks like: 

 TASK [Verification summary] ****************************************************
ok: [pbr-NEWHOST-kl1] =&gt;
 msg:
 - '==================== VERIFICATION PASSED ===================='
 - 'Joined to realm: pbr.org.au'
 - 'AD user resolves: a.mfraser (1234:5678)'
 - 'SSH key retrieved: ssh-ed25519 AAAAC3...'
 - 'sshd config valid: yes'
 - 'All services running: ssh, sssd, fail2ban, auditd'
 - ''
 - 'Next: SSH from your workstation as a.mfraser@pbr-NEWHOST-kl1'
 - 'Expect: key auth + Duo push for SSH; Duo push + AD password for sudo' 

 

 Step 8: Manual SSH validation from your workstation 

 This step proves the end-user experience actually works. From your workstation: 

 Test 1: AD user via SSH 

 ssh a.mfraser@pbr-NEWHOST-kl1.pbr.org.au 

 Expected: SSH key auth completes (no password prompt), then a Duo push to your phone. Approve the push, you land in a shell as your AD user. 

 Test 2: sudo as AD user 

 sudo whoami 

 Expected: Duo push prompt (auto-pushed), then AD password prompt, then root . Within the 30-minute timestamp window, subsequent sudo commands skip both prompts. 

 Test 3: pbr_admin break-glass 

 ssh pbr_admin@pbr-NEWHOST-kl1.pbr.org.au 

 Expected: Password-only prompt (no key, no Duo) — local password from 1Password. 

 sudo whoami 

 Expected: Local password prompt only (no Duo). Returns root . 

 Test 4: Ansible NOPASSWD path still works 

 From the control node (already validated by verify.yml but worth a manual check): 

 ansible pbr-NEWHOST-kl1 -m shell -a 'sudo -n true' --become 

 Expected: Success. Confirms PAM stack hasn't broken automation. 

 

 Step 9: Clean up tee'd log files (if any) 

 If you piped playbook output to a log file during deployment: 

 # Check whether any log contains the AD service account password
grep -l "MDT_JD\|--login-user" /tmp/*.log 2&gt;/dev/null

# Shred any logs created during this deployment
shred -u /tmp/NEWHOST-*.log 2&gt;/dev/null 

 Even with no_log: true restored, transient diagnostic logs from troubleshooting may contain sensitive material. Always scrub. 

 

 Royal TS Connection Notes 

 Royal TS 7's Rebex SSH library has a constraint: it does not support OpenSSH's AuthenticationMethods publickey,keyboard-interactive directive natively. Without configuration, Royal TS will fail to connect to baselined hosts. 

 Workaround: set Authentication Method to "Any" 

 
 Open the host's Royal TS connection properties 
 Navigate to Advanced > Security 
 Set Authentication method to Any 
 Save and reconnect 
 

 This lets Rebex negotiate either method per the server's policy, and the server's AuthenticationMethods directive will require both. 

 Auto-push approval 

 Royal TS's keyboard-interactive UI does not support pre-filling the Duo response. You will press Enter once at the Duo prompt to confirm the push. This is acceptable for a single round-trip MFA. 

 Alternative: External Application launching Windows OpenSSH 

 If Rebex limitations bite, configure Royal TS to launch Windows' native ssh.exe as an External Application connection instead. PowerShell ssh.exe handles AuthenticationMethods publickey,keyboard-interactive correctly and integrates with the 1Password SSH agent via the OpenSSH named pipe ( \\.\pipe\openssh-ssh-agent ). 

 

 Where to Read Next 

 
 Known Limitations, Troubleshooting & Version History — detailed troubleshooting if deployment fails 
 Configuration Reference — per-host overrides via host_vars/ if a host needs non-default settings 
 Playbook Reference — details on preflight, verify, and teardown