Deployment Runbook — New Host When to Use This Runbook Follow this runbook when adding a new Ubuntu host to the SSH baseline. The procedure assumes: The host runs Ubuntu 22.04 or 24.04 LTS (the role's supported versions) The host has a real hostname (not ubuntu or localhost ) The host can reach AD DCs on TCP 88 (Kerberos) and 389 (LDAP) The host can reach https://pki.pbr.org.au/ca (SCEPman root CA) The host has NTP synchronisation working ( timedatectl status shows NTPSynchronized=yes ) Preflight will validate all of these before any changes are made. Step 1: Bootstrap the ansible automation account On the target host , as root (e.g. via console, ScreenConnect, or your initial admin SSH session): # Copy the bootstrap script to the host. Easiest: paste via SSH session or # fetch from the repo. curl -fsSL https://raw.githubusercontent.com/Puffing-Billy-Railway/pbr-infra/main/scripts/bootstrap-ansible-user.sh \ -o /tmp/bootstrap-ansible-user.sh # Inspect it before running less /tmp/bootstrap-ansible-user.sh # Run as root sudo bash /tmp/bootstrap-ansible-user.sh The script is idempotent. It creates the local ansible account, adds it to the sudo group, locks the password (key auth only), installs the control node's public key at ~ansible/.ssh/authorized_keys , and writes /etc/sudoers.d/ansible with NOPASSWD. Full source: #!/bin/bash # Run as root on a fresh host before adding to ssh-baseline inventory. # Creates the local ansible automation user with sudo group membership, # key-only auth, and NOPASSWD sudoers. Idempotent. set -e PUBKEY="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBMc7IDlr/IZ5M/2HcXU7cGCKZ03SLjpr5cbmiHnokdP ansible-svc@pbr-ansible-kl1" useradd -m -s /bin/bash -c "Ansible automation" ansible 2>/dev/null || true usermod -aG sudo ansible passwd -l ansible install -d -m 0700 -o ansible -g ansible /home/ansible/.ssh grep -qxF "$PUBKEY" /home/ansible/.ssh/authorized_keys 2>/dev/null \ || echo "$PUBKEY" >> /home/ansible/.ssh/authorized_keys chmod 0600 /home/ansible/.ssh/authorized_keys chown ansible:ansible /home/ansible/.ssh/authorized_keys echo "ansible ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/ansible chmod 0440 /etc/sudoers.d/ansible visudo -c -f /etc/sudoers.d/ansible id ansible Verify bootstrap success from the control node: ansible -i 'NEW_HOST_IP,' all -m ping \ -u ansible -e ansible_user=ansible \ --private-key ~/.ssh/ansible_svc Expected: NEW_HOST_IP | SUCCESS => {"ping": "pong"} . If this fails, fix bootstrap first — do not proceed. Step 2: Create local pbr_admin break-glass account On the target host , as root: useradd -m -s /bin/bash -c "PBR break-glass admin" pbr_admin passwd pbr_admin # Set the password from 1Password (PBR > Linux > pbr_admin) usermod -aG sudo pbr_admin id pbr_admin This account must exist before the baseline role runs; preflight verifies it. Step 3: Pre-clean AD (PowerShell, on a domain-joined Windows host with AD module) If the host has ever been joined to AD — even an aborted attempt — the AD computer object must be deleted before re-joining. Always check, even for fresh hosts (the name may collide with a decommissioned host). # Check whether the computer object exists Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue # If it exists and you're sure it's safe to delete Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue | Remove-ADComputer -Confirm:$false # Confirm gone Get-ADComputer NEW-HOSTNAME -ErrorAction SilentlyContinue Note: Even with proper pre-clean, the first realm join attempt may fail due to AD multi-master replication lag. See Step 6 for the expected retry behaviour. Step 4: Add host to inventory On pbr-ansible-kl1 , edit ~/pbr-infra/inventory/hosts.yml . The host must be added in two places : Under all.children.linux.hosts (with ansible_host: ) Under all.children.targets.hosts (no ansible_host — inherited) --- all: children: linux: hosts: # ... existing hosts ... pbr-NEWHOST-kl1: ansible_host: 10.1.X.Y # <-- add here targets: hosts: # ... existing hosts ... pbr-NEWHOST-kl1: # <-- and here Why two places: The linux group lists known hosts (used for ad-hoc commands, monitoring, fact-gathering). The targets group is the deployment scope — playbooks use hosts: targets to ensure the control node and any informational-only hosts cannot be hit accidentally. Commit and push the inventory change: cd ~/pbr-infra git add inventory/hosts.yml git commit -m "inventory: add pbr-NEWHOST-kl1" git push origin main Step 5: Run preflight (no-changes verification) cd ~/pbr-infra ansible-playbook playbooks/preflight.yml -l pbr-NEWHOST-kl1 \ --vault-password-file ~/.ansible_vault_pass Preflight is read-only — it makes zero changes to the host. It validates: OS is Ubuntu 22.04 or 24.04 Hostname is set to a real value and resolves System clock is NTP-synchronised Required local users ( ansible , pbr_admin ) exist APT Universe component is enabled (for oddjob , oddjob-mkhomedir ) visudo -c passes (ignoring the known ThreatLocker drop-in permission issue) AD DCs are reachable on TCP 88 and 389 No existing realm membership conflicts SCEPman /ca endpoint returns a valid CA cert AD schema has the sshPublicKey attribute Vault password file exists with correct permissions Required collections are installed on the control node If preflight fails, fix the cause and re-run. Do not proceed to the baseline step until preflight is clean. Step 6: Run the baseline role cd ~/pbr-infra ansible-playbook playbooks/ssh-baseline.yml -l pbr-NEWHOST-kl1 \ --vault-password-file ~/.ansible_vault_pass The playbook runs preflight again (defence in depth) then applies the role. Expected duration: ~3-5 minutes per host on a typical KVM VM. Expected behaviour: realm join may fail on first attempt Despite a clean AD pre-clean, the first realm join attempt sometimes fails. This is a known pattern caused by AD multi-master replication lag — the join hits a DC that hasn't yet seen the deletion of the pre-cleaned computer object. The output looks like this (with no_log: true hiding the actual error): TASK [ssh-baseline : Join Active Directory domain] ***************************** fatal: [pbr-NEWHOST-kl1]: FAILED! => changed=true censored: 'the output has been hidden due to the fact that no_log: true was specified for this result' Fix: Just re-run the playbook. The role is idempotent and the second attempt almost always succeeds: ansible-playbook playbooks/ssh-baseline.yml -l pbr-NEWHOST-kl1 \ --vault-password-file ~/.ansible_vault_pass If the second attempt also fails, dig deeper (see Troubleshooting in the Known Limitations page). The most common diagnostic is to read the host's journalctl for adcli/realmd/Kerberos errors: ansible pbr-NEWHOST-kl1 -m shell -a ' journalctl --since "10 minutes ago" --no-pager 2>&1 \ | grep -iE "realm|adcli|krb5|sssd|kerberos" | tail -40 timedatectl status | head -8 ' --become --vault-password-file ~/.ansible_vault_pass Step 7: Run post-deployment verification cd ~/pbr-infra ansible-playbook playbooks/verify.yml -l pbr-NEWHOST-kl1 \ -e verify_test_user=a.mfraser \ --vault-password-file ~/.ansible_vault_pass Replace a.mfraser with any AD username that is a member of SG_ServerAccess or SG_Sudo and has an sshPublicKey populated. Verify checks: Realm membership reports the correct domain The test AD user resolves via SSSD ( getent passwd ) The test user's SSH public key is retrievable via sss_ssh_authorizedkeys sshd -t passes (full config validates) Services ssh , sssd , fail2ban are running auditd is running on managed hosts (skipped on LXC) fail2ban sshd jail is active pam_duo.so is referenced in /etc/pam.d/sudo The sudo timestamp_timeout drop-in exists The ansible NOPASSWD sudo path still works (proves PAM stack didn't break automation) pbr_admin is not in sg_sudo (would force Duo on break-glass account) The verification summary at the end looks like: TASK [Verification summary] **************************************************** ok: [pbr-NEWHOST-kl1] => msg: - '==================== VERIFICATION PASSED ====================' - 'Joined to realm: pbr.org.au' - 'AD user resolves: a.mfraser (1234:5678)' - 'SSH key retrieved: ssh-ed25519 AAAAC3...' - 'sshd config valid: yes' - 'All services running: ssh, sssd, fail2ban, auditd' - '' - 'Next: SSH from your workstation as a.mfraser@pbr-NEWHOST-kl1' - 'Expect: key auth + Duo push for SSH; Duo push + AD password for sudo' Step 8: Manual SSH validation from your workstation This step proves the end-user experience actually works. From your workstation: Test 1: AD user via SSH ssh a.mfraser@pbr-NEWHOST-kl1.pbr.org.au Expected: SSH key auth completes (no password prompt), then a Duo push to your phone. Approve the push, you land in a shell as your AD user. Test 2: sudo as AD user sudo whoami Expected: Duo push prompt (auto-pushed), then AD password prompt, then root . Within the 30-minute timestamp window, subsequent sudo commands skip both prompts. Test 3: pbr_admin break-glass ssh pbr_admin@pbr-NEWHOST-kl1.pbr.org.au Expected: Password-only prompt (no key, no Duo) — local password from 1Password. sudo whoami Expected: Local password prompt only (no Duo). Returns root . Test 4: Ansible NOPASSWD path still works From the control node (already validated by verify.yml but worth a manual check): ansible pbr-NEWHOST-kl1 -m shell -a 'sudo -n true' --become Expected: Success. Confirms PAM stack hasn't broken automation. Step 9: Clean up tee'd log files (if any) If you piped playbook output to a log file during deployment: # Check whether any log contains the AD service account password grep -l "MDT_JD\|--login-user" /tmp/*.log 2>/dev/null # Shred any logs created during this deployment shred -u /tmp/NEWHOST-*.log 2>/dev/null Even with no_log: true restored, transient diagnostic logs from troubleshooting may contain sensitive material. Always scrub. Royal TS Connection Notes Royal TS 7's Rebex SSH library has a constraint: it does not support OpenSSH's AuthenticationMethods publickey,keyboard-interactive directive natively. Without configuration, Royal TS will fail to connect to baselined hosts. Workaround: set Authentication Method to "Any" Open the host's Royal TS connection properties Navigate to Advanced > Security Set Authentication method to Any Save and reconnect This lets Rebex negotiate either method per the server's policy, and the server's AuthenticationMethods directive will require both. Auto-push approval Royal TS's keyboard-interactive UI does not support pre-filling the Duo response. You will press Enter once at the Duo prompt to confirm the push. This is acceptable for a single round-trip MFA. Alternative: External Application launching Windows OpenSSH If Rebex limitations bite, configure Royal TS to launch Windows' native ssh.exe as an External Application connection instead. PowerShell ssh.exe handles AuthenticationMethods publickey,keyboard-interactive correctly and integrates with the 1Password SSH agent via the OpenSSH named pipe ( \\.\pipe\openssh-ssh-agent ). Where to Read Next Known Limitations, Troubleshooting & Version History — detailed troubleshooting if deployment fails Configuration Reference — per-host overrides via host_vars/ if a host needs non-default settings Playbook Reference — details on preflight, verify, and teardown