Hermes gateway/cron systemd guardrails

목적

Hermes gateway와 cron을 24/7 서버에서 운영할 때, 단순 재시작보다 profile 분리, 로그 위치, provider/auth 분리 진단, systemd restart limit, self-restart 금지를 표준화한다.

기본 점검 명령

# 전체 profile/gateway 개요
hermes profile list

# default profile gateway
hermes gateway status

# named profile gateway
hermes -p &lt;profile&gt; gateway status
# 또는 alias가 있으면
&lt;alias&gt; gateway status

# Linux user systemd units
systemctl --user list-units 'hermes-gateway*'

# default logs
hermes logs --tail
# 또는
less ~/.hermes/logs/gateway.log
less ~/.hermes/logs/gateway.error.log

# named profile logs
hermes -p &lt;profile&gt; logs --tail
# 또는
less ~/.hermes/profiles/&lt;profile&gt;/logs/gateway.log
less ~/.hermes/profiles/&lt;profile&gt;/logs/gateway.error.log

장애 분류 체크리스트

Transport 문제인가?
- Telegram/Slack/Mattermost token, webhook/polling, channel/chat id, platform API 응답 확인
Provider/auth 문제인가?
- gateway/cron log에서 OAuth refresh, HTTP 401/403, token consumed, missing API key 확인
Profile scope 문제인가?
- default profile과 named profile의 .env, config.yaml, auth.json, service가 분리되어 있는지 확인
Scheduler 문제인가?
- hermes cron list, job enabled 여부, duplicate/old job 여부, timezone 변환, workdir 설정 확인
Delivery 문제인가?
- job 실행 성공과 outbound delivery 성공을 분리해서 확인

systemd 운영 가드레일

장기 실행 gateway는 systemd user service로 관리한다.
crash 자동 복구는 필요하지만, 짧은 시간 반복 실패는 멈추게 해야 한다.
일반적인 방향:
- Restart=on-failure
- RestartSec=5s 이상
- StartLimitIntervalSec=...
- StartLimitBurst=...
Hermes CLI가 service file을 생성하므로 직접 override가 필요하면 drop-in override를 우선 검토한다.
gateway restart 직후 transient activating (auto-restart) 상태는 재확인한다.

금지/주의 패턴

gateway 내부 agent가 자기 자신을 멈추거나 재시작하는 cron/job을 만들지 않는다.
여러 profiles가 같은 Telegram bot token을 공유한 채 동시에 gateway를 띄우지 않는다.
default profile만 보고 named profile gateway까지 정상이라고 판단하지 않는다.
cron 보고서 prompt에 delivery, archive, source, failure contract를 안 넣은 채 운영하지 않는다.

No-agent watchdog 패턴

LLM 판단이 필요 없는 감시 작업은 일반 agent cron보다 no_agent script-only cron을 우선한다.

적합한 작업:

RAM/disk/GPU 사용률 임계값 알림
HTTP health check, port check, service heartbeat
CI/deploy 종료 상태 알림
외부 API poller가 명확한 상태 변화만 감지하는 경우

운영 계약:

스크립트는 ~/.hermes/scripts/ 아래에 둔다.
정상 상태는 stdout empty로 둔다. Empty stdout은 silent tick이다.
알림이 필요할 때만 짧고 actionable한 stdout을 출력한다.
non-zero exit와 timeout은 오류 알림으로 올라가므로 실패가 묻히지 않는다.
요약/선별/원인 판단이 필요하면 no-agent가 아니라 LLM cron으로 전환한다.

hermes cron create &quot;every 10m&quot; \
  --no-agent \
  --script disk-watchdog.sh \
  --deliver telegram \
  --name &quot;disk-watchdog&quot;

hermes cron list
hermes cron run disk-watchdog

Cron prompt에 넣을 운영 계약

- timezone: Asia/Seoul 기준으로 작성하되 서버 cron 표현과 차이를 명시
- delivery: 최종 응답은 자동 전달되므로 별도 send_message 금지
- archive: 결과를 &lt;absolute-path&gt;/YYYY-MM-DD.md에 저장
- source rule: 공식 문서 우선, 비공식은 unverified 표시
- failure rule: 근거 부족 시 [SILENT] 또는 명시적 partial report
- provider/model: unattended reliability가 필요하면 job-level override 검토
- duplicate rule: 변경 후 old duplicate job 확인

systemd drop-in override 기본형

Hermes가 생성한 gateway service 파일을 직접 수정하지 말고, 필요한 경우 user unit drop-in으로 재시작 정책만 덮어쓴다. 실제 unit 이름은 systemctl --user list-units 'hermes-gateway*'로 확인한다.

UNIT=hermes-gateway.service   # named profile이면 예: hermes-gateway-code-senior.service
mkdir -p &quot;$HOME/.config/systemd/user/${UNIT}.d&quot;
cat &gt; &quot;$HOME/.config/systemd/user/${UNIT}.d/10-restart-guardrails.conf&quot; &lt;&lt;'EOF'
[Unit]
StartLimitIntervalSec=10min
StartLimitBurst=5

[Service]
Restart=on-failure
RestartSec=10s
EOF
systemctl --user daemon-reload
systemctl --user restart &quot;$UNIT&quot;
systemctl --user status &quot;$UNIT&quot; --no-pager

검증 포인트:

Restart=on-failure는 장기 실행 서비스의 오류 자동 복구에 적합하다고 systemd.service 문서가 권장한다.
RestartSec= 기본값은 100ms라 crash loop에서는 너무 공격적일 수 있으므로 명시적으로 늘린다.
StartLimitIntervalSec=/StartLimitBurst=는 unit start rate limit이다. 짧은 시간 내 반복 시작이 burst를 넘으면 더 이상 시작을 허용하지 않는다.
user service가 SSH logout 후에도 살아야 하면 loginctl enable-linger <user> 전제도 확인한다.

Hermes gateway/cron systemd guardrails

Hermes gateway/cron systemd guardrails

목적

기본 점검 명령

장애 분류 체크리스트

systemd 운영 가드레일

금지/주의 패턴

No-agent watchdog 패턴

Cron prompt에 넣을 운영 계약

systemd drop-in override 기본형

관련 링크

내일 학습·스터디 큐

스터디 대화

인사이트로 Second Brain에 저장