Split View: Cohere Forward Deployed Engineer 합격 가이드: AI 플랫폼 배포 전문가가 되는 완벽 로드맵

Cohere Forward Deployed Engineer 합격 가이드: AI 플랫폼 배포 전문가가 되는 완벽 로드맵

1. Cohere와 Agentic Platform 팀 분석
2. JD 라인 바이 라인 완전 해부
- 주요 책임 분석
- 필수 자격 요건 분석
3. 기술스택 딥다이브
4. 면접 예상 질문 25선
5. 8개월 학습 로드맵
- 월별 상세 계획
6. 포트폴리오 프로젝트 아이디어 3개
7. 이력서 작성 전략
8. Cohere에서 일한다는 것
9. 퀴즈
10. 참고 자료
마무리

1. Cohere와 Agentic Platform 팀 분석

Cohere란 무엇인가

Cohere는 2019년 토론토에서 설립된 엔터프라이즈 AI 전문 기업입니다. Google Brain 출신의 Aidan Gomez(Transformer 논문 "Attention Is All You Need" 공동 저자), Ivan Zhang, Nick Frosst가 공동 창업했습니다. OpenAI가 소비자 시장을 공략하는 동안 Cohere는 처음부터 엔터프라이즈 B2B에 집중했다는 점이 핵심 차별화 포인트입니다.

주요 제품 라인업:

Command R+: 엔터프라이즈급 대규모 언어 모델(LLM)로 RAG(Retrieval-Augmented Generation)에 최적화
Embed v3: 다국어 임베딩 모델로 100개 이상 언어 지원, 검색과 분류에 특화
Rerank v3: 검색 결과의 관련성을 재평가하는 리랭킹 모델
Aya: 101개 언어를 지원하는 오픈소스 다국어 모델 프로젝트

Cohere가 엔터프라이즈 시장에서 강력한 위치를 점하는 이유는 데이터 프라이버시에 대한 철저한 접근 때문입니다. 고객의 데이터가 외부로 나가지 않도록 프라이빗 클라우드와 온프레미스 배포를 핵심 전략으로 삼고 있습니다.

North AI Platform이란

North AI Platform은 Cohere의 엔터프라이즈 AI 배포 플랫폼입니다. 이전에는 Cohere Toolkit이라 불리던 것이 진화한 형태로, 기업이 자체 인프라 내에서 Cohere의 AI 모델을 안전하게 실행할 수 있도록 해줍니다.

North의 핵심 특징:

프라이빗 클라우드 및 온프레미스 환경에서 완전한 AI 스택 배포
Kubernetes 기반 아키텍처로 다양한 인프라에 배포 가능
Helm Chart를 통한 표준화된 배포 프로세스
GPU 리소스 관리와 모델 서빙 최적화
엔터프라이즈급 보안, 모니터링, 로깅 통합

Agentic Platform 팀의 미션

Agentic Platform 팀은 Cohere 내에서 가장 고객 접점이 높은 팀 중 하나입니다. 이 팀의 미션은 기업 고객이 AI 에이전트를 자체 환경에서 안전하고 효율적으로 운영할 수 있도록 하는 것입니다.

AI 에이전트란 단순한 챗봇을 넘어서 도구를 사용하고, 다단계 추론을 수행하며, 실제 비즈니스 워크플로우를 자동화하는 시스템입니다. 금융 기관에서 문서 분석 에이전트, 헬스케어에서 의료 기록 요약 에이전트, 통신사에서 고객 서비스 자동화 에이전트 등이 대표적인 사례입니다.

주요 고객사 분석

Cohere의 엔터프라이즈 고객은 산업별로 매우 다양하며, 각 고객사마다 고유한 인프라 요구사항이 있습니다.

RBC (Royal Bank of Canada) - 금융

캐나다 최대 은행으로 글로벌 금융 규제(OSFI, GDPR, SOX) 준수 필수
에어갭 환경에서의 AI 배포가 핵심 과제
데이터 레지던시 요건: 캐나다 영토 내 데이터 보관 의무
금융 데이터의 민감성으로 인한 극도의 보안 요구

Dell Technologies - IT/하드웨어

Dell 자체 서버와 스토리지 인프라 위에 AI 배포
Dell PowerEdge + NVIDIA GPU 조합의 온프레미스 AI 인프라
멀티 테넌시 환경에서의 격리된 AI 워크로드 관리
Dell의 고객에게도 AI 솔루션을 제공하는 2차 배포 시나리오

LG CNS - IT 서비스/한국

한국의 개인정보보호법(PIPA) 및 데이터 3법 준수
한국어 특화 AI 모델 성능 최적화 요구
금융(LG 계열사), 제조, 물류 등 다양한 산업군 대응
한국 시장 특유의 온프레미스 선호 문화 대응

Forward Deployed Engineer의 기원

Forward Deployed Engineer(FDE)라는 직함은 Palantir Technologies가 처음 만든 개념입니다. 군사 용어인 "전방 배치(Forward Deployed)"에서 영감을 받아, 엔지니어를 고객 현장 최전선에 배치한다는 의미를 담았습니다.

일반적인 소프트웨어 엔지니어가 사내에서 제품을 개발하는 것과 달리, FDE는 고객의 환경을 직접 이해하고 그 위에 솔루션을 구축합니다. Palantir의 FDE는 미국 정부, 군, 정보기관과 직접 일하며 데이터 분석 플랫폼을 배포하는 역할이었습니다.

이 모델이 성공하면서 Databricks, Scale AI, Anyscale 등 많은 AI/데이터 기업이 유사한 역할을 도입했고, Cohere도 이 모델을 채택했습니다.

FDE vs SE vs SA 역할 비교

구분	Forward Deployed Engineer (FDE)	Solutions Engineer (SE)	Solutions Architect (SA)
핵심 업무	고객 현장에서 직접 배포/구축	기술 영업 지원, 데모, PoC	아키텍처 설계, 기술 자문
코딩 비중	60-70% (실전 구현)	30-40% (데모, 스크립트)	10-20% (프로토타입)
고객 접점	심층적 (수주-수개월 상주)	영업 단계 집중	설계 단계 집중
기술 깊이	매우 깊음 (인프라+코드)	넓지만 깊이는 보통	넓고 아키텍처 수준에서 깊음
출장	20-40% (고객 현장)	10-20%	10-15%
보고 라인	엔지니어링 조직	영업/프리세일즈 조직	영업 또는 CTO 조직
성공 지표	배포 성공률, 가동시간	딜 성사율, PoC 전환율	기술 채택률, 고객 만족도

2. JD 라인 바이 라인 완전 해부

주요 책임 분석

"Lead North AI platform deployments across private cloud and on-premises environments"

이 한 줄이 이 포지션의 핵심입니다. "Lead"는 단순 참여가 아니라 배포의 전 과정을 주도한다는 의미입니다. 프라이빗 클라우드(Azure Private Cloud, AWS Outposts, GCP Anthos 등)와 온프레미스(고객 데이터센터) 모두를 다뤄야 합니다.

실제로 이는 다음을 의미합니다:

고객 인프라 환경 사전 조사(Pre-assessment)
배포 아키텍처 설계 및 문서화
Kubernetes 클러스터 설정과 검증
Helm Chart를 이용한 North 플랫폼 배포
GPU 노드 설정 및 모델 로딩
통합 테스트 및 성능 검증
고객 운영팀에 인수인계

"Partner with enterprise IT teams on infrastructure and security assessments"

엔터프라이즈 IT 팀과의 협업은 기술만큼이나 소통 능력을 요구합니다. 대기업의 IT 팀은 보안 정책, 네트워크 아키텍처, 변경 관리 프로세스가 엄격합니다.

인프라 평가에서 확인해야 할 사항:

네트워크 토폴로지: VPC/VLAN 구성, 서브넷, 방화벽 규칙
보안 정책: 인증/인가 메커니즘, TLS 인증서 관리
컴퓨팅 리소스: CPU/GPU 사양, 메모리, 스토리지 IOPS
쿠버네티스 환경: 버전, CNI 플러그인, 인그레스 컨트롤러
규제 준수: 데이터 레지던시, 감사 로그, 접근 제어

"Design tailored deployment strategies ensuring data privacy compliance"

맞춤형 배포 전략은 고객마다 다릅니다. 금융 기관은 에어갭 환경을 요구하고, 헬스케어는 HIPAA 준수가 필수이며, EU 고객은 GDPR을 충족해야 합니다.

배포 전략 설계 시 고려할 요소:

데이터 흐름: 학습 데이터와 추론 데이터의 이동 경로
암호화: 저장 시(at rest)와 전송 시(in transit) 암호화
접근 제어: 누가 모델에 접근하고 API를 호출할 수 있는지
감사 추적: 모든 접근과 변경에 대한 로깅
데이터 보존: 데이터 보관 기간과 삭제 정책

"Troubleshoot deployment issues and minimize system downtime"

운영 환경에서의 트러블슈팅 능력은 FDE의 핵심 역량입니다. 고객 프로덕션 환경에서 장애가 발생하면 즉각적인 대응이 필요합니다.

일반적인 트러블슈팅 시나리오:

Pod CrashLoopBackOff: OOM, 설정 오류, 의존성 문제
GPU 할당 실패: 드라이버 불일치, 리소스 부족
네트워크 연결 문제: DNS, 인그레스, 서비스 메시 설정
모델 로딩 실패: 스토리지 접근 권한, 모델 파일 손상
성능 저하: 리소스 경합, 스케일링 이슈

필수 자격 요건 분석

"Direct customer-facing experience"

고객 대면 경험이란 단순히 고객과 대화한 경험이 아닙니다. 기술적으로 복잡한 내용을 비기술적인 의사결정자에게 설명하고, 기술 팀과는 깊은 수준의 기술 토론을 할 수 있어야 합니다.

"Production Kubernetes cluster administration and Helm expertise"

프로덕션 수준의 K8s 운영 경험을 요구합니다. 개인 프로젝트의 minikube가 아니라 실제 트래픽을 처리하는 멀티 노드 클러스터를 관리한 경험이 필요합니다. Helm은 단순 사용이 아닌 차트 개발과 커스터마이징 수준을 기대합니다.

"Cloud infrastructure (Azure, AWS, GCP), networking, virtualization"

멀티 클라우드 지식이 필수입니다. 고객마다 다른 클라우드를 사용하므로 세 가지 모두에 대한 기본적인 이해가 필요합니다. 특히 네트워킹(VPC, 서브넷, 피어링, 프라이빗 엔드포인트)과 가상화(VMware, KVM) 지식이 중요합니다.

3. 기술스택 딥다이브

3-1. Kubernetes 심화 (프로덕션 클러스터 운영)

Kubernetes는 이 포지션의 가장 핵심적인 기술입니다. 프로덕션 환경에서의 클러스터 운영 능력이 합격의 당락을 좌우합니다.

클러스터 아키텍처 이해

                     ┌─────────────────────────────┐
                     │       Control Plane          │
                     │  ┌─────────┐ ┌───────────┐  │
                     │  │kube-api │ │ scheduler │  │
                     │  │ server  │ │           │  │
                     │  └────┬────┘ └───────────┘  │
                     │  ┌────┴────┐ ┌───────────┐  │
                     │  │  etcd   │ │controller │  │
                     │  │         │ │ manager   │  │
                     │  └─────────┘ └───────────┘  │
                     └──────────────┬──────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
     ┌────────┴────────┐  ┌────────┴────────┐  ┌────────┴────────┐
     │   Worker Node 1  │  │   Worker Node 2  │  │  GPU Node (AI)  │
     │  ┌─────┐┌─────┐ │  │  ┌─────┐┌─────┐ │  │  ┌─────┐┌─────┐│
     │  │ Pod ││ Pod │ │  │  │ Pod ││ Pod │ │  │  │ Pod ││ Pod ││
     │  └─────┘└─────┘ │  │  └─────┘└─────┘ │  │  │ GPU ││ GPU ││
     │  kubelet+kproxy  │  │  kubelet+kproxy  │  │  └─────┘└─────┘│
     └──────────────────┘  └──────────────────┘  └─────────────────┘

Control Plane 핵심 컴포넌트:

kube-apiserver: 모든 API 요청의 진입점. REST API를 통해 클러스터 상태를 관리
etcd: 클러스터의 모든 상태를 저장하는 분산 키-값 저장소. 백업이 매우 중요
kube-scheduler: Pod를 적절한 노드에 배치하는 스케줄러. GPU 요청 시 GPU 노드에 배치
kube-controller-manager: Deployment, ReplicaSet, DaemonSet 등의 상태를 관리

프로덕션 배포 전략

# Rolling Update - 가장 일반적
apiVersion: apps/v1
kind: Deployment
metadata:
  name: north-ai-api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    spec:
      containers:
        - name: north-api
          image: cohere/north-api:v2.1.0
          resources:
            requests:
              memory: '4Gi'
              cpu: '2'
            limits:
              memory: '8Gi'
              cpu: '4'
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 30

배포 전략 비교:

전략	다운타임	롤백 속도	리소스 사용	적합한 상황
Rolling Update	없음	보통	점진적 증가	일반적인 업데이트
Blue-Green	없음	즉시	2배	중요 업데이트
Canary	없음	즉시	소폭 증가	리스크 높은 변경
Recreate	있음	느림	동일	호환성 문제 시

RBAC (Role-Based Access Control)

엔터프라이즈 환경에서 RBAC 설정은 보안의 기본입니다.

# 고객별 네임스페이스 격리
apiVersion: v1
kind: Namespace
metadata:
  name: customer-rbc
  labels:
    customer: rbc
    environment: production
---
# 최소 권한 원칙에 따른 Role 정의
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: customer-rbc
  name: north-deployer
rules:
  - apiGroups: ['apps']
    resources: ['deployments', 'statefulsets']
    verbs: ['get', 'list', 'watch', 'create', 'update', 'patch']
  - apiGroups: ['']
    resources: ['pods', 'services', 'configmaps', 'secrets']
    verbs: ['get', 'list', 'watch', 'create', 'update']
  - apiGroups: ['']
    resources: ['pods/log']
    verbs: ['get']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: customer-rbc
  name: north-deployer-binding
subjects:
  - kind: ServiceAccount
    name: north-deploy-sa
    namespace: customer-rbc
roleRef:
  kind: Role
  name: north-deployer
  apiGroup: rbac.authorization.k8s.io

NetworkPolicy로 네트워크 격리

# North AI 플랫폼 Pod 간 통신만 허용
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: north-platform-policy
  namespace: customer-rbc
spec:
  podSelector:
    matchLabels:
      app: north-platform
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: north-platform
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: north-platform
      ports:
        - protocol: TCP
          port: 8080
    - to: # DNS 허용
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53

리소스 관리

# LimitRange로 네임스페이스 기본값 설정
apiVersion: v1
kind: LimitRange
metadata:
  name: north-limits
  namespace: customer-rbc
spec:
  limits:
    - default:
        memory: '2Gi'
        cpu: '1'
      defaultRequest:
        memory: '512Mi'
        cpu: '250m'
      type: Container
---
# ResourceQuota로 네임스페이스 전체 제한
apiVersion: v1
kind: ResourceQuota
metadata:
  name: north-quota
  namespace: customer-rbc
spec:
  hard:
    requests.cpu: '32'
    requests.memory: '64Gi'
    limits.cpu: '64'
    limits.memory: '128Gi'
    requests.nvidia.com/gpu: '8'
    pods: '50'

GPU 노드 관리

AI 모델 서빙에서 GPU 관리는 특히 중요합니다.

# GPU 노드에 Pod 배치를 위한 tolerations + nodeSelector
apiVersion: apps/v1
kind: Deployment
metadata:
  name: north-model-server
spec:
  template:
    spec:
      nodeSelector:
        accelerator: nvidia-a100
      tolerations:
        - key: 'nvidia.com/gpu'
          operator: 'Exists'
          effect: 'NoSchedule'
      containers:
        - name: model-server
          image: cohere/north-model:latest
          resources:
            limits:
              nvidia.com/gpu: 4
          volumeMounts:
            - name: model-storage
              mountPath: /models
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: '16Gi'

모니터링 스택

# Prometheus ServiceMonitor 설정
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: north-platform-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: north-platform
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

핵심 모니터링 메트릭:

Pod 레벨: CPU/메모리 사용률, 재시작 횟수, OOM Kill 횟수
노드 레벨: 노드 가용성, 디스크 사용량, GPU 활용률
클러스터 레벨: 스케줄링 지연, etcd 레이턴시, API 서버 응답시간
AI 워크로드: 추론 레이턴시, 토큰/초 처리량, 큐 깊이

장애 대응 절차

# 1. 문제 노드 식별
kubectl get nodes -o wide
kubectl describe node problematic-node

# 2. 노드 cordon (새 Pod 스케줄링 차단)
kubectl cordon problematic-node

# 3. 기존 워크로드 안전하게 이동 (drain)
kubectl drain problematic-node --ignore-daemonsets --delete-emptydir-data

# 4. 문제 해결 후 노드 복구
kubectl uncordon problematic-node

# 5. Pod 상태 확인
kubectl get pods -n customer-rbc -o wide
kubectl logs -n customer-rbc pod-name --previous

추천 자격증 경로:

CKA (Certified Kubernetes Administrator): 클러스터 관리에 집중 - 필수
CKAD (Certified Kubernetes Application Developer): 애플리케이션 배포 - 권장
CKS (Certified Kubernetes Security Specialist): 보안 - 우대

3-2. Helm 마스터 클래스

Helm은 Kubernetes의 패키지 매니저이자 이 포지션에서 매일 사용하는 도구입니다. North AI 플랫폼의 배포 단위가 Helm Chart이므로 차트 개발 능력이 필수입니다.

Helm Chart 구조

north-ai-platform/
  Chart.yaml          # 차트 메타데이터
  Chart.lock          # 의존성 잠금 파일
  values.yaml         # 기본 설정값
  values-production.yaml   # 프로덕션 오버라이드
  values-staging.yaml      # 스테이징 오버라이드
  templates/
    _helpers.tpl      # 공통 템플릿 함수
    deployment.yaml   # Deployment 리소스
    service.yaml      # Service 리소스
    ingress.yaml      # Ingress 리소스
    configmap.yaml    # ConfigMap
    secret.yaml       # Secret
    hpa.yaml          # HorizontalPodAutoscaler
    pdb.yaml          # PodDisruptionBudget
    networkpolicy.yaml
    serviceaccount.yaml
    NOTES.txt         # 설치 후 안내 메시지
  charts/             # 하위 차트 (의존성)
  tests/
    test-connection.yaml

Chart.yaml 작성

apiVersion: v2
name: north-ai-platform
description: Cohere North AI Platform for Enterprise Deployment
type: application
version: 2.1.0 # Chart 버전
appVersion: '3.5.0' # 애플리케이션 버전
keywords:
  - ai
  - llm
  - cohere
  - enterprise
dependencies:
  - name: postgresql
    version: '12.x.x'
    repository: 'https://charts.bitnami.com/bitnami'
    condition: postgresql.enabled
  - name: redis
    version: '17.x.x'
    repository: 'https://charts.bitnami.com/bitnami'
    condition: redis.enabled

Go Templates과 Sprig Functions

# templates/_helpers.tpl
{{- define "north.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name .Chart.Name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}

{{- define "north.labels" -}}
helm.sh/chart: {{ printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" }}
app.kubernetes.io/name: {{ .Chart.Name }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}

{{- define "north.gpuResources" -}}
{{- if .Values.gpu.enabled }}
resources:
  limits:
    nvidia.com/gpu: {{ .Values.gpu.count | default 1 }}
  requests:
    nvidia.com/gpu: {{ .Values.gpu.count | default 1 }}
{{- end }}
{{- end }}

values.yaml 환경별 오버라이드 전략

# values.yaml (기본값)
replicaCount: 1
image:
  repository: cohere/north-platform
  tag: 'latest'
  pullPolicy: IfNotPresent

modelServer:
  replicas: 1
  gpu:
    enabled: false
    count: 1
    type: ''
  resources:
    requests:
      memory: '8Gi'
      cpu: '4'
    limits:
      memory: '16Gi'
      cpu: '8'

ingress:
  enabled: true
  className: nginx
  tls:
    enabled: true

monitoring:
  enabled: true
  prometheus:
    scrape: true

security:
  networkPolicy:
    enabled: true
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 1000

# values-production-rbc.yaml (RBC 고객용 오버라이드)
replicaCount: 3

image:
  repository: harbor.rbc.internal/cohere/north-platform
  tag: '3.5.0'
  pullPolicy: Always
  pullSecret: rbc-harbor-secret

modelServer:
  replicas: 2
  gpu:
    enabled: true
    count: 4
    type: nvidia-a100-80gb
  resources:
    requests:
      memory: '64Gi'
      cpu: '16'
    limits:
      memory: '128Gi'
      cpu: '32'

ingress:
  enabled: true
  className: nginx
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: 'true'
    nginx.ingress.kubernetes.io/client-max-body-size: '100m'
  hosts:
    - host: north-ai.rbc.internal
  tls:
    enabled: true
    secretName: rbc-tls-secret

persistence:
  enabled: true
  storageClass: rbc-premium-ssd
  size: 500Gi

security:
  networkPolicy:
    enabled: true
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000

Helm Hooks 활용

# pre-install hook: 데이터베이스 마이그레이션
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
  annotations:
    'helm.sh/hook': pre-install,pre-upgrade
    'helm.sh/hook-weight': '-5'
    'helm.sh/hook-delete-policy': hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migration
          image: cohere/north-migration:latest
          command: ['./migrate', '--direction', 'up']

Chart 테스트

# 린트 검사
helm lint ./north-ai-platform

# 템플릿 렌더링 확인
helm template my-release ./north-ai-platform -f values-production-rbc.yaml

# 드라이런으로 설치 시뮬레이션
helm install my-release ./north-ai-platform --dry-run --debug

# 차트 테스트 실행
helm test my-release

OCI Registry로 차트 배포

# 차트 패키징
helm package ./north-ai-platform

# OCI Registry에 푸시
helm push north-ai-platform-2.1.0.tgz oci://registry.example.com/charts

# OCI Registry에서 설치
helm install north oci://registry.example.com/charts/north-ai-platform --version 2.1.0

Helmfile로 다중 차트 관리

# helmfile.yaml
repositories:
  - name: bitnami
    url: https://charts.bitnami.com/bitnami

releases:
  - name: north-platform
    chart: ./charts/north-ai-platform
    namespace: north-system
    values:
      - values/common.yaml
      - values/production.yaml

  - name: monitoring
    chart: prometheus-community/kube-prometheus-stack
    namespace: monitoring
    values:
      - values/monitoring.yaml

  - name: ingress
    chart: ingress-nginx/ingress-nginx
    namespace: ingress
    values:
      - values/ingress.yaml

3-3. 클라우드 인프라 (Azure/AWS/GCP)

Managed Kubernetes 비교

특성	AKS (Azure)	EKS (AWS)	GKE (Google)
Control Plane 비용	무료	시간당 약 0.10 USD	무료 (Standard)
GPU 지원	A100, H100	A100, H100, P5	A100, H100, TPU
최대 노드	5,000	500 (관리형 노드그룹)	15,000
프라이빗 클러스터	지원	지원	지원
서비스 메시	Istio, OSM	App Mesh, Istio	Anthos SM
GitOps 통합	Flux (네이티브)	ArgoCD	Config Sync
ML 플랫폼	Azure ML	SageMaker	Vertex AI
에어갭 지원	Azure Stack	Outposts	Anthos

네트워킹 핵심 개념

┌─────────────────────────────────────────────┐
│                    VPC                       │
│  ┌──────────────────┐ ┌──────────────────┐  │
│  │  Public Subnet    │ │  Public Subnet    │  │
│  │  (AZ-1)          │ │  (AZ-2)          │  │
│  │  Load Balancer   │ │  Load Balancer   │  │
│  └────────┬─────────┘ └────────┬─────────┘  │
│           │                     │            │
│  ┌────────┴─────────┐ ┌────────┴─────────┐  │
│  │  Private Subnet   │ │  Private Subnet   │  │
│  │  (AZ-1)          │ │  (AZ-2)          │  │
│  │  K8s Worker      │ │  K8s Worker      │  │
│  │  Nodes           │ │  Nodes           │  │
│  └────────┬─────────┘ └────────┬─────────┘  │
│           │                     │            │
│  ┌────────┴─────────┐ ┌────────┴─────────┐  │
│  │  Data Subnet      │ │  Data Subnet      │  │
│  │  (AZ-1)          │ │  (AZ-2)          │  │
│  │  DB, Storage     │ │  DB, Storage     │  │
│  └──────────────────┘ └──────────────────┘  │
│                                              │
│  ┌──────────────────────────────────────┐   │
│  │  Private Endpoint (Storage/DB)       │   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

핵심 네트워킹 요소:

VPC Peering / Transit Gateway: 여러 VPC 간 연결
Private Endpoint / PrivateLink: 공개 인터넷을 거치지 않는 서비스 접근
Network Security Group / Security Group: 인바운드/아웃바운드 트래픽 제어
DNS: Private DNS Zone으로 내부 서비스 이름 해석

IAM 전략

각 클라우드의 워크로드 아이덴티티 매핑:

Azure: Managed Identity -> Pod Identity (AAD Pod Identity / Workload Identity)
AWS:   IAM Role -> IRSA (IAM Roles for Service Accounts)
GCP:   Service Account -> Workload Identity Federation

추천 자격증:

Azure: AZ-104 (Azure Administrator) 또는 AZ-305 (Solutions Architect)
AWS: SAA-C03 (Solutions Architect Associate)
GCP: Professional Cloud Architect

3-4. DevOps 및 CI/CD

GitOps 워크플로우

Developer -> Git Push -> GitHub/GitLab
                             │
                    ┌────────┴────────┐
                    │                 │
              CI Pipeline        ArgoCD/Flux
              (Build/Test)     (watches repo)
                    │                 │
              Container        Sync to K8s
              Registry         Cluster
                    │                 │
                    └────────┬────────┘
                             │
                      K8s Cluster
                    (Desired State)

CI/CD 파이프라인 예시 (GitHub Actions)

name: North Platform CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Helm Lint
        run: helm lint ./charts/north-ai-platform
      - name: Template Validation
        run: |
          helm template test ./charts/north-ai-platform \
            -f values/test.yaml \
            | kubectl apply --dry-run=client -f -

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Trivy Chart Scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: config
          scan-ref: ./charts/north-ai-platform

  build-and-push:
    needs: [lint-and-test, security-scan]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and Push Chart
        run: |
          helm package ./charts/north-ai-platform
          helm push north-ai-platform-*.tgz oci://registry.example.com/charts

IaC: Terraform으로 K8s 클러스터 프로비저닝

# Azure AKS 클러스터 프로비저닝
resource "azurerm_kubernetes_cluster" "north" {
  name                = "north-aks-cluster"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  dns_prefix          = "north"
  kubernetes_version  = "1.29"

  default_node_pool {
    name       = "system"
    node_count = 3
    vm_size    = "Standard_D4s_v3"
  }

  identity {
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "calico"
  }

  private_cluster_enabled = true
}

# GPU 노드 풀
resource "azurerm_kubernetes_cluster_node_pool" "gpu" {
  name                  = "gpupool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.north.id
  vm_size              = "Standard_NC24ads_A100_v4"
  node_count           = 2

  node_taints = [
    "nvidia.com/gpu=present:NoSchedule"
  ]

  node_labels = {
    "accelerator" = "nvidia-a100"
  }
}

Secret Management

# HashiCorp Vault와 K8s 연동 (CSI Driver)
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: north-secrets
spec:
  provider: vault
  parameters:
    vaultAddress: 'https://vault.internal:8200'
    roleName: 'north-platform'
    objects: |
      - objectName: "db-password"
        secretPath: "secret/data/north/database"
        secretKey: "password"
      - objectName: "api-key"
        secretPath: "secret/data/north/api"
        secretKey: "key"

3-5. 프라이빗 클라우드 및 온프레미스 배포

이 섹션은 이 포지션에서 가장 중요한 차별화 기술입니다. 퍼블릭 클라우드에서의 K8s 운영은 많은 엔지니어가 할 수 있지만, 에어갭 환경에서의 배포 경험은 희소합니다.

에어갭(Air-Gapped) 환경의 특수성

에어갭 환경은 외부 인터넷과 완전히 격리된 네트워크입니다. 금융(RBC), 군사, 의료 기관에서 주로 사용합니다.

에어갭 환경에서의 주요 과제:

컨테이너 이미지 전달: Docker Hub, GitHub Container Registry에 접근 불가
패키지 설치: apt, yum, pip, npm 리포지토리 접근 불가
Helm Chart 다운로드: 차트 리포지토리 접근 불가
인증서 관리: Let's Encrypt 등 외부 CA 사용 불가
시간 동기화: NTP 서버 접근이 제한될 수 있음

Harbor 미러 레지스트리 구축

# Harbor 설치 (에어갭 환경용)
# 1. 인터넷 연결된 환경에서 모든 이미지 다운로드
docker pull goharbor/harbor-core:v2.10.0
docker pull goharbor/harbor-db:v2.10.0
docker pull goharbor/harbor-jobservice:v2.10.0
docker pull goharbor/harbor-portal:v2.10.0
docker pull goharbor/nginx-photon:v2.10.0
docker pull goharbor/registry-photon:v2.10.0

# 2. 이미지를 tar로 저장
docker save -o harbor-images.tar \
  goharbor/harbor-core:v2.10.0 \
  goharbor/harbor-db:v2.10.0 \
  goharbor/harbor-jobservice:v2.10.0 \
  goharbor/harbor-portal:v2.10.0

# 3. 에어갭 환경으로 물리적 미디어를 통해 전송

# 4. 에어갭 환경에서 이미지 로드
docker load -i harbor-images.tar

Helm Chart 오프라인 번들

# 인터넷 환경에서 차트와 의존성 다운로드
helm pull oci://registry.example.com/charts/north-ai-platform --version 2.1.0
helm pull bitnami/postgresql --version 12.5.0
helm pull bitnami/redis --version 17.3.0

# 모든 컨테이너 이미지 목록 추출
helm template north ./north-ai-platform-2.1.0.tgz | \
  grep "image:" | awk '{print $2}' | sort -u > image-list.txt

# 이미지 일괄 다운로드 및 저장
while read -r image; do
  docker pull "$image"
done < image-list.txt

docker save -o north-platform-images.tar $(cat image-list.txt | tr '\n' ' ')

# 에어갭 환경에서 로드 및 태깅
docker load -i north-platform-images.tar
# Harbor 내부 레지스트리로 리태깅 후 푸시

온프레미스 Kubernetes 옵션

도구	특징	적합한 환경
RKE2	Rancher의 보안 강화 K8s. FIPS 140-2 인증	금융, 정부
Kubespray	Ansible 기반 유연한 설치	커스텀 환경
Tanzu	VMware 통합 K8s	VMware 고객
OpenShift	Red Hat 엔터프라이즈 K8s	대기업
k3s	경량 K8s	Edge, IoT

데이터 레지던시와 규제 준수

규제	지역	핵심 요구사항
GDPR	EU	데이터 처리 동의, 잊힐 권리, DPO 지정
PIPA	한국	개인정보 수집/이용 동의, 국외 이전 제한
FISC	일본	금융 시스템 보안 기준, 데이터 국내 보관
OSFI	캐나다	금융기관 기술 리스크 관리
HIPAA	미국	의료 데이터 보호, 암호화 필수

GPU 인프라 관리

# NVIDIA GPU Operator 설치 (에어갭 환경)
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
---
# NVIDIA Device Plugin DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin
  template:
    metadata:
      labels:
        name: nvidia-device-plugin
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: nvidia-device-plugin
          image: harbor.internal/nvidia/k8s-device-plugin:v0.14.0
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ['ALL']
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

NVIDIA MIG (Multi-Instance GPU) 설정:

MIG를 사용하면 A100 같은 대형 GPU를 여러 개의 격리된 인스턴스로 분할하여 여러 모델을 동시에 서빙할 수 있습니다.

A100 80GB GPU
├── MIG 1g.10gb (소형 모델 서빙)
├── MIG 2g.20gb (중형 모델 서빙)
└── MIG 4g.40gb (대형 모델 서빙)

3-6. AI 모델 배포 기술

LLM 서빙 프레임워크

프레임워크	특징	적합한 상황
vLLM	PagedAttention으로 높은 처리량	범용 LLM 서빙
TensorRT-LLM	NVIDIA 최적화, 최고 성능	NVIDIA GPU 환경
Triton	멀티 모델, 멀티 프레임워크	복합 모델 파이프라인
Text Generation Inference	HuggingFace 생태계 통합	HF 모델 사용 시

RAG 아키텍처

사용자 질의
    │
    ▼
┌──────────┐    ┌──────────────┐
│ Embed v3 │───▶│ Vector DB    │
│ (쿼리    │    │ (검색)       │
│  임베딩)  │    │ Pinecone/    │
└──────────┘    │ Weaviate/    │
                │ pgvector     │
                └──────┬───────┘
                       │ Top-K 문서
                       ▼
                ┌──────────────┐
                │ Rerank v3    │
                │ (관련성 재정렬)│
                └──────┬───────┘
                       │ 정제된 문서
                       ▼
                ┌──────────────┐
                │ Command R+   │
                │ (답변 생성)   │
                └──────────────┘
                       │
                       ▼
                   최종 답변

모델 최적화 기법

양자화(Quantization): FP32 -> FP16 -> INT8 -> INT4로 모델 크기와 메모리 감소
지식 증류(Distillation): 큰 모델의 지식을 작은 모델로 전이
KV Cache 최적화: PagedAttention으로 메모리 효율 개선
배치 처리: 연속 배칭(Continuous Batching)으로 처리량 극대화

모니터링 메트릭

핵심 AI 서빙 메트릭:
├── 레이턴시
│   ├── Time to First Token (TTFT)
│   ├── Inter-Token Latency (ITL)
│   └── End-to-End Latency
├── 처리량
│   ├── Tokens per Second (TPS)
│   ├── Requests per Second (RPS)
│   └── Concurrent Requests
├── 리소스
│   ├── GPU Memory Utilization
│   ├── GPU Compute Utilization
│   └── KV Cache Hit Rate
└── 품질
    ├── Error Rate
    ├── Timeout Rate
    └── Queue Depth

3-7. 고객 대응 역량 (소프트 스킬)

FDE 포지션은 기술 역량만큼이나 고객 대응 능력이 중요합니다.

기술 프레젠테이션 스킬

청중 맞춤: CTO에게는 아키텍처 수준, IT 팀에게는 구현 수준으로 조절
데모 준비: 라이브 데모는 반드시 백업 녹화본 준비
질문 처리: 모르는 것은 솔직히 인정하고 후속 조치 약속

고객 요구사항 수집 프레임워크

1. 현재 상태 파악 (As-Is)
   - 기존 인프라 구성
   - 사용 중인 AI/ML 도구
   - 팀 구성과 역량

2. 목표 상태 정의 (To-Be)
   - 원하는 AI 활용 사례
   - 성능/확장성 요구사항
   - 보안/규제 요구사항

3. 갭 분석 (Gap Analysis)
   - 기술적 갭
   - 프로세스 갭
   - 인력 갭

4. 구현 계획 (Implementation Plan)
   - 단계별 마일스톤
   - 리스크와 완화 방안
   - 타임라인과 리소스

언어 능력

이 포지션은 일본 베이스이므로:

영어: 기술 문서 작성, 글로벌 팀과 소통 (필수)
일본어: 일본 고객 대응 (크게 유리)
한국어: LG CNS 등 한국 고객 대응 (보너스)

장애 대응 시 고객 커뮤니케이션 템플릿

[장애 인지 시] (5분 이내)
"현재 [시스템명]에서 [증상]이 확인되었습니다.
원인 분석을 시작했으며, [예상 시간] 내에 업데이트를 드리겠습니다."

[원인 파악 시]
"원인이 [root cause]로 확인되었습니다.
[해결 방안]을 진행 중이며, 예상 복구 시간은 [ETA]입니다."

[해결 완료 시]
"[시간]에 서비스가 정상 복구되었습니다.
근본 원인: [상세 설명]
재발 방지: [대책]
상세 RCA(Root Cause Analysis) 보고서는 [기한]까지 전달하겠습니다."

4. 면접 예상 질문 25선

Kubernetes 및 인프라 (8문제)

Q1. 프로덕션 Kubernetes 클러스터에서 노드 장애가 발생했을 때의 대응 절차를 설명해주세요.

모범 답안 포인트: 장애 감지(모니터링 알림) -> 영향 범위 파악(어떤 Pod가 해당 노드에 있는지) -> cordon으로 신규 스케줄링 차단 -> drain으로 워크로드 이동 -> 노드 문제 해결 또는 교체 -> uncordon으로 복구 -> 사후 분석(Post-mortem)

Q2. etcd 백업과 복구 전략을 설명해주세요.

모범 답안 포인트: etcd 스냅샷 정기 백업, 백업 주기 결정 기준, 복구 절차, 분산 환경에서의 쿼럼 유지, etcd 성능이 클러스터 전체에 미치는 영향

Q3. Pod가 Pending 상태에서 벗어나지 못할 때 디버깅 절차를 설명해주세요.

모범 답안 포인트: kubectl describe pod로 이벤트 확인 -> 리소스 부족(CPU/메모리/GPU) 확인 -> PVC 바인딩 확인 -> nodeSelector/affinity/taint 확인 -> 스케줄러 로그 확인

Q4. NetworkPolicy를 사용한 마이크로서비스 간 통신 제어 방법을 설명해주세요.

모범 답안 포인트: 기본 거부(deny-all) 정책 설정 -> 필요한 통신만 화이트리스트 -> 네임스페이스 간 격리 -> egress 제어로 외부 통신 제한

Q5. Kubernetes에서 GPU 리소스를 관리하는 방법을 설명해주세요.

모범 답안 포인트: NVIDIA Device Plugin, resource limits, nodeSelector, tolerations, MIG 설정, 모니터링(DCGM exporter)

Q6. PodDisruptionBudget(PDB)의 역할과 설정 전략을 설명해주세요.

모범 답안 포인트: 자발적 중단(업그레이드, 스케일다운) 시 최소 가용 Pod 보장, minAvailable vs maxUnavailable 전략, StatefulSet과의 관계

Q7. Horizontal Pod Autoscaler와 Vertical Pod Autoscaler의 차이점과 각각의 사용 시나리오를 설명해주세요.

모범 답안 포인트: HPA는 Pod 수 조절, VPA는 리소스 요청량 조절. AI 추론 서비스에서는 GPU Pod의 수평 확장이 주로 사용됨. Custom Metrics 기반 스케일링(큐 길이, 레이턴시)

Q8. 멀티 테넌시 환경에서 네임스페이스 수준의 격리를 구현하는 방법을 설명해주세요.

모범 답안 포인트: 네임스페이스 + RBAC + NetworkPolicy + ResourceQuota + LimitRange + Pod Security Standards 조합

Helm 및 배포 전략 (5문제)

Q9. Helm Chart의 values.yaml 오버라이드 전략을 설명해주세요. 다중 환경(dev/staging/prod)에서 어떻게 관리하나요?

모범 답안 포인트: 기본 values.yaml + 환경별 오버라이드 파일, Helmfile 활용, 시크릿 관리 전략(Sealed Secrets, SOPS)

Q10. Helm Hook을 활용한 배포 자동화 경험을 공유해주세요.

모범 답안 포인트: pre-install/pre-upgrade로 DB 마이그레이션, post-install로 초기 설정, 훅 가중치(weight)로 실행 순서 제어

Q11. Helm Chart를 에어갭 환경에 배포하는 전체 프로세스를 설명해주세요.

모범 답안 포인트: 차트 패키징 -> 이미지 목록 추출 -> 이미지 다운로드/저장 -> 물리적 미디어 전송 -> 내부 레지스트리 로드 -> values 수정(이미지 경로 변경) -> 설치

Q12. Canary 배포를 Kubernetes와 Helm으로 구현하는 방법을 설명해주세요.

모범 답안 포인트: 별도 Deployment로 canary 버전 배포, Service의 selector 활용, Istio의 VirtualService로 트래픽 비율 조절, 메트릭 기반 자동 승격/롤백

Q13. Helm Chart에서 CRD(Custom Resource Definition)를 관리하는 전략을 설명해주세요.

모범 답안 포인트: crds/ 디렉토리 사용, CRD 업그레이드 주의사항, Helm은 CRD를 삭제하지 않는 정책, 별도 차트로 CRD 관리하는 패턴

클라우드 및 네트워킹 (5문제)

Q14. Azure, AWS, GCP 각각의 Managed Kubernetes 서비스의 장단점을 비교해주세요.

모범 답안 포인트: 위 3-3 섹션의 비교표 내용 + 실제 운영 경험 기반의 인사이트

Q15. VPC 설계 시 AI 워크로드를 위해 고려해야 할 네트워킹 요소를 설명해주세요.

모범 답안 포인트: 서브넷 분리(퍼블릭/프라이빗/데이터), 대역폭 요구사항(GPU 노드 간 통신), Private Endpoint, NAT Gateway, DNS 해석

Q16. 하이브리드 클라우드 환경에서 온프레미스와 클라우드 간 보안 연결을 구축하는 방법을 설명해주세요.

모범 답안 포인트: VPN Gateway, ExpressRoute/Direct Connect/Interconnect, mTLS, 인증서 관리, 대역폭 계획

Q17. Terraform으로 멀티 클라우드 인프라를 관리한 경험이 있다면 공유해주세요.

모범 답안 포인트: 프로바이더별 모듈 분리, 원격 상태 관리(Backend), 워크스페이스 전략, 모듈 재사용 패턴

Q18. 프라이빗 엔드포인트를 통한 스토리지 접근 설정 경험을 설명해주세요.

모범 답안 포인트: Azure Private Endpoint / AWS VPC Endpoint / GCP Private Service Connect, DNS 설정, 네트워크 규칙

AI 배포 및 트러블슈팅 (4문제)

Q19. 대규모 LLM을 Kubernetes 환경에서 서빙할 때의 주요 도전 과제와 해결 방법을 설명해주세요.

모범 답안 포인트: GPU 메모리 관리, 모델 로딩 시간, 콜드 스타트 문제, 배치 처리 최적화, 모델 업데이트 전략(다운타임 최소화)

Q20. RAG(Retrieval-Augmented Generation) 시스템의 아키텍처를 설계하고 배포한 경험을 공유해주세요.

모범 답안 포인트: 벡터 DB 선택 기준, 임베딩 모델 서빙, 검색-리랭킹-생성 파이프라인, 문서 청킹 전략, 성능 튜닝

Q21. AI 모델 서빙 시 OOM(Out of Memory) 에러가 반복적으로 발생할 때의 디버깅 과정을 설명해주세요.

모범 답안 포인트: GPU 메모리 vs 시스템 메모리 구분, nvidia-smi로 GPU 사용량 확인, 배치 크기 조절, 양자화 적용, KV cache 크기 제한, shared memory 설정

Q22. 모델 업데이트를 무중단으로 수행하는 전략을 설명해주세요.

모범 답안 포인트: Blue-Green 배포로 새 모델 버전 준비, Health Check로 모델 로딩 완료 확인 후 트래픽 전환, 롤백 계획, A/B 테스트 가능성

고객 대응 및 상황 (3문제)

Q23. 고객의 프로덕션 환경에서 긴급 장애가 발생했을 때 어떻게 대응하시겠습니까?

모범 답안 포인트: 즉각 대응(10분 이내), 영향 범위 파악, 임시 조치(workaround) 적용, 고객 커뮤니케이션(정기 업데이트), 근본 원인 분석, RCA 보고서, 재발 방지

Q24. 고객의 IT 팀이 Kubernetes에 대한 경험이 부족한 상황에서 North Platform을 배포해야 합니다. 어떻게 접근하시겠습니까?

모범 답안 포인트: 고객 팀 역량 평가, 단계적 교육 계획 수립, 문서화(배포 가이드, 운영 가이드), 운영 인수인계 계획, 포스트-배포 지원 기간

Q25. 고객이 기술적으로 불가능한 요구사항을 제시했을 때(예: 에어갭 환경에서 실시간 모델 업데이트) 어떻게 대응하시겠습니까?

모범 답안 포인트: 요구사항의 본질적 니즈 파악, 대안 제시(정기 업데이트 사이클, 준에어갭 구성), 트레이드오프 명확히 설명, 기술적 근거 문서화

5. 8개월 학습 로드맵

월	주제	목표	핵심 프로젝트
1개월	Kubernetes 기초 + CKA 준비	클러스터 설치, Pod/Service/Deployment 이해	minikube로 3-tier 앱 배포
2개월	Kubernetes 심화 + CKA 취득	RBAC, NetworkPolicy, 스토리지, 모니터링	kubeadm으로 멀티노드 클러스터 구축
3개월	Helm 마스터 + 클라우드 기초	Chart 개발, 템플릿 엔진 이해, AKS/EKS 경험	커스텀 Helm Chart 작성 및 배포
4개월	클라우드 인프라 + Terraform	VPC 설계, IAM, Terraform으로 IaC 실습	Terraform으로 K8s 클러스터 프로비저닝
5개월	CI/CD + GitOps	GitHub Actions, ArgoCD, 컨테이너 보안	GitOps 파이프라인 구축
6개월	에어갭/온프레미스 배포	Harbor 구축, 오프라인 배포, RKE2	에어갭 환경 시뮬레이션
7개월	AI 모델 배포	vLLM, RAG 아키텍처, GPU 관리	LLM 서빙 파이프라인 구축
8개월	통합 프로젝트 + 면접 준비	포트폴리오 완성, 모의 면접	풀스택 AI 배포 플랫폼

월별 상세 계획

1개월차: Kubernetes 기초

주 1-2: K8s 아키텍처 이해, kubectl 숙달
주 3-4: Deployment, Service, ConfigMap, Secret 실습
일일 1시간 CKA 문제 풀이
도구: minikube, kind

2개월차: Kubernetes 심화 + CKA

주 1: RBAC, ServiceAccount
주 2: NetworkPolicy, Ingress
주 3: PV/PVC, StorageClass, StatefulSet
주 4: CKA 시험 응시
도구: kubeadm, Vagrant

3개월차: Helm + 클라우드 입문

주 1-2: Helm 기초, 기존 차트 분석
주 3: 커스텀 차트 개발, 템플릿 엔진
주 4: AKS 또는 EKS 첫 경험
프로젝트: 마이크로서비스 앱을 Helm Chart로 패키징

4개월차: 클라우드 인프라 + Terraform

주 1: VPC 설계, 서브넷, 보안 그룹
주 2: IAM, 서비스 어카운트, 워크로드 아이덴티티
주 3-4: Terraform 기초, 모듈 작성
프로젝트: Terraform으로 프라이빗 K8s 클러스터 구축

5개월차: CI/CD + GitOps

주 1: GitHub Actions 파이프라인 구축
주 2: ArgoCD 설치 및 설정
주 3: 컨테이너 이미지 스캐닝, Trivy
주 4: 통합 CI/CD + GitOps 파이프라인
프로젝트: Push-to-Deploy 자동화 구현

6개월차: 에어갭/온프레미스

주 1: Harbor 설치 및 운영
주 2: 오프라인 이미지/차트 번들 생성
주 3: RKE2로 에어갭 K8s 클러스터 구축
주 4: 보안 강화(Falco, OPA/Gatekeeper)
프로젝트: 완전 에어갭 환경에서 앱 배포

7개월차: AI 모델 배포

주 1: vLLM으로 LLM 서빙
주 2: GPU 관리, NVIDIA Device Plugin
주 3: RAG 파이프라인 구축
주 4: 모니터링과 최적화
프로젝트: K8s + Helm으로 LLM 서빙 구축

8개월차: 통합 + 면접 준비

주 1-2: 포트폴리오 프로젝트 완성
주 3: 기술 면접 모의 연습
주 4: 행동 면접 준비, 이력서 최종 검토
매일: 면접 질문 25선 반복 연습

6. 포트폴리오 프로젝트 아이디어 3개

프로젝트 1: K8s + Helm으로 LLM 서빙 파이프라인

목표: 오픈소스 LLM을 Kubernetes 환경에서 Helm Chart로 배포하고 운영하는 전체 파이프라인 구축

기술 스택: Kubernetes, Helm, vLLM, Prometheus, Grafana, NVIDIA GPU Operator

구현 내용:

vLLM 기반 모델 서빙을 위한 커스텀 Helm Chart 개발
GPU 노드 관리와 자동 스케일링 설정
Prometheus + Grafana로 추론 메트릭 대시보드 구축
Health Check와 자동 복구 메커니즘
모델 업데이트를 위한 Blue-Green 배포 전략
README에 아키텍처 다이어그램과 배포 가이드 포함

GitHub 리포지토리 구조:

llm-serving-platform/
  charts/
    llm-server/        # 메인 Helm Chart
    monitoring/         # 모니터링 Chart
  terraform/           # 인프라 프로비저닝
  docs/
    architecture.md
    deployment-guide.md
  scripts/
    setup.sh
    test.sh

프로젝트 2: 에어갭 환경 시뮬레이션 배포

목표: 인터넷이 완전히 격리된 환경을 시뮬레이션하고 그 안에서 AI 서비스를 배포하는 전체 프로세스 실습

기술 스택: Vagrant, RKE2, Harbor, Helm, 컨테이너 이미지 번들링

구현 내용:

Vagrant로 네트워크 격리된 VM 환경 구축
Harbor 미러 레지스트리 설치 및 구성
RKE2로 에어갭 K8s 클러스터 구축
모든 이미지와 차트를 오프라인 번들로 패키징
번들을 이용한 애플리케이션 배포 자동화 스크립트
내부 CA로 TLS 인증서 관리

프로젝트 3: Multi-Cloud AI 배포 자동화

목표: Terraform과 Helm으로 Azure, AWS, GCP 세 클라우드에 동일한 AI 서비스를 배포하는 자동화 시스템

기술 스택: Terraform, Helm, Helmfile, GitHub Actions, AKS/EKS/GKE

구현 내용:

Terraform 모듈로 클라우드별 K8s 클러스터 프로비저닝
Helmfile로 클라우드 무관한 애플리케이션 배포
GitHub Actions로 CI/CD 파이프라인 통합
클라우드별 차이(스토리지, IAM, 네트워킹) 추상화
비용 비교 분석 문서
재해 복구(DR) 전략 포함

7. 이력서 작성 전략

STAR 형식으로 경험 정리

이력서의 각 경험은 STAR(Situation-Task-Action-Result) 형식으로 구성해야 합니다.

좋은 예시:

"레거시 VM 기반 배포 시스템에서 Kubernetes로 마이그레이션을 주도(S/T). Helm Chart를 개발하고 ArgoCD 기반 GitOps 파이프라인을 구축하여 배포 자동화를 실현(A). 결과적으로 배포 시간을 2시간에서 15분으로 87% 단축하고, 배포 관련 장애를 월 3건에서 0건으로 감소(R)."

FDE 포지션에 맞는 핵심 키워드

이력서에 반드시 포함해야 할 키워드:

인프라: Kubernetes, Helm, Terraform, Docker, CI/CD
클라우드: Azure/AWS/GCP, VPC, IAM, 프라이빗 클라우드
보안: RBAC, NetworkPolicy, TLS, 에어갭, 규제 준수
AI/ML: LLM 배포, GPU 관리, 모델 서빙, RAG
소프트: 고객 대응, 기술 프레젠테이션, 문서화, 트러블슈팅

고객 대응 경험 강조

FDE 포지션에서 가장 차별화되는 요소는 고객 대응 경험입니다.

강조할 포인트:

직접 고객과 소통한 경험 (기술 미팅, 데모, 트레이닝)
고객 환경에서 직접 배포/운영한 경험
장애 대응 및 RCA(Root Cause Analysis) 경험
기술 문서 작성 및 전달 경험
비기술적 의사결정자에게 기술 설명한 경험

이력서 구성 추천

1. 요약 (3줄)
   - 핵심 경력 연수 + 전문 분야
   - 가장 인상적인 성과 1개
   - FDE 역할에 대한 열정

2. 기술 스택
   - 프로그래밍: Python, Go, Bash
   - 인프라: K8s, Helm, Docker, Terraform
   - 클라우드: Azure, AWS, GCP
   - AI/ML: LLM 배포, vLLM, RAG
   - 도구: Git, ArgoCD, Prometheus, Grafana

3. 경력 사항 (STAR 형식, 3-5개)

4. 프로젝트 (GitHub 링크 포함, 2-3개)

5. 자격증
   - CKA, CKAD, 클라우드 자격증

6. 교육

8. Cohere에서 일한다는 것

복지와 문화

Cohere의 복지는 AI 스타트업 중에서도 최상위 수준입니다.

주요 복리후생:

6주 휴가: 한국 기준 법정 휴가(15일)의 2배. 일본 기준으로도 매우 넉넉
육아휴직 100% Top-up (6개월): 정부 지원금에 더해 급여 100%를 보장
원격 근무 유연성: 일본 어디서든 근무 가능, EMEA/APAC 지역 출장
건강보험: 포괄적인 건강, 치과, 안과 보험
학습 지원: 자격증, 컨퍼런스, 도서 구매 지원
주식 옵션: 스타트업 성장에 따른 자산 증식 기회

성장 기회

AI 최전선에서의 경험:

세계 최고 수준의 LLM을 직접 다루고 배포하는 경험
엔터프라이즈 AI 배포의 최신 트렌드와 기술을 학습
금융, 의료, 통신 등 다양한 산업의 AI 적용 사례 경험

글로벌 네트워크:

RBC, Dell, LG CNS 등 글로벌 엔터프라이즈 고객과의 협업
다국적 팀(캐나다, 미국, 일본, 유럽)과의 협업 경험
글로벌 AI 컨퍼런스와 커뮤니티 참여 기회

도전 과제

솔직히 말해 이 포지션의 도전도 알아야 합니다.

출장 (20-40%)

한 달에 1-2주는 고객 현장에 있을 수 있음
일본 내 + 아시아 태평양 지역 출장
원격 근무와의 균형 필요

고객 현장 압박

고객 프로덕션 환경에서의 장애는 높은 스트레스
다양한 고객의 다양한 환경에 빠르게 적응해야 함
기술적 한계와 고객 기대 사이의 조율

기술 범위의 광범위함

K8s, Helm, 클라우드, 네트워킹, 보안, AI 모두를 다뤄야 함
지속적인 학습이 필수
넓이와 깊이를 동시에 유지하는 것이 과제

9. 퀴즈

배운 내용을 점검해봅시다.

Q1. Forward Deployed Engineer(FDE)는 어떤 회사에서 처음 만든 개념이며, 일반 소프트웨어 엔지니어와 가장 큰 차이점은 무엇인가요?

A: Palantir Technologies에서 처음 만든 개념입니다. 일반 소프트웨어 엔지니어가 사내에서 제품을 개발하는 것과 달리, FDE는 고객 현장에서 직접 기술 솔루션을 구축하고 배포합니다. 고객의 인프라 환경을 이해하고 그 위에 맞춤형 솔루션을 제공하는 것이 핵심 차별점입니다.

Q2. 에어갭(Air-Gapped) 환경에서 컨테이너 이미지를 배포하려면 어떤 프로세스를 거쳐야 하나요?

A: 1) 인터넷이 연결된 환경에서 필요한 모든 컨테이너 이미지를 다운로드합니다. 2) docker save 명령으로 이미지를 tar 파일로 저장합니다. 3) 물리적 미디어(USB, 외장 디스크 등)를 통해 에어갭 환경으로 전송합니다. 4) docker load로 이미지를 로드합니다. 5) Harbor 같은 내부 레지스트리에 이미지를 푸시합니다. 6) Helm Chart의 이미지 경로를 내부 레지스트리로 변경하여 배포합니다.

Q3. Kubernetes에서 프로덕션 환경의 Rolling Update 시 maxUnavailable과 maxSurge 설정의 의미와 권장 값은 무엇인가요?

A: maxUnavailable은 업데이트 중 동시에 사용 불가능할 수 있는 최대 Pod 수이고, maxSurge는 원하는 Pod 수 이상으로 추가 생성할 수 있는 최대 Pod 수입니다. 일반적으로 maxUnavailable: 1, maxSurge: 1로 설정하면 항상 최소 replicas-1개의 Pod가 서비스를 유지하면서 점진적으로 업데이트됩니다. AI 추론 서비스처럼 가용성이 매우 중요한 경우 maxUnavailable: 0, maxSurge: 1로 설정하면 다운타임 없이 업데이트할 수 있습니다.

Q4. Cohere의 RAG 아키텍처에서 Embed v3, Rerank v3, Command R+는 각각 어떤 역할을 하나요?

A: Embed v3는 사용자 쿼리와 문서를 벡터로 변환하여 유사도 기반 검색을 가능하게 합니다. Rerank v3는 초기 검색 결과의 관련성을 재평가하여 가장 적합한 문서를 상위에 배치합니다. Command R+는 최종적으로 선별된 문서를 컨텍스트로 활용하여 사용자 질문에 대한 정확한 답변을 생성합니다. 이 세 모델이 검색-재정렬-생성의 파이프라인을 구성합니다.

Q5. Helm Chart의 values.yaml 오버라이드 전략에서 환경별(dev/staging/prod) 설정 분리가 왜 중요하며, 시크릿은 어떻게 관리해야 하나요?

A: 환경별 설정 분리는 동일한 차트로 여러 환경에 일관되게 배포하면서도 환경 특성에 맞는 설정(리소스 크기, 레플리카 수, 이미지 태그 등)을 적용할 수 있게 합니다. 기본 values.yaml에 공통 설정을 두고, values-dev.yaml, values-staging.yaml, values-production.yaml로 오버라이드합니다. 시크릿은 절대 values 파일에 평문으로 저장하면 안 되며, Sealed Secrets(Bitnami), SOPS(Mozilla), External Secrets Operator(AWS/Azure/GCP Secret Manager 연동) 등을 활용해야 합니다.

10. 참고 자료

공식 문서

Cohere 공식 문서 - docs.cohere.com - Cohere API와 모델 가이드
Kubernetes 공식 문서 - kubernetes.io/docs - 쿠버네티스 전체 레퍼런스
Helm 공식 문서 - helm.sh/docs - 헬름 차트 개발 가이드
NVIDIA GPU Operator - docs.nvidia.com/datacenter/cloud-native - GPU 관리 가이드

자격증 준비

CKA 시험 가이드 - training.linuxfoundation.org - CNCF 인증 K8s 관리자
CKAD 시험 가이드 - training.linuxfoundation.org - CNCF 인증 K8s 개발자
Azure AZ-104 - learn.microsoft.com - Azure 관리자 자격증
AWS SAA-C03 - aws.amazon.com/certification - AWS 솔루션 아키텍트

학습 리소스

Kubernetes The Hard Way - Kelsey Hightower의 K8s 심화 학습
Helm Chart 개발 베스트 프랙티스 - helm.sh/docs/chart_best_practices
Terraform Up and Running - Yevgeniy Brikman 저, IaC 바이블
vLLM 공식 문서 - docs.vllm.ai - LLM 서빙 프레임워크

업계 동향

Cohere 블로그 - cohere.com/blog - 최신 모델과 기술 업데이트
CNCF Landscape - landscape.cncf.io - 클라우드 네이티브 생태계 맵
AI Infrastructure Alliance - ai-infrastructure.org - AI 인프라 동향

커뮤니티

CNCF Slack - slack.cncf.io - 클라우드 네이티브 커뮤니티
Kubernetes subreddit - reddit.com/r/kubernetes - K8s 커뮤니티
MLOps Community - mlops.community - ML 운영 커뮤니티

마무리

Cohere의 Forward Deployed Engineer (Infrastructure Specialist) 포지션은 AI 시대의 가장 흥미로운 역할 중 하나입니다. 단순히 코드를 작성하는 것이 아니라, 세계 최고 수준의 AI 기술을 기업 현장에 직접 가져다 놓는 일을 합니다.

이 가이드에서 다룬 내용을 요약하면:

Cohere는 엔터프라이즈 AI에 집중하는 회사이며, North 플랫폼이 핵심 제품
FDE는 고객 현장에서 직접 배포하는 독특한 역할
Kubernetes와 Helm이 기술적 핵심이고, 에어갭 배포 경험이 큰 차별화 요소
8개월의 체계적 학습으로 필요한 기술을 갖출 수 있음
고객 대응 능력이 기술만큼 중요

이 로드맵을 따라 체계적으로 준비한다면, Forward Deployed Engineer로서의 커리어를 시작하는 데 확실한 기반이 될 것입니다. AI 인프라 분야는 빠르게 성장하고 있고, 이 역량을 갖춘 엔지니어에 대한 수요는 계속 늘어날 것입니다.

화이팅하세요!

Cohere Forward Deployed Engineer Complete Guide: Roadmap to Becoming an AI Platform Deployment Specialist

1. Understanding Cohere and the Agentic Platform Team
2. Line-by-Line JD Analysis
- Key Responsibilities Breakdown
- Required Qualifications Analysis
3. Tech Stack Deep Dive
4. 25 Expected Interview Questions
5. Eight-Month Study Roadmap
- Month-by-Month Detailed Plan
6. Three Portfolio Project Ideas
7. Resume Writing Strategy
8. What It Is Like to Work at Cohere
9. Quiz
10. References
Conclusion

1. Understanding Cohere and the Agentic Platform Team

What Is Cohere

Cohere is an enterprise AI company founded in Toronto in 2019. It was co-founded by Aidan Gomez (co-author of the Transformer paper "Attention Is All You Need" from Google Brain), Ivan Zhang, and Nick Frosst. While OpenAI targets the consumer market, Cohere has focused on enterprise B2B from day one, which is its key differentiator.

Key Product Lineup:

Command R+: Enterprise-grade large language model (LLM) optimized for RAG (Retrieval-Augmented Generation)
Embed v3: Multilingual embedding model supporting 100+ languages, specialized for search and classification
Rerank v3: A reranking model that re-evaluates the relevance of search results
Aya: An open-source multilingual model project supporting 101 languages

The reason Cohere holds a strong position in the enterprise market is its rigorous approach to data privacy. Its core strategy ensures customer data never leaves their environment through private cloud and on-premises deployments.

What Is the North AI Platform

The North AI Platform is Cohere's enterprise AI deployment platform. It evolved from what was previously known as the Cohere Toolkit, enabling companies to securely run Cohere's AI models within their own infrastructure.

Key Features of North:

Complete AI stack deployment in private cloud and on-premises environments
Kubernetes-based architecture deployable across diverse infrastructure
Standardized deployment process through Helm Charts
GPU resource management and model serving optimization
Enterprise-grade security, monitoring, and logging integration

The Agentic Platform Team's Mission

The Agentic Platform team is one of the most customer-facing teams within Cohere. Their mission is to enable enterprise customers to operate AI agents safely and efficiently within their own environments.

AI agents go beyond simple chatbots — they use tools, perform multi-step reasoning, and automate real business workflows. Representative use cases include document analysis agents in financial institutions, medical record summarization agents in healthcare, and customer service automation agents in telecommunications.

Key Client Analysis

Cohere's enterprise clients span various industries, each with unique infrastructure requirements.

RBC (Royal Bank of Canada) - Finance

Canada's largest bank, requiring compliance with global financial regulations (OSFI, GDPR, SOX)
AI deployment in air-gapped environments is a core challenge
Data residency requirements: mandatory data storage within Canadian territory
Extreme security demands due to financial data sensitivity

Dell Technologies - IT/Hardware

AI deployment on Dell's own server and storage infrastructure
On-premises AI infrastructure combining Dell PowerEdge with NVIDIA GPUs
Isolated AI workload management in multi-tenancy environments
Secondary deployment scenarios providing AI solutions to Dell's own customers

LG CNS - IT Services/South Korea

Compliance with South Korea's Personal Information Protection Act (PIPA)
Korean language-specific AI model performance optimization requirements
Serving diverse industry verticals: finance (LG affiliates), manufacturing, logistics
Accommodating South Korea's strong preference for on-premises deployments

Origins of the Forward Deployed Engineer Role

The Forward Deployed Engineer (FDE) title was first created by Palantir Technologies. Inspired by the military term "forward deployed," it signifies placing engineers at the frontlines of customer engagements.

Unlike traditional software engineers who build products internally, FDEs directly understand customer environments and build solutions on top of them. Palantir's FDEs worked directly with the U.S. government, military, and intelligence agencies to deploy data analytics platforms.

As this model proved successful, many AI and data companies like Databricks, Scale AI, and Anyscale adopted similar roles, and Cohere followed suit.

FDE vs SE vs SA Role Comparison

Category	Forward Deployed Engineer (FDE)	Solutions Engineer (SE)	Solutions Architect (SA)
Core Work	Direct deployment/building at customer sites	Technical sales support, demos, PoC	Architecture design, technical consulting
Coding Ratio	60-70% (production implementation)	30-40% (demos, scripts)	10-20% (prototypes)
Customer Contact	Deep (weeks to months on-site)	Focused on sales stage	Focused on design stage
Technical Depth	Very deep (infrastructure + code)	Broad but moderate depth	Broad and deep at architecture level
Travel	20-40% (customer sites)	10-20%	10-15%
Reporting Line	Engineering organization	Sales/pre-sales organization	Sales or CTO organization
Success Metrics	Deployment success rate, uptime	Deal closure rate, PoC conversion	Technology adoption, customer satisfaction

2. Line-by-Line JD Analysis

Key Responsibilities Breakdown

"Lead North AI platform deployments across private cloud and on-premises environments"

This single line captures the essence of the position. "Lead" means not just participating but driving the entire deployment process. You must handle both private cloud (Azure Private Cloud, AWS Outposts, GCP Anthos, etc.) and on-premises (customer data centers).

In practice, this means:

Customer infrastructure pre-assessment
Deployment architecture design and documentation
Kubernetes cluster setup and validation
North platform deployment via Helm Charts
GPU node configuration and model loading
Integration testing and performance validation
Handoff to customer operations team

"Partner with enterprise IT teams on infrastructure and security assessments"

Collaborating with enterprise IT teams demands communication skills as much as technical expertise. Large enterprise IT teams have strict security policies, network architectures, and change management processes.

Infrastructure assessment checklist:

Network topology: VPC/VLAN configuration, subnets, firewall rules
Security policies: authentication/authorization mechanisms, TLS certificate management
Compute resources: CPU/GPU specifications, memory, storage IOPS
Kubernetes environment: version, CNI plugin, ingress controller
Regulatory compliance: data residency, audit logging, access control

"Design tailored deployment strategies ensuring data privacy compliance"

Customized deployment strategies vary per client. Financial institutions require air-gapped environments, healthcare needs HIPAA compliance, and EU customers must meet GDPR requirements.

Deployment strategy design considerations:

Data flow: movement paths for training and inference data
Encryption: at rest and in transit encryption
Access control: who can access models and invoke APIs
Audit trail: logging all access and changes
Data retention: storage duration and deletion policies

"Troubleshoot deployment issues and minimize system downtime"

Troubleshooting ability in production environments is a core FDE competency. When failures occur in customer production environments, immediate response is required.

Common troubleshooting scenarios:

Pod CrashLoopBackOff: OOM, configuration errors, dependency issues
GPU allocation failures: driver mismatches, resource exhaustion
Network connectivity issues: DNS, ingress, service mesh configuration
Model loading failures: storage access permissions, model file corruption
Performance degradation: resource contention, scaling issues

Required Qualifications Analysis

"Direct customer-facing experience"

Customer-facing experience is not just about talking to customers. You need to explain technically complex topics to non-technical decision-makers while engaging in deep technical discussions with engineering teams.

"Production Kubernetes cluster administration and Helm expertise"

This requires production-level K8s operational experience. Not personal project minikube, but managing multi-node clusters handling real traffic. Helm expertise means chart development and customization, not just basic usage.

"Cloud infrastructure (Azure, AWS, GCP), networking, virtualization"

Multi-cloud knowledge is essential. Since each customer uses different clouds, basic understanding of all three is needed. Networking (VPC, subnets, peering, private endpoints) and virtualization (VMware, KVM) knowledge are particularly important.

3. Tech Stack Deep Dive

3-1. Kubernetes Deep Dive (Production Cluster Operations)

Kubernetes is the most critical technology for this position. Production cluster management capability will make or break your candidacy.

Cluster Architecture Understanding

                     ┌─────────────────────────────┐
                     │       Control Plane          │
                     │  ┌─────────┐ ┌───────────┐  │
                     │  │kube-api │ │ scheduler │  │
                     │  │ server  │ │           │  │
                     │  └────┬────┘ └───────────┘  │
                     │  ┌────┴────┐ ┌───────────┐  │
                     │  │  etcd   │ │controller │  │
                     │  │         │ │ manager   │  │
                     │  └─────────┘ └───────────┘  │
                     └──────────────┬──────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
     ┌────────┴────────┐  ┌────────┴────────┐  ┌────────┴────────┐
     │   Worker Node 1  │  │   Worker Node 2  │  │  GPU Node (AI)  │
     │  ┌─────┐┌─────┐ │  │  ┌─────┐┌─────┐ │  │  ┌─────┐┌─────┐│
     │  │ Pod ││ Pod │ │  │  │ Pod ││ Pod │ │  │  │ Pod ││ Pod ││
     │  └─────┘└─────┘ │  │  └─────┘└─────┘ │  │  │ GPU ││ GPU ││
     │  kubelet+kproxy  │  │  kubelet+kproxy  │  │  └─────┘└─────┘│
     └──────────────────┘  └──────────────────┘  └─────────────────┘

Control Plane Core Components:

kube-apiserver: Entry point for all API requests. Manages cluster state via REST API
etcd: Distributed key-value store holding all cluster state. Backups are critical
kube-scheduler: Schedules Pods onto appropriate nodes. Places GPU-requesting Pods on GPU nodes
kube-controller-manager: Manages state of Deployments, ReplicaSets, DaemonSets, etc.

Production Deployment Strategies

# Rolling Update - Most common
apiVersion: apps/v1
kind: Deployment
metadata:
  name: north-ai-api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    spec:
      containers:
        - name: north-api
          image: cohere/north-api:v2.1.0
          resources:
            requests:
              memory: '4Gi'
              cpu: '2'
            limits:
              memory: '8Gi'
              cpu: '4'
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 30

Deployment Strategy Comparison:

Strategy	Downtime	Rollback Speed	Resource Usage	Best For
Rolling Update	None	Moderate	Gradual increase	General updates
Blue-Green	None	Instant	2x resources	Critical updates
Canary	None	Instant	Slight increase	High-risk changes
Recreate	Yes	Slow	Same	Compatibility issues

RBAC (Role-Based Access Control)

RBAC configuration is fundamental to security in enterprise environments.

# Per-customer namespace isolation
apiVersion: v1
kind: Namespace
metadata:
  name: customer-rbc
  labels:
    customer: rbc
    environment: production
---
# Role definition following least privilege principle
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: customer-rbc
  name: north-deployer
rules:
  - apiGroups: ['apps']
    resources: ['deployments', 'statefulsets']
    verbs: ['get', 'list', 'watch', 'create', 'update', 'patch']
  - apiGroups: ['']
    resources: ['pods', 'services', 'configmaps', 'secrets']
    verbs: ['get', 'list', 'watch', 'create', 'update']
  - apiGroups: ['']
    resources: ['pods/log']
    verbs: ['get']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: customer-rbc
  name: north-deployer-binding
subjects:
  - kind: ServiceAccount
    name: north-deploy-sa
    namespace: customer-rbc
roleRef:
  kind: Role
  name: north-deployer
  apiGroup: rbac.authorization.k8s.io

Network Isolation with NetworkPolicy

# Allow only inter-North AI platform Pod communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: north-platform-policy
  namespace: customer-rbc
spec:
  podSelector:
    matchLabels:
      app: north-platform
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: north-platform
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: north-platform
      ports:
        - protocol: TCP
          port: 8080
    - to: # Allow DNS
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53

Resource Management

# LimitRange for namespace defaults
apiVersion: v1
kind: LimitRange
metadata:
  name: north-limits
  namespace: customer-rbc
spec:
  limits:
    - default:
        memory: '2Gi'
        cpu: '1'
      defaultRequest:
        memory: '512Mi'
        cpu: '250m'
      type: Container
---
# ResourceQuota for namespace-wide limits
apiVersion: v1
kind: ResourceQuota
metadata:
  name: north-quota
  namespace: customer-rbc
spec:
  hard:
    requests.cpu: '32'
    requests.memory: '64Gi'
    limits.cpu: '64'
    limits.memory: '128Gi'
    requests.nvidia.com/gpu: '8'
    pods: '50'

GPU Node Management

GPU management is especially critical for AI model serving.

# Tolerations + nodeSelector for GPU node placement
apiVersion: apps/v1
kind: Deployment
metadata:
  name: north-model-server
spec:
  template:
    spec:
      nodeSelector:
        accelerator: nvidia-a100
      tolerations:
        - key: 'nvidia.com/gpu'
          operator: 'Exists'
          effect: 'NoSchedule'
      containers:
        - name: model-server
          image: cohere/north-model:latest
          resources:
            limits:
              nvidia.com/gpu: 4
          volumeMounts:
            - name: model-storage
              mountPath: /models
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: '16Gi'

Monitoring Stack

# Prometheus ServiceMonitor configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: north-platform-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: north-platform
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Key monitoring metrics:

Pod level: CPU/memory utilization, restart count, OOM kill count
Node level: Node availability, disk usage, GPU utilization
Cluster level: Scheduling latency, etcd latency, API server response time
AI workload: Inference latency, tokens/sec throughput, queue depth

Incident Response Procedures

# 1. Identify problematic node
kubectl get nodes -o wide
kubectl describe node problematic-node

# 2. Cordon node (prevent new Pod scheduling)
kubectl cordon problematic-node

# 3. Safely move existing workloads (drain)
kubectl drain problematic-node --ignore-daemonsets --delete-emptydir-data

# 4. Restore node after fixing the issue
kubectl uncordon problematic-node

# 5. Verify Pod status
kubectl get pods -n customer-rbc -o wide
kubectl logs -n customer-rbc pod-name --previous

Recommended Certification Path:

CKA (Certified Kubernetes Administrator): Cluster management focused - Essential
CKAD (Certified Kubernetes Application Developer): Application deployment - Recommended
CKS (Certified Kubernetes Security Specialist): Security - Preferred

3-2. Helm Master Class

Helm is Kubernetes' package manager and a tool you will use daily in this position. Since the North AI platform's deployment unit is a Helm Chart, chart development capability is essential.

Helm Chart Structure

north-ai-platform/
  Chart.yaml          # Chart metadata
  Chart.lock          # Dependency lock file
  values.yaml         # Default values
  values-production.yaml   # Production override
  values-staging.yaml      # Staging override
  templates/
    _helpers.tpl      # Common template functions
    deployment.yaml   # Deployment resource
    service.yaml      # Service resource
    ingress.yaml      # Ingress resource
    configmap.yaml    # ConfigMap
    secret.yaml       # Secret
    hpa.yaml          # HorizontalPodAutoscaler
    pdb.yaml          # PodDisruptionBudget
    networkpolicy.yaml
    serviceaccount.yaml
    NOTES.txt         # Post-install instructions
  charts/             # Sub-charts (dependencies)
  tests/
    test-connection.yaml

values.yaml Environment-Specific Override Strategy

# values.yaml (defaults)
replicaCount: 1
image:
  repository: cohere/north-platform
  tag: 'latest'
  pullPolicy: IfNotPresent

modelServer:
  replicas: 1
  gpu:
    enabled: false
    count: 1
    type: ''
  resources:
    requests:
      memory: '8Gi'
      cpu: '4'
    limits:
      memory: '16Gi'
      cpu: '8'

ingress:
  enabled: true
  className: nginx
  tls:
    enabled: true

monitoring:
  enabled: true
  prometheus:
    scrape: true

security:
  networkPolicy:
    enabled: true
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 1000

# values-production-rbc.yaml (RBC customer override)
replicaCount: 3

image:
  repository: harbor.rbc.internal/cohere/north-platform
  tag: '3.5.0'
  pullPolicy: Always
  pullSecret: rbc-harbor-secret

modelServer:
  replicas: 2
  gpu:
    enabled: true
    count: 4
    type: nvidia-a100-80gb
  resources:
    requests:
      memory: '64Gi'
      cpu: '16'
    limits:
      memory: '128Gi'
      cpu: '32'

ingress:
  enabled: true
  className: nginx
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: 'true'
    nginx.ingress.kubernetes.io/client-max-body-size: '100m'
  hosts:
    - host: north-ai.rbc.internal
  tls:
    enabled: true
    secretName: rbc-tls-secret

persistence:
  enabled: true
  storageClass: rbc-premium-ssd
  size: 500Gi

Helm Hooks

# pre-install hook: database migration
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
  annotations:
    'helm.sh/hook': pre-install,pre-upgrade
    'helm.sh/hook-weight': '-5'
    'helm.sh/hook-delete-policy': hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migration
          image: cohere/north-migration:latest
          command: ['./migrate', '--direction', 'up']

Chart Testing

# Lint check
helm lint ./north-ai-platform

# Template rendering verification
helm template my-release ./north-ai-platform -f values-production-rbc.yaml

# Dry-run install simulation
helm install my-release ./north-ai-platform --dry-run --debug

# Run chart tests
helm test my-release

Helmfile for Multi-Chart Management

# helmfile.yaml
repositories:
  - name: bitnami
    url: https://charts.bitnami.com/bitnami

releases:
  - name: north-platform
    chart: ./charts/north-ai-platform
    namespace: north-system
    values:
      - values/common.yaml
      - values/production.yaml

  - name: monitoring
    chart: prometheus-community/kube-prometheus-stack
    namespace: monitoring
    values:
      - values/monitoring.yaml

  - name: ingress
    chart: ingress-nginx/ingress-nginx
    namespace: ingress
    values:
      - values/ingress.yaml

3-3. Cloud Infrastructure (Azure/AWS/GCP)

Managed Kubernetes Comparison

Feature	AKS (Azure)	EKS (AWS)	GKE (Google)
Control Plane Cost	Free	~$0.10/hr	Free (Standard)
GPU Support	A100, H100	A100, H100, P5	A100, H100, TPU
Max Nodes	5,000	500 (managed node groups)	15,000
Private Cluster	Supported	Supported	Supported
Service Mesh	Istio, OSM	App Mesh, Istio	Anthos SM
GitOps Integration	Flux (native)	ArgoCD	Config Sync
ML Platform	Azure ML	SageMaker	Vertex AI
Air-Gap Support	Azure Stack	Outposts	Anthos

Networking Core Concepts

┌─────────────────────────────────────────────┐
│                    VPC                       │
│  ┌──────────────────┐ ┌──────────────────┐  │
│  │  Public Subnet    │ │  Public Subnet    │  │
│  │  (AZ-1)          │ │  (AZ-2)          │  │
│  │  Load Balancer   │ │  Load Balancer   │  │
│  └────────┬─────────┘ └────────┬─────────┘  │
│           │                     │            │
│  ┌────────┴─────────┐ ┌────────┴─────────┐  │
│  │  Private Subnet   │ │  Private Subnet   │  │
│  │  (AZ-1)          │ │  (AZ-2)          │  │
│  │  K8s Worker      │ │  K8s Worker      │  │
│  │  Nodes           │ │  Nodes           │  │
│  └────────┬─────────┘ └────────┬─────────┘  │
│           │                     │            │
│  ┌────────┴─────────┐ ┌────────┴─────────┐  │
│  │  Data Subnet      │ │  Data Subnet      │  │
│  │  (AZ-1)          │ │  (AZ-2)          │  │
│  │  DB, Storage     │ │  DB, Storage     │  │
│  └──────────────────┘ └──────────────────┘  │
│                                              │
│  ┌──────────────────────────────────────┐   │
│  │  Private Endpoint (Storage/DB)       │   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Key Networking Elements:

VPC Peering / Transit Gateway: Connecting multiple VPCs
Private Endpoint / PrivateLink: Service access without traversing public internet
Network Security Group / Security Group: Inbound/outbound traffic control
DNS: Internal service name resolution with Private DNS Zones

IAM Strategy

Workload identity mapping across clouds:

Azure: Managed Identity -> Pod Identity (AAD Pod Identity / Workload Identity)
AWS:   IAM Role -> IRSA (IAM Roles for Service Accounts)
GCP:   Service Account -> Workload Identity Federation

Recommended Certifications:

Azure: AZ-104 (Azure Administrator) or AZ-305 (Solutions Architect)
AWS: SAA-C03 (Solutions Architect Associate)
GCP: Professional Cloud Architect

3-4. DevOps and CI/CD

GitOps Workflow

Developer -> Git Push -> GitHub/GitLab
                             │
                    ┌────────┴────────┐
                    │                 │
              CI Pipeline        ArgoCD/Flux
              (Build/Test)     (watches repo)
                    │                 │
              Container        Sync to K8s
              Registry         Cluster
                    │                 │
                    └────────┬────────┘
                             │
                      K8s Cluster
                    (Desired State)

CI/CD Pipeline Example (GitHub Actions)

name: North Platform CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Helm Lint
        run: helm lint ./charts/north-ai-platform
      - name: Template Validation
        run: |
          helm template test ./charts/north-ai-platform \
            -f values/test.yaml \
            | kubectl apply --dry-run=client -f -

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Trivy Chart Scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: config
          scan-ref: ./charts/north-ai-platform

  build-and-push:
    needs: [lint-and-test, security-scan]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and Push Chart
        run: |
          helm package ./charts/north-ai-platform
          helm push north-ai-platform-*.tgz oci://registry.example.com/charts

IaC: Kubernetes Cluster Provisioning with Terraform

# Azure AKS cluster provisioning
resource "azurerm_kubernetes_cluster" "north" {
  name                = "north-aks-cluster"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  dns_prefix          = "north"
  kubernetes_version  = "1.29"

  default_node_pool {
    name       = "system"
    node_count = 3
    vm_size    = "Standard_D4s_v3"
  }

  identity {
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "calico"
  }

  private_cluster_enabled = true
}

# GPU node pool
resource "azurerm_kubernetes_cluster_node_pool" "gpu" {
  name                  = "gpupool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.north.id
  vm_size              = "Standard_NC24ads_A100_v4"
  node_count           = 2

  node_taints = [
    "nvidia.com/gpu=present:NoSchedule"
  ]

  node_labels = {
    "accelerator" = "nvidia-a100"
  }
}

Secret Management

# HashiCorp Vault with K8s integration (CSI Driver)
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: north-secrets
spec:
  provider: vault
  parameters:
    vaultAddress: 'https://vault.internal:8200'
    roleName: 'north-platform'
    objects: |
      - objectName: "db-password"
        secretPath: "secret/data/north/database"
        secretKey: "password"
      - objectName: "api-key"
        secretPath: "secret/data/north/api"
        secretKey: "key"

3-5. Private Cloud and On-Premises Deployment

This section covers the most critical differentiating skill for this position. Many engineers can operate K8s on public cloud, but air-gapped deployment experience is rare.

Air-Gapped Environment Specifics

An air-gapped environment is a network completely isolated from the external internet. It is primarily used in finance (RBC), military, and healthcare organizations.

Key Challenges in Air-Gapped Environments:

Container image delivery: No access to Docker Hub, GitHub Container Registry
Package installation: No access to apt, yum, pip, npm repositories
Helm Chart downloads: No access to chart repositories
Certificate management: Cannot use external CAs like Let's Encrypt
Time synchronization: NTP server access may be restricted

Harbor Mirror Registry Setup

# Harbor installation (for air-gapped environments)
# 1. Download all images on an internet-connected machine
docker pull goharbor/harbor-core:v2.10.0
docker pull goharbor/harbor-db:v2.10.0
docker pull goharbor/harbor-jobservice:v2.10.0
docker pull goharbor/harbor-portal:v2.10.0
docker pull goharbor/nginx-photon:v2.10.0
docker pull goharbor/registry-photon:v2.10.0

# 2. Save images as tar
docker save -o harbor-images.tar \
  goharbor/harbor-core:v2.10.0 \
  goharbor/harbor-db:v2.10.0 \
  goharbor/harbor-jobservice:v2.10.0 \
  goharbor/harbor-portal:v2.10.0

# 3. Transfer to air-gapped environment via physical media

# 4. Load images in air-gapped environment
docker load -i harbor-images.tar

Helm Chart Offline Bundle

# Download charts and dependencies on internet-connected machine
helm pull oci://registry.example.com/charts/north-ai-platform --version 2.1.0
helm pull bitnami/postgresql --version 12.5.0
helm pull bitnami/redis --version 17.3.0

# Extract all container image list
helm template north ./north-ai-platform-2.1.0.tgz | \
  grep "image:" | awk '{print $2}' | sort -u > image-list.txt

# Batch download and save images
while read -r image; do
  docker pull "$image"
done < image-list.txt

docker save -o north-platform-images.tar $(cat image-list.txt | tr '\n' ' ')

# Load and retag in air-gapped environment
docker load -i north-platform-images.tar
# Retag to internal Harbor registry and push

On-Premises Kubernetes Options

Tool	Characteristics	Best For
RKE2	Rancher's security-hardened K8s. FIPS 140-2 certified	Finance, government
Kubespray	Ansible-based flexible installation	Custom environments
Tanzu	VMware-integrated K8s	VMware customers
OpenShift	Red Hat enterprise K8s	Large enterprises
k3s	Lightweight K8s	Edge, IoT

Data Residency and Regulatory Compliance

Regulation	Region	Key Requirements
GDPR	EU	Data processing consent, right to be forgotten, DPO appointment
PIPA	South Korea	Personal info collection consent, cross-border transfer restrictions
FISC	Japan	Financial system security standards, domestic data storage
OSFI	Canada	Financial institution technology risk management
HIPAA	USA	Medical data protection, mandatory encryption

GPU Infrastructure Management

# NVIDIA GPU Operator installation (air-gapped)
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
---
# NVIDIA Device Plugin DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin
  template:
    metadata:
      labels:
        name: nvidia-device-plugin
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: nvidia-device-plugin
          image: harbor.internal/nvidia/k8s-device-plugin:v0.14.0
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ['ALL']
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

NVIDIA MIG (Multi-Instance GPU) Configuration:

MIG allows partitioning large GPUs like the A100 into multiple isolated instances to serve multiple models simultaneously.

A100 80GB GPU
├── MIG 1g.10gb (small model serving)
├── MIG 2g.20gb (medium model serving)
└── MIG 4g.40gb (large model serving)

3-6. AI Model Deployment Technologies

LLM Serving Frameworks

Framework	Characteristics	Best For
vLLM	High throughput via PagedAttention	General LLM serving
TensorRT-LLM	NVIDIA-optimized, best performance	NVIDIA GPU environments
Triton	Multi-model, multi-framework	Complex model pipelines
Text Generation Inference	HuggingFace ecosystem integration	HF model usage

RAG Architecture

User Query
    │
    ▼
┌──────────┐    ┌──────────────┐
│ Embed v3 │───▶│ Vector DB    │
│ (query   │    │ (search)     │
│ embedding)│    │ Pinecone/    │
└──────────┘    │ Weaviate/    │
                │ pgvector     │
                └──────┬───────┘
                       │ Top-K docs
                       ▼
                ┌──────────────┐
                │ Rerank v3    │
                │ (relevance   │
                │  reranking)  │
                └──────┬───────┘
                       │ Refined docs
                       ▼
                ┌──────────────┐
                │ Command R+   │
                │ (answer gen) │
                └──────────────┘
                       │
                       ▼
                  Final Answer

Model Optimization Techniques

Quantization: FP32 -> FP16 -> INT8 -> INT4 to reduce model size and memory
Knowledge Distillation: Transferring knowledge from a large model to a smaller one
KV Cache Optimization: Memory efficiency improvement via PagedAttention
Batch Processing: Throughput maximization via Continuous Batching

Monitoring Metrics

Core AI Serving Metrics:
├── Latency
│   ├── Time to First Token (TTFT)
│   ├── Inter-Token Latency (ITL)
│   └── End-to-End Latency
├── Throughput
│   ├── Tokens per Second (TPS)
│   ├── Requests per Second (RPS)
│   └── Concurrent Requests
├── Resources
│   ├── GPU Memory Utilization
│   ├── GPU Compute Utilization
│   └── KV Cache Hit Rate
└── Quality
    ├── Error Rate
    ├── Timeout Rate
    └── Queue Depth

3-7. Customer-Facing Skills (Soft Skills)

The FDE position values customer-facing ability as much as technical capability.

Technical Presentation Skills

Audience adaptation: Architecture level for CTOs, implementation level for IT teams
Demo preparation: Always have backup recordings for live demos
Handling questions: Honestly admit unknowns and promise follow-up

Customer Requirements Gathering Framework

1. Current State Assessment (As-Is)
   - Existing infrastructure configuration
   - Current AI/ML tools in use
   - Team composition and capabilities

2. Target State Definition (To-Be)
   - Desired AI use cases
   - Performance/scalability requirements
   - Security/regulatory requirements

3. Gap Analysis
   - Technical gaps
   - Process gaps
   - Talent gaps

4. Implementation Plan
   - Phased milestones
   - Risks and mitigation strategies
   - Timeline and resources

Language Skills

As this position is Japan-based:

English: Technical documentation, global team communication (required)
Japanese: Japanese customer engagement (highly advantageous)
Korean: LG CNS and Korean customer engagement (bonus)

Incident Communication Template

[Upon incident detection] (within 5 minutes)
"We have identified [symptom] in [system name].
Root cause analysis has begun, and we will provide an update within [estimated time]."

[When cause is identified]
"The root cause has been identified as [root cause].
[Resolution approach] is in progress, with an estimated recovery time of [ETA]."

[Upon resolution]
"Service was restored to normal at [time].
Root cause: [detailed explanation]
Prevention measures: [countermeasures]
A detailed RCA (Root Cause Analysis) report will be delivered by [deadline]."

4. 25 Expected Interview Questions

Kubernetes and Infrastructure (8 Questions)

Q1. Describe the incident response procedure when a node fails in a production Kubernetes cluster.

Key answer points: Failure detection (monitoring alert) -> Impact scope assessment (which Pods are on the node) -> Cordon to block new scheduling -> Drain to migrate workloads -> Fix or replace node -> Uncordon to restore -> Post-mortem analysis

Q2. Explain your etcd backup and recovery strategy.

Key answer points: Regular etcd snapshots, backup frequency criteria, recovery procedure, maintaining quorum in distributed environments, etcd performance impact on the entire cluster

Q3. Describe the debugging procedure when a Pod is stuck in Pending state.

Key answer points: Check events with kubectl describe pod -> Verify resource availability (CPU/memory/GPU) -> Check PVC binding -> Verify nodeSelector/affinity/taint -> Check scheduler logs

Q4. Explain how to control microservice communication using NetworkPolicy.

Key answer points: Set default deny-all policy -> Whitelist only required communication -> Namespace isolation -> Restrict external communication with egress control

Q5. Explain how to manage GPU resources in Kubernetes.

Key answer points: NVIDIA Device Plugin, resource limits, nodeSelector, tolerations, MIG configuration, monitoring (DCGM exporter)

Q6. Explain the role of PodDisruptionBudget (PDB) and configuration strategies.

Key answer points: Ensuring minimum Pod availability during voluntary disruptions (upgrades, scale-down), minAvailable vs maxUnavailable strategies, relationship with StatefulSets

Q7. Explain the differences between Horizontal Pod Autoscaler and Vertical Pod Autoscaler and their respective use cases.

Key answer points: HPA adjusts Pod count, VPA adjusts resource requests. AI inference services primarily use horizontal scaling for GPU Pods. Custom metrics-based scaling (queue depth, latency)

Q8. Explain how to implement namespace-level isolation in a multi-tenancy environment.

Key answer points: Namespace + RBAC + NetworkPolicy + ResourceQuota + LimitRange + Pod Security Standards combination

Helm and Deployment Strategies (5 Questions)

Q9. Explain the values.yaml override strategy for Helm Charts. How do you manage multiple environments (dev/staging/prod)?

Key answer points: Base values.yaml + environment-specific override files, Helmfile usage, secret management strategy (Sealed Secrets, SOPS)

Q10. Share your experience with deployment automation using Helm Hooks.

Key answer points: pre-install/pre-upgrade for DB migrations, post-install for initial setup, hook weight for execution order control

Q11. Describe the entire process of deploying a Helm Chart to an air-gapped environment.

Key answer points: Chart packaging -> Image list extraction -> Image download/save -> Physical media transfer -> Internal registry load -> Modify values (image path changes) -> Install

Q12. Explain how to implement Canary deployment with Kubernetes and Helm.

Key answer points: Deploy canary version as separate Deployment, use Service selector, control traffic ratios with Istio VirtualService, metrics-based auto-promotion/rollback

Q13. Describe the strategy for managing CRDs (Custom Resource Definitions) in Helm Charts.

Key answer points: Using crds/ directory, CRD upgrade considerations, Helm's policy of not deleting CRDs, pattern of managing CRDs as separate charts

Cloud and Networking (5 Questions)

Q14. Compare the pros and cons of each cloud provider's Managed Kubernetes service (Azure, AWS, GCP).

Key answer points: Content from section 3-3 comparison table + insights from actual operational experience

Q15. Explain the networking elements to consider when designing a VPC for AI workloads.

Key answer points: Subnet separation (public/private/data), bandwidth requirements (inter-GPU node communication), Private Endpoint, NAT Gateway, DNS resolution

Q16. Explain how to establish a secure connection between on-premises and cloud in a hybrid cloud environment.

Key answer points: VPN Gateway, ExpressRoute/Direct Connect/Interconnect, mTLS, certificate management, bandwidth planning

Q17. Share your experience managing multi-cloud infrastructure with Terraform.

Key answer points: Provider-specific module separation, remote state management (Backend), workspace strategy, module reuse patterns

Q18. Describe your experience setting up storage access via Private Endpoints.

Key answer points: Azure Private Endpoint / AWS VPC Endpoint / GCP Private Service Connect, DNS configuration, network rules

AI Deployment and Troubleshooting (4 Questions)

Q19. Explain the key challenges and solutions when serving large LLMs in a Kubernetes environment.

Key answer points: GPU memory management, model loading time, cold start issues, batch processing optimization, model update strategies (minimizing downtime)

Q20. Share your experience designing and deploying a RAG (Retrieval-Augmented Generation) system architecture.

Key answer points: Vector DB selection criteria, embedding model serving, search-reranking-generation pipeline, document chunking strategy, performance tuning

Q21. Describe the debugging process when OOM (Out of Memory) errors occur repeatedly during AI model serving.

Key answer points: Distinguish GPU memory vs system memory, check GPU usage with nvidia-smi, adjust batch size, apply quantization, limit KV cache size, shared memory configuration

Q22. Explain the strategy for performing zero-downtime model updates.

Key answer points: Prepare new model version with Blue-Green deployment, confirm model loading completion via Health Check before traffic switching, rollback plan, A/B testing possibility

Customer-Facing and Situational (3 Questions)

Q23. How would you respond to an urgent failure in a customer's production environment?

Key answer points: Immediate response (within 10 minutes), impact scope assessment, temporary fix (workaround) application, customer communication (regular updates), root cause analysis, RCA report, prevention measures

Q24. How would you approach deploying the North Platform when the customer's IT team has limited Kubernetes experience?

Key answer points: Customer team capability assessment, phased training plan development, documentation (deployment guide, operations guide), operations handoff plan, post-deployment support period

Q25. How would you respond when a customer presents technically impossible requirements (e.g., real-time model updates in an air-gapped environment)?

Key answer points: Identify the essential need behind the requirement, propose alternatives (periodic update cycles, semi-air-gapped configuration), clearly explain trade-offs, document technical rationale

5. Eight-Month Study Roadmap

Month	Topic	Goal	Key Project
1	Kubernetes Basics + CKA Prep	Cluster installation, understand Pod/Service/Deployment	Deploy 3-tier app on minikube
2	Kubernetes Advanced + CKA	RBAC, NetworkPolicy, storage, monitoring	Build multi-node cluster with kubeadm
3	Helm Mastery + Cloud Basics	Chart development, template engine, AKS/EKS experience	Create and deploy custom Helm Chart
4	Cloud Infrastructure + Terraform	VPC design, IAM, IaC hands-on with Terraform	Provision K8s cluster with Terraform
5	CI/CD + GitOps	GitHub Actions, ArgoCD, container security	Build GitOps pipeline
6	Air-Gapped/On-Premises Deployment	Harbor setup, offline deployment, RKE2	Air-gapped environment simulation
7	AI Model Deployment	vLLM, RAG architecture, GPU management	Build LLM serving pipeline
8	Integration Project + Interview Prep	Portfolio completion, mock interviews	Full-stack AI deployment platform

Month-by-Month Detailed Plan

Month 1: Kubernetes Basics

Week 1-2: Understand K8s architecture, master kubectl
Week 3-4: Hands-on with Deployment, Service, ConfigMap, Secret
Daily 1 hour CKA practice problems
Tools: minikube, kind

Month 2: Kubernetes Advanced + CKA

Week 1: RBAC, ServiceAccount
Week 2: NetworkPolicy, Ingress
Week 3: PV/PVC, StorageClass, StatefulSet
Week 4: Take CKA exam
Tools: kubeadm, Vagrant

Month 3: Helm + Cloud Introduction

Week 1-2: Helm basics, analyze existing charts
Week 3: Custom chart development, template engine
Week 4: First experience with AKS or EKS
Project: Package a microservices app as a Helm Chart

Month 4: Cloud Infrastructure + Terraform

Week 1: VPC design, subnets, security groups
Week 2: IAM, service accounts, workload identity
Week 3-4: Terraform basics, module authoring
Project: Build a private K8s cluster with Terraform

Month 5: CI/CD + GitOps

Week 1: Build GitHub Actions pipeline
Week 2: ArgoCD installation and configuration
Week 3: Container image scanning, Trivy
Week 4: Integrated CI/CD + GitOps pipeline
Project: Push-to-Deploy automation implementation

Month 6: Air-Gapped/On-Premises

Week 1: Harbor installation and operation
Week 2: Create offline image/chart bundles
Week 3: Build air-gapped K8s cluster with RKE2
Week 4: Security hardening (Falco, OPA/Gatekeeper)
Project: Deploy app in fully air-gapped environment

Month 7: AI Model Deployment

Week 1: LLM serving with vLLM
Week 2: GPU management, NVIDIA Device Plugin
Week 3: Build RAG pipeline
Week 4: Monitoring and optimization
Project: Build LLM serving on K8s + Helm

Month 8: Integration + Interview Prep

Week 1-2: Complete portfolio project
Week 3: Technical interview mock practice
Week 4: Behavioral interview prep, resume final review
Daily: Practice the 25 interview questions

6. Three Portfolio Project Ideas

Project 1: LLM Serving Pipeline with K8s + Helm

Goal: Build a complete pipeline for deploying and operating an open-source LLM in a Kubernetes environment using Helm Charts

Tech Stack: Kubernetes, Helm, vLLM, Prometheus, Grafana, NVIDIA GPU Operator

Implementation:

Develop custom Helm Chart for vLLM-based model serving
GPU node management and auto-scaling configuration
Build inference metrics dashboard with Prometheus + Grafana
Health check and auto-recovery mechanisms
Blue-Green deployment strategy for model updates
Include architecture diagram and deployment guide in README

Project 2: Air-Gapped Environment Simulation Deployment

Goal: Simulate a completely internet-isolated environment and practice the entire process of deploying AI services within it

Tech Stack: Vagrant, RKE2, Harbor, Helm, container image bundling

Implementation:

Build network-isolated VM environment with Vagrant
Install and configure Harbor mirror registry
Build air-gapped K8s cluster with RKE2
Package all images and charts as offline bundles
Automate deployment via bundle scripts
Manage TLS certificates with internal CA

Project 3: Multi-Cloud AI Deployment Automation

Goal: Build an automation system that deploys the same AI service to Azure, AWS, and GCP using Terraform and Helm

Tech Stack: Terraform, Helm, Helmfile, GitHub Actions, AKS/EKS/GKE

Implementation:

Provision cloud-specific K8s clusters with Terraform modules
Cloud-agnostic application deployment with Helmfile
CI/CD pipeline integration with GitHub Actions
Abstract cloud differences (storage, IAM, networking)
Cost comparison analysis documentation
Include disaster recovery (DR) strategy

7. Resume Writing Strategy

Organizing Experience in STAR Format

Each experience on your resume should be structured using the STAR (Situation-Task-Action-Result) format.

Good Example:

"Led migration from legacy VM-based deployment to Kubernetes (S/T). Developed Helm Charts and built ArgoCD-based GitOps pipeline to automate deployments (A). Reduced deployment time by 87% from 2 hours to 15 minutes and decreased deployment-related incidents from 3 per month to zero (R)."

Essential Keywords for FDE Position

Keywords to include in your resume:

Infrastructure: Kubernetes, Helm, Terraform, Docker, CI/CD
Cloud: Azure/AWS/GCP, VPC, IAM, private cloud
Security: RBAC, NetworkPolicy, TLS, air-gapped, regulatory compliance
AI/ML: LLM deployment, GPU management, model serving, RAG
Soft skills: Customer-facing, technical presentations, documentation, troubleshooting

Emphasizing Customer-Facing Experience

The most differentiating factor for an FDE position is customer-facing experience.

Points to highlight:

Direct customer communication experience (technical meetings, demos, training)
Direct deployment/operations at customer sites
Incident response and RCA (Root Cause Analysis) experience
Technical documentation creation and delivery
Explaining technical concepts to non-technical decision-makers

Recommended Resume Structure

1. Summary (3 lines)
   - Core experience years + specialty
   - Most impressive achievement (1)
   - Passion for the FDE role

2. Technical Skills
   - Programming: Python, Go, Bash
   - Infrastructure: K8s, Helm, Docker, Terraform
   - Cloud: Azure, AWS, GCP
   - AI/ML: LLM deployment, vLLM, RAG
   - Tools: Git, ArgoCD, Prometheus, Grafana

3. Professional Experience (STAR format, 3-5 items)

4. Projects (with GitHub links, 2-3 items)

5. Certifications
   - CKA, CKAD, cloud certifications

6. Education

8. What It Is Like to Work at Cohere

Benefits and Culture

Cohere's benefits are among the best in the AI startup space.

Key Benefits:

6 weeks vacation: Double the statutory vacation in many countries
100% parental leave top-up (6 months): Full salary guaranteed on top of government support
Remote work flexibility: Work from anywhere in Japan, travel to EMEA/APAC regions
Health insurance: Comprehensive health, dental, and vision coverage
Learning support: Certification, conference, and book purchase support
Stock options: Wealth growth opportunity with startup growth

Growth Opportunities

Frontline AI Experience:

Hands-on experience deploying world-class LLMs
Learning the latest trends and technologies in enterprise AI deployment
Exposure to AI applications across finance, healthcare, telecom, and more

Global Network:

Collaboration with global enterprise clients like RBC, Dell, LG CNS
Working with multinational teams (Canada, US, Japan, Europe)
Opportunities to participate in global AI conferences and communities

Challenges

Being honest about the position's challenges is important.

Travel (20-40%)

You may spend 1-2 weeks per month at customer sites
Travel within Japan and across the Asia-Pacific region
Balancing remote work with on-site requirements

Customer Site Pressure

Production environment failures at customer sites create high stress
Must quickly adapt to diverse environments across different customers
Navigating between technical limitations and customer expectations

Breadth of Technical Scope

Must handle K8s, Helm, cloud, networking, security, and AI
Continuous learning is mandatory
Maintaining both breadth and depth simultaneously is the challenge

9. Quiz

Let us review what we have learned.

Q1. Which company first created the Forward Deployed Engineer (FDE) concept, and what is the biggest difference from a regular software engineer?

A: Palantir Technologies first created the concept. Unlike regular software engineers who build products internally, FDEs build and deploy technical solutions directly at customer sites. The key differentiator is understanding customer infrastructure environments and providing customized solutions on top of them.

Q2. What process must you follow to deploy container images in an air-gapped environment?

A: 1) Download all required container images on an internet-connected machine. 2) Save images as tar files using docker save. 3) Transfer to the air-gapped environment via physical media (USB, external drive, etc.). 4) Load images using docker load. 5) Push images to an internal registry like Harbor. 6) Modify image paths in the Helm Chart values to point to the internal registry and deploy.

Q3. What do maxUnavailable and maxSurge settings mean in Kubernetes Rolling Update for production environments, and what are the recommended values?

A: maxUnavailable is the maximum number of Pods that can be unavailable simultaneously during an update, and maxSurge is the maximum number of Pods that can be created above the desired count. Setting maxUnavailable: 1 and maxSurge: 1 ensures at least replicas-1 Pods remain serving while progressively updating. For AI inference services where availability is critical, setting maxUnavailable: 0 and maxSurge: 1 enables zero-downtime updates.

Q4. In Cohere's RAG architecture, what roles do Embed v3, Rerank v3, and Command R+ play respectively?

A: Embed v3 converts user queries and documents into vectors to enable similarity-based search. Rerank v3 re-evaluates the relevance of initial search results to place the most relevant documents at the top. Command R+ uses the selected documents as context to generate accurate answers to user questions. These three models form the search-reranking-generation pipeline.

Q5. Why is environment-specific (dev/staging/prod) separation important in Helm Chart values.yaml override strategy, and how should secrets be managed?

A: Environment-specific separation enables consistent deployment across multiple environments using the same chart while applying environment-appropriate settings (resource sizes, replica counts, image tags, etc.). Common settings go in the base values.yaml, with overrides in values-dev.yaml, values-staging.yaml, and values-production.yaml. Secrets must never be stored in plaintext in values files. Instead, use Sealed Secrets (Bitnami), SOPS (Mozilla), or External Secrets Operator (integrating with AWS/Azure/GCP Secret Manager).

10. References

Official Documentation

Cohere Documentation - docs.cohere.com - Cohere API and model guides
Kubernetes Documentation - kubernetes.io/docs - Complete Kubernetes reference
Helm Documentation - helm.sh/docs - Helm chart development guide
NVIDIA GPU Operator - docs.nvidia.com/datacenter/cloud-native - GPU management guide

Certification Preparation

CKA Exam Guide - training.linuxfoundation.org - CNCF Certified K8s Administrator
CKAD Exam Guide - training.linuxfoundation.org - CNCF Certified K8s Developer
Azure AZ-104 - learn.microsoft.com - Azure Administrator certification
AWS SAA-C03 - aws.amazon.com/certification - AWS Solutions Architect

Learning Resources

Kubernetes The Hard Way - Kelsey Hightower's deep K8s learning
Helm Chart Development Best Practices - helm.sh/docs/chart_best_practices
Terraform Up and Running - by Yevgeniy Brikman, the IaC bible
vLLM Documentation - docs.vllm.ai - LLM serving framework

Industry Trends

Cohere Blog - cohere.com/blog - Latest model and technology updates
CNCF Landscape - landscape.cncf.io - Cloud native ecosystem map
AI Infrastructure Alliance - ai-infrastructure.org - AI infrastructure trends

Communities

CNCF Slack - slack.cncf.io - Cloud native community
Kubernetes subreddit - reddit.com/r/kubernetes - K8s community
MLOps Community - mlops.community - ML operations community

Conclusion

The Cohere Forward Deployed Engineer (Infrastructure Specialist) position is one of the most exciting roles in the AI era. It is not just about writing code — it is about bringing world-class AI technology directly to enterprise environments.

To summarize what we covered in this guide:

Cohere is a company focused on enterprise AI, with the North platform as its core product
FDE is a unique role that deploys directly at customer sites
Kubernetes and Helm are the technical core, and air-gapped deployment experience is a major differentiator
Eight months of systematic study can equip you with the necessary skills
Customer-facing ability is as important as technical skills

Following this roadmap with systematic preparation will provide a solid foundation for starting your career as a Forward Deployed Engineer. The AI infrastructure field is growing rapidly, and demand for engineers with these capabilities will continue to increase.

Best of luck on your journey!