Split View: AWS 데이터 분석 전문가 (DAS-C01) 실전 모의고사 65문제

AWS 데이터 분석 전문가 (DAS-C01) 실전 모의고사 65문제

DAS-C01 시험 개요
도메인별 출제 비율
AWS 데이터 분석 서비스 생태계 맵
실전 연습 문제 65문제
학습 리소스

DAS-C01 시험 개요

항목	내용
시험 시간	180분
문제 수	65문제
합격 점수	750점 / 1000점
문제 유형	단일 정답, 복수 정답
시험 비용	USD 300

도메인별 출제 비율

도메인	비율
Domain 1: Collection (데이터 수집)	18%
Domain 2: Storage and Data Management (저장 및 관리)	22%
Domain 3: Processing (처리)	24%
Domain 4: Analysis and Visualization (분석 및 시각화)	18%
Domain 5: Data Security (데이터 보안)	18%

AWS 데이터 분석 서비스 생태계 맵

[데이터 소스]
  ├── 스트리밍: Kinesis Data Streams → Kinesis Data Analytics (Flink)
  │                                  → Kinesis Data Firehose → S3/Redshift/OpenSearch
  ├── 배치: DMS, Snow Family, Direct Connect
  └── SaaS: AppFlow

[저장]
  ├── 데이터 레이크: S3 + Lake Formation
  ├── 데이터 웨어하우스: Redshift (RA3, Spectrum)
  ├── NoSQL: DynamoDB
  └── 검색: OpenSearch Service

[처리]
  ├── 대용량 배치: EMR (Spark, Hive, Flink)
  ├── 서버리스 ETL: AWS Glue
  └── 경량 변환: Lambda

[분석 & 시각화]
  ├── 서버리스 쿼리: Athena
  ├── BI 대시보드: QuickSight (SPICE)
  └── 탐색 분석: OpenSearch Dashboards

[보안]
  ├── 권한 관리: Lake Formation, IAM
  ├── 암호화: KMS, SSE
  └── 네트워크: VPC Endpoints, PrivateLink

실전 연습 문제 65문제

Domain 1: Collection (수집)

Q1. 초당 50,000건의 IoT 센서 데이터를 수집해야 합니다. 데이터는 순서가 보장되어야 하고 24시간 재처리가 가능해야 합니다. 가장 적합한 서비스는?

A) Kinesis Data Firehose B) Kinesis Data Streams C) SQS FIFO Queue D) MSK (Amazon Managed Streaming for Kafka)

정답: B

설명: Kinesis Data Streams는 파티션 키 기반의 순서 보장, 최대 7일(기본 24시간) 데이터 보존, 실시간 재처리가 가능합니다. Firehose는 재처리 불가, SQS FIFO는 처리량 제한이 있습니다.

Q2. Kinesis Data Streams에서 ConsumerReadThroughput 제한에 도달했습니다. 여러 Consumer 애플리케이션이 동일 스트림을 읽을 때 처리량을 향상시키는 방법은?

A) 샤드 수를 늘린다 B) Enhanced Fan-Out을 활성화한다 C) Provisioned 모드로 전환한다 D) GetRecords API 호출 빈도를 줄인다

정답: B

설명: Enhanced Fan-Out은 각 Consumer에게 샤드당 2MB/s의 독립적인 읽기 처리량을 제공합니다. 기본 GetRecords는 모든 Consumer가 샤드당 2MB/s를 공유하지만, Enhanced Fan-Out은 Consumer별로 전용 처리량을 할당합니다.

Q3. Kinesis Data Firehose로 S3에 데이터를 저장할 때, 원본 JSON 데이터를 Parquet 형식으로 변환하고 특정 필드 값에 따라 S3 경로를 분리하려 합니다. 이를 위한 올바른 구성은?

A) Lambda 변환 + 접두사 표현식 B) Format Conversion + Dynamic Partitioning C) Glue ETL 작업 연동 D) S3 Object Lambda 사용

정답: B

설명: Kinesis Data Firehose의 Format Conversion 기능으로 JSON을 Parquet/ORC로 변환하고, Dynamic Partitioning으로 레코드 내 특정 필드 값을 기반으로 S3 접두사를 동적으로 설정합니다. jq 표현식이나 인라인 파싱을 사용합니다.

Q4. 온프레미스 Oracle 데이터베이스에서 Amazon Redshift로 지속적인 CDC(Change Data Capture) 복제를 구현해야 합니다. 가장 적절한 방법은?

A) AWS Glue ETL 작업으로 전체 테이블 주기적 복사 B) AWS DMS + 지속적 복제 태스크 C) Kinesis Data Streams + Lambda D) Snowball Edge로 초기 로드 후 Direct Connect

정답: B

설명: AWS DMS(Database Migration Service)는 소스 데이터베이스의 트랜잭션 로그를 읽어 CDC를 구현합니다. 초기 전체 로드 후 지속적 복제 태스크로 변경사항만 실시간으로 동기화합니다. Oracle에서 Redshift로의 이기종 마이그레이션은 SCT(Schema Conversion Tool)도 함께 사용합니다.

Q5. 페타바이트 규모의 데이터를 온프레미스에서 AWS S3로 전송해야 합니다. 인터넷 대역폭이 1Gbps이고 전송에 수개월이 걸릴 예상입니다. 비용 효율적이고 빠른 방법은?

A) AWS Direct Connect 전용선 구축 B) AWS Snowball Edge Storage Optimized 다수 주문 C) AWS Snowmobile 사용 D) S3 Transfer Acceleration 활성화

정답: C

설명: 페타바이트 이상의 대용량 데이터는 Snowmobile(100PB 이상 적합)을 사용합니다. 수십 페타바이트는 Snowball을 여러 대 사용하거나 Snowmobile을 검토합니다. 인터넷/Direct Connect는 전송 시간이 너무 길고 비용이 높습니다. Snowball Edge는 최대 80TB로 페타바이트급에는 많은 수가 필요합니다.

Q6. Amazon MSK와 Kinesis Data Streams 중 선택해야 합니다. 팀이 Apache Kafka API와 호환되는 기존 애플리케이션을 가지고 있고 긴 메시지 보존 기간(최대 1년)이 필요합니다. 어느 서비스가 적합합니까?

A) Kinesis Data Streams — 확장성이 더 우수함 B) Amazon MSK — Kafka 호환성과 긴 보존 기간 지원 C) Kinesis Data Firehose — 완전 관리형 D) SQS — 메시지 보존이 더 오래됨

정답: B

설명: Amazon MSK는 Apache Kafka 완전 호환 서비스로 기존 Kafka 클라이언트 코드를 그대로 사용할 수 있습니다. 메시지 보존 기간을 무기한으로 설정할 수 있어 장기 보존에 유리합니다. Kinesis는 최대 365일이지만 Kafka API와 호환되지 않습니다.

Q7. Kinesis Data Streams의 ProvisionedThroughputExceededException 오류가 빈번히 발생합니다. 데이터 파티션 키 분포를 확인한 결과 특정 키에 집중되어 있습니다. 해결책은?

A) 샤드 수를 두 배로 늘린다 B) 파티션 키에 랜덤 접두사를 추가하여 샤드 균등 분산 C) Enhanced Fan-Out 활성화 D) Kinesis Producer Library(KPL) Aggregation 비활성화

정답: B

설명: 핫 샤드(특정 파티션 키에 쓰기가 집중) 문제는 파티션 키를 분산시켜 해결합니다. 파티션 키 앞에 랜덤 접두사(예: 0~N 범위의 숫자)를 붙여 여러 샤드에 균등 분배합니다. 단, 읽기 시 fan-out으로 모든 샤드를 읽어야 합니다.

Q8. Kinesis Data Analytics (Apache Flink)에서 실시간 이상 탐지를 구현할 때, 과거 30분 데이터 기준으로 현재 값과 비교해야 합니다. 사용해야 할 Flink 기능은?

A) Tumbling Window B) Sliding Window C) Session Window D) Global Window

정답: B

설명: Sliding Window는 고정된 크기의 윈도우가 일정 간격으로 슬라이딩합니다. 예를 들어 30분 크기의 윈도우를 1분 간격으로 이동하면 항상 최근 30분 데이터를 기준으로 계산할 수 있습니다. Tumbling Window는 겹치지 않는 고정 구간입니다.

Domain 2: Storage and Data Management (저장)

Q9. S3 데이터 레이크에서 Athena 쿼리 성능을 최적화하려 합니다. 데이터는 날짜별로 매일 추가되며 주로 특정 기간 범위와 지역 코드로 필터링됩니다. 최적의 파티셔닝 전략은?

A) 단일 파티션: year/month/day B) 복합 파티션: year/month/day/region C) 파티셔닝 없이 Athena Partition Projection 사용 D) 모든 데이터를 단일 접두사에 저장하고 압축

정답: B

설명: 쿼리 패턴에 맞게 날짜와 지역 코드를 함께 파티셔닝하면 Athena가 스캔할 데이터를 최소화합니다. year/month/day/region 구조로 계층적 파티셔닝을 적용하면 두 필터 조건 모두 파티션 프루닝을 활용할 수 있습니다.

Q10. AWS Lake Formation에서 특정 테이블의 특정 열(컬럼)을 특정 IAM 역할에만 보이게 제한해야 합니다. 올바른 접근 방법은?

A) S3 버킷 정책으로 특정 접두사 접근 제한 B) Glue 카탈로그 리소스 기반 정책으로 테이블 수준 제한 C) Lake Formation 열 수준 보안(Column-level security) 구성 D) Athena 워크그룹별 쿼리 필터 설정

정답: C

설명: Lake Formation은 테이블, 열, 행 수준의 세밀한 접근 제어를 지원합니다. 열 수준 보안을 사용하면 특정 IAM 역할/사용자에게 특정 열만 표시되도록 제한할 수 있습니다. S3 버킷 정책은 파일 수준이라 열 단위 제어가 불가능합니다.

Q11. Redshift 클러스터에서 대용량 테이블 조인 성능이 떨어집니다. 두 테이블 모두 수십억 행이며 자주 조인됩니다. 최적의 분산 스타일 조합은?

A) 두 테이블 모두 EVEN 분산 B) 두 테이블 모두 ALL 분산 C) 큰 팩트 테이블은 KEY 분산(조인 키), 작은 차원 테이블은 ALL 분산 D) 두 테이블 모두 AUTO 분산

정답: C

설명: 대형 팩트 테이블 간의 조인에서는 조인 키로 KEY 분산을 동일하게 적용하면 데이터를 각 노드에서 로컬로 조인할 수 있어 네트워크 전송이 최소화됩니다. 소규모 차원 테이블은 ALL 분산으로 모든 노드에 복사하면 브로드캐스트 조인 오버헤드를 줄입니다.

Q12. DynamoDB 테이블에서 사용자 주문 내역을 조회합니다. 기본 키는 UserID(파티션 키)이고 OrderDate(정렬 키)입니다. 특정 날짜 이후의 주문을 효율적으로 조회하는 방법은?

A) Scan 작업으로 FilterExpression 사용 B) Query 작업으로 KeyConditionExpression에 UserID와 OrderDate 범위 조건 C) GSI 생성 후 Query D) DynamoDB Streams + Lambda로 인덱스 유지

정답: B

설명: 기본 키의 파티션 키(UserID)와 정렬 키(OrderDate)를 사용한 Query는 가장 효율적입니다. KeyConditionExpression에서 UserID = :uid AND OrderDate >= :date 조건으로 특정 사용자의 특정 날짜 이후 주문을 효율적으로 조회합니다. Scan은 전체 테이블을 읽어 비효율적입니다.

Q13. Redshift RA3 노드를 사용하는 이유로 가장 올바른 것은?

A) 컴퓨팅과 스토리지를 독립적으로 확장하기 위해 B) 인메모리 캐싱으로 최고 쿼리 성능 달성 C) 자동으로 쿼리를 병렬 처리하기 위해 D) S3 데이터에 직접 접근하기 위해

정답: A

설명: Redshift RA3 노드는 컴퓨팅과 스토리지를 분리하여 독립적으로 확장할 수 있습니다. 자주 접근하는 데이터는 로컬 SSD 캐시에, 나머지는 S3 기반 Redshift Managed Storage(RMS)에 저장합니다. 데이터 증가에 따라 컴퓨팅 비용 증가 없이 스토리지만 확장 가능합니다.

Q14. OpenSearch Service에서 오래된 인덱스 데이터를 비용 효율적으로 관리하려 합니다. 최근 7일 데이터는 자주 조회되고, 30일~1년 데이터는 가끔 조회되며, 1년 이상은 거의 조회되지 않습니다. 최적의 스토리지 계층 구성은?

A) 모든 데이터를 Hot 스토리지에 유지 B) Hot → UltraWarm → Cold 스토리지 계층 이동 C) 30일 이후 데이터는 S3로 내보내기 D) Index State Management로 30일 후 삭제

정답: B

설명: OpenSearch Service는 Hot(빠른 SSD, 자주 접근), UltraWarm(S3 기반, 비용 절감, 가끔 접근), Cold(더 저렴, 드물게 접근)의 스토리지 계층을 제공합니다. Index State Management(ISM)로 자동 계층 이동 정책을 설정하면 비용 최적화와 접근 패턴을 균형 있게 관리합니다.

Q15. S3에 저장된 데이터 레이크에 Lake Formation Governed Tables를 적용했습니다. 이 기능의 주요 이점은?

A) 자동 파일 압축으로 스토리지 비용 절감 B) ACID 트랜잭션 지원과 자동 데이터 컴팩션 C) 실시간 스트리밍 데이터 수집 D) 열 수준 암호화 자동 적용

정답: B

설명: Lake Formation Governed Tables는 S3 데이터에 ACID 트랜잭션(원자성, 일관성, 격리성, 내구성)을 지원합니다. 동시 읽기/쓰기 작업에서 데이터 일관성을 보장하고, 자동 압축(compaction)으로 작은 파일 문제를 해결합니다. 행 수준 보안도 지원합니다.

Q16. Redshift Spectrum을 사용하여 S3 데이터 레이크를 쿼리할 때 성능을 최대화하는 방법으로 가장 효과적인 것은?

A) S3 데이터를 CSV 형식으로 저장 B) S3 데이터를 Parquet 형식으로 저장하고 파티셔닝 적용 C) Redshift 클러스터 노드 수를 최대화 D) Spectrum 슬라이스당 파일 수를 1개로 제한

정답: B

설명: Redshift Spectrum은 Parquet나 ORC 같은 컬럼형 형식에서 최고 성능을 발휘합니다. 열 프루닝(필요한 열만 읽기)과 파티션 프루닝(필요한 파티션만 읽기)이 결합되면 스캔 데이터를 대폭 줄일 수 있습니다. CSV는 전체 파일을 읽어야 합니다.

Domain 3: Processing (처리)

Q17. EMR 클러스터에서 Spot 인스턴스를 사용하여 비용을 최소화하려 합니다. 작업 실패 위험을 최소화하면서 비용을 최적화하는 올바른 구성은?

A) 마스터, 코어, 태스크 노드 모두 Spot 사용 B) 마스터와 코어는 On-Demand, 태스크 노드만 Spot 사용 C) 마스터는 On-Demand, 코어와 태스크는 Spot 사용 D) 모든 노드를 Reserved 인스턴스로 구성

정답: B

설명: EMR에서 마스터 노드는 클러스터 관리를 담당하고, 코어 노드는 HDFS 데이터를 저장합니다. 마스터와 코어 노드는 On-Demand로 안정성을 보장하고, 태스크 노드(추가 컴퓨팅 전용)만 Spot으로 사용하면 Spot 중단 시 데이터 손실 없이 비용을 절감할 수 있습니다.

Q18. AWS Glue DynamicFrame과 Apache Spark DataFrame의 차이점으로 가장 정확한 것은?

A) DynamicFrame은 스키마가 없어 모든 데이터 타입을 처리 B) DynamicFrame은 스키마 불일치(choice 타입)를 허용하고 relationalize 같은 AWS 전용 변환 제공 C) DataFrame이 항상 더 빠르므로 DynamicFrame은 사용하지 않는 것이 좋음 D) DynamicFrame은 구조화 스트리밍만 지원

정답: B

설명: AWS Glue DynamicFrame은 열 내에 여러 타입이 혼재하는 경우 choice 타입으로 표현합니다. resolveChoice(), relationalize() 같은 AWS 특화 변환 메서드를 제공합니다. 성능이 중요한 경우 DynamicFrame을 DataFrame으로 변환하여 처리 후 다시 DynamicFrame으로 변환할 수 있습니다.

Q19. AWS Glue Crawler가 S3의 파티셔닝된 데이터를 크롤링할 때 새 파티션이 추가될 때마다 전체 크롤러를 실행하는 것이 비효율적입니다. 대안은?

A) Glue 크롤러를 매 1분마다 실행 B) Athena의 MSCK REPAIR TABLE 또는 ADD PARTITION 명령 사용 C) Lake Formation blueprints로 자동 파티션 관리 D) S3 이벤트 알림으로 Lambda를 통해 Glue 카탈로그 파티션 업데이트

정답: D

설명: S3 ObjectCreated 이벤트로 Lambda를 트리거하여 glue:BatchCreatePartition API를 호출하면 새 파티션만 효율적으로 추가됩니다. Athena의 MSCK REPAIR TABLE도 가능하지만 파티션이 많을수록 느립니다. Lambda를 통한 자동 파티션 등록이 가장 효율적입니다.

Q20. EMR Serverless와 EMR on EC2의 주요 차이점으로 올바른 것은?

A) EMR Serverless는 영구적인 클러스터를 유지하고 EMR on EC2는 임시 클러스터 B) EMR Serverless는 클러스터 프로비저닝 없이 자동 스케일링되고 유휴 시 비용 발생 없음 C) EMR Serverless는 Hive만 지원하고 Spark는 지원하지 않음 D) EMR on EC2가 항상 더 비용 효율적

정답: B

설명: EMR Serverless는 클러스터를 직접 관리할 필요 없이 애플리케이션을 제출하면 자동으로 리소스를 프로비저닝합니다. 작업이 없을 때는 비용이 발생하지 않습니다. Spark, Hive 등을 지원하며, 간헐적인 배치 작업에 이상적입니다.

Q21. AWS Glue DataBrew의 주요 용도는?

A) 대규모 분산 Spark ETL 처리 B) 코드 없는 시각적 데이터 준비 및 정제 C) 실시간 스트리밍 데이터 변환 D) 데이터 카탈로그 메타데이터 관리

정답: B

설명: AWS Glue DataBrew는 코드 작성 없이 시각적 인터페이스로 데이터를 탐색, 정제, 정규화할 수 있는 서비스입니다. 250개 이상의 사전 빌드된 변환을 제공하며, 데이터 품질 규칙 정의와 프로파일링도 지원합니다. 데이터 분석가나 데이터 사이언티스트에게 적합합니다.

Q22. Step Functions를 사용한 데이터 파이프라인 오케스트레이션에서 EMR 작업 실패 시 자동 재시도 로직을 구현해야 합니다. 올바른 방법은?

A) Lambda 함수로 EMR 상태를 폴링하고 실패 시 재시작 B) Step Functions 상태 정의에 Retry 블록 추가 C) CloudWatch Alarm으로 EMR 실패 감지 후 SNS 알림 D) Glue Workflow로 EMR 작업을 래핑

정답: B

설명: Step Functions의 각 태스크 상태에 Retry 블록을 추가하면 지정된 오류 유형에 따라 자동 재시도합니다. 재시도 횟수, 재시도 간격, 백오프 비율을 설정할 수 있습니다. Catch 블록으로 최종 실패 시 대안 경로도 구성할 수 있습니다.

Q23. AWS Glue ETL 작업에서 작은 파일(small files) 문제를 해결하려 합니다. Spark 작업 완료 후 S3에 수천 개의 작은 파일이 생성됩니다. 해결 방법은?

A) Glue 작업의 Worker 수를 줄임 B) coalesce() 또는 repartition()으로 출력 파티션 수 조절 후 쓰기 C) S3 Lifecycle 정책으로 작은 파일 자동 삭제 D) Kinesis Firehose로 파일 병합

정답: B

설명: Spark에서 coalesce(N)으로 파티션 수를 줄이거나 repartition(N)으로 균등하게 재분배한 후 쓰면 출력 파일 수를 제어할 수 있습니다. coalesce는 셔플 없이 파티션을 줄이지만 불균등할 수 있고, repartition은 균등하지만 셔플 비용이 발생합니다. 작은 파일 병합에는 coalesce가 일반적으로 적합합니다.

Q24. Lambda 함수를 Kinesis Data Streams의 이벤트 소스로 사용할 때 처리 속도를 높이는 방법은?

A) Lambda 함수의 메모리를 늘린다 B) 샤드당 병렬 처리(Parallelization Factor)를 늘린다 C) 배치 크기(Batch Size)를 줄인다 D) Lambda 예약 동시성을 늘린다

정답: B

설명: Kinesis-Lambda 이벤트 소스 매핑에서 Parallelization Factor(1~10)를 늘리면 각 샤드에서 동시에 여러 Lambda 호출이 처리됩니다. 기본값 1은 샤드당 1개의 동시 Lambda 실행입니다. Parallelization Factor 10으로 설정하면 동일 샤드에서 최대 10개의 Lambda가 병렬 실행됩니다.

Domain 4: Analysis and Visualization (분석 및 시각화)

Q25. Athena에서 자주 실행되는 쿼리의 비용을 최소화하려 합니다. 동일한 쿼리 결과를 재사용하고 싶습니다. 사용해야 할 Athena 기능은?

A) Athena 쿼리 캐싱(Query Result Reuse) B) Athena 연합 쿼리(Federated Query) C) CTAS(Create Table As Select)로 결과 저장 D) Athena 워크그룹 쿼리 큐

정답: A

설명: Athena의 Query Result Reuse 기능을 활성화하면 동일한 쿼리에 대해 지정된 기간(최대 7일) 내 이전 결과를 재사용합니다. 스캔 비용이 발생하지 않아 반복 쿼리 비용을 크게 줄일 수 있습니다. 데이터가 변경되지 않는 경우에 효과적입니다.

Q26. Athena로 S3의 파티셔닝된 데이터를 쿼리할 때 파티션 메타데이터를 Glue 카탈로그 없이 관리하려 합니다. 파티션이 매시간 추가됩니다. 가장 효율적인 방법은?

A) 매시간 MSCK REPAIR TABLE 실행 B) Athena Partition Projection 설정 C) 매시간 ALTER TABLE ADD PARTITION 실행 D) Lake Formation으로 파티션 자동 관리

정답: B

설명: Athena Partition Projection을 사용하면 파티션 메타데이터를 Glue 카탈로그에 저장하지 않고 테이블 정의의 규칙으로 파티션을 동적으로 계산합니다. 규칙적인 패턴(날짜, 숫자 범위)의 파티션에 특히 효과적이며 파티션 등록 관리 오버헤드가 없습니다.

Q27. Redshift 쿼리 성능 문제를 진단하고 있습니다. EXPLAIN 명령 결과를 보니 DS_DIST_ALL_NONE 분산 방식이 표시됩니다. 이것은 무엇을 의미합니까?

A) 두 테이블 모두 KEY 분산으로 조인 효율적 B) 한 테이블이 ALL 분산으로 모든 노드에 복사되어 조인 효율적 C) EVEN 분산으로 데이터 재분배 필요 D) 분산 키가 없어 브로드캐스트 조인 불가

정답: B

설명: DS_DIST_ALL_NONE은 조인의 한 테이블이 ALL 분산(모든 노드에 복사)으로 데이터 재분배(redistribution)가 필요 없음을 의미합니다. 이는 성능이 좋은 조인 패턴입니다. DS_DIST_ALL_INNER는 재분배 비용이 있음을 나타냅니다.

Q28. QuickSight에서 대용량 데이터셋(수억 행)을 빠르게 시각화하려 합니다. 데이터는 매일 업데이트됩니다. 최적의 구성은?

A) Direct Query 모드로 Redshift에 직접 연결 B) SPICE에 데이터 가져오기(Import) 후 예약 새로고침 설정 C) Athena를 통해 S3 데이터에 직접 쿼리 D) QuickSight Paginated Reports 사용

정답: B

설명: SPICE(Super-fast Parallel In-memory Calculation Engine)는 QuickSight의 인메모리 스토리지로 초고속 쿼리와 시각화를 지원합니다. 수억 행도 빠르게 처리합니다. 예약 새로고침으로 데이터를 주기적으로 업데이트할 수 있습니다. Direct Query는 실시간이지만 대용량에서 응답이 느릴 수 있습니다.

Q29. Redshift Workload Management(WLM)에서 장시간 실행 쿼리가 단시간 쿼리를 차단합니다. 이 문제를 해결하는 최적의 방법은?

A) 클러스터에 노드를 추가 B) 짧은 쿼리 우선 처리(Short Query Acceleration) 활성화 C) 모든 쿼리에 동일한 우선순위 부여 D) 장시간 쿼리를 야간에만 실행 예약

정답: B

설명: Redshift의 Short Query Acceleration(SQA) 기능은 머신러닝을 사용하여 단시간 쿼리를 예측하고 별도 큐에서 우선 처리합니다. 별도 WLM 큐 설정 없이 간단하게 단/장시간 쿼리 혼재 환경을 최적화합니다. WLM 자동 큐로 추가 전환도 고려할 수 있습니다.

Q30. QuickSight에서 행 수준 보안(Row-Level Security)을 구현하여 영업 담당자가 자신의 지역 데이터만 볼 수 있게 해야 합니다. 올바른 방법은?

A) 각 영업 담당자마다 별도 데이터셋 생성 B) QuickSight RLS(Row-Level Security) 규칙을 데이터셋에 적용 C) IAM 정책으로 S3 데이터 접근 제한 D) Athena 뷰로 사용자별 데이터 필터링

정답: B

설명: QuickSight RLS는 사용자/그룹과 필터 값을 매핑한 규칙 파일(CSV 또는 다른 데이터셋)을 데이터셋에 연결합니다. 로그인한 사용자에 따라 자동으로 데이터가 필터링됩니다. 단일 데이터셋으로 여러 사용자에게 각자의 데이터만 표시할 수 있는 효율적인 방법입니다.

Q31. Athena 연합 쿼리(Federated Query)를 사용하는 주요 시나리오는?

A) S3 데이터 레이크와 RDS, DynamoDB 등 이기종 데이터 소스를 단일 SQL로 조인 B) Redshift와 S3 데이터를 함께 쿼리 C) 여러 AWS 계정의 S3 버킷을 단일 쿼리로 처리 D) 실시간 스트리밍 데이터를 SQL로 분석

정답: A

설명: Athena Federated Query는 Lambda 기반 데이터 소스 커넥터를 사용하여 S3 이외의 데이터 소스(RDS, DynamoDB, ElastiCache, CloudWatch, Redis 등)에 SQL을 실행합니다. 여러 이기종 소스를 단일 쿼리에서 조인할 수 있어 데이터 통합 분석에 유용합니다.

Q32. OpenSearch Service에서 Kibana 대신 OpenSearch Dashboards를 사용할 때, 대용량 로그 데이터 인덱싱 성능을 최적화하는 방법은?

A) Refresh Interval을 1초로 줄임 B) Bulk API 사용 및 Refresh Interval을 늘림 C) 각 문서마다 별도 Index API 호출 D) 샤드 수를 최대화

정답: B

설명: OpenSearch/Elasticsearch에서 대량 인덱싱 시 Bulk API로 여러 문서를 일괄 처리하면 네트워크 오버헤드를 줄입니다. Refresh Interval을 늘리면(기본 1초에서 30초 이상으로) 세그먼트 생성 빈도가 줄어 인덱싱 처리량이 크게 향상됩니다. 초기 대량 인덱싱 후 Refresh를 수동으로 호출하는 방법도 있습니다.

Domain 5: Data Security (데이터 보안)

Q33. S3 데이터 레이크에서 특정 부서만 특정 테이블에 접근할 수 있도록 제어해야 합니다. Lake Formation과 S3 버킷 정책 중 올바른 방법은?

A) S3 버킷 정책만으로 세밀한 데이터 접근 제어 가능 B) Lake Formation 데이터 권한으로 테이블/열/행 수준 접근 제어 C) IAM 정책으로 Glue 카탈로그 접근 제어 D) S3 접근 포인트(Access Points)로 접두사별 접근 제어

정답: B

설명: Lake Formation은 Glue 데이터 카탈로그와 통합하여 논리적 테이블, 열, 행 수준의 세밀한 접근 제어를 제공합니다. 물리적 S3 파일 경로가 아닌 논리적 데이터 구조를 기반으로 권한을 부여합니다. S3 버킷 정책은 파일/접두사 수준만 제어 가능하여 열/행 수준 제어가 불가능합니다.

Q34. Kinesis Data Streams 전송 중 데이터 암호화를 활성화해야 합니다. AWS 관리 키를 사용하는 방법은?

A) S3 서버 측 암호화(SSE-S3) 활성화 B) Kinesis 스트림에서 서버 측 암호화(SSE) 활성화 — AWS/kinesis KMS 키 선택 C) KMS 고객 관리 키(CMK) 생성 후 Kinesis에 연결 D) 클라이언트 측 암호화로 데이터 전송

정답: B

설명: Kinesis Data Streams는 서버 측 암호화(SSE)를 지원합니다. AWS 관리 키(aws/kinesis)를 선택하면 추가 설정 없이 모든 스트림 데이터가 KMS로 암호화됩니다. 고객 관리 키(CMK)를 사용하면 더 세밀한 키 정책 제어가 가능하지만 추가 구성이 필요합니다.

Q35. Redshift 클러스터를 VPC 내에 배포하고 S3에서 COPY 명령으로 데이터를 로드할 때 인터넷을 통하지 않도록 하려면?

A) Redshift 클러스터에 퍼블릭 IP 할당 B) NAT Gateway를 통해 S3 접근 C) S3에 대한 VPC 엔드포인트(게이트웨이 엔드포인트) 구성 D) Direct Connect로 AWS 네트워크 연결

정답: C

설명: S3 게이트웨이 엔드포인트를 VPC에 구성하면 Redshift에서 S3로의 COPY/UNLOAD 작업이 AWS 프라이빗 네트워크를 통해 이루어집니다. 인터넷 게이트웨이나 NAT 게이트웨이 없이도 S3에 접근할 수 있어 보안이 강화되고 데이터 전송 비용도 절감됩니다.

Q36. AWS Glue 카탈로그의 메타데이터(테이블 정의, 파티션 정보)를 암호화해야 합니다. 올바른 방법은?

A) S3 서버 측 암호화로 자동 처리됨 B) Glue 보안 구성(Security Configuration)에서 메타데이터 암호화 활성화 C) KMS 키를 Glue 카탈로그 테이블에 직접 연결 D) Lake Formation으로 메타데이터 접근 제어

정답: B

설명: AWS Glue Security Configuration에서 Glue 데이터 카탈로그 메타데이터 암호화를 활성화하면 KMS 키로 메타데이터를 암호화합니다. Glue ETL 작업의 데이터 암호화(전송 중, 저장 시), 작업 북마크 암호화도 Security Configuration에서 함께 설정합니다.

Q37. 데이터 분석 팀이 프로덕션 S3 데이터 레이크에 접근해야 하지만, 개인 식별 정보(PII)는 마스킹되어야 합니다. 효율적인 솔루션은?

A) 분석 팀을 위해 PII 제거된 별도 S3 버킷 복사 유지 B) Lake Formation의 열 수준 보안 + S3 Object Lambda로 동적 마스킹 C) 분석 팀 IAM 역할에서 S3 버킷 전체 접근 거부 후 특정 접두사만 허용 D) Glue ETL로 매일 PII 제거된 데이터셋 생성

정답: B

설명: Lake Formation 열 수준 보안으로 PII 열을 차단하거나 S3 Object Lambda를 사용하면 데이터를 읽을 때 동적으로 PII를 마스킹할 수 있습니다. 이 방법은 원본 데이터를 복사하지 않아도 되고, 마스킹 로직을 중앙에서 관리할 수 있습니다.

Q38. 회사는 Redshift에 저장된 데이터에 대해 규정 준수 감사를 위해 모든 쿼리 활동을 기록해야 합니다. 올바른 설정은?

A) CloudTrail로 Redshift API 호출 기록 B) Redshift 감사 로깅(Audit Logging)을 S3에 활성화 C) VPC Flow Logs로 Redshift 네트워크 트래픽 기록 D) CloudWatch Logs로 Redshift 연결 기록

정답: B

설명: Redshift 감사 로깅은 연결 로그, 사용자 활동 로그, 사용자 로그를 S3에 저장합니다. 사용자 활동 로그는 실행된 모든 SQL 쿼리를 기록합니다. 규정 준수 감사에 필요한 Who/What/When 정보를 완전히 제공합니다. CloudTrail은 Redshift API 호출만 기록합니다.

심화 문제 (복합 시나리오)

Q39. 전자상거래 회사가 실시간 구매 이벤트를 분석하려 합니다. 요구사항: 1) 이벤트는 초당 100,000건, 2) 실시간 사기 탐지 (100ms 이내), 3) 일별 구매 리포트, 4) 3년 데이터 보관. 아키텍처를 선택하세요.

A) Kinesis Firehose → S3 → Athena (모든 요구사항 충족) B) Kinesis Data Streams → Lambda(실시간 사기탐지) + Firehose → S3(배치) + Glue + Redshift(리포트) C) MSK → Spark Streaming → DynamoDB D) SQS → Lambda → RDS → QuickSight

정답: B

설명: 실시간 처리(100ms)를 위해 Kinesis Data Streams에서 Lambda로 즉시 처리하고, 동시에 Firehose로 S3에 저장합니다. 배치 처리는 Glue ETL로 S3 데이터를 정제하고 Redshift에 로드하여 일별 리포트를 생성합니다. 3년 데이터는 S3 Intelligent-Tiering으로 비용 효율적으로 보관합니다.

Q40. 의료 기관이 EMR(전자의무기록) 데이터를 AWS에서 분석하려 합니다. HIPAA 준수를 위한 필수 보안 구성은? (2개 선택)

A) S3 서버 측 암호화(KMS) 활성화 B) Redshift의 전송 중 암호화(SSL/TLS) 강제 C) CloudFront로 데이터 배포 D) 퍼블릭 S3 버킷에 데이터 저장 E) 암호화 없는 Kinesis 스트림 사용

정답: A, B

설명: HIPAA 준수를 위해 저장 시 암호화(S3 SSE-KMS)와 전송 중 암호화(SSL/TLS)가 필수입니다. 또한 VPC 내 배포, 접근 로깅, 감사 추적도 필요하지만 이 선택지에서는 A와 B가 핵심입니다. CloudFront는 배포 서비스로 HIPAA 요구사항과 직접 관련 없고, 퍼블릭 버킷과 암호화 없는 스트림은 위반입니다.

Q41. 데이터 레이크 마이그레이션 프로젝트에서 온프레미스 Hadoop HDFS에서 S3로 페타바이트 규모 데이터를 이전해야 합니다. 기존 Hive 메타스토어도 마이그레이션해야 합니다. 올바른 접근법은?

A) S3 DistCp로 데이터 복사, Glue 크롤러로 메타데이터 재생성 B) Snowball 가족으로 물리적 데이터 이전, SCT로 메타데이터 변환 C) AWS DataSync로 데이터 복사, Glue 카탈로그 import로 Hive 메타스토어 마이그레이션 D) DMS로 전체 마이그레이션

정답: C

설명: AWS DataSync는 온프레미스 HDFS에서 S3로 대용량 데이터를 효율적으로 전송합니다(병렬 전송, 자동 무결성 검증). Glue 카탈로그는 기존 Hive 메타스토어와 호환되며 import 기능으로 메타데이터를 마이그레이션할 수 있습니다. EMR은 기존 Hive 쿼리를 그대로 실행할 수 있습니다.

Q42. Athena 쿼리 비용을 줄이기 위해 CTAS(Create Table As Select)를 사용했습니다. 원본 CSV 대신 Parquet로 변환하고 파티셔닝했습니다. 추가로 성능을 개선할 수 있는 방법은?

A) 파일 크기를 최소화하여 파일 수를 늘림 B) 파일 크기를 128MB~1GB로 최적화하여 분할 가능(splittable) 형식 유지 C) 압축을 비활성화하여 파일 읽기 속도 향상 D) 더 작은 파티션으로 세분화

정답: B

설명: Athena는 파일을 병렬로 읽을 때 각 파일을 분할합니다. Parquet 파일은 기본적으로 분할 가능하지만 최적 크기(128MB~1GB)로 유지해야 최대 병렬 처리가 가능합니다. 너무 작은 파일은 오버헤드를 늘리고, 너무 크면 병렬성이 감소합니다. Snappy나 ZSTD 압축을 함께 사용하는 것이 좋습니다.

Q43. 회사의 데이터 엔지니어링 팀이 Glue Job Bookmarks를 활성화했습니다. 이 기능의 주요 목적은?

A) Glue 작업의 실행 히스토리 추적 B) 이미 처리된 데이터를 추적하여 증분 처리 구현 C) Glue 작업의 비용 최적화 D) 데이터 품질 체크포인트 저장

정답: B

설명: Glue Job Bookmarks는 이전 실행에서 처리된 데이터를 추적합니다. 다음 실행 시 새로 추가된 데이터만 처리하여 중복 처리를 방지합니다. S3의 수정 시간과 파일 이름을 기반으로 처리 여부를 판단합니다. 전체 데이터 재처리 없이 효율적인 증분 ETL 파이프라인을 구현할 수 있습니다.

Q44. Kinesis Data Analytics (Apache Flink) 애플리케이션에서 외부 데이터베이스를 참조 데이터로 사용하여 스트리밍 이벤트를 enrichment하려 합니다. 권장 방법은?

A) 모든 참조 데이터를 Flink 메모리에 로드 B) Flink Async I/O로 외부 데이터베이스에 비동기 조회 C) 참조 데이터를 Kinesis 스트림으로 전송하여 조인 D) Lambda 함수로 enrichment 후 Kinesis에 다시 전송

정답: B

설명: Flink Async I/O는 외부 데이터베이스(Redis, DynamoDB 등) 조회를 비동기로 처리하여 스트리밍 처리 지연을 최소화합니다. 동기 조회 시 외부 시스템 응답 대기로 처리량이 크게 떨어지지만, Async I/O는 여러 요청을 동시에 처리합니다. RocksDB State Backend와 함께 로컬 캐싱도 고려할 수 있습니다.

Q45. 기업 BI 팀이 QuickSight 임베디드 분석을 웹 애플리케이션에 구현하려 합니다. 외부 사용자(비 AWS 사용자)가 대시보드를 볼 수 있어야 합니다. 올바른 접근법은?

A) QuickSight 표준 사용자 계정을 각 외부 사용자에게 생성 B) QuickSight 임베디드 URL API + 익명 임베딩 또는 독자(Reader) 세션 C) QuickSight 공개 대시보드 링크 공유 D) S3에 대시보드 이미지를 내보내 웹에 표시

정답: B

설명: QuickSight 임베디드 분석 API를 사용하면 외부 사용자가 QuickSight 계정 없이도 대시보드를 볼 수 있습니다. 익명 임베딩(비인증 접근) 또는 독자 세션(한시적 URL)을 생성하여 웹 애플리케이션에 삽입합니다. 사용당 요금 모델이 적용됩니다.

Q46. 실시간 데이터 파이프라인에서 Kinesis Data Firehose가 S3에 데이터를 저장할 때 Lambda 변환 함수가 실패합니다. 실패한 레코드는 어떻게 처리됩니까?

A) 모든 레코드가 삭제됨 B) 처리 실패 레코드는 설정된 S3 접두사에 별도 저장 (처리 실패 접두사) C) Firehose가 자동으로 재시도 후 성공할 때까지 대기 D) 실패 레코드가 원본 소스로 반환됨

정답: B

설명: Kinesis Data Firehose Lambda 변환에서 실패한 레코드는 처리 실패(processing-failed) 접두사의 S3 경로에 저장됩니다. 성공한 레코드는 지정된 S3 대상에, 실패한 레코드는 별도 경로에 저장되어 나중에 재처리하거나 오류 분석에 활용할 수 있습니다.

Q47. 다중 AWS 계정 환경에서 중앙 집중식 데이터 레이크를 구축하려 합니다. 각 사업부 계정의 데이터를 중앙 데이터 레이크 계정에서 Athena로 쿼리해야 합니다. Lake Formation을 활용하는 방법은?

A) 각 계정 S3 데이터를 중앙 계정으로 복사 B) Lake Formation 크로스 계정 데이터 공유 기능 사용 C) S3 크로스 계정 복제 설정 D) AWS Organizations SCP로 데이터 접근 통합

정답: B

설명: Lake Formation은 크로스 계정 데이터 공유를 지원합니다. 데이터 소유 계정에서 중앙 계정의 IAM 역할/사용자에게 Lake Formation 데이터 권한을 부여합니다. 중앙 계정에서 Athena나 Redshift Spectrum으로 다른 계정의 데이터 카탈로그를 쿼리할 수 있습니다. 데이터 이동 없이 중앙화된 거버넌스를 구현합니다.

Q48. EMR 클러스터에서 Apache Hudi를 사용하는 주요 이점은?

A) Hive 쿼리를 자동으로 Spark로 변환 B) S3 데이터에 UPSERT/DELETE 및 증분 처리 지원 C) EMR 클러스터 자동 스케일링 D) HDFS와 S3 간 자동 데이터 동기화

정답: B

설명: Apache Hudi는 S3와 같은 데이터 레이크 스토리지에 UPSERT(삽입/업데이트), DELETE 작업을 지원하여 전통적인 데이터 레이크의 불변성 한계를 극복합니다. Copy-on-Write와 Merge-on-Read 테이블 타입을 제공하고, 증분 쿼리로 변경된 데이터만 효율적으로 처리할 수 있습니다.

Q49. AWS Glue Elastic Views는 어떤 문제를 해결합니까?

A) Glue ETL 작업의 메모리 부족 문제 해결 B) 여러 데이터 소스의 데이터를 실시간으로 복제 및 통합하는 구체화된 뷰(Materialized View) 제공 C) Glue 크롤러의 스키마 탐지 오류 수정 D) Athena 쿼리 결과를 자동으로 캐싱

정답: B

설명: AWS Glue Elastic Views는 DynamoDB, Aurora, RDS 등의 소스 데이터를 OpenSearch, S3, Redshift 등 대상에 자동으로 복제하여 구체화된 뷰를 유지합니다. SQL 기반 뷰 정의로 복잡한 ETL 파이프라인 없이 데이터 통합을 구현합니다.

Q50. S3 Intelligent-Tiering을 데이터 레이크에 적용하는 주요 이유는?

A) 모든 S3 작업을 자동으로 암호화 B) 접근 패턴을 자동으로 모니터링하여 비용 효율적인 스토리지 계층으로 자동 이동 C) S3 데이터에 대한 자동 백업 생성 D) 글로벌 데이터 복제 자동화

정답: B

설명: S3 Intelligent-Tiering은 접근 빈도를 모니터링하여 자주 접근하는 데이터는 Frequent Access, 30일 이상 미접근 데이터는 Infrequent Access, 90일 이상은 Archive Instant Access로 자동 이동합니다. 접근 패턴이 예측 불가능한 데이터 레이크에 특히 적합하며 모니터링 비용만 추가됩니다.

Q51. 회사의 데이터 팀이 Redshift에서 VACUUM 명령을 실행합니다. VACUUM의 주요 목적은?

A) 불필요한 데이터베이스 연결 종료 B) 삭제/업데이트된 행을 정리하고 정렬 키 순서 재정렬 C) Redshift 클러스터 캐시 초기화 D) 테이블 통계 업데이트

정답: B

설명: Redshift VACUUM은 DELETE/UPDATE로 표시만 된 행을 실제로 제거하고 정렬 키 순서로 데이터를 재정렬합니다. 정기적인 VACUUM으로 스토리지를 회수하고 쿼리 성능을 유지합니다. ANALYZE는 테이블 통계를 업데이트합니다. Redshift Serverless와 최신 버전은 자동 VACUUM을 지원합니다.

Q52. Kinesis Data Streams에서 데이터 보존 기간을 7일로 연장했습니다. 이 기간 내에 특정 타임스탬프부터 데이터를 다시 처리하려 합니다. 어떻게 합니까?

A) 새 Consumer 그룹을 생성하면 처음부터 자동 재처리 B) GetShardIterator API에서 AT_TIMESTAMP 옵션으로 특정 시간부터 읽기 시작 C) Firehose를 통해 S3에서 재처리 D) 보존 기간 내 데이터를 새 스트림으로 복사

정답: B

설명: Kinesis GetShardIterator API의 ShardIteratorType을 AT_TIMESTAMP로 설정하고 원하는 타임스탬프를 지정하면 해당 시점부터 데이터를 읽을 수 있습니다. KCL(Kinesis Client Library)을 사용하는 경우 초기 위치를 타임스탬프로 설정할 수 있습니다.

Q53. Glue 카탈로그를 Athena, Redshift Spectrum, EMR에서 공통으로 사용할 때의 주요 이점은?

A) 단일 메타데이터 레지스트리로 여러 분석 엔진 간 스키마 공유 및 일관성 유지 B) 데이터 처리 속도 향상 C) S3 저장 비용 절감 D) 자동 데이터 품질 검증

정답: A

설명: Glue 데이터 카탈로그는 AWS의 중앙 메타데이터 저장소입니다. Athena, Redshift Spectrum, EMR이 동일한 Glue 카탈로그를 참조하므로 테이블 정의, 파티션 정보, 스키마를 한 곳에서 관리합니다. 스키마 변경이 모든 분석 엔진에 자동 반영되어 일관성을 유지합니다.

Q54. 데이터 레이크의 데이터 품질 문제가 자주 발생합니다. 데이터 파이프라인에서 품질 검사를 자동화하는 AWS 서비스는?

A) CloudWatch Alarms로 S3 파일 크기 모니터링 B) AWS Glue Data Quality 규칙으로 자동 검증 C) Lambda로 수동 데이터 검증 스크립트 실행 D) SNS로 데이터 팀에 수동 검토 알림

정답: B

설명: AWS Glue Data Quality는 데이터셋에 품질 규칙(완전성, 유일성, 참조 무결성 등)을 정의하고 ETL 작업 중 자동으로 검증합니다. 품질 점수, 규칙 통과/실패 결과를 제공하며 CloudWatch와 통합하여 품질 저하 시 알림을 보낼 수 있습니다.

Q55. EMR on EKS를 선택하는 주요 이유는?

A) EMR on EC2보다 항상 비용이 저렴함 B) 기존 Kubernetes 인프라를 활용하여 Spark 작업을 실행하고 컨테이너 기반 워크로드 통합 C) HDFS 스토리지를 EKS 클러스터에 구성 가능 D) GPU 기반 딥러닝 전용

정답: B

설명: EMR on EKS는 기존 Amazon EKS 클러스터에서 Apache Spark를 실행합니다. Kubernetes 기반 인프라를 이미 사용하는 조직이 별도 EMR 클러스터 없이 데이터 처리 워크로드를 통합 관리할 수 있습니다. 컨테이너 격리, 리소스 공유, 기존 Kubernetes 도구(Helm, Argo Workflows 등) 통합이 이점입니다.

Q56. Redshift Concurrency Scaling을 활성화하면 어떤 이점이 있습니까?

A) 클러스터 노드를 자동으로 추가하여 영구 확장 B) 쿼리 동시성이 최고점일 때 추가 클러스터 용량을 자동으로 프로비저닝하여 SLA 유지 C) Redshift 클러스터를 자동으로 업그레이드 D) 야간 일괄 처리를 자동으로 예약

정답: B

설명: Redshift Concurrency Scaling은 동시 쿼리 수가 증가하여 메인 클러스터가 포화 상태가 되면 추가 읽기 클러스터를 자동으로 시작합니다. 첫 24시간/일은 무료이며, 이후 초당 요금이 부과됩니다. 사용자는 지연 없이 일관된 쿼리 성능을 경험합니다.

Q57. DMS를 사용하여 Aurora MySQL에서 Amazon Redshift로 데이터를 마이그레이션할 때, SCT(Schema Conversion Tool)가 필요합니까?

A) 아니오, DMS가 스키마 변환을 자동으로 처리함 B) 예, Aurora MySQL과 Redshift는 다른 데이터베이스 엔진이므로 SCT로 스키마 변환 필요 C) 아니오, Aurora와 Redshift는 동일한 AWS 서비스이므로 불필요 D) SCT는 온프레미스 데이터베이스에서만 필요

정답: B

설명: DMS는 데이터 이동을 담당하고 SCT는 스키마(DDL, 프로시저, 함수 등) 변환을 담당합니다. Aurora MySQL(OLTP용)에서 Redshift(OLAP용)로의 이기종 마이그레이션은 데이터 타입, 테이블 구조 등의 변환이 필요하므로 SCT를 먼저 실행하여 Redshift 호환 DDL을 생성해야 합니다.

Q58. Athena에서 쿼리 결과를 다른 팀과 공유하고 비용을 팀별로 청구(chargeback)하려 합니다. 올바른 방법은?

A) 각 팀에 별도 AWS 계정 생성 B) Athena 워크그룹(Workgroup)으로 팀별 쿼리 격리, 비용 추적, 데이터 사용량 제어 C) CloudWatch 비용 알림으로 팀별 예산 모니터링 D) IAM 태그로 쿼리 비용 추적

정답: B

설명: Athena 워크그룹은 팀/프로젝트별로 쿼리를 격리하고, 각 워크그룹의 쿼리 비용과 데이터 스캔량을 별도로 추적합니다. 워크그룹별 쿼리당 데이터 스캔 한도 설정, 결과 저장 위치 지정, CloudWatch 지표 통합이 가능합니다. 비용 할당 태그와 결합하여 팀별 비용 청구가 가능합니다.

Q59. QuickSight ML Insights 기능을 사용하는 시나리오로 가장 적합한 것은?

A) 머신러닝 모델을 직접 훈련하고 배포 B) 시계열 데이터의 이상치, 예측, 자동 내러티브 생성 C) SageMaker 모델 결과를 QuickSight에 표시 D) A/B 테스트 결과 분석

정답: B

설명: QuickSight ML Insights는 코드 없이 시계열 이상치 탐지(Anomaly Detection), 판매 예측(Forecasting), 자동 내러티브(Auto-narratives) 기능을 제공합니다. Amazon Forecast와 Random Cut Forest 알고리즘을 내부적으로 사용하며, 데이터 사이언티스트 없이도 비즈니스 사용자가 ML 기반 인사이트를 얻을 수 있습니다.

Q60. AWS Lake Formation의 Blueprint 기능은 무엇입니까?

A) 데이터 레이크 보안 정책 템플릿 B) 일반적인 데이터 소스(RDS, DMS 등)에서 S3 데이터 레이크로의 데이터 수집 워크플로 자동화 C) Glue ETL 작업 코드 자동 생성 D) QuickSight 대시보드 템플릿

정답: B

설명: Lake Formation Blueprints는 일반적인 데이터 수집 패턴(데이터베이스 스냅샷, 증분 데이터베이스, 로그 파일)에 대한 사전 구성된 워크플로를 제공합니다. 내부적으로 Glue 크롤러와 ETL 작업을 자동 생성하여 데이터 레이크 구축의 복잡성을 줄입니다.

Q61. 데이터 엔지니어가 S3에 저장된 JSON 데이터를 Parquet로 변환하는 Glue ETL 작업을 최적화해야 합니다. 현재 단일 파티션으로 처리 중입니다. 처리 시간을 단축하는 방법은?

A) Glue 작업을 Python Shell로 전환 B) groupFiles 옵션으로 작은 파일을 그룹화하고 Worker 수 증가 C) 데이터를 먼저 Redshift에 로드 후 Parquet로 언로드 D) Lambda 함수 체인으로 병렬 변환

정답: B

설명: Glue ETL의 groupFiles 설정으로 작은 파일들을 논리적으로 그룹화하여 처리 파티션 수를 최적화하고, Worker 수(DPU)를 늘려 병렬 처리를 향상시킵니다. Spark가 내부적으로 데이터를 병렬 파티션으로 처리하므로 충분한 DPU를 할당해야 병렬성이 활용됩니다.

Q62. Redshift에서 실행 시간이 긴 쿼리를 분석할 때 Performance Insights 또는 어떤 방법으로 병목 구간을 식별합니까?

A) CloudWatch 지표에서 CPUUtilization 확인 B) STL_QUERY, STL_EXPLAIN, SVL_QUERY_SUMMARY 시스템 테이블 분석 C) Redshift 감사 로그에서 쿼리 ID 검색 D) CloudTrail로 API 호출 추적

정답: B

설명: Redshift 시스템 테이블로 쿼리 성능을 분석합니다. STL_QUERY는 완료된 쿼리 정보, STL_EXPLAIN은 쿼리 계획, SVL_QUERY_SUMMARY는 각 단계별 실행 통계를 제공합니다. EXPLAIN 명령으로 실행 계획을 확인하고 데이터 분산, 조인 전략, 정렬 키 효율성을 분석합니다.

Q63. Kinesis Data Firehose의 버퍼링 힌트(Buffering Hints) 설정에서 Buffer Size와 Buffer Interval은 어떻게 작동합니까?

A) 두 조건 모두 충족되어야 S3에 데이터를 전송 B) 두 조건 중 먼저 충족되는 조건에 따라 S3에 데이터를 전송 C) Buffer Size만 기준으로 전송 D) Buffer Interval만 기준으로 전송

정답: B

설명: Kinesis Firehose는 Buffer Size(MB)와 Buffer Interval(초) 중 먼저 충족되는 조건에서 S3로 데이터를 전송합니다. 예를 들어 Buffer Size 5MB, Buffer Interval 60초 설정 시, 5MB가 쌓이거나 60초가 경과하면(먼저 도달하는 조건) 즉시 S3로 플러시합니다.

Q64. DynamoDB Streams를 활성화하면 어떤 정보를 캡처합니까?

A) 테이블의 모든 읽기(Read) 작업 B) 테이블의 쓰기(Put, Update, Delete) 변경사항을 24시간 보존 C) 테이블 스캔 작업만 기록 D) GSI/LSI 생성 및 삭제 이벤트

정답: B

설명: DynamoDB Streams는 테이블의 항목 수준 변경사항(INSERT, MODIFY, REMOVE)을 캡처합니다. 변경 전/후 이미지를 최대 24시간 보존합니다. Lambda 트리거와 결합하여 이벤트 기반 아키텍처, 실시간 집계, 다른 시스템 동기화에 활용합니다.

Q65. 전사적 데이터 분석 플랫폼에서 다양한 데이터 소스(RDS, DynamoDB, S3, Redshift)를 단일 플랫폼에서 분석해야 합니다. 가장 포괄적인 솔루션은?

A) 모든 데이터를 Redshift로 이동 후 분석 B) AWS Glue 카탈로그 + Athena 연합 쿼리 + Lake Formation 통합 거버넌스 C) 각 소스에 맞는 별도 분석 도구 사용 D) 모든 데이터를 DynamoDB로 통합

정답: B

설명: AWS Glue 카탈로그를 중앙 메타데이터 저장소로, Athena 연합 쿼리로 이기종 소스에 단일 SQL 실행, Lake Formation으로 통합 보안 및 거버넌스를 구현하는 것이 가장 포괄적인 솔루션입니다. 데이터 이동 없이 현재 위치에서 분석하고, 단일 접근 제어 정책을 적용할 수 있습니다.

학습 리소스

AWS DAS-C01 시험 가이드
AWS 데이터 분석 서비스 문서
AWS Skill Builder DAS-C01 공식 학습 경로
AWS 빅데이터 백서 및 아키텍처 모범 사례

이 모의고사는 학습 목적으로 제작되었습니다. 실제 시험 문제와 다를 수 있습니다.

AWS Data Analytics Specialty (DAS-C01) Practice Exam — 65 Questions

Exam Overview
Domain Breakdown
AWS Data Analytics Services Ecosystem
Practice Questions
Study Resources

Exam Overview

Item	Details
Duration	180 minutes
Questions	65
Passing Score	750 / 1000
Question Types	Single answer, Multiple answer
Exam Cost	USD 300

Domain Breakdown

Domain	Weight
Domain 1: Collection	18%
Domain 2: Storage and Data Management	22%
Domain 3: Processing	24%
Domain 4: Analysis and Visualization	18%
Domain 5: Data Security	18%

AWS Data Analytics Services Ecosystem

[Data Sources]
  ├── Streaming: Kinesis Data Streams → Kinesis Data Analytics (Flink)
  │                                   → Kinesis Data Firehose → S3/Redshift/OpenSearch
  ├── Batch: DMS, Snow Family, Direct Connect
  └── SaaS: AppFlow

[Storage]
  ├── Data Lake: S3 + Lake Formation
  ├── Data Warehouse: Redshift (RA3, Spectrum)
  ├── NoSQL: DynamoDB
  └── Search: OpenSearch Service

[Processing]
  ├── Large-scale Batch: EMR (Spark, Hive, Flink)
  ├── Serverless ETL: AWS Glue
  └── Lightweight Transforms: Lambda

[Analysis & Visualization]
  ├── Serverless Query: Athena
  ├── BI Dashboards: QuickSight (SPICE)
  └── Exploratory Analysis: OpenSearch Dashboards

[Security]
  ├── Access Control: Lake Formation, IAM
  ├── Encryption: KMS, SSE
  └── Network: VPC Endpoints, PrivateLink

Practice Questions

Domain 1: Collection

Q1. You need to collect IoT sensor data at 50,000 events per second. Data must be ordered and replayable for 24 hours. Which service is most appropriate?

A) Kinesis Data Firehose B) Kinesis Data Streams C) SQS FIFO Queue D) Amazon MSK

Answer: B

Explanation: Kinesis Data Streams provides partition-key-based ordering, configurable data retention up to 7 days (default 24 hours), and real-time replay capability. Firehose does not support replay, and SQS FIFO has throughput limits.

Q2. Your Kinesis Data Streams consumers are hitting the shared read throughput limit. Multiple consumer applications read from the same stream. How do you improve read throughput per consumer?

A) Increase the number of shards B) Enable Enhanced Fan-Out C) Switch to Provisioned capacity mode D) Reduce GetRecords API call frequency

Answer: B

Explanation: Enhanced Fan-Out provides each registered consumer with a dedicated 2 MB/s throughput per shard. Standard GetRecords shares the 2 MB/s per shard across all consumers, while Enhanced Fan-Out gives each consumer its own dedicated pipe.

Q3. You want to use Kinesis Data Firehose to convert JSON records to Parquet and route them to different S3 prefixes based on a field value. What is the correct configuration?

A) Lambda transformation + prefix expressions B) Format Conversion + Dynamic Partitioning C) Integrate with Glue ETL D) Use S3 Object Lambda

Answer: B

Explanation: Kinesis Data Firehose Format Conversion transforms JSON to Parquet/ORC using the Glue Data Catalog schema. Dynamic Partitioning extracts values from records using jq expressions or inline parsing to build dynamic S3 prefixes. Both features are configured natively in Firehose.

Q4. You need to implement continuous CDC replication from an on-premises Oracle database to Amazon Redshift. What is the most appropriate approach?

A) Use AWS Glue ETL to periodically copy full tables B) Use AWS DMS with a continuous replication task C) Use Kinesis Data Streams + Lambda D) Initial load with Snowball, then Direct Connect

Answer: B

Explanation: AWS DMS reads transaction logs from the source database to implement CDC. After an initial full load, the ongoing replication task continuously synchronizes changes. For heterogeneous migration (Oracle to Redshift), use SCT (Schema Conversion Tool) alongside DMS to convert the schema.

Q5. You need to transfer petabytes of data from on-premises to S3. Internet bandwidth is 1 Gbps and transfer would take months. What is the most cost-effective and fast approach?

A) Build an AWS Direct Connect dedicated line B) Order multiple Snowball Edge Storage Optimized devices C) Use AWS Snowmobile D) Enable S3 Transfer Acceleration

Answer: C

Explanation: For petabyte-scale (100 PB+) data transfers, Snowmobile is the appropriate choice. Each Snowmobile can hold up to 100 PB. Internet or Direct Connect transfers at this scale would take too long and cost too much. Snowball Edge holds up to 80 TB per device, requiring many devices for petabyte-scale.

Q6. You are choosing between Amazon MSK and Kinesis Data Streams. Your team has existing applications built on Apache Kafka APIs and requires long message retention (up to 1 year). Which service is appropriate?

A) Kinesis Data Streams — better scalability B) Amazon MSK — Kafka compatibility and long retention support C) Kinesis Data Firehose — fully managed D) SQS — longer message retention

Answer: B

Explanation: Amazon MSK is a fully managed Apache Kafka service, enabling existing Kafka client code to work without modification. Retention can be set to unlimited. Kinesis supports up to 365 days retention but is not compatible with the Kafka API.

Q7. ProvisionedThroughputExceededException errors occur frequently on your Kinesis Data Streams. Analysis of the partition key distribution shows concentration on specific keys. What is the solution?

A) Double the number of shards B) Add a random prefix to the partition key for uniform shard distribution C) Enable Enhanced Fan-Out D) Disable KPL aggregation

Answer: B

Explanation: Hot shard problems (write concentration on specific partition keys) are resolved by distributing the partition key. Prepending a random prefix (e.g., a number in range 0 to N) distributes records across multiple shards. On the read side, you must fan-out across all shards to reconstruct the full dataset.

Q8. In Kinesis Data Analytics (Apache Flink) for real-time anomaly detection, you need to compare current values against the past 30 minutes. Which Flink windowing feature should you use?

A) Tumbling Window B) Sliding Window C) Session Window D) Global Window

Answer: B

Explanation: A Sliding Window has a fixed size that moves forward at a defined interval. For example, a 30-minute window sliding every 1 minute always evaluates the most recent 30 minutes of data. Tumbling Windows are non-overlapping fixed intervals.

Domain 2: Storage and Data Management

Q9. You want to optimize Athena query performance over a partitioned S3 data lake. Data is appended daily and queries typically filter by date range and region code. What is the optimal partitioning strategy?

A) Single partition: year/month/day B) Composite partition: year/month/day/region C) No partitioning, use Athena Partition Projection D) Store all data under a single prefix with compression

Answer: B

Explanation: Aligning the partitioning scheme with query patterns minimizes data scanned by Athena. A year/month/day/region hierarchy enables partition pruning for both filter conditions simultaneously, maximizing performance.

Q10. In AWS Lake Formation, you need to restrict specific columns in a table to be visible only to specific IAM roles. What is the correct approach?

A) Restrict access via S3 bucket policy on specific prefixes B) Apply Glue catalog resource-based policies at the table level C) Configure Lake Formation column-level security D) Set query filters per Athena workgroup

Answer: C

Explanation: Lake Formation provides fine-grained access control at the table, column, and row level. Column-level security restricts which columns are visible to specific IAM roles or users. S3 bucket policies only control at the file level and cannot enforce column-level restrictions.

Q11. Redshift large-table join performance is degrading. Both tables have billions of rows and are frequently joined. What is the optimal distribution style combination?

A) Both tables: EVEN distribution B) Both tables: ALL distribution C) Large fact table: KEY distribution (join key); small dimension table: ALL distribution D) Both tables: AUTO distribution

Answer: C

Explanation: Using KEY distribution on both fact tables with the same join key means data is co-located on the same nodes, enabling local joins without network redistribution. Small dimension tables with ALL distribution are broadcast to every node, avoiding broadcast join overhead for large-to-small joins.

Q12. Your DynamoDB table stores user orders with UserID (partition key) and OrderDate (sort key). What is the most efficient way to query all orders for a specific user after a specific date?

A) Use a Scan with FilterExpression B) Use a Query with KeyConditionExpression specifying UserID and an OrderDate range C) Create a GSI and query it D) Use DynamoDB Streams + Lambda to maintain an index

Answer: B

Explanation: Using a Query with the table's primary key (UserID as partition key, OrderDate as sort key) is the most efficient access pattern. Set KeyConditionExpression to UserID = :uid AND OrderDate >= :date. Scans read the entire table and are highly inefficient.

Q13. What is the primary reason to use Redshift RA3 nodes?

A) To independently scale compute and storage B) For maximum query performance through in-memory caching C) To automatically parallelize queries D) To directly access data in S3

Answer: A

Explanation: Redshift RA3 nodes separate compute and storage, allowing each to scale independently. Frequently accessed data is cached on local NVMe SSD, while the rest resides in Redshift Managed Storage (RMS) backed by S3. You can grow storage without increasing compute costs.

Q14. You need to manage aging index data in OpenSearch Service cost-effectively. The last 7 days are frequently queried, 30 days to 1 year occasionally, and beyond 1 year rarely. What is the optimal storage tier configuration?

A) Keep all data in Hot storage B) Transition data: Hot → UltraWarm → Cold storage C) Export data older than 30 days to S3 D) Use Index State Management to delete data after 30 days

Answer: B

Explanation: OpenSearch Service provides Hot (fast NVMe SSD), UltraWarm (S3-backed, lower cost), and Cold (lowest cost) storage tiers. Use Index State Management (ISM) to define automated policies that move indices through these tiers based on age, balancing performance and cost.

Q15. You applied Lake Formation Governed Tables to your S3 data lake. What is the primary benefit?

A) Automatic file compression reduces storage costs B) ACID transaction support and automatic data compaction C) Real-time streaming data ingestion D) Automatic column-level encryption

Answer: B

Explanation: Lake Formation Governed Tables provide ACID transactions (atomicity, consistency, isolation, durability) on S3 data, ensuring data consistency during concurrent reads and writes. Automatic compaction resolves the small files problem. Row-level security is also supported.

Q16. What is the most effective way to maximize Redshift Spectrum performance when querying S3 data lake?

A) Store S3 data in CSV format B) Store S3 data in Parquet format with partitioning C) Maximize the number of Redshift cluster nodes D) Limit each Spectrum slice to one file

Answer: B

Explanation: Redshift Spectrum achieves best performance with columnar formats like Parquet or ORC. Column pruning (reading only required columns) combined with partition pruning (reading only required partitions) drastically reduces the data scanned. CSV requires reading entire files.

Domain 3: Processing

Q17. You want to minimize EMR cluster costs using Spot Instances while minimizing the risk of job failure. What is the correct configuration?

A) Use Spot for master, core, and task nodes B) Use On-Demand for master and core nodes; Spot only for task nodes C) Use On-Demand for master; Spot for core and task nodes D) Use Reserved Instances for all nodes

Answer: B

Explanation: The master node manages the cluster, and core nodes store HDFS data. Using On-Demand for both master and core nodes ensures stability and data durability. Task nodes (additional compute only, no HDFS storage) can safely use Spot. Spot interruption of task nodes does not cause data loss.

Q18. What is the most accurate description of the difference between AWS Glue DynamicFrame and Apache Spark DataFrame?

A) DynamicFrame has no schema and handles all data types B) DynamicFrame tolerates schema inconsistencies (choice type) and provides AWS-specific transforms like relationalize C) DataFrame is always faster so DynamicFrame should be avoided D) DynamicFrame supports only structured streaming

Answer: B

Explanation: AWS Glue DynamicFrame represents columns with mixed types as a "choice" type. It provides AWS-specific transformations such as resolveChoice() and relationalize(). For performance-critical paths, you can convert DynamicFrame to DataFrame, process it, and convert back.

Q19. A Glue Crawler re-crawling your entire S3 partitioned dataset each time a new partition is added is inefficient. What is the alternative?

A) Schedule the Glue Crawler to run every minute B) Use Athena MSCK REPAIR TABLE or ADD PARTITION C) Use Lake Formation blueprints to auto-manage partitions D) Use an S3 event notification to trigger Lambda that updates Glue Catalog partitions

Answer: D

Explanation: An S3 ObjectCreated event triggers a Lambda function that calls glue:BatchCreatePartition to add only the new partition to the Glue Catalog. This is the most efficient approach. Athena MSCK REPAIR TABLE also works but becomes slower as the partition count grows.

Q20. What is the key difference between EMR Serverless and EMR on EC2?

A) EMR Serverless maintains a permanent cluster; EMR on EC2 uses ephemeral clusters B) EMR Serverless automatically scales without cluster provisioning and has no idle costs C) EMR Serverless only supports Hive, not Spark D) EMR on EC2 is always more cost-effective

Answer: B

Explanation: EMR Serverless automatically provisions resources when a job is submitted without you managing any cluster. There are no charges when no jobs are running. It supports Spark, Hive, and other frameworks. It is ideal for intermittent batch workloads.

Q21. What is the primary use case for AWS Glue DataBrew?

A) Large-scale distributed Spark ETL processing B) No-code visual data preparation and cleaning C) Real-time streaming data transformation D) Data catalog metadata management

Answer: B

Explanation: AWS Glue DataBrew provides a visual, code-free interface for exploring, cleaning, and normalizing data. It offers 250+ pre-built transformations and supports data quality rule definitions and profiling. It is designed for data analysts and data scientists who prefer visual tooling.

Q22. In a Step Functions-orchestrated data pipeline, you need to implement automatic retry logic when an EMR job fails. What is the correct approach?

A) Poll EMR status with a Lambda function and restart on failure B) Add a Retry block to the Step Functions state definition C) Detect EMR failure with a CloudWatch Alarm and send SNS notification D) Wrap the EMR job in a Glue Workflow

Answer: B

Explanation: Adding a Retry block to each task state in Step Functions enables automatic retries for specified error types. You can configure the number of retries, interval, and backoff rate. A Catch block handles the final failure case with an alternative path.

Q23. An AWS Glue ETL job produces thousands of small files in S3 after processing. How do you resolve this small files problem?

A) Reduce the number of Glue workers B) Use coalesce() or repartition() to control the output partition count before writing C) Use S3 Lifecycle policies to automatically delete small files D) Merge files using Kinesis Firehose

Answer: B

Explanation: In Spark, coalesce(N) reduces the number of partitions without a shuffle (may produce uneven sizes), while repartition(N) redistributes evenly (incurs shuffle). Using either before a write operation controls the number of output files. coalesce is generally preferred for merging small files.

Q24. How do you increase the processing speed of a Lambda function used as an event source for Kinesis Data Streams?

A) Increase Lambda function memory B) Increase the Parallelization Factor per shard C) Decrease the batch size D) Increase Lambda reserved concurrency

Answer: B

Explanation: The Parallelization Factor (1–10) for the Kinesis-Lambda event source mapping allows multiple concurrent Lambda invocations per shard. The default value of 1 means one concurrent Lambda execution per shard. Setting it to 10 enables up to 10 parallel Lambda invocations per shard simultaneously.

Domain 4: Analysis and Visualization

Q25. You want to minimize Athena costs for frequently executed queries by reusing previous results. Which Athena feature should you use?

A) Athena Query Result Reuse B) Athena Federated Query C) CTAS to save results D) Athena workgroup query queuing

Answer: A

Explanation: Athena Query Result Reuse reuses previous query results for identical queries within a configurable period (up to 7 days). No scan charges are incurred for reused results. This is highly effective for repeated queries on static data.

Q26. You want to avoid managing partition metadata in the Glue Catalog for hourly partitioned S3 data queried with Athena. What is the most efficient approach?

A) Run MSCK REPAIR TABLE every hour B) Configure Athena Partition Projection C) Run ALTER TABLE ADD PARTITION every hour D) Use Lake Formation for automatic partition management

Answer: B

Explanation: Athena Partition Projection computes partition values dynamically from rules defined in the table properties, without storing partition metadata in the Glue Catalog. It is especially effective for regular patterns (dates, numeric ranges) and eliminates partition registration overhead.

Q27. An Athena EXPLAIN output for a Redshift query shows DS_DIST_ALL_NONE. What does this mean?

A) Both tables use KEY distribution for efficient co-located joins B) One table uses ALL distribution so no data redistribution is needed for the join C) EVEN distribution requires data redistribution D) No distribution key, broadcast join not possible

Answer: B

Explanation: DS_DIST_ALL_NONE means one table in the join uses ALL distribution (copied to every node), so no redistribution is needed. This is an efficient join pattern. DS_DIST_ALL_INNER indicates redistribution cost is involved.

Q28. You want to quickly visualize a dataset with hundreds of millions of rows in QuickSight. Data is updated daily. What is the optimal configuration?

A) Direct Query mode connected to Redshift B) Import data into SPICE with a scheduled refresh C) Direct query of S3 data via Athena D) Use QuickSight Paginated Reports

Answer: B

Explanation: SPICE (Super-fast Parallel In-memory Calculation Engine) is QuickSight's in-memory storage that enables ultra-fast queries and visualization even for hundreds of millions of rows. Scheduled refresh keeps SPICE data current. Direct Query mode is real-time but may be slow for very large datasets.

Q29. Long-running Redshift queries are blocking short queries. How do you best resolve this with Workload Management (WLM)?

A) Add nodes to the cluster B) Enable Short Query Acceleration (SQA) C) Assign equal priority to all queries D) Schedule long-running queries for nighttime

Answer: B

Explanation: Redshift Short Query Acceleration (SQA) uses machine learning to predict short-running queries and run them in a dedicated priority queue. It requires no separate WLM queue configuration and automatically optimizes environments where short and long queries coexist.

Q30. You need to implement Row-Level Security in QuickSight so that sales representatives can only see data for their own region. What is the correct approach?

A) Create a separate dataset per sales representative B) Apply a QuickSight RLS (Row-Level Security) rule to the dataset C) Restrict S3 data access via IAM policies D) Use Athena views to filter data per user

Answer: B

Explanation: QuickSight RLS attaches a rules file (CSV or another dataset) mapping users/groups to filter values to a dataset. Data is automatically filtered based on the logged-in user. A single dataset serves all representatives, each seeing only their own region's data.

Q31. What is the primary scenario for using Athena Federated Query?

A) Join S3 data lake with RDS, DynamoDB, and other heterogeneous sources in a single SQL statement B) Query Redshift and S3 data together C) Process S3 buckets across multiple AWS accounts in a single query D) Analyze real-time streaming data with SQL

Answer: A

Explanation: Athena Federated Query uses Lambda-based data source connectors to run SQL on data sources beyond S3 (RDS, DynamoDB, ElastiCache, CloudWatch, Redis, etc.). Multiple heterogeneous sources can be joined in a single query, enabling integrated analytical workloads without data movement.

Q32. When indexing large volumes of log data in OpenSearch Service, how do you optimize indexing performance?

A) Reduce the Refresh Interval to 1 second B) Use the Bulk API and increase the Refresh Interval C) Call the Index API for each document individually D) Maximize the number of shards

Answer: B

Explanation: The Bulk API batches multiple documents per request, reducing network overhead. Increasing the Refresh Interval (e.g., from the default 1 second to 30 seconds or more) reduces segment creation frequency and significantly improves indexing throughput. Call a manual refresh after the bulk load completes.

Domain 5: Data Security

Q33. You need to ensure that specific departments can only access specific tables in your S3 data lake. Should you use Lake Formation or S3 bucket policies?

A) S3 bucket policies alone provide fine-grained data access control B) Lake Formation data permissions for table/column/row-level access control C) IAM policies to control Glue Catalog access D) S3 Access Points to control access by prefix

Answer: B

Explanation: Lake Formation integrates with the Glue Data Catalog to provide logical table, column, and row-level fine-grained access control. Permissions are based on logical data structure rather than physical S3 file paths. S3 bucket policies only control at the file/prefix level and cannot enforce column or row-level restrictions.

Q34. You need to enable in-transit encryption for Kinesis Data Streams using an AWS-managed key. What is the correct approach?

A) Enable S3 server-side encryption (SSE-S3) B) Enable Server-Side Encryption (SSE) on the Kinesis stream — select the aws/kinesis KMS key C) Create a customer-managed KMS key (CMK) and attach it to Kinesis D) Use client-side encryption before sending data

Answer: B

Explanation: Kinesis Data Streams supports server-side encryption (SSE). Selecting the AWS-managed key (aws/kinesis) encrypts all stream data with KMS without additional configuration. A customer-managed key (CMK) provides more granular key policy control but requires additional setup.

Q35. Your Redshift cluster is deployed in a VPC. You want COPY commands loading data from S3 to never traverse the public internet. What should you configure?

A) Assign a public IP to the Redshift cluster B) Route S3 traffic through a NAT Gateway C) Configure a VPC endpoint (gateway endpoint) for S3 D) Use Direct Connect for AWS network connectivity

Answer: C

Explanation: An S3 gateway VPC endpoint routes Redshift COPY/UNLOAD operations through the AWS private network. S3 traffic does not traverse the internet gateway or NAT Gateway, improving security and eliminating data transfer costs.

Q36. How do you encrypt metadata (table definitions, partition information) stored in the AWS Glue Data Catalog?

A) It is automatically handled by S3 server-side encryption B) Enable metadata encryption in Glue Security Configuration C) Directly attach a KMS key to Glue Catalog tables D) Use Lake Formation to control metadata access

Answer: B

Explanation: AWS Glue Security Configuration enables encryption of Glue Data Catalog metadata with a KMS key. ETL job data encryption (at rest, in transit) and job bookmark encryption are also configured in Security Configuration.

Q37. Your analytics team needs access to the production S3 data lake, but PII must be masked. What is an efficient solution?

A) Maintain a separate S3 bucket copy with PII removed for the analytics team B) Lake Formation column-level security + S3 Object Lambda for dynamic masking C) Deny full bucket access in the analytics team IAM role, allow only specific prefixes D) Use Glue ETL to generate a daily PII-free dataset

Answer: B

Explanation: Lake Formation column-level security blocks PII columns. S3 Object Lambda can dynamically mask PII at read time using a Lambda function. This approach avoids duplicating data and allows centralized management of masking logic.

Q38. For compliance auditing, your organization must log all query activity on Redshift. What is the correct configuration?

A) Use CloudTrail to record Redshift API calls B) Enable Redshift Audit Logging to S3 C) Use VPC Flow Logs to record Redshift network traffic D) Use CloudWatch Logs to record Redshift connections

Answer: B

Explanation: Redshift Audit Logging writes connection logs, user activity logs, and user logs to S3. The user activity log records every executed SQL query. This provides complete Who/What/When information needed for compliance auditing. CloudTrail captures only Redshift API calls.

Advanced Scenarios

Q39. An e-commerce company needs to analyze real-time purchase events. Requirements: 1) 100,000 events/sec, 2) real-time fraud detection within 100ms, 3) daily purchase reports, 4) 3-year data retention. Choose the architecture.

A) Kinesis Firehose → S3 → Athena (satisfies all requirements) B) Kinesis Data Streams → Lambda (real-time fraud detection) + Firehose → S3 (batch) + Glue + Redshift (reports) C) MSK → Spark Streaming → DynamoDB D) SQS → Lambda → RDS → QuickSight

Answer: B

Explanation: Kinesis Data Streams feeds Lambda for sub-100ms fraud detection and simultaneously routes to Firehose for S3 storage. Glue ETL processes S3 data and loads it into Redshift for daily reports. S3 Intelligent-Tiering manages 3-year retention cost-effectively.

Q40. A healthcare organization needs to analyze electronic health records on AWS. Which two security configurations are required for HIPAA compliance? (Select 2)

A) Enable S3 server-side encryption (KMS) B) Enforce in-transit encryption (SSL/TLS) for Redshift C) Distribute data via CloudFront D) Store data in a public S3 bucket E) Use Kinesis streams without encryption

Answer: A, B

Explanation: HIPAA compliance requires encryption at rest (S3 SSE-KMS) and in transit (SSL/TLS). VPC deployment, access logging, and audit trails are also required. Options D and E violate HIPAA requirements, and CloudFront is not directly related to HIPAA data protection.

Q41. For a data lake migration project, you need to move petabytes of on-premises Hadoop HDFS data to S3, including the Hive Metastore. What is the correct approach?

A) Copy data with S3 DistCp, regenerate metadata with Glue Crawler B) Physical data transfer with Snowball family, schema conversion with SCT C) Copy data with AWS DataSync, migrate Hive Metastore by importing into Glue Catalog D) Use DMS for the entire migration

Answer: C

Explanation: AWS DataSync efficiently transfers large datasets from on-premises HDFS to S3 with parallel transfer and automatic integrity verification. The Glue Data Catalog is compatible with the Hive Metastore and supports importing metadata. EMR can run existing Hive queries against the Glue Catalog.

Q42. You used Athena CTAS to convert CSV data to Parquet with partitioning. What additional step maximizes performance?

A) Minimize file sizes to increase file count B) Optimize file sizes to 128 MB–1 GB to maintain splittable format C) Disable compression to speed up file reads D) Increase partitioning granularity with smaller partitions

Answer: B

Explanation: Parquet files are splittable by default. Maintaining the optimal file size (128 MB–1 GB) ensures maximum parallelism during Athena scans. Too-small files increase overhead; too-large files reduce parallelism. Combine with Snappy or ZSTD compression for best results.

Q43. The data engineering team has enabled Glue Job Bookmarks. What is the primary purpose of this feature?

A) Track Glue job execution history B) Track already-processed data to implement incremental processing C) Optimize Glue job costs D) Save data quality checkpoints

Answer: B

Explanation: Glue Job Bookmarks track which data was processed in previous runs. Subsequent runs process only newly added data, preventing duplicate processing. They use S3 file modification times and names to determine processing status. This enables efficient incremental ETL pipelines without reprocessing all data.

Q44. In a Kinesis Data Analytics (Apache Flink) application, you need to enrich streaming events using an external database as reference data. What is the recommended approach?

A) Load all reference data into Flink memory B) Use Flink Async I/O to asynchronously query the external database C) Send reference data as a Kinesis stream and join D) Enrich with a Lambda function and re-publish to Kinesis

Answer: B

Explanation: Flink Async I/O processes external database lookups (Redis, DynamoDB, etc.) asynchronously, minimizing processing latency. Synchronous lookups would block on external system responses and severely reduce throughput. Async I/O processes multiple outstanding requests concurrently. Consider local caching with RocksDB State Backend for further optimization.

Q45. A BI team wants to implement embedded analytics in a web application using QuickSight. External users (non-AWS users) must be able to view dashboards. What is the correct approach?

A) Create standard QuickSight user accounts for each external user B) Use the QuickSight embedded URL API with anonymous embedding or Reader sessions C) Share public dashboard links D) Export dashboard images to S3 and display on the web

Answer: B

Explanation: The QuickSight embedding API generates time-limited URLs for external users to view dashboards without a QuickSight account. Anonymous embedding (unauthenticated access) or Reader sessions (per-session pricing) are available. The embedded dashboard is rendered in an iframe within the web application.

Q46. When a Kinesis Data Firehose Lambda transformation function fails, how are failed records handled?

A) All records are deleted B) Failed records are stored in a separate S3 prefix (processing-failed prefix) C) Firehose automatically retries until success D) Failed records are returned to the source

Answer: B

Explanation: Records that fail Lambda transformation in Kinesis Data Firehose are written to the processing-failed S3 prefix. Successfully transformed records go to the configured destination. Failed records are preserved separately for later reprocessing or error analysis.

Q47. In a multi-AWS account environment, you want to build a central data lake in a central account and query business unit account data with Athena. How does Lake Formation support this?

A) Copy each account's S3 data to the central account B) Use Lake Formation cross-account data sharing C) Configure S3 cross-account replication D) Use AWS Organizations SCP to consolidate data access

Answer: B

Explanation: Lake Formation supports cross-account data sharing. Data owner accounts grant Lake Formation data permissions to IAM roles/users in the central account. Athena or Redshift Spectrum in the central account can query other accounts' data catalogs. This achieves centralized governance without data movement.

Q48. What is the primary benefit of using Apache Hudi on EMR?

A) Automatically converts Hive queries to Spark B) Supports UPSERT/DELETE and incremental processing on S3 C) Automatically scales the EMR cluster D) Automatically syncs data between HDFS and S3

Answer: B

Explanation: Apache Hudi enables UPSERT (insert/update), DELETE operations on data lake storage like S3, overcoming the immutability limitation of traditional data lakes. It provides Copy-on-Write and Merge-on-Read table types, and supports incremental queries to efficiently process only changed data.

Q49. What problem does AWS Glue Elastic Views solve?

A) Resolves out-of-memory issues in Glue ETL jobs B) Provides materialized views that replicate and combine data from multiple sources in near-real-time C) Fixes schema detection errors in Glue Crawlers D) Automatically caches Athena query results

Answer: B

Explanation: AWS Glue Elastic Views automatically replicates and combines data from source databases (DynamoDB, Aurora, RDS) to targets (OpenSearch, S3, Redshift) maintaining materialized views. A SQL-based view definition enables data integration without complex ETL pipelines.

Q50. What is the primary reason to apply S3 Intelligent-Tiering to a data lake?

A) Automatically encrypts all S3 operations B) Automatically monitors access patterns and moves data to the most cost-effective storage tier C) Automatically creates backups of S3 data D) Automates global data replication

Answer: B

Explanation: S3 Intelligent-Tiering monitors access frequency and automatically moves data: frequently accessed to Frequent Access tier, data not accessed for 30+ days to Infrequent Access, and 90+ days to Archive Instant Access. It is ideal for data lakes with unpredictable access patterns. Only a monitoring fee is added.

Q51. What is the primary purpose of the VACUUM command in Redshift?

A) Terminate unnecessary database connections B) Reclaim storage from deleted/updated rows and re-sort data by sort key order C) Clear the Redshift cluster cache D) Update table statistics

Answer: B

Explanation: Redshift VACUUM physically removes rows that were only soft-deleted/updated and re-sorts data by the sort key. Regular VACUUM reclaims storage and maintains query performance. ANALYZE updates table statistics. Redshift Serverless and recent versions support automatic VACUUM.

Q52. You extended Kinesis Data Streams retention to 7 days. You need to reprocess data starting from a specific timestamp. How do you do this?

A) Creating a new consumer group automatically starts from the beginning B) Use GetShardIterator API with AT_TIMESTAMP to start reading from a specific time C) Reprocess via S3 through Firehose D) Copy retention data to a new stream

Answer: B

Explanation: Setting ShardIteratorType to AT_TIMESTAMP in the Kinesis GetShardIterator API and specifying the desired timestamp lets you read data starting from that point. When using KCL, set the initial position to a timestamp.

Q53. What is the primary benefit of using the Glue Catalog as the shared metastore for Athena, Redshift Spectrum, and EMR?

A) Single metadata registry that shares schemas across multiple analytics engines, ensuring consistency B) Faster data processing speed C) Reduced S3 storage costs D) Automatic data quality validation

Answer: A

Explanation: The Glue Data Catalog is the central metadata store in AWS. Athena, Redshift Spectrum, and EMR all reference the same Glue Catalog, so table definitions, partition information, and schemas are managed in one place. Schema changes are automatically reflected across all analytics engines.

Q54. Data quality issues occur frequently in the data lake pipeline. Which AWS service automates quality checks in the data pipeline?

A) CloudWatch Alarms to monitor S3 file sizes B) AWS Glue Data Quality rules for automated validation C) Run manual data validation scripts with Lambda D) Send manual review notifications to the data team via SNS

Answer: B

Explanation: AWS Glue Data Quality defines quality rules (completeness, uniqueness, referential integrity, etc.) on datasets and automatically validates them during ETL jobs. It provides quality scores and rule pass/fail results. Integration with CloudWatch enables alerts when quality degrades.

Q55. What is the primary reason to choose EMR on EKS?

A) Always cheaper than EMR on EC2 B) Run Spark jobs on existing Kubernetes infrastructure, integrating container-based workloads C) Configure HDFS storage on the EKS cluster D) GPU-based deep learning only

Answer: B

Explanation: EMR on EKS runs Apache Spark on an existing Amazon EKS cluster. Organizations already using Kubernetes can consolidate data processing workloads without managing separate EMR clusters. Benefits include container isolation, resource sharing, and integration with existing Kubernetes tooling (Helm, Argo Workflows, etc.).

Q56. What is the benefit of enabling Redshift Concurrency Scaling?

A) Automatically adds cluster nodes for permanent expansion B) Automatically provisions additional cluster capacity during peak concurrency to maintain SLAs C) Automatically upgrades the Redshift cluster D) Automatically schedules overnight batch processing

Answer: B

Explanation: Redshift Concurrency Scaling automatically starts additional read clusters when concurrent queries saturate the main cluster. The first 24 hours per day are free; subsequent usage is billed per second. Users experience consistent, low-latency query performance without manual intervention.

Q57. When migrating from Aurora MySQL to Amazon Redshift using DMS, is the Schema Conversion Tool (SCT) required?

A) No, DMS automatically handles schema conversion B) Yes, Aurora MySQL and Redshift are different engines; SCT is needed for schema conversion C) No, Aurora and Redshift are both AWS services so conversion is unnecessary D) SCT is only required for on-premises databases

Answer: B

Explanation: DMS handles data movement; SCT handles schema (DDL, procedures, functions, etc.) conversion. A heterogeneous migration from Aurora MySQL (OLTP) to Redshift (OLAP) requires data type and table structure conversion. Run SCT first to generate Redshift-compatible DDL, then use DMS to replicate the data.

Q58. You want to share Athena query results with other teams and charge costs back by team. What is the correct approach?

A) Create a separate AWS account per team B) Use Athena Workgroups to isolate queries, track costs, and control data usage per team C) Use CloudWatch cost alarms to monitor budgets per team D) Use IAM tags to track query costs

Answer: B

Explanation: Athena Workgroups isolate queries by team or project and separately track query costs and data scanned per workgroup. You can set per-query data scan limits, specify result storage locations, and integrate with CloudWatch metrics. Combined with cost allocation tags, this enables team-level cost chargebacks.

Q59. What is the most appropriate use case for QuickSight ML Insights?

A) Directly train and deploy machine learning models B) Detect anomalies, generate forecasts, and auto-narratives from time-series data C) Display SageMaker model results in QuickSight D) Analyze A/B test results

Answer: B

Explanation: QuickSight ML Insights provides code-free Anomaly Detection, Forecasting, and Auto-narratives for time-series data. It internally uses Amazon Forecast and the Random Cut Forest algorithm. Business users can obtain ML-driven insights without data scientists.

Q60. What does the Lake Formation Blueprint feature do?

A) Provides data lake security policy templates B) Automates data ingestion workflows from common sources (RDS, DMS, etc.) to the S3 data lake C) Automatically generates Glue ETL job code D) Provides QuickSight dashboard templates

Answer: B

Explanation: Lake Formation Blueprints provide pre-configured workflows for common ingestion patterns (Database Snapshot, Incremental Database, Log File). They automatically create Glue crawlers and ETL jobs internally, reducing the complexity of data lake construction.

Q61. A data engineer needs to optimize a Glue ETL job converting S3 JSON data to Parquet. Currently processing as a single partition. How do you reduce processing time?

A) Switch the Glue job to Python Shell B) Use the groupFiles option to group small files and increase worker count C) Load data into Redshift first, then unload as Parquet D) Parallelize conversion with a Lambda function chain

Answer: B

Explanation: Glue ETL's groupFiles setting logically groups small files to optimize processing partition count. Increasing the number of workers (DPUs) enables greater parallelism. Spark internally processes data in parallel partitions, so sufficient DPU allocation is needed to fully utilize the parallelism.

Q62. How do you identify bottlenecks in long-running Redshift queries?

A) Check CPUUtilization in CloudWatch metrics B) Analyze STL_QUERY, STL_EXPLAIN, SVL_QUERY_SUMMARY system tables C) Search by query ID in Redshift audit logs D) Trace API calls with CloudTrail

Answer: B

Explanation: Redshift system tables enable query performance analysis. STL_QUERY contains completed query information, STL_EXPLAIN contains query plans, and SVL_QUERY_SUMMARY provides per-step execution statistics. Use the EXPLAIN command to view the execution plan and analyze data distribution, join strategies, and sort key effectiveness.

Q63. How do Buffering Hints (Buffer Size and Buffer Interval) work in Kinesis Data Firehose?

A) Both conditions must be met before data is delivered to S3 B) Data is delivered to S3 when either condition is met first C) Only Buffer Size is used as the delivery criterion D) Only Buffer Interval is used as the delivery criterion

Answer: B

Explanation: Kinesis Firehose delivers data to S3 when either the Buffer Size (MB) or Buffer Interval (seconds) threshold is reached first. For example, with a 5 MB buffer size and 60-second interval, data is flushed when either 5 MB accumulates or 60 seconds elapse — whichever comes first.

Q64. What information does enabling DynamoDB Streams capture?

A) All Read operations on the table B) Write change events (Put, Update, Delete) retained for 24 hours C) Only Scan operations D) GSI/LSI creation and deletion events

Answer: B

Explanation: DynamoDB Streams captures item-level changes (INSERT, MODIFY, REMOVE) on the table, retaining them for up to 24 hours. It can capture the before/after images of changed items. Combined with Lambda triggers, it enables event-driven architectures, real-time aggregations, and cross-system synchronization.

Q65. An enterprise needs to analyze data from various sources (RDS, DynamoDB, S3, Redshift) in a single platform. What is the most comprehensive solution?

A) Move all data to Redshift and analyze there B) AWS Glue Catalog + Athena Federated Query + Lake Formation integrated governance C) Use separate analytics tools for each source D) Consolidate all data into DynamoDB

Answer: B

Explanation: Using the AWS Glue Catalog as a central metadata store, Athena Federated Query for single-SQL access across heterogeneous sources, and Lake Formation for unified security and governance is the most comprehensive solution. Data remains in place; a unified access control policy is applied across all sources.

Study Resources

AWS DAS-C01 Exam Guide
AWS Data Analytics Documentation
AWS Skill Builder DAS-C01 Official Learning Path
AWS Big Data Whitepapers and Architecture Best Practices

This practice exam is created for study purposes. Actual exam questions may differ.