
  <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
      <title>Chaos and Order</title>
      <link>https://www.youngju.dev/blog</link>
      <description>천천히 올바르게. AI Researcher &amp; DevOps Engineer Youngju&#39;s tech blog. GPU/CUDA, LLM, MLOps, Kubernetes AI workloads, distributed training, and data engineering.</description>
      <language>ko</language>
      <managingEditor>fjvbn2003@gmail.com (Youngju Kim)</managingEditor>
      <webMaster>fjvbn2003@gmail.com (Youngju Kim)</webMaster>
      <lastBuildDate>Sat, 16 May 2026 00:00:00 GMT</lastBuildDate>
      <atom:link href="https://www.youngju.dev/tags/mc4/feed.xml" rel="self" type="application/rss+xml"/>
      
  <item>
    <guid>https://www.youngju.dev/blog/culture/2026-05-16-open-source-ai-training-datasets-2026-common-crawl-fineweb-redpajama-dolma-slimpajama-the-stack-laion-coyo-deep-dive.en</guid>
    <title>Open Source AI Training Datasets in 2026 — Common Crawl / FineWeb (HF) / RedPajama-V2 / Dolma / SlimPajama / The Stack v2 / LAION / COYO-700M (Kakao) Deep Dive</title>
    <link>https://www.youngju.dev/blog/culture/2026-05-16-open-source-ai-training-datasets-2026-common-crawl-fineweb-redpajama-dolma-slimpajama-the-stack-laion-coyo-deep-dive.en</link>
    <description>A full atlas of open source AI training datasets in 2026. The foundation Common Crawl and its refined descendants RefinedWeb / RedPajama-V2 / FineWeb / FineWeb-Edu / Dolma / SlimPajama, the academic stacks The Pile / S2ORC / arXiv, code-only datasets The Stack v2 / StarCoder, multimodal LAION-5B / DataComp / COYO-700M, the Korean ecosystem (AI Hub, NIA, KAIST, HyperCLOVA), the Japanese ecosystem (NII, NTT, ABEJA), and the robotics frontier Open X-Embodiment — with licensing, ethics, and right-to-be-forgotten implications throughout.</description>
    <pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate>
    <author>fjvbn2003@gmail.com (Youngju Kim)</author>
    <category>ai-datasets</category><category>training-data</category><category>common-crawl</category><category>refinedweb</category><category>redpajama</category><category>fineweb</category><category>fineweb-edu</category><category>the-pile</category><category>dolma</category><category>slimpajama</category><category>oscar</category><category>c4</category><category>mc4</category><category>commonpile</category><category>roots</category><category>openwebtext</category><category>arxiv</category><category>s2orc</category><category>the-stack-v2</category><category>starcoder</category><category>coyo-700m</category><category>kakao-brain</category><category>laion-5b</category><category>laion-aesthetics</category><category>datacomp</category><category>imagenet</category><category>cc12m</category><category>open-images</category><category>coco</category><category>open-x-embodiment</category><category>2026</category><category>deep-dive</category><category>english</category>
  </item>

  <item>
    <guid>https://www.youngju.dev/blog/culture/2026-05-16-open-source-ai-training-datasets-2026-common-crawl-fineweb-redpajama-dolma-slimpajama-the-stack-laion-coyo-deep-dive.ja</guid>
    <title>オープンソースAI学習データセット2026 — Common Crawl / FineWeb (HF) / RedPajama-V2 / Dolma / SlimPajama / The Stack v2 / LAION / COYO-700M (Kakao) 深掘りガイド</title>
    <link>https://www.youngju.dev/blog/culture/2026-05-16-open-source-ai-training-datasets-2026-common-crawl-fineweb-redpajama-dolma-slimpajama-the-stack-laion-coyo-deep-dive.ja</link>
    <description>2026年のオープンソースAI学習データセットの全体地図を描く。すべてのLLMの基盤となるCommon Crawl、それを精製したRefinedWeb / RedPajama-V2 / FineWeb / FineWeb-Edu / Dolma / SlimPajama、学術系のThe Pile / S2ORC / arXiv、コード系のThe Stack v2 / StarCoder、マルチモーダルのLAION-5B / DataComp / COYO-700M、韓国(AI Hub, NIA, KAIST, HyperCLOVA)・日本(NII, NTT, ABEJA)のデータセット、ロボティクスのOpen X-Embodimentまで — ライセンス、倫理、忘れられる権利を含む実務者向け深掘りガイド。</description>
    <pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate>
    <author>fjvbn2003@gmail.com (Youngju Kim)</author>
    <category>ai-datasets</category><category>training-data</category><category>common-crawl</category><category>refinedweb</category><category>redpajama</category><category>fineweb</category><category>fineweb-edu</category><category>the-pile</category><category>dolma</category><category>slimpajama</category><category>oscar</category><category>c4</category><category>mc4</category><category>commonpile</category><category>roots</category><category>openwebtext</category><category>arxiv</category><category>s2orc</category><category>the-stack-v2</category><category>starcoder</category><category>coyo-700m</category><category>kakao-brain</category><category>laion-5b</category><category>laion-aesthetics</category><category>datacomp</category><category>imagenet</category><category>cc12m</category><category>open-images</category><category>coco</category><category>open-x-embodiment</category><category>2026</category><category>deep-dive</category><category>日本語</category>
  </item>

  <item>
    <guid>https://www.youngju.dev/blog/culture/2026-05-16-open-source-ai-training-datasets-2026-common-crawl-fineweb-redpajama-dolma-slimpajama-the-stack-laion-coyo-deep-dive</guid>
    <title>오픈소스 AI 학습 데이터셋 2026 — Common Crawl / FineWeb (HF) / RedPajama-V2 / Dolma / SlimPajama / The Stack v2 / LAION / COYO-700M (Kakao) 심층 가이드</title>
    <link>https://www.youngju.dev/blog/culture/2026-05-16-open-source-ai-training-datasets-2026-common-crawl-fineweb-redpajama-dolma-slimpajama-the-stack-laion-coyo-deep-dive</link>
    <description>2026년 오픈소스 AI 학습 데이터셋의 전체 지도를 그린다. 모든 LLM의 토대인 Common Crawl, 그것을 정제한 RefinedWeb / RedPajama-V2 / FineWeb / FineWeb-Edu / Dolma / SlimPajama, 학술용 The Pile / S2ORC / arXiv, 코드용 The Stack v2 / StarCoder, 멀티모달 LAION-5B / DataComp / COYO-700M, 그리고 한국·일본 데이터셋(AI Hub, NIA, KAIST, HyperCLOVA, NII, NTT, ABEJA)과 로보틱스 Open X-Embodiment까지 — 라이선스, 윤리, 옵트아웃 권리를 포함한 실무자용 심층 가이드.</description>
    <pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate>
    <author>fjvbn2003@gmail.com (Youngju Kim)</author>
    <category>ai-datasets</category><category>training-data</category><category>common-crawl</category><category>refinedweb</category><category>redpajama</category><category>fineweb</category><category>fineweb-edu</category><category>the-pile</category><category>dolma</category><category>slimpajama</category><category>oscar</category><category>c4</category><category>mc4</category><category>commonpile</category><category>roots</category><category>openwebtext</category><category>arxiv</category><category>s2orc</category><category>the-stack-v2</category><category>starcoder</category><category>coyo-700m</category><category>kakao-brain</category><category>laion-5b</category><category>laion-aesthetics</category><category>datacomp</category><category>imagenet</category><category>cc12m</category><category>open-images</category><category>coco</category><category>open-x-embodiment</category><category>2026</category><category>deep-dive</category>
  </item>

    </channel>
  </rss>
