- Authors

- Name
- Youngju Kim
- @fjvbn20031
自然言語処理(NLP)完全ガイド: ゼロからヒーローへ
自然言語処理(NLP)は、コンピュータが人間の言語を理解し生成できるようにする人工知能の中核分野です。ChatGPT、翻訳エンジン、検索エンジン、感情分析システムなど、私たちが毎日使用する無数のサービスはすべてNLP技術の上に構築されています。このガイドでは、最も基本的なテキスト前処理から最新の大規模言語モデル(LLM)まで、すべてを網羅する完全な学習パスを提供します。
目次
- NLPの基礎とテキスト前処理
- テキスト表現
- 単語埋め込み
- 再帰ニューラルネットワークによるNLP
- アテンションメカニズム
- Transformerアーキテクチャの詳細
- BERTの詳細
- GPTファミリーのモデル
- 現代のNLP技術
- 実世界のNLPプロジェクト
1. NLPの基礎とテキスト前処理
自然言語処理の第一歩は、生のテキストデータをモデルが扱える形式に変換することです。テキストデータは非構造化データであるため、一貫した形式に整理・構造化することが不可欠です。
1.1 トークナイゼーション
トークナイゼーションは、テキストをトークンと呼ばれる小さな単位に分割するプロセスです。トークンの単位によって、単語トークナイゼーション、文字トークナイゼーション、サブワードトークナイゼーションがあります。
単語トークナイゼーション
最も直感的なアプローチ — スペースや句読点に基づいてテキストを単語に分割します。
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
text = "Natural Language Processing is fascinating! It powers ChatGPT and many AI applications."
# 単語トークナイゼーション
word_tokens = word_tokenize(text)
print("Word tokens:", word_tokens)
# 出力: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!', ...]
# 文トークナイゼーション
sent_tokens = sent_tokenize(text)
print("Sentence tokens:", sent_tokens)
# 出力: ['Natural Language Processing is fascinating!', 'It powers ChatGPT...']
文字トークナイゼーション
テキストを個々の文字に分割します。語彙が小さく、未知語(OOV)の問題はありませんが、シーケンスが非常に長くなります。
text = "Hello NLP"
char_tokens = list(text)
print("Character tokens:", char_tokens)
# 出力: ['H', 'e', 'l', 'l', 'o', ' ', 'N', 'L', 'P']
サブワードトークナイゼーション
テキストを単語と文字の中間の単位に分割します。BPE(Byte Pair Encoding)、WordPiece、SentencePieceなどのアルゴリズムが、BERTやGPTなどの現代のモデルで使用されています。
from tokenizers import ByteLevelBPETokenizer
# BPEトークナイザーをトレーニング
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(
files=["corpus.txt"],
vocab_size=5000,
min_frequency=2,
special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"]
)
# エンコード
encoding = tokenizer.encode("Natural Language Processing")
print("Tokens:", encoding.tokens)
print("IDs:", encoding.ids)
1.2 ストップワードの削除
ストップワードは「the」、「is」、「at」などのように非常に頻繁に出現するため、ほとんど意味のある情報を持たない単語です。削除するとデータサイズが減り、モデルが情報豊富な単語に集中できます。
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
text = "This is a sample sentence showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
print("Original:", word_tokens)
print("Filtered:", filtered_sentence)
# 出力: ['sample', 'sentence', 'showing', 'stop', 'words', 'filtration', '.']
1.3 ステミングとレンマタイゼーション
ステミング
接辞を取り除いて単語の語幹を抽出します。高速ですが、言語的に正しくない結果が生じる場合があります。
from nltk.stem import PorterStemmer, LancasterStemmer
ps = PorterStemmer()
ls = LancasterStemmer()
words = ["running", "runs", "ran", "runner", "easily", "fairly"]
for word in words:
print(f"{word:15} -> Porter: {ps.stem(word):15} Lancaster: {ls.stem(word)}")
レンマタイゼーション
単語の基本形(レンマ)を抽出します。ステミングより遅いですが、言語的に正確です。
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
# 品詞を指定するとより正確な結果が得られる
print(lemmatizer.lemmatize("running", pos='v')) # run
print(lemmatizer.lemmatize("better", pos='a')) # good
print(lemmatizer.lemmatize("dogs")) # dog
print(lemmatizer.lemmatize("went", pos='v')) # go
1.4 正規表現によるテキストクリーニング
import re
def clean_text(text):
"""テキストクリーニング関数"""
# HTMLタグを削除
text = re.sub(r'<[^>]+>', '', text)
# URLを削除
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# メールアドレスを削除
text = re.sub(r'\S+@\S+', '', text)
# 特殊文字を削除(英数字とスペースを保持)
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# 複数スペースを縮小
text = re.sub(r'\s+', ' ', text)
# 前後スペース削除と小文字化
text = text.strip().lower()
return text
sample_text = """
Check out my website at https://example.com!
Email me at user@example.com for <b>more info</b>.
It's really cool!!!
"""
cleaned = clean_text(sample_text)
print(cleaned)
# 出力: "check out my website at email me at for more info its really cool"
1.5 spaCyを使った高度な前処理
spaCyは産業用強度のNLPライブラリで、トークナイゼーション、品詞タグ付け、固有表現認識、依存関係解析などを提供します。
import spacy
# 英語モデルのロード
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
print("=== トークン情報 ===")
for token in doc:
print(f"{token.text:15} | POS: {token.pos_:10} | Lemma: {token.lemma_:15} | Stop: {token.is_stop}")
print("\n=== 固有表現認識 ===")
for ent in doc.ents:
print(f"{ent.text:20} -> {ent.label_} ({spacy.explain(ent.label_)})")
print("\n=== 依存関係解析 ===")
for token in doc:
print(f"{token.text:15} -> {token.dep_:15} (head: {token.head.text})")
2. テキスト表現
テキストを数値ベクトルに変換することはNLPの核心です。モデルがテキストを処理するには、数学的演算を実行できる数値形式に変換する必要があります。
2.1 Bag of Words(BoW)
BoWは最もシンプルな手法で、テキストを単語頻度ベクトルとして表現します。単語の順序を無視し、各単語が出現する頻度のみを考慮します。
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
corpus = [
"I love natural language processing",
"Natural language processing is amazing",
"I love machine learning too",
"Deep learning is part of machine learning"
]
# BoWベクトライザー
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBoW Matrix:")
print(X.toarray())
print("\nShape:", X.shape) # (4文書, n単語)
# 1つの文書を検査
doc_0 = dict(zip(vectorizer.get_feature_names_out(), X.toarray()[0]))
print("\n文書0の単語頻度:")
for word, count in sorted(doc_0.items()):
if count > 0:
print(f" {word}: {count}")
2.2 TF-IDF
TF-IDF(Term Frequency-Inverse Document Frequency)は、単語が文書内に出現する頻度だけでなく、すべての文書においてどれだけ希少かも考慮してBoWを改善します。
TF(Term Frequency): 与えられた文書内での単語の出現頻度。 IDF(Inverse Document Frequency): 単語を含む文書数の逆数 — 希少な単語ほど高スコア。
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
"the cat sat on the mat",
"the cat sat on the hat",
"the dog sat on the log",
"the cat wore the hat",
]
# TF-IDFベクトライザー
tfidf = TfidfVectorizer(smooth_idf=True, norm='l2')
X = tfidf.fit_transform(corpus)
# DataFrameとして表示
df = pd.DataFrame(
X.toarray(),
columns=tfidf.get_feature_names_out(),
index=[f"Doc {i}" for i in range(len(corpus))]
)
print(df.round(3))
# TF-IDFの手動実装
def compute_tfidf(corpus):
from math import log
vocab = set()
for doc in corpus:
vocab.update(doc.split())
vocab = sorted(vocab)
def tf(word, doc):
words = doc.split()
return words.count(word) / len(words)
def idf(word, corpus):
n_docs_with_word = sum(1 for doc in corpus if word in doc.split())
return log((1 + len(corpus)) / (1 + n_docs_with_word)) + 1
tfidf_matrix = []
for doc in corpus:
row = [tf(word, doc) * idf(word, corpus) for word in vocab]
tfidf_matrix.append(row)
return tfidf_matrix, vocab
matrix, vocab = compute_tfidf(corpus)
print("\n手動計算TF-IDF:")
df_manual = pd.DataFrame(matrix, columns=vocab)
print(df_manual.round(3))
2.3 N-gram
N-gramは与えられたテキストからN個の連続するアイテム(単語または文字)のシーケンスです。N-gramを使うことで、モデルがある程度のローカルな単語順序情報を捉えることができます。
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
text = "I love natural language processing and machine learning"
tokens = word_tokenize(text.lower())
unigrams = list(ngrams(tokens, 1))
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))
print("Unigrams:", unigrams[:5])
print("Bigrams:", bigrams[:5])
print("Trigrams:", trigrams[:5])
def get_ngram_freq(text, n):
tokens = word_tokenize(text.lower())
n_grams = list(ngrams(tokens, n))
return Counter(n_grams)
large_corpus = """
Machine learning is a subset of artificial intelligence.
Artificial intelligence is transforming many industries.
Natural language processing is a part of machine learning.
Deep learning has revolutionized natural language processing.
"""
bigram_freq = get_ngram_freq(large_corpus, 2)
print("\n最も頻繁なbigrams:")
for bigram, count in bigram_freq.most_common(10):
print(f" {' '.join(bigram)}: {count}")
3. 単語埋め込み
単語埋め込みは、意味的に類似した単語がベクトル空間で近くに位置するように単語を密なベクトルとして表現します。
3.1 Word2Vec
Word2Vecは、Googleのトーマス・ミコロフが2013年に発表した画期的な単語埋め込みモデルです。CBOW(Continuous Bag of Words)とSkip-gramの2つのバリアントがあります。
CBOW: 周囲のコンテキスト単語から中心単語を予測します。 Skip-gram: 中心単語から周囲のコンテキスト単語を予測します。
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
class Word2VecSkipGram(nn.Module):
"""Skip-gram Word2Vec実装"""
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.center_embedding = nn.Embedding(vocab_size, embedding_dim)
self.context_embedding = nn.Embedding(vocab_size, embedding_dim)
self.center_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
self.context_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
def forward(self, center, context, negative):
"""
center: (batch_size,) 中心単語インデックス
context: (batch_size,) 実際のコンテキスト単語インデックス
negative: (batch_size, neg_samples) ネガティブサンプルインデックス
"""
center_emb = self.center_embedding(center) # (batch, dim)
# ポジティブサンプルスコア
context_emb = self.context_embedding(context) # (batch, dim)
pos_score = torch.sum(center_emb * context_emb, dim=1) # (batch,)
pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10)
# ネガティブサンプルスコア
neg_emb = self.context_embedding(negative) # (batch, neg_samples, dim)
center_emb_expanded = center_emb.unsqueeze(1) # (batch, 1, dim)
neg_score = torch.bmm(neg_emb, center_emb_expanded.transpose(1, 2)).squeeze(2)
neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-10), dim=1)
return (pos_loss + neg_loss).mean()
class Word2VecCBOW(nn.Module):
"""CBOW Word2Vec実装"""
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(embedding_dim, vocab_size)
def forward(self, context):
"""
context: (batch_size, context_window_size) コンテキスト単語インデックス
"""
embedded = self.embedding(context) # (batch, window, dim)
mean_embedded = embedded.mean(dim=1) # (batch, dim)
output = self.linear(mean_embedded) # (batch, vocab_size)
return output
# Gensimを使った実用的な使用法
from gensim.models import Word2Vec
sentences = [
"I love natural language processing".split(),
"natural language processing is a field of AI".split(),
"machine learning is part of artificial intelligence".split(),
"deep learning models process natural language".split(),
"word embeddings represent words as vectors".split(),
"Word2Vec learns word representations".split(),
"semantic similarity between words is captured by embeddings".split(),
]
model = Word2Vec(
sentences=sentences,
vector_size=100, # 埋め込み次元
window=5, # コンテキストウィンドウサイズ
min_count=1, # 最小単語頻度
workers=4,
epochs=100,
sg=1, # 1=Skip-gram, 0=CBOW
negative=5 # ネガティブサンプリング
)
print("'language'ベクトル(最初の5次元):", model.wv['language'][:5])
similar_words = model.wv.most_similar('language', topn=5)
print("\n'language'に類似した単語:")
for word, similarity in similar_words:
print(f" {word}: {similarity:.4f}")
result = model.wv.most_similar(
positive=['artificial', 'language'],
negative=['natural'],
topn=3
)
print("\nアナロジー結果:")
for word, sim in result:
print(f" {word}: {sim:.4f}")
print("\n'language'と'processing'の類似度:", model.wv.similarity('language', 'processing'))
3.2 t-SNEによる埋め込み可視化
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
def visualize_embeddings(model, words=None):
"""t-SNEによるWord2Vec埋め込みの可視化"""
if words is None:
words = list(model.wv.key_to_index.keys())[:50]
vectors = np.array([model.wv[word] for word in words])
tsne = TSNE(
n_components=2,
random_state=42,
perplexity=min(30, len(words)-1),
n_iter=1000
)
vectors_2d = tsne.fit_transform(vectors)
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.7)
for i, word in enumerate(words):
ax.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]),
fontsize=9, ha='center', va='bottom')
ax.set_title("Word2Vec埋め込みt-SNE可視化")
plt.tight_layout()
plt.savefig('word2vec_tsne.png', dpi=150, bbox_inches='tight')
plt.show()
visualize_embeddings(model)
3.3 GloVe
GloVe(Global Vectors for Word Representation)はスタンフォードで開発され、コーパス全体のグローバルな共起統計を活用します。
# 事前学習済みGloVeベクトルの使用
# pip install torchtext
from torchtext.vocab import GloVe
import torch
glove = GloVe(name='6B', dim=100)
vector = glove['computer']
print("'computer' GloVeベクトル(最初の5次元):", vector[:5].numpy())
def cosine_similarity(v1, v2):
return torch.nn.functional.cosine_similarity(
v1.unsqueeze(0), v2.unsqueeze(0)
).item()
words_to_compare = [('king', 'queen'), ('man', 'woman'), ('Paris', 'France')]
for w1, w2 in words_to_compare:
if w1 in glove.stoi and w2 in glove.stoi:
sim = cosine_similarity(glove[w1], glove[w2])
print(f"sim('{w1}', '{w2}') = {sim:.4f}")
4. 再帰ニューラルネットワークによるNLP
再帰ニューラルネットワーク(RNN)は系列データを処理するために設計されています。前のタイムステップからの情報を現在のステップに引き継ぐため、テキスト処理に適しています。
4.1 基本的なRNNアーキテクチャ
import torch
import torch.nn as nn
class SimpleRNN(nn.Module):
"""基本的なRNN実装"""
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden_size = hidden_size
self.input_to_hidden = nn.Linear(input_size + hidden_size, hidden_size)
self.hidden_to_output = nn.Linear(hidden_size, output_size)
def forward(self, x, hidden=None):
"""
x: (seq_len, batch, input_size)
hidden: (batch, hidden_size) 初期隠れ状態
"""
batch_size = x.size(1)
if hidden is None:
hidden = torch.zeros(batch_size, self.hidden_size)
outputs = []
for t in range(x.size(0)):
combined = torch.cat([x[t], hidden], dim=1)
hidden = torch.tanh(self.input_to_hidden(combined))
output = self.hidden_to_output(hidden)
outputs.append(output)
return torch.stack(outputs, dim=0), hidden
# PyTorch組み込みRNNの使用
rnn = nn.RNN(
input_size=50,
hidden_size=128,
num_layers=2,
batch_first=True,
dropout=0.3,
bidirectional=True
)
x = torch.randn(32, 20, 50) # (batch, seq_len, features)
output, hidden = rnn(x)
print("Output shape:", output.shape) # (32, 20, 256) - 双方向: 128*2
print("Hidden shape:", hidden.shape) # (4, 32, 128) - 2層 * 2方向
4.2 LSTM — ゲートメカニズムの解説
LSTM(Long Short-Term Memory)は通常のRNNの勾配消失問題を解決するために設計されました。ゲートメカニズムにより、長距離の依存関係を捉えることができます。
class LSTMCell(nn.Module):
"""LSTMセル — ゲートメカニズム理解のための手動実装"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# 効率のために4つのゲートを一度に計算
# 忘却ゲート、入力ゲート、セルゲート、出力ゲート
self.gates = nn.Linear(input_size + hidden_size, 4 * hidden_size)
def forward(self, x, state=None):
"""
x: (batch, input_size)
state: (h, c) 前の隠れ状態とセル状態
"""
batch_size = x.size(0)
if state is None:
h = torch.zeros(batch_size, self.hidden_size)
c = torch.zeros(batch_size, self.hidden_size)
else:
h, c = state
combined = torch.cat([x, h], dim=1)
gates = self.gates(combined)
f_gate, i_gate, g_gate, o_gate = gates.chunk(4, dim=1)
f = torch.sigmoid(f_gate) # 忘却ゲート: どれだけ忘れるか
i = torch.sigmoid(i_gate) # 入力ゲート: どれだけ新しい情報を保存するか
g = torch.tanh(g_gate) # セルゲート: 新しい候補情報
o = torch.sigmoid(o_gate) # 出力ゲート: どれだけ出力するか
new_c = f * c + i * g # 一部忘れて新しい情報を追加
new_h = o * torch.tanh(new_c)
return new_h, (new_h, new_c)
# 感情分析のための双方向LSTM
class SentimentLSTM(nn.Module):
"""LSTMベースの感情分析モデル"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2, dropout=0.3):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim,
hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout,
bidirectional=True
)
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * 2, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, x, lengths=None):
embedded = self.dropout(self.embedding(x))
if lengths is not None:
packed = nn.utils.rnn.pack_padded_sequence(
embedded, lengths, batch_first=True, enforce_sorted=False
)
lstm_out, (hidden, cell) = self.lstm(packed)
lstm_out, _ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
else:
lstm_out, (hidden, cell) = self.lstm(embedded)
forward_hidden = hidden[-2]
backward_hidden = hidden[-1]
combined = torch.cat([forward_hidden, backward_hidden], dim=1)
output = self.classifier(self.dropout(combined))
return output
4.3 GRU
GRU(Gated Recurrent Unit)はLSTMよりシンプルな代替手法で、より高速に同等の性能を発揮します。
class GRUCell(nn.Module):
"""GRUセル — 手動実装"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
self.reset_gate = nn.Linear(input_size + hidden_size, hidden_size)
self.update_gate = nn.Linear(input_size + hidden_size, hidden_size)
self.new_gate_input = nn.Linear(input_size, hidden_size)
self.new_gate_hidden = nn.Linear(hidden_size, hidden_size)
def forward(self, x, h=None):
batch_size = x.size(0)
if h is None:
h = torch.zeros(batch_size, self.hidden_size)
combined = torch.cat([x, h], dim=1)
r = torch.sigmoid(self.reset_gate(combined)) # リセットゲート
z = torch.sigmoid(self.update_gate(combined)) # 更新ゲート
n = torch.tanh(
self.new_gate_input(x) + r * self.new_gate_hidden(h)
)
new_h = (1 - z) * n + z * h
return new_h
4.4 Seq2Seq機械翻訳
class Encoder(nn.Module):
"""Seq2Seqエンコーダー"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.dropout = nn.Dropout(dropout)
def forward(self, src):
embedded = self.dropout(self.embedding(src))
outputs, (hidden, cell) = self.lstm(embedded)
return outputs, hidden, cell
class Decoder(nn.Module):
"""Seq2Seqデコーダー"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.fc = nn.Linear(hidden_dim, vocab_size)
self.dropout = nn.Dropout(dropout)
def forward(self, trg, hidden, cell):
trg = trg.unsqueeze(1) # (batch, 1)
embedded = self.dropout(self.embedding(trg))
output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
prediction = self.fc(output.squeeze(1)) # (batch, vocab_size)
return prediction, hidden, cell
class Seq2Seq(nn.Module):
"""Seq2Seq翻訳モデル"""
def __init__(self, encoder, decoder, device):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.device = device
def forward(self, src, trg, teacher_forcing_ratio=0.5):
batch_size = src.shape[0]
trg_len = trg.shape[1]
trg_vocab_size = self.decoder.fc.out_features
outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
encoder_outputs, hidden, cell = self.encoder(src)
# 最初のデコーダー入力: シーケンス開始トークン
input = trg[:, 0]
for t in range(1, trg_len):
output, hidden, cell = self.decoder(input, hidden, cell)
outputs[:, t] = output
teacher_force = torch.rand(1).item() < teacher_forcing_ratio
top1 = output.argmax(1)
input = trg[:, t] if teacher_force else top1
return outputs
5. アテンションメカニズム
アテンションメカニズムにより、モデルは各出力トークンを生成する際に入力シーケンスの異なる部分に動的に集中することができます。
5.1 バダナウアテンション
class BahdanauAttention(nn.Module):
"""バダナウ(加法的)アテンション"""
def __init__(self, hidden_dim):
super().__init__()
self.W_enc = nn.Linear(hidden_dim * 2, hidden_dim)
self.W_dec = nn.Linear(hidden_dim, hidden_dim)
self.v = nn.Linear(hidden_dim, 1)
def forward(self, decoder_hidden, encoder_outputs):
"""
decoder_hidden: (batch, hidden) 現在のデコーダー隠れ状態
encoder_outputs: (batch, src_len, hidden*2) すべてのエンコーダー出力
"""
src_len = encoder_outputs.shape[1]
decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1)
energy = torch.tanh(
self.W_enc(encoder_outputs) + self.W_dec(decoder_hidden)
)
attention = self.v(energy).squeeze(2) # (batch, src_len)
attention_weights = torch.softmax(attention, dim=1)
context = torch.bmm(
attention_weights.unsqueeze(1), # (batch, 1, src_len)
encoder_outputs # (batch, src_len, hidden*2)
).squeeze(1)
return context, attention_weights
class LuongAttention(nn.Module):
"""ルオング(乗法的)アテンション"""
def __init__(self, hidden_dim, method='dot'):
super().__init__()
self.method = method
if method == 'general':
self.W = nn.Linear(hidden_dim, hidden_dim)
elif method == 'concat':
self.W = nn.Linear(hidden_dim * 2, hidden_dim)
self.v = nn.Linear(hidden_dim, 1)
def forward(self, decoder_hidden, encoder_outputs):
if self.method == 'dot':
score = torch.bmm(
encoder_outputs,
decoder_hidden.unsqueeze(2)
).squeeze(2)
elif self.method == 'general':
energy = self.W(encoder_outputs)
score = torch.bmm(
energy,
decoder_hidden.unsqueeze(2)
).squeeze(2)
elif self.method == 'concat':
decoder_expanded = decoder_hidden.unsqueeze(1).expand_as(encoder_outputs)
energy = torch.tanh(self.W(torch.cat([decoder_expanded, encoder_outputs], dim=2)))
score = self.v(energy).squeeze(2)
attention_weights = torch.softmax(score, dim=1)
context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs).squeeze(1)
return context, attention_weights
5.2 セルフアテンション
セルフアテンションにより、シーケンスの各位置が他のすべての位置に注目することができます。Transformerの核心的な構成要素です。
class SelfAttention(nn.Module):
"""マルチヘッドセルフアテンション"""
def __init__(self, embed_dim, num_heads=8, dropout=0.1):
super().__init__()
assert embed_dim % num_heads == 0
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
self.W_q = nn.Linear(embed_dim, embed_dim)
self.W_k = nn.Linear(embed_dim, embed_dim)
self.W_v = nn.Linear(embed_dim, embed_dim)
self.W_o = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
x: (batch, seq_len, embed_dim)
mask: (batch, seq_len, seq_len) アテンションマスク
"""
batch_size, seq_len, _ = x.shape
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = self.dropout(torch.softmax(scores, dim=-1))
attended = torch.matmul(attention_weights, V)
attended = attended.transpose(1, 2).contiguous()
attended = attended.view(batch_size, seq_len, self.embed_dim)
output = self.W_o(attended)
return output, attention_weights
6. Transformerアーキテクチャの詳細
2017年の論文「Attention is All You Need」で発表されたTransformerは、NLPのパラダイムを完全に変えました。RNNや畳み込みを使わずアテンションメカニズムのみを使用し、最先端の翻訳性能を達成しました。
6.1 位置エンコーディング
TransformerはすべポジションをPを並列に処理するため、シーケンスの順序情報を注入するために位置エンコーディングが必要です。
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
"""正弦波位置エンコーディング"""
def __init__(self, embed_dim, max_len=5000, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max_len, embed_dim)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
)
pe[:, 0::2] = torch.sin(position * div_term) # 偶数インデックス: sin
pe[:, 1::2] = torch.cos(position * div_term) # 奇数インデックス: cos
pe = pe.unsqueeze(0) # (1, max_len, embed_dim)
self.register_buffer('pe', pe)
def forward(self, x):
"""x: (batch, seq_len, embed_dim)"""
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
class LearnablePositionalEncoding(nn.Module):
"""学習可能な位置エンコーディング(BERTスタイル)"""
def __init__(self, embed_dim, max_len=512):
super().__init__()
self.pe = nn.Embedding(max_len, embed_dim)
def forward(self, x):
batch_size, seq_len, _ = x.shape
positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
return x + self.pe(positions)
6.2 完全なTransformerの実装
class MultiHeadAttention(nn.Module):
"""マルチヘッドアテンション"""
def __init__(self, embed_dim, num_heads, dropout=0.1):
super().__init__()
assert embed_dim % num_heads == 0
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_o = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def split_heads(self, x):
"""(batch, seq, embed) -> (batch, heads, seq, head_dim)"""
batch, seq, _ = x.shape
x = x.view(batch, seq, self.num_heads, self.head_dim)
return x.transpose(1, 2)
def forward(self, query, key, value, mask=None):
batch_size = query.shape[0]
Q = self.split_heads(self.W_q(query))
K = self.split_heads(self.W_k(key))
V = self.split_heads(self.W_v(value))
scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = self.dropout(torch.softmax(scores, dim=-1))
output = torch.matmul(attn_weights, V)
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_dim)
return self.W_o(output), attn_weights
class FeedForward(nn.Module):
"""位置ごとのフィードフォワードネットワーク"""
def __init__(self, embed_dim, ff_dim, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(ff_dim, embed_dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
class EncoderLayer(nn.Module):
"""Transformerエンコーダー層"""
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, src_mask=None):
attn_output, _ = self.self_attention(x, x, x, src_mask)
x = self.norm1(x + self.dropout(attn_output))
ff_output = self.feed_forward(x)
x = self.norm2(x + ff_output)
return x
class DecoderLayer(nn.Module):
"""Transformerデコーダー層"""
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.cross_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.norm3 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, memory, src_mask=None, tgt_mask=None):
# マスクセルフアテンション(未来トークンのマスキング)
self_attn_out, _ = self.self_attention(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout(self_attn_out))
# クロスアテンション(エンコーダー出力に注目)
cross_attn_out, cross_attn_weights = self.cross_attention(x, memory, memory, src_mask)
x = self.norm2(x + self.dropout(cross_attn_out))
ff_out = self.feed_forward(x)
x = self.norm3(x + ff_out)
return x, cross_attn_weights
class Transformer(nn.Module):
"""完全なTransformer実装"""
def __init__(
self,
src_vocab_size,
tgt_vocab_size,
embed_dim=512,
num_heads=8,
num_encoder_layers=6,
num_decoder_layers=6,
ff_dim=2048,
max_len=5000,
dropout=0.1
):
super().__init__()
self.src_embedding = nn.Embedding(src_vocab_size, embed_dim)
self.tgt_embedding = nn.Embedding(tgt_vocab_size, embed_dim)
self.pos_encoding = PositionalEncoding(embed_dim, max_len, dropout)
self.embed_scale = math.sqrt(embed_dim)
self.encoder_layers = nn.ModuleList([
EncoderLayer(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_encoder_layers)
])
self.decoder_layers = nn.ModuleList([
DecoderLayer(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_decoder_layers)
])
self.output_norm = nn.LayerNorm(embed_dim)
self.output_projection = nn.Linear(embed_dim, tgt_vocab_size)
self._init_weights()
def _init_weights(self):
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def make_causal_mask(self, seq_len, device):
"""自己回帰マスク — 未来のトークンへのアテンションを防ぐ"""
mask = torch.tril(torch.ones(seq_len, seq_len, device=device)).bool()
return mask.unsqueeze(0).unsqueeze(0)
def make_pad_mask(self, x, pad_idx=0):
"""パディングマスク"""
return (x != pad_idx).unsqueeze(1).unsqueeze(2)
def encode(self, src, src_mask=None):
x = self.pos_encoding(self.src_embedding(src) * self.embed_scale)
for layer in self.encoder_layers:
x = layer(x, src_mask)
return x
def decode(self, tgt, memory, src_mask=None, tgt_mask=None):
x = self.pos_encoding(self.tgt_embedding(tgt) * self.embed_scale)
for layer in self.decoder_layers:
x, _ = layer(x, memory, src_mask, tgt_mask)
return self.output_norm(x)
def forward(self, src, tgt, src_pad_idx=0, tgt_pad_idx=0):
src_mask = self.make_pad_mask(src, src_pad_idx)
tgt_len = tgt.shape[1]
tgt_pad_mask = self.make_pad_mask(tgt, tgt_pad_idx)
tgt_causal_mask = self.make_causal_mask(tgt_len, tgt.device)
tgt_mask = tgt_pad_mask & tgt_causal_mask
memory = self.encode(src, src_mask)
output = self.decode(tgt, memory, src_mask, tgt_mask)
logits = self.output_projection(output)
return logits
# インスタンス化とテスト
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Transformer(
src_vocab_size=10000,
tgt_vocab_size=10000,
embed_dim=512,
num_heads=8,
num_encoder_layers=6,
num_decoder_layers=6,
ff_dim=2048,
dropout=0.1
).to(device)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"総パラメータ数: {total_params:,}")
src = torch.randint(1, 10000, (4, 20)).to(device)
tgt = torch.randint(1, 10000, (4, 18)).to(device)
output = model(src, tgt)
print(f"出力の形状: {output.shape}") # (4, 18, 10000)
7. BERTの詳細
BERT(Bidirectional Encoder Representations from Transformers)は2018年にGoogleが発表し、NLP分野に革命をもたらしました。
7.1 BERTのコアアイデア
BERTは2つの事前学習タスクを使用します:
マスク言語モデリング(MLM): 入力トークンの15%を[MASK]に置き換え、モデルが元の単語を予測します。 次文予測(NSP): 2つの文が与えられ、モデルが2文目が1文目の後に続くかどうかを予測します。
from transformers import (
BertTokenizer,
BertModel,
BertForSequenceClassification,
BertForTokenClassification,
BertForQuestionAnswering,
AdamW,
get_linear_schedule_with_warmup
)
import torch
from torch.utils.data import Dataset, DataLoader
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Natural language processing is transforming AI."
encoding = tokenizer(
text,
add_special_tokens=True,
max_length=128,
padding='max_length',
truncation=True,
return_tensors='pt'
)
print("Input IDs:", encoding['input_ids'][0][:10])
print("Attention mask:", encoding['attention_mask'][0][:10])
print("Decoded:", tokenizer.decode(encoding['input_ids'][0]))
# WordPieceトークナイゼーション
complex_words = ["unbelievable", "preprocessing", "transformers", "tokenization"]
for word in complex_words:
tokens = tokenizer.tokenize(word)
print(f"{word:20} -> {tokens}")
7.2 BERTファインチューニング: テキスト分類
class SentimentDataset(Dataset):
"""感情分析データセット"""
def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(self.labels[idx], dtype=torch.long)
}
def train_bert_classifier(
train_texts,
train_labels,
val_texts,
val_labels,
num_labels=2,
epochs=3,
batch_size=16,
lr=2e-5
):
"""BERT感情分析ファインチューニング"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)
val_dataset = SentimentDataset(val_texts, val_labels, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=num_labels
).to(device)
optimizer = AdamW(model.parameters(), lr=lr, weight_decay=0.01)
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=total_steps // 10,
num_training_steps=total_steps
)
best_val_acc = 0
for epoch in range(epochs):
model.train()
total_loss = 0
correct = 0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
optimizer.zero_grad()
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
preds = outputs.logits.argmax(dim=1)
correct += (preds == labels).sum().item()
train_acc = correct / len(train_dataset)
avg_loss = total_loss / len(train_loader)
model.eval()
val_correct = 0
with torch.no_grad():
for batch in val_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
preds = outputs.logits.argmax(dim=1)
val_correct += (preds == labels).sum().item()
val_acc = val_correct / len(val_dataset)
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Train Acc={train_acc:.4f}, Val Acc={val_acc:.4f}")
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_bert_classifier.pt')
return model
7.3 BERT NER(固有表現認識)
from transformers import BertForTokenClassification
ner_labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
label2id = {label: i for i, label in enumerate(ner_labels)}
id2label = {i: label for i, label in enumerate(ner_labels)}
model = BertForTokenClassification.from_pretrained(
'bert-base-uncased',
num_labels=len(ner_labels),
id2label=id2label,
label2id=label2id
)
def predict_ner(text, model, tokenizer):
"""固有表現を予測"""
model.eval()
encoding = tokenizer(
text.split(),
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
truncation=True,
return_tensors='pt'
)
with torch.no_grad():
outputs = model(
input_ids=encoding['input_ids'],
attention_mask=encoding['attention_mask']
)
predictions = outputs.logits.argmax(dim=2).squeeze().tolist()
word_ids = encoding.word_ids()
result = []
prev_word_id = None
for pred, word_id in zip(predictions, word_ids):
if word_id is None or word_id == prev_word_id:
continue
word = text.split()[word_id]
label = id2label[pred]
result.append((word, label))
prev_word_id = word_id
return result
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Barack Obama was born in Hawaii and served as the 44th President of the United States."
# print(predict_ner(text, model, tokenizer)) # ファインチューニング済みモデルで使用
8. GPTファミリーのモデル
GPT(Generative Pre-trained Transformer)はOpenAIが開発した自己回帰言語モデルのシリーズです。
8.1 GPTの進化
| モデル | 年 | パラメータ数 | 主な特徴 |
|---|---|---|---|
| GPT-1 | 2018 | 117M | 最初のGPT、教師なし事前学習 |
| GPT-2 | 2019 | 1.5B | ゼロショット転移、「公開危険すぎる」 |
| GPT-3 | 2020 | 175B | インコンテキスト学習、フューショット |
| InstructGPT | 2022 | 1.3B | RLHF、指示への従順 |
| GPT-4 | 2023 | 非公開 | マルチモーダル、より強力な推論 |
8.2 GPT-2によるテキスト生成
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
tokenizer.pad_token = tokenizer.eos_token
def generate_text(
prompt,
model,
tokenizer,
max_length=200,
temperature=0.9,
top_p=0.95,
top_k=50,
num_return_sequences=1,
do_sample=True
):
"""GPT-2でテキスト生成"""
inputs = tokenizer(prompt, return_tensors='pt')
input_ids = inputs['input_ids']
with torch.no_grad():
output = model.generate(
input_ids,
max_length=max_length,
temperature=temperature,
top_p=top_p,
top_k=top_k,
num_return_sequences=num_return_sequences,
do_sample=do_sample,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2
)
generated_texts = []
for out in output:
text = tokenizer.decode(out, skip_special_tokens=True)
generated_texts.append(text)
return generated_texts
prompt = "Artificial intelligence is transforming the world by"
generated = generate_text(prompt, model, tokenizer, max_length=150, temperature=0.8)
for i, text in enumerate(generated):
print(f"\n生成テキスト {i+1}:")
print(text)
print("-" * 50)
def compare_generation_strategies(prompt, model, tokenizer):
"""異なるデコード戦略の比較"""
input_ids = tokenizer.encode(prompt, return_tensors='pt')
strategies = {
"貪欲探索": dict(do_sample=False),
"ビームサーチ": dict(do_sample=False, num_beams=5, early_stopping=True),
"温度サンプリング": dict(do_sample=True, temperature=0.7),
"Top-kサンプリング": dict(do_sample=True, top_k=50),
"Top-p(核)サンプリング": dict(do_sample=True, top_p=0.92),
}
print(f"プロンプト: {prompt}\n")
with torch.no_grad():
for name, params in strategies.items():
output = model.generate(
input_ids,
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id,
**params
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"[{name}]")
print(text[len(prompt):].strip())
print()
compare_generation_strategies(
"The future of artificial intelligence",
model,
tokenizer
)
8.3 プロンプトエンジニアリング
# ゼロショット分類
def zero_shot_classification(text, categories, model_name="gpt2"):
"""ゼロショットテキスト分類"""
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()
scores = {}
for category in categories:
prompt = f"Text: {text}\nCategory: {category}"
inputs = tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs, labels=inputs['input_ids'])
scores[category] = -outputs.loss.item()
best_category = max(scores, key=scores.get)
return best_category, scores
# フューショット学習の例
few_shot_prompt = """
以下のテキストの感情を分類してください:
Text: "This movie was amazing!"
Sentiment: Positive
Text: "I hated every minute of it."
Sentiment: Negative
Text: "The film was okay, nothing special."
Sentiment: Neutral
Text: "Absolutely brilliant performance!"
Sentiment:"""
print("Few-shotプロンプト:")
print(few_shot_prompt)
# Chain-of-Thoughtプロンプティング
cot_prompt = """
Q: A train travels 120 miles in 2 hours. How long will it take to travel 300 miles?
A: Let me think step by step.
1. First, find the speed: 120 miles / 2 hours = 60 miles per hour
2. Then, calculate the time for 300 miles: 300 miles / 60 mph = 5 hours
Therefore, it will take 5 hours.
Q: If a store sells 3 apples for $2.40, how much do 7 apples cost?
A: Let me think step by step.
"""
9. 現代のNLP技術
9.1 LoRA / QLoRAファインチューニング
LoRA(Low-Rank Adaptation)は大規模言語モデルを効率的にファインチューニングするための技術です。元の重みを凍結して、既存の層に追加される小さな低ランク行列のみをトレーニングします。
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def setup_lora_model(model_name="gpt2", r=8, lora_alpha=32, lora_dropout=0.1):
"""LoRAモデルの設定"""
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=r, # LoRAランク
lora_alpha=lora_alpha, # スケーリングパラメータ
target_modules=["c_attn"], # LoRAを適用する層
lora_dropout=lora_dropout,
bias="none"
)
model = get_peft_model(model, lora_config)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"トレーニング可能なパラメータ: {trainable:,} ({100 * trainable / total:.2f}%)")
return model, tokenizer
class InstructionDataset(torch.utils.data.Dataset):
"""SFT用の指示形式データセット"""
def __init__(self, data, tokenizer, max_length=512):
self.tokenizer = tokenizer
self.max_length = max_length
self.texts = []
for item in data:
prompt = f"### Instruction:\n{item['instruction']}\n\n### Response:\n{item['response']}"
self.texts.append(prompt)
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
input_ids = encoding['input_ids'].squeeze()
return {
'input_ids': input_ids,
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': input_ids.clone()
}
9.2 RAG(検索拡張生成)
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class SimpleRAG:
"""シンプルなRAGシステム"""
def __init__(self, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
self.tokenizer = AutoTokenizer.from_pretrained(embedding_model)
self.model = AutoModel.from_pretrained(embedding_model)
self.knowledge_base = []
self.embeddings = []
def mean_pooling(self, model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(
token_embeddings.size()
).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / \
torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def encode(self, texts):
encoded = self.tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
)
with torch.no_grad():
model_output = self.model(**encoded)
embeddings = self.mean_pooling(model_output, encoded['attention_mask'])
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings.numpy()
def add_documents(self, documents):
self.knowledge_base.extend(documents)
new_embeddings = self.encode(documents)
if len(self.embeddings) == 0:
self.embeddings = new_embeddings
else:
self.embeddings = np.vstack([self.embeddings, new_embeddings])
print(f"{len(documents)}文書を追加。合計: {len(self.knowledge_base)}。")
def retrieve(self, query, top_k=3):
query_embedding = self.encode([query])
similarities = cosine_similarity(query_embedding, self.embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
results = []
for idx in top_indices:
results.append({
'document': self.knowledge_base[idx],
'similarity': similarities[idx]
})
return results
def answer(self, query, generator_model, generator_tokenizer, top_k=3):
"""RAG: 検索 + 生成"""
relevant_docs = self.retrieve(query, top_k=top_k)
context = "\n".join([doc['document'] for doc in relevant_docs])
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
inputs = generator_tokenizer(prompt, return_tensors='pt', max_length=512, truncation=True)
with torch.no_grad():
outputs = generator_model.generate(
inputs['input_ids'],
max_new_tokens=150,
temperature=0.7,
do_sample=True,
pad_token_id=generator_tokenizer.eos_token_id
)
answer = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = answer[len(prompt):].strip()
return answer, relevant_docs
# 使用例
rag = SimpleRAG()
documents = [
"BERT (Bidirectional Encoder Representations from Transformers) was published by Google in 2018.",
"GPT-3 has 175 billion parameters and was developed by OpenAI in 2020.",
"The Transformer architecture was introduced in the paper 'Attention is All You Need' in 2017.",
"RLHF (Reinforcement Learning from Human Feedback) is used to align language models with human values.",
"LoRA allows efficient fine-tuning by adding low-rank matrices to pre-trained model weights.",
"RAG combines information retrieval with text generation for more accurate responses.",
]
rag.add_documents(documents)
query = "What is BERT and when was it created?"
results = rag.retrieve(query, top_k=2)
print(f"\nクエリ: {query}")
print("\n取得した文書:")
for r in results:
print(f" [{r['similarity']:.3f}] {r['document']}")
9.3 RLHF(人間のフィードバックからの強化学習)
class RewardModel(nn.Module):
"""報酬モデル — 応答品質のスコアリング"""
def __init__(self, base_model_name="gpt2"):
super().__init__()
from transformers import GPT2Model
self.transformer = GPT2Model.from_pretrained(base_model_name)
hidden_size = self.transformer.config.hidden_size
self.reward_head = nn.Linear(hidden_size, 1)
def forward(self, input_ids, attention_mask=None):
outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
last_hidden = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden)
return reward.squeeze(-1)
class PPOTrainer:
"""PPOベースのRLHFトレーニング(概念的な実装)"""
def __init__(self, policy_model, reward_model, ref_model, tokenizer):
self.policy = policy_model
self.reward_model = reward_model
self.ref_model = ref_model # KLペナルティのための参照モデル
self.tokenizer = tokenizer
def compute_kl_penalty(self, policy_logprobs, ref_logprobs, kl_coeff=0.1):
"""KLダイバージェンスペナルティ — ポリシーが大きく逸脱するのを防ぐ"""
kl = policy_logprobs - ref_logprobs
return kl_coeff * kl.mean()
def compute_advantages(self, rewards, values, gamma=0.99, lam=0.95):
"""一般化アドバンテージ推定(GAE)"""
advantages = []
last_advantage = 0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0
else:
next_value = values[t + 1]
delta = rewards[t] + gamma * next_value - values[t]
advantage = delta + gamma * lam * last_advantage
advantages.insert(0, advantage)
last_advantage = advantage
return torch.tensor(advantages)
10. 実世界のNLPプロジェクト
10.1 感情分析システム(完全パイプライン)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
import matplotlib.pyplot as plt
import seaborn as sns
class SentimentAnalysisPipeline:
"""エンドツーエンドの感情分析パイプライン"""
def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
self.model.eval()
self.label_map = {0: 'Negative', 1: 'Positive'}
def predict(self, texts, batch_size=32):
"""バッチ予測"""
if isinstance(texts, str):
texts = [texts]
all_predictions = []
all_probabilities = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
encoding = self.tokenizer(
batch,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
)
input_ids = encoding['input_ids'].to(self.device)
attention_mask = encoding['attention_mask'].to(self.device)
with torch.no_grad():
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
probs = torch.softmax(outputs.logits, dim=1)
preds = probs.argmax(dim=1)
all_predictions.extend(preds.cpu().numpy())
all_probabilities.extend(probs.cpu().numpy())
results = []
for pred, prob in zip(all_predictions, all_probabilities):
results.append({
'label': self.label_map[pred],
'score': float(prob[pred]),
'probabilities': {self.label_map[i]: float(p) for i, p in enumerate(prob)}
})
return results if len(results) > 1 else results[0]
def analyze_batch(self, texts):
"""統計付きバッチ感情分析"""
results = self.predict(texts)
positive_count = sum(1 for r in results if r['label'] == 'Positive')
negative_count = len(results) - positive_count
avg_confidence = np.mean([r['score'] for r in results])
print(f"\n=== 感情分析結果 ===")
print(f"総テキスト数: {len(texts)}")
print(f"ポジティブ: {positive_count} ({100*positive_count/len(texts):.1f}%)")
print(f"ネガティブ: {negative_count} ({100*negative_count/len(texts):.1f}%)")
print(f"平均信頼度: {avg_confidence:.3f}")
return results
pipeline = SentimentAnalysisPipeline()
reviews = [
"This product exceeded all my expectations! Absolutely love it.",
"Terrible quality, broke after one day. Complete waste of money.",
"It's okay, does what it's supposed to do.",
"Best purchase I've made this year!",
"Would not recommend to anyone.",
]
print("=== 個別予測 ===")
for review in reviews:
result = pipeline.predict(review)
print(f"Text: {review[:50]}...")
print(f" -> {result['label']} ({result['score']:.3f})\n")
print("\n=== バッチ分析 ===")
pipeline.analyze_batch(reviews)
10.2 テキスト要約システム
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
class TextSummarizer:
"""テキスト要約システム"""
def __init__(self, model_name="facebook/bart-large-cnn"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
def summarize(self, text, max_length=130, min_length=30, num_beams=4):
"""抽象的要約"""
inputs = self.tokenizer(
text,
max_length=1024,
truncation=True,
return_tensors='pt'
).to(self.device)
with torch.no_grad():
summary_ids = self.model.generate(
inputs['input_ids'],
max_length=max_length,
min_length=min_length,
num_beams=num_beams,
length_penalty=2.0,
early_stopping=True
)
summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
original_words = len(text.split())
summary_words = len(summary.split())
compression = (1 - summary_words / original_words) * 100
return {
'summary': summary,
'original_length': original_words,
'summary_length': summary_words,
'compression_rate': f"{compression:.1f}%"
}
def extractive_summarize(self, text, num_sentences=3):
"""抽出的要約(TF-IDFベース)"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
if len(sentences) <= num_sentences:
return text
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)
similarity_matrix = cosine_similarity(tfidf_matrix)
scores = similarity_matrix.mean(axis=1)
top_indices = sorted(np.argsort(scores)[-num_sentences:].tolist())
summary = ' '.join([sentences[i] for i in top_indices])
return summary
summarizer = TextSummarizer()
long_text = """
Artificial intelligence has made remarkable strides in recent years, particularly in
natural language processing. The development of transformer-based models like BERT and
GPT has revolutionized how machines understand and generate human language. These models,
trained on massive datasets, can perform a wide range of tasks including translation,
summarization, question answering, and sentiment analysis.
The key innovation behind these models is the attention mechanism, which allows the model
to focus on relevant parts of the input when generating each word of the output. This has
enabled much more nuanced understanding of context and semantics compared to earlier
approaches like recurrent neural networks.
However, these advances come with challenges. Training large language models requires
enormous computational resources and energy. There are also concerns about bias in the
training data leading to biased outputs, and the potential for misuse in generating
misinformation. Researchers are actively working on making these models more efficient,
fair, and reliable.
The future of NLP looks promising, with models becoming increasingly capable of
understanding and generating language that is indistinguishable from human writing.
Applications range from customer service chatbots to scientific research assistance,
and the technology continues to evolve rapidly.
"""
result = summarizer.summarize(long_text)
print("=== 抽象的要約 ===")
print(f"元の長さ: {result['original_length']} 単語")
print(f"要約の長さ: {result['summary_length']} 単語")
print(f"圧縮率: {result['compression_rate']}")
print(f"\n要約:\n{result['summary']}")
print("\n=== 抽出的要約 ===")
extractive = summarizer.extractive_summarize(long_text, num_sentences=2)
print(extractive)
10.3 質問応答システム
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
class QASystem:
"""質問応答システム"""
def __init__(self, model_name="deepset/roberta-base-squad2"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForQuestionAnswering.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
self.model.eval()
def answer(self, question, context, max_answer_len=100):
"""コンテキストから回答を抽出"""
encoding = self.tokenizer(
question,
context,
max_length=512,
truncation=True,
stride=128,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding='max_length',
return_tensors='pt'
)
offset_mapping = encoding.pop('offset_mapping').cpu()
sample_map = encoding.pop('overflow_to_sample_mapping').cpu()
encoding = {k: v.to(self.device) for k, v in encoding.items()}
with torch.no_grad():
outputs = self.model(**encoding)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
answers = []
for i in range(len(start_logits)):
start_log = start_logits[i].cpu().numpy()
end_log = end_logits[i].cpu().numpy()
offsets = offset_mapping[i].numpy()
start_idx = np.argmax(start_log)
end_idx = np.argmax(end_log)
if start_idx <= end_idx:
start_char = offsets[start_idx][0]
end_char = offsets[end_idx][1]
answer_text = context[start_char:end_char]
score = float(start_log[start_idx] + end_log[end_idx])
if answer_text:
answers.append({
'answer': answer_text,
'score': score,
'start': int(start_char),
'end': int(end_char)
})
if not answers:
return {'answer': "答えが見つかりませんでした。", 'score': 0.0}
return max(answers, key=lambda x: x['score'])
def multi_question_answer(self, questions, context):
"""コンテキストに対する複数の質問に回答"""
results = []
for question in questions:
answer = self.answer(question, context)
results.append({
'question': question,
'answer': answer['answer'],
'confidence': answer['score']
})
return results
qa_system = QASystem()
context = """
The Transformer model was introduced in a 2017 paper titled "Attention Is All You Need"
by researchers at Google Brain. The model architecture relies entirely on attention
mechanisms, dispensing with recurrence and convolutions. BERT, which stands for
Bidirectional Encoder Representations from Transformers, was introduced in 2018 by
Google AI Language team. BERT achieved state-of-the-art results on eleven NLP tasks.
GPT-3, developed by OpenAI, was released in 2020 and has 175 billion parameters, making
it one of the largest language models at the time of its release.
"""
questions = [
"When was the Transformer model introduced?",
"What does BERT stand for?",
"How many parameters does GPT-3 have?",
"Who developed BERT?",
]
print("=== 質問応答システム ===\n")
results = qa_system.multi_question_answer(questions, context)
for r in results:
print(f"Q: {r['question']}")
print(f"A: {r['answer']}")
print(f"信頼度: {r['confidence']:.2f}")
print()
おわりに: NLP学習ロードマップ
このガイドでは、NLPの完全な旅を歩んできました。各ステージのまとめ:
- 基礎: テキスト前処理、BoW、TF-IDF
- 埋め込み: Word2Vec、GloVe、FastText
- 系列モデル: RNN、LSTM、GRU、Seq2Seq
- アテンション革命: アテンションメカニズム、セルフアテンション
- Transformerをマスターする: 完全なアーキテクチャ
- 事前学習済みモデル: BERTとGPTの実践
- 現代的テクニック: LoRA、RAG、RLHF
推奨リソース
- HuggingFace公式ドキュメント
- Attention is All You Need論文
- BERT論文
- PyTorch NLPチュートリアル
- Stanford CS224N — ディープラーニングによるNLP
- NLTKドキュメント
- spaCyドキュメント
- Gensim Word2Vec
NLPは急速に進化する分野なので、継続的な学習と実践が不可欠です。コード例を自分で実行し、パラメータを試験し、自分自身のプロジェクトを構築して深い理解を培ってください。ハッピーラーニング!