필사 모드: Python Complete Guide for AI/ML: Master NumPy, Pandas, Matplotlib, and Scikit-learn
EnglishPython Complete Guide for AI/ML
Python is the standard language for AI and machine learning. Its clean syntax, vast library ecosystem, and active community have made it the go-to choice for both researchers and engineers. This guide covers everything you need to master the core Python libraries for AI/ML development.
1. Setting Up Your Python AI/ML Environment
Choosing a Python Version
For AI/ML work, Python 3.10 or later is recommended. Python 3.10+ offers structural pattern matching, clearer error messages, and improved type hints. As of 2026, Python 3.12 is stable and compatible with most ML libraries.
Check Python version
python --version
python3 --version
Install a specific version with pyenv
pyenv install 3.12.0
pyenv global 3.12.0
Setting Up Virtual Environments
Virtual environments isolate dependencies per project.
**venv (Standard Library)**
Create a virtual environment
python -m venv ml_env
Activate (Linux/Mac)
source ml_env/bin/activate
Activate (Windows)
ml_env\Scripts\activate
Deactivate
deactivate
**conda (Anaconda/Miniconda)**
Create environment
conda create -n ml_env python=3.12
Activate
conda activate ml_env
Install packages
conda install numpy pandas scikit-learn matplotlib
List environments
conda env list
Export environment
conda env export > environment.yml
Restore environment
conda env create -f environment.yml
**Poetry (Advanced Dependency Management)**
Install Poetry
curl -sSL https://install.python-poetry.org | python3 -
Initialize a project
poetry new ml_project
cd ml_project
Add packages
poetry add numpy pandas scikit-learn torch
Add dev dependencies
poetry add --dev pytest black flake8
Run inside the environment
poetry run python train.py
Jupyter Notebook/Lab Setup
Install JupyterLab
pip install jupyterlab
Register kernel
python -m ipykernel install --user --name=ml_env --display-name "ML Environment"
Launch JupyterLab
jupyter lab
Install useful extensions
pip install jupyterlab-git
pip install nbformat
**Jupyter config file (~/.jupyter/jupyter_lab_config.py)**
c.ServerApp.open_browser = True
c.ServerApp.port = 8888
c.ServerApp.ip = '0.0.0.0'
GPU Python Environment (CUDA, cuDNN)
Check CUDA version
nvidia-smi
nvcc --version
Install PyTorch with CUDA (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Install TensorFlow with GPU
pip install tensorflow[and-cuda]
Verify GPU availability (PyTorch)
python -c "import torch; print(torch.cuda.is_available())"
Check cuDNN
python -c "import torch; print(torch.backends.cudnn.version())"
Essential Package List
requirements.txt
numpy>=1.24.0
pandas>=2.0.0
matplotlib>=3.7.0
seaborn>=0.12.0
scikit-learn>=1.3.0
scipy>=1.11.0
torch>=2.0.0
torchvision>=0.15.0
tensorflow>=2.13.0
xgboost>=1.7.0
lightgbm>=4.0.0
optuna>=3.3.0
wandb>=0.15.0
tqdm>=4.65.0
jupyterlab>=4.0.0
black>=23.0.0
flake8>=6.0.0
pytest>=7.4.0
Install all at once
pip install -r requirements.txt
2. Mastering NumPy
NumPy (Numerical Python) is the foundation of scientific computing in Python. It provides multidimensional arrays and mathematical functions, and most ML libraries use NumPy internally.
Creating ndarrays
Basic array creation
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr1.shape) # (5,)
print(arr2.shape) # (2, 3)
print(arr2.dtype) # int64
print(arr2.ndim) # 2
print(arr2.size) # 6
Special arrays
zeros = np.zeros((3, 4)) # all zeros
ones = np.ones((2, 3, 4)) # all ones
full = np.full((3, 3), 7) # all sevens
eye = np.eye(4) # identity matrix
empty = np.empty((2, 3)) # uninitialized
Range arrays
arange = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
logspace = np.logspace(0, 3, 4) # [1, 10, 100, 1000]
Random arrays
np.random.seed(42)
rand_uniform = np.random.rand(3, 4) # uniform [0, 1)
rand_normal = np.random.randn(3, 4) # standard normal
rand_int = np.random.randint(0, 10, (3, 4)) # random integers
Modern random API (recommended)
rng = np.random.default_rng(42)
samples = rng.normal(loc=0, scale=1, size=(100, 3))
Basic Operations and Broadcasting
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[7, 8, 9], [10, 11, 12]])
Element-wise arithmetic
print(a + b) # element-wise addition
print(a - b) # element-wise subtraction
print(a * b) # element-wise multiplication
print(a / b) # element-wise division
print(a ** 2) # element-wise squaring
print(a % 2) # element-wise modulo
Broadcasting - operations on arrays of different shapes
x = np.array([[1], [2], [3]]) # shape: (3, 1)
y = np.array([10, 20, 30]) # shape: (3,) treated as (1, 3)
Broadcasting result: (3, 3)
result = x + y
print(result)
[[11, 21, 31],
[12, 22, 32],
[13, 23, 33]]
Practical Broadcasting: batch normalization
data = np.random.randn(100, 10) # 100 samples, 10 features
mean = data.mean(axis=0) # per-feature mean (shape: 10,)
std = data.std(axis=0) # per-feature std (shape: 10,)
normalized = (data - mean) / std # broadcasting normalization
print(normalized.mean(axis=0).round(10)) # approximately 0
print(normalized.std(axis=0).round(10)) # approximately 1
Indexing, Slicing, and Boolean Indexing
arr = np.arange(24).reshape(4, 6)
print(arr)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
Basic indexing
print(arr[0, 0]) # 0
print(arr[3, 5]) # 23
print(arr[-1, -1]) # 23
Slicing
print(arr[1:3, 2:5]) # rows 1-2, cols 2-4
print(arr[:, 0]) # all rows, column 0
print(arr[::2, ::2]) # every 2nd row and column
Fancy indexing
rows = np.array([0, 2])
cols = np.array([1, 4])
print(arr[rows, cols]) # [arr[0,1], arr[2,4]] = [1, 16]
Boolean indexing (masking)
mask = arr > 12
print(arr[mask]) # elements greater than 12
Filtering with conditions
data = np.array([1, -2, 3, -4, 5, -6])
positive = data[data > 0] # [1, 3, 5]
np.where - conditional selection
result = np.where(data > 0, data, 0) # keep positives, zero out negatives
print(result) # [1, 0, 3, 0, 5, 0]
np.where to find indices
indices = np.where(data > 0)
print(indices) # (array([0, 2, 4]),)
Shape Transformations
arr = np.arange(12)
reshape
a = arr.reshape(3, 4)
b = arr.reshape(2, 2, 3)
c = arr.reshape(-1, 4) # -1 infers the size: gives (3, 4)
flatten vs ravel
flat1 = a.flatten() # always returns a copy
flat2 = a.ravel() # returns a view when possible (more memory-efficient)
transpose
mat = np.random.randn(3, 4)
transposed = mat.T
transposed2 = mat.transpose()
transposed3 = np.transpose(mat, (1, 0))
3D transpose
tensor = np.random.randn(2, 3, 4)
batch, channels, spatial -> batch, spatial, channels
reordered = tensor.transpose(0, 2, 1) # (2, 4, 3)
squeeze and expand_dims
x = np.array([[[1, 2, 3]]]) # shape: (1, 1, 3)
squeezed = np.squeeze(x) # (3,)
expanded = np.expand_dims(squeezed, axis=0) # (1, 3)
Stacking arrays
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
hstack = np.hstack([a, b]) # horizontal stack (2, 4)
vstack = np.vstack([a, b]) # vertical stack (4, 2)
Mathematical Functions
x = np.array([0, np.pi/6, np.pi/4, np.pi/3, np.pi/2])
Trigonometry
sin_x = np.sin(x)
cos_x = np.cos(x)
tan_x = np.tan(x)
Exponential and logarithm
exp_x = np.exp(x) # e^x
log_x = np.log(x + 1) # natural log (ln)
log2_x = np.log2(x + 1) # base-2 log
log10_x = np.log10(x + 1)
Power and root
sqrt_x = np.sqrt(x)
square_x = np.square(x) # x^2
power_x = np.power(x, 3) # x^3
Sigmoid and softmax
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def softmax(x):
e_x = np.exp(x - x.max()) # subtract max for numerical stability
return e_x / e_x.sum()
z = np.array([1.0, 2.0, 3.0])
print(sigmoid(z)) # [0.731, 0.880, 0.952]
print(softmax(z)) # [0.090, 0.245, 0.665]
Linear Algebra
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
Matrix multiplication
C = np.dot(A, B) # classic approach
C = A @ B # preferred in Python 3.5+
C = np.matmul(A, B) # same as np.dot for 2D
Batched matrix multiplication (3D+)
batch_A = np.random.randn(32, 3, 4)
batch_B = np.random.randn(32, 4, 5)
batch_C = batch_A @ batch_B # (32, 3, 5)
Linear algebra functions
det = np.linalg.det(A) # determinant
inv = np.linalg.inv(A) # inverse
rank = np.linalg.matrix_rank(A) # rank
trace = np.trace(A) # trace
Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(A)
Singular Value Decomposition (SVD)
U, S, Vt = np.linalg.svd(A)
Solve linear system Ax = b
b = np.array([5, 6])
x = np.linalg.solve(A, b)
Norms
v = np.array([3, 4])
l1_norm = np.linalg.norm(v, ord=1) # L1 norm: 7
l2_norm = np.linalg.norm(v, ord=2) # L2 norm: 5
inf_norm = np.linalg.norm(v, ord=np.inf) # max norm: 4
Vectorized Operations vs For Loops
n = 1_000_000
a = np.random.randn(n)
b = np.random.randn(n)
For loop
start = time.time()
result_loop = []
for i in range(n):
result_loop.append(a[i] * b[i])
loop_time = time.time() - start
print(f"For loop: {loop_time:.4f}s")
Vectorized
start = time.time()
result_vec = a * b
vec_time = time.time() - start
print(f"Vectorized: {vec_time:.4f}s")
print(f"Speedup: {loop_time / vec_time:.1f}x")
Typically 100-1000x faster
Practical: Neural Network Forward Pass with NumPy
class SimpleNeuralNetwork:
"""Two-layer neural network implemented with NumPy only"""
def __init__(self, input_size, hidden_size, output_size, seed=42):
np.random.seed(seed)
He initialization
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)
self.b2 = np.zeros((1, output_size))
def relu(self, z):
return np.maximum(0, z)
def relu_derivative(self, z):
return (z > 0).astype(float)
def softmax(self, z):
exp_z = np.exp(z - z.max(axis=1, keepdims=True))
return exp_z / exp_z.sum(axis=1, keepdims=True)
def forward(self, X):
Layer 1
self.Z1 = X @ self.W1 + self.b1
self.A1 = self.relu(self.Z1)
Layer 2
self.Z2 = self.A1 @ self.W2 + self.b2
self.A2 = self.softmax(self.Z2)
return self.A2
def cross_entropy_loss(self, y_pred, y_true):
m = y_true.shape[0]
log_probs = -np.log(y_pred[range(m), y_true] + 1e-8)
return log_probs.mean()
def backward(self, X, y_true, learning_rate=0.01):
m = X.shape[0]
Output layer gradient
dZ2 = self.A2.copy()
dZ2[range(m), y_true] -= 1
dZ2 /= m
dW2 = self.A1.T @ dZ2
db2 = dZ2.sum(axis=0, keepdims=True)
Hidden layer gradient
dA1 = dZ2 @ self.W2.T
dZ1 = dA1 * self.relu_derivative(self.Z1)
dW1 = X.T @ dZ1
db1 = dZ1.sum(axis=0, keepdims=True)
Weight update
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
def train(self, X, y, epochs=100, learning_rate=0.01):
losses = []
for epoch in range(epochs):
y_pred = self.forward(X)
loss = self.cross_entropy_loss(y_pred, y)
losses.append(loss)
self.backward(X, y, learning_rate)
if epoch % 10 == 0:
acc = (y_pred.argmax(axis=1) == y).mean()
print(f"Epoch {epoch:3d}: Loss={loss:.4f}, Acc={acc:.4f}")
return losses
Test
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20,
n_classes=3, n_informative=15,
random_state=42)
nn = SimpleNeuralNetwork(input_size=20, hidden_size=64, output_size=3)
losses = nn.train(X, y, epochs=50, learning_rate=0.1)
3. Mastering Pandas
Pandas is the core library for working with tabular data. It provides DataFrame and Series data structures and supports every step of data cleaning, transformation, and analysis.
Series and DataFrame
Creating a Series
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s3 = pd.Series({'x': 100, 'y': 200, 'z': 300})
print(s2['a']) # 10
print(s2[['a', 'c']]) # a=10, c=30
Creating a DataFrame
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 28, 22],
'score': [88.5, 92.3, 78.1, 95.7, 83.2],
'passed': [True, True, False, True, True]
}
df = pd.DataFrame(data)
print(df.head())
print(df.tail(3))
print(df.info())
print(df.describe())
print(df.dtypes)
print(df.shape) # (5, 4)
Reading and Writing Data
CSV
df_csv = pd.read_csv('data.csv',
sep=',',
header=0,
index_col=0,
parse_dates=['date'],
encoding='utf-8',
na_values=['N/A', 'null', ''])
df_csv.to_csv('output.csv', index=False, encoding='utf-8')
Excel
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1', header=0)
df_excel.to_excel('output.xlsx', sheet_name='Result', index=False)
JSON
df_json = pd.read_json('data.json', orient='records')
df_json.to_json('output.json', orient='records', indent=2)
Parquet (high-performance columnar format)
df.to_parquet('data.parquet', engine='pyarrow', compression='snappy')
df_parquet = pd.read_parquet('data.parquet')
SQL (SQLite example)
conn = sqlite3.connect('database.db')
df_sql = pd.read_sql_query("SELECT * FROM users WHERE age > 25", conn)
df.to_sql('new_table', conn, if_exists='replace', index=False)
Indexing with loc and iloc
df = pd.DataFrame({
'A': range(10),
'B': range(10, 20),
'C': range(20, 30)
}, index=[f'row{i}' for i in range(10)])
loc: label-based indexing
print(df.loc['row0', 'A']) # single value
print(df.loc['row0':'row3', 'A':'B']) # range (end inclusive)
print(df.loc[['row1', 'row5'], 'C']) # list
iloc: position-based indexing
print(df.iloc[0, 0]) # row 0, col 0
print(df.iloc[0:4, 0:2]) # range (end exclusive)
print(df.iloc[[1, 5], 2]) # list
Condition-based selection
mask = df['A'] > 5
filtered = df[mask]
filtered2 = df[df['B'].between(12, 17)]
filtered3 = df.query('A > 5 and B < 18')
Handling Missing Values
Create data with missing values
df = pd.DataFrame({
'age': [25, np.nan, 35, np.nan, 22],
'income': [50000, 60000, np.nan, 80000, np.nan],
'city': ['Seoul', 'Busan', None, 'Incheon', 'Seoul'],
'score': [88.5, 92.3, 78.1, np.nan, 83.2]
})
Inspect missing values
print(df.isnull().sum()) # count per column
print(df.isnull().sum() / len(df) * 100) # missing percentage
Drop missing values
df_dropped_rows = df.dropna() # drop rows with any NaN
df_dropped_cols = df.dropna(axis=1) # drop columns with any NaN
df_thresh = df.dropna(thresh=3) # keep rows with at least 3 non-NaN
Fill missing values
df_filled_0 = df.fillna(0)
df_filled_mean = df.fillna(df.mean())
df_filled_dict = df.fillna({
'age': df['age'].mean(),
'income': df['income'].median(),
'city': 'Unknown',
'score': df['score'].mean()
})
Forward/backward fill
df_ffill = df.fillna(method='ffill')
df_bfill = df.fillna(method='bfill')
Interpolation
df_interpolated = df.interpolate(method='linear')
Smart handling pattern
for col in df.columns:
missing_pct = df[col].isnull().mean()
if missing_pct > 0.5:
df.drop(columns=[col], inplace=True)
elif df[col].dtype == 'object':
df[col].fillna(df[col].mode()[0], inplace=True)
else:
df[col].fillna(df[col].median(), inplace=True)
Data Transformation
df = pd.DataFrame({
'text': ['hello world', 'PYTHON IS GREAT', 'data science'],
'value': [1, 2, 3],
'category': ['A', 'B', 'A']
})
apply: apply a function
df['text_upper'] = df['text'].apply(str.upper)
df['text_length'] = df['text'].apply(len)
Complex function
def process_text(text):
return ' '.join(word.capitalize() for word in text.lower().split())
df['text_processed'] = df['text'].apply(process_text)
Multiple columns simultaneously
def feature_engineer(row):
return pd.Series({
'value_squared': row['value'] ** 2,
'category_is_A': int(row['category'] == 'A')
})
new_features = df.apply(feature_engineer, axis=1)
df = pd.concat([df, new_features], axis=1)
map: apply a mapping table
category_map = {'A': 'Alpha', 'B': 'Beta', 'C': 'Gamma'}
df['category_name'] = df['category'].map(category_map)
String operations (vectorized)
texts = pd.Series(['Hello World', 'Python 3.12', 'Machine Learning'])
print(texts.str.lower())
print(texts.str.split())
print(texts.str.contains('Python'))
print(texts.str.extract(r'(\w+)\s+(\w+)'))
Groupby Aggregation
np.random.seed(42)
df = pd.DataFrame({
'team': np.random.choice(['A', 'B', 'C'], 100),
'role': np.random.choice(['dev', 'ds', 'pm'], 100),
'score': np.random.randint(60, 100, 100),
'salary': np.random.randint(3000, 8000, 100)
})
Basic groupby
grouped = df.groupby('team')
print(grouped['score'].mean())
print(grouped['salary'].describe())
Multiple keys
multi_grouped = df.groupby(['team', 'role'])
print(multi_grouped['score'].mean().unstack())
Custom aggregation
agg_result = df.groupby('team').agg(
avg_score=('score', 'mean'),
total_salary=('salary', 'sum'),
count=('score', 'count'),
max_score=('score', 'max'),
min_salary=('salary', 'min')
)
Custom aggregation function
def iqr(x):
return x.quantile(0.75) - x.quantile(0.25)
custom_agg = df.groupby('team')['score'].agg(['mean', 'median', 'std', iqr])
filter: keep groups matching a condition
large_teams = df.groupby('team').filter(lambda x: len(x) > 30)
Merging DataFrames
users = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 28, 22]
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105, 106],
'user_id': [1, 2, 1, 3, 5, 6],
'amount': [150, 250, 80, 320, 190, 440]
})
Inner join (intersection)
inner = pd.merge(users, orders, on='user_id', how='inner')
Left join
left = pd.merge(users, orders, on='user_id', how='left')
Right join
right = pd.merge(users, orders, on='user_id', how='right')
Outer join (union)
outer = pd.merge(users, orders, on='user_id', how='outer')
concat
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
vertical = pd.concat([df1, df2], axis=0, ignore_index=True)
Practical: AI Training Data Preprocessing Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
def preprocess_titanic(df):
"""Titanic dataset preprocessing pipeline"""
df = df.copy()
Feature engineering
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.')
title_map = {'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3}
df['Title'] = df['Title'].map(title_map).fillna(4)
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
Handle missing values
df['Age'].fillna(df.groupby('Title')['Age'].transform('median'), inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
Encoding
df['Sex'] = (df['Sex'] == 'male').astype(int)
df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
features = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked',
'FamilySize', 'IsAlone', 'Title']
return df[features]
4. Matplotlib and Seaborn Visualization
Basic Plots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
Line plot
x = np.linspace(0, 2 * np.pi, 100)
axes[0, 0].plot(x, np.sin(x), 'b-', linewidth=2, label='sin(x)')
axes[0, 0].plot(x, np.cos(x), 'r--', linewidth=2, label='cos(x)')
axes[0, 0].set_title('Trigonometric Functions')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
Bar plot
categories = ['Classification', 'Regression', 'Clustering', 'Dim. Reduction']
values = [85, 72, 68, 91]
bars = axes[0, 1].bar(categories, values,
color=['#3498db', '#e74c3c', '#2ecc71', '#f39c12'])
axes[0, 1].set_title('Algorithm Accuracy')
axes[0, 1].set_ylabel('Accuracy (%)')
for bar, val in zip(bars, values):
axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
f'{val}%', ha='center', va='bottom')
Scatter plot
np.random.seed(42)
x_scatter = np.random.randn(100)
y_scatter = 2 * x_scatter + np.random.randn(100) * 0.5
axes[0, 2].scatter(x_scatter, y_scatter, alpha=0.6, c=y_scatter, cmap='viridis')
axes[0, 2].set_title('Scatter Plot')
Histogram
data = np.concatenate([
np.random.normal(0, 1, 500),
np.random.normal(4, 1.5, 300)
])
axes[1, 0].hist(data, bins=50, density=True, alpha=0.7, color='steelblue')
axes[1, 0].set_title('Data Distribution')
Box plot
box_data = [np.random.normal(i, 1, 100) for i in range(5)]
axes[1, 1].boxplot(box_data, labels=[f'Model{i+1}' for i in range(5)])
axes[1, 1].set_title('Model Performance Distribution')
plt.tight_layout()
plt.savefig('basic_plots.png', dpi=150, bbox_inches='tight')
plt.show()
Seaborn Statistical Visualization
sns.set_theme(style='whitegrid', palette='husl', font_scale=1.2)
Sample data
df = pd.DataFrame({
'model': np.repeat(['ResNet', 'VGG', 'EfficientNet', 'ViT'], 50),
'accuracy': np.concatenate([
np.random.normal(92, 2, 50),
np.random.normal(88, 3, 50),
np.random.normal(94, 1.5, 50),
np.random.normal(95, 2.5, 50)
]),
'params_M': np.concatenate([
np.random.normal(25, 2, 50),
np.random.normal(138, 5, 50),
np.random.normal(5.3, 0.3, 50),
np.random.normal(86, 3, 50)
])
})
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
Violin plot
sns.violinplot(data=df, x='model', y='accuracy', ax=axes[0, 0])
axes[0, 0].set_title('Accuracy Distribution by Model')
Heatmap (correlation)
corr_data = df[['accuracy', 'params_M']].corr()
sns.heatmap(corr_data, annot=True, fmt='.2f', cmap='RdYlGn',
center=0, ax=axes[0, 1])
axes[0, 1].set_title('Correlation Matrix')
Scatter with regression line
sns.regplot(data=df, x='params_M', y='accuracy',
scatter_kws={'alpha': 0.4}, ax=axes[1, 0])
axes[1, 0].set_title('Parameters vs Accuracy')
KDE distribution plot
for model in df['model'].unique():
subset = df[df['model'] == model]
sns.kdeplot(data=subset, x='accuracy', label=model, ax=axes[1, 1])
axes[1, 1].set_title('Accuracy KDE by Model')
axes[1, 1].legend()
plt.tight_layout()
plt.savefig('seaborn_plots.png', dpi=150, bbox_inches='tight')
plt.show()
Practical: Learning Curves and Confusion Matrix
from sklearn.metrics import confusion_matrix
def plot_learning_curve(train_losses, val_losses, train_accs, val_accs):
"""Visualize training and validation learning curves"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
epochs = range(1, len(train_losses) + 1)
ax1.plot(epochs, train_losses, 'b-', label='Train Loss', linewidth=2)
ax1.plot(epochs, val_losses, 'r--', label='Val Loss', linewidth=2)
ax1.fill_between(epochs, train_losses, val_losses, alpha=0.1, color='gray')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Loss Curve')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax2.plot(epochs, train_accs, 'b-', label='Train Accuracy', linewidth=2)
ax2.plot(epochs, val_accs, 'r--', label='Val Accuracy', linewidth=2)
best_epoch = np.argmax(val_accs)
ax2.axvline(x=best_epoch + 1, color='g', linestyle=':',
label=f'Best Epoch ({best_epoch+1})')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Accuracy Curve')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
return fig
def plot_confusion_matrix(y_true, y_pred, class_names):
"""Visualize a confusion matrix"""
cm = confusion_matrix(y_true, y_pred)
cm_pct = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=class_names, yticklabels=class_names, ax=ax1)
ax1.set_title('Confusion Matrix (Counts)')
ax1.set_ylabel('True Label')
ax1.set_xlabel('Predicted Label')
sns.heatmap(cm_pct, annot=True, fmt='.2%', cmap='Greens',
xticklabels=class_names, yticklabels=class_names, ax=ax2)
ax2.set_title('Confusion Matrix (Rates)')
ax2.set_ylabel('True Label')
ax2.set_xlabel('Predicted Label')
plt.tight_layout()
return fig
5. Machine Learning with Scikit-learn
Data Preprocessing
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler,
LabelEncoder, OneHotEncoder
)
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]], dtype=float)
StandardScaler: mean 0, std 1
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)
MinMaxScaler: scale to [0, 1]
min_max = MinMaxScaler(feature_range=(0, 1))
X_minmax = min_max.fit_transform(X)
RobustScaler: uses median and IQR, robust to outliers
robust = RobustScaler()
X_robust = robust.fit_transform(X)
LabelEncoder
le = LabelEncoder()
labels = ['cat', 'dog', 'bird', 'cat', 'dog']
encoded = le.fit_transform(labels) # [0, 2, 1, 0, 2]
OneHotEncoder
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
categories = np.array([['red'], ['green'], ['blue'], ['red']])
encoded_ohe = ohe.fit_transform(categories)
Feature Selection and Extraction
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
X = np.random.randn(200, 20)
y = (X[:, 0] + X[:, 1] + np.random.randn(200) * 0.1 > 0).astype(int)
PCA
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2%}")
SelectKBest
selector = SelectKBest(f_classif, k=5)
X_kbest = selector.fit_transform(X, y)
selected = selector.get_support(indices=True)
print(f"Selected feature indices: {selected}")
Linear Models
from sklearn.linear_model import (
LinearRegression, LogisticRegression, Ridge, Lasso
)
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
Regression
X_reg, y_reg = make_regression(n_samples=500, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2)
lr = LinearRegression()
lr.fit(X_train, y_train)
print(f"Linear R2: {r2_score(y_test, lr.predict(X_test)):.4f}")
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"Ridge R2: {r2_score(y_test, ridge.predict(X_test)):.4f}")
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
print(f"Lasso R2: {r2_score(y_test, lasso.predict(X_test)):.4f}")
print(f"Non-zero coefficients: {np.sum(lasso.coef_ != 0)}")
Tree-based Models
from sklearn.ensemble import (
RandomForestClassifier, GradientBoostingClassifier
)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X, y = make_classification(n_samples=1000, n_features=20,
n_classes=2, n_informative=10,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),
}
for name, model in models.items():
model.fit(X_train, y_train)
acc = (model.predict(X_test) == y_test).mean()
print(f"{name}: {acc:.4f}")
Model Evaluation and Cross-Validation
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(rf, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1
)
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)
n = 500
df = pd.DataFrame({
'age': np.random.randint(18, 70, n).astype(float),
'income': np.random.randint(20000, 100000, n).astype(float),
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
'city': np.random.choice(['New York', 'Chicago', 'Houston', 'Phoenix'], n),
'target': np.random.randint(0, 2, n)
})
X = df.drop('target', axis=1)
y = df['target']
numeric_features = ['age', 'income']
categorical_features = ['education', 'city']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
from sklearn.model_selection import cross_val_score
scores = cross_val_score(full_pipeline, X, y, cv=5, scoring='accuracy')
print(f"Pipeline CV accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")
6. Python Performance Optimization
List Comprehensions vs map vs for
n = 1_000_000
data = list(range(n))
For loop
start = time.time()
result_for = []
for x in data:
result_for.append(x ** 2)
print(f"For loop: {time.time() - start:.4f}s")
List comprehension
start = time.time()
result_lc = [x ** 2 for x in data]
print(f"List comprehension: {time.time() - start:.4f}s")
map
start = time.time()
result_map = list(map(lambda x: x ** 2, data))
print(f"Map: {time.time() - start:.4f}s")
NumPy vectorization
arr = np.array(data)
start = time.time()
result_np = arr ** 2
print(f"NumPy: {time.time() - start:.4f}s")
Generators
List vs generator memory comparison
list_comp = [x**2 for x in range(1_000_000)]
gen_expr = (x**2 for x in range(1_000_000))
print(f"List size: {sys.getsizeof(list_comp):,} bytes") # ~8MB
print(f"Generator size: {sys.getsizeof(gen_expr)} bytes") # ~120 bytes
Generator function
def infinite_data_loader(dataset, batch_size=32):
"""Infinite data loader generator"""
while True:
indices = np.random.permutation(len(dataset))
for i in range(0, len(dataset), batch_size):
batch_indices = indices[i:i + batch_size]
yield dataset[batch_indices]
Parallel Processing
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def cpu_intensive_task(n):
return sum(i**2 for i in range(n))
def io_bound_task(url):
time.sleep(0.1)
return f"Fetched: {url}"
ThreadPoolExecutor: best for I/O-bound tasks
urls = [f"https://example.com/data/{i}" for i in range(20)]
start = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(io_bound_task, urls))
print(f"ThreadPool time: {time.time() - start:.2f}s")
ProcessPoolExecutor: best for CPU-bound tasks
numbers = [1_000_000] * 8
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(cpu_intensive_task, numbers))
print(f"ProcessPool time: {time.time() - start:.2f}s")
Numba JIT Compilation
from numba import njit, prange
@njit(parallel=True)
def fast_matrix_norm(A):
"""Parallelized matrix norm computation"""
n, m = A.shape
result = 0.0
for i in prange(n):
for j in prange(m):
result += A[i, j] ** 2
return result ** 0.5
A = np.random.randn(1000, 1000)
Warmup (first run triggers JIT compilation)
_ = fast_matrix_norm(A)
start = time.time()
for _ in range(10):
result = fast_matrix_norm(A)
print(f"Numba: {time.time() - start:.4f}s")
start = time.time()
for _ in range(10):
result = np.linalg.norm(A)
print(f"NumPy: {time.time() - start:.4f}s")
7. AI/ML Utility Libraries
tqdm - Progress Bars
from tqdm import tqdm, trange
Basic usage
for i in tqdm(range(100)):
time.sleep(0.01)
Custom description
items = list(range(50))
for item in tqdm(items, desc='Processing', unit='sample'):
pass
Nested progress bars
for epoch in trange(10, desc='Epochs'):
for batch in trange(100, desc='Batches', leave=False):
pass
Manual update with metrics
with tqdm(total=100, desc='Training') as pbar:
for i in range(10):
pbar.update(10)
pbar.set_postfix({'loss': 0.5 - i * 0.04, 'acc': 0.7 + i * 0.02})
Weights and Biases (wandb) - Experiment Tracking
wandb.init(
project="ml-experiment",
name="run-001",
config={
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 100,
"model": "ResNet50",
"optimizer": "AdamW"
}
)
for epoch in range(100):
train_loss = 1.0 - epoch * 0.009 + np.random.normal(0, 0.01)
val_loss = 1.0 - epoch * 0.008 + np.random.normal(0, 0.02)
train_acc = epoch * 0.009 + np.random.normal(0, 0.01)
val_acc = epoch * 0.008 + np.random.normal(0, 0.02)
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/loss": val_loss,
"train/acc": train_acc,
"val/acc": val_acc,
"learning_rate": 0.001 * (0.95 ** epoch)
})
wandb.finish()
pytest - Testing
tests/test_preprocessing.py
def normalize(x):
return (x - x.mean()) / x.std()
class TestNormalize:
def test_mean_zero(self):
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
result = normalize(x)
assert abs(result.mean()) < 1e-10
def test_std_one(self):
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
result = normalize(x)
assert abs(result.std() - 1.0) < 1e-10
def test_shape_preserved(self):
x = np.random.randn(10, 5)
result = normalize(x)
assert result.shape == x.shape
@pytest.fixture
def sample_df():
return pd.DataFrame({
'value': [1.0, 2.0, np.nan, 4.0, 5.0],
'label': ['a', 'b', 'c', 'd', 'e']
})
def test_dropna(sample_df):
result = sample_df.dropna()
assert result.isnull().sum().sum() == 0
Run with: pytest tests/ -v --coverage
Conclusion
This guide covered the essential Python ecosystem for AI/ML development:
- **Environment Setup**: Project isolation with venv, conda, and poetry
- **NumPy**: High-performance numerical computation through vectorized operations
- **Pandas**: Building data preprocessing and analysis pipelines
- **Matplotlib/Seaborn**: Rich visualizations to surface insights
- **Scikit-learn**: Complete ML workflow from preprocessing to model evaluation
- **Performance Optimization**: Generators, parallel processing, and Numba JIT
- **Utilities**: tqdm, wandb, hydra, and pytest
In real-world ML projects, these tools combine to build a complete pipeline: data loading, preprocessing, feature engineering, model training, evaluation, and deployment. Consult the official documentation for each tool to go deeper.
References
- [NumPy Documentation](https://numpy.org/doc/stable/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/)
- [Matplotlib Documentation](https://matplotlib.org/stable/)
- [Seaborn Documentation](https://seaborn.pydata.org/)
- [Weights and Biases Documentation](https://docs.wandb.ai/)
현재 단락 (1/785)
Python is the standard language for AI and machine learning. Its clean syntax, vast library ecosyste...