💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Python Complete Guide for AI/ML

Python is the standard language for AI and machine learning. Its clean syntax, vast library ecosystem, and active community have made it the go-to choice for both researchers and engineers. This guide covers everything you need to master the core Python libraries for AI/ML development.

1. Setting Up Your Python AI/ML Environment

Choosing a Python Version

For AI/ML work, Python 3.10 or later is recommended. Python 3.10+ offers structural pattern matching, clearer error messages, and improved type hints. As of 2026, Python 3.12 is stable and compatible with most ML libraries.

Check Python version

python --version

python3 --version

Install a specific version with pyenv

pyenv install 3.12.0

pyenv global 3.12.0

Setting Up Virtual Environments

Virtual environments isolate dependencies per project.

**venv (Standard Library)**

Create a virtual environment

python -m venv ml_env

Activate (Linux/Mac)

source ml_env/bin/activate

Activate (Windows)

ml_env\Scripts\activate

Deactivate

deactivate

**conda (Anaconda/Miniconda)**

Create environment

conda create -n ml_env python=3.12

Activate

conda activate ml_env

Install packages

conda install numpy pandas scikit-learn matplotlib

List environments

conda env list

Export environment

conda env export > environment.yml

Restore environment

conda env create -f environment.yml

**Poetry (Advanced Dependency Management)**

Install Poetry

curl -sSL https://install.python-poetry.org | python3 -

Initialize a project

poetry new ml_project

cd ml_project

Add packages

poetry add numpy pandas scikit-learn torch

Add dev dependencies

poetry add --dev pytest black flake8

Run inside the environment

poetry run python train.py

Jupyter Notebook/Lab Setup

Install JupyterLab

pip install jupyterlab

Register kernel

python -m ipykernel install --user --name=ml_env --display-name "ML Environment"

Launch JupyterLab

jupyter lab

Install useful extensions

pip install jupyterlab-git

pip install nbformat

**Jupyter config file (~/.jupyter/jupyter_lab_config.py)**

c.ServerApp.open_browser = True

c.ServerApp.port = 8888

c.ServerApp.ip = '0.0.0.0'

GPU Python Environment (CUDA, cuDNN)

Check CUDA version

nvidia-smi

nvcc --version

Install PyTorch with CUDA (CUDA 12.1)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Install TensorFlow with GPU

pip install tensorflow[and-cuda]

Verify GPU availability (PyTorch)

python -c "import torch; print(torch.cuda.is_available())"

Check cuDNN

python -c "import torch; print(torch.backends.cudnn.version())"

Essential Package List

requirements.txt

numpy>=1.24.0

pandas>=2.0.0

matplotlib>=3.7.0

seaborn>=0.12.0

scikit-learn>=1.3.0

scipy>=1.11.0

torch>=2.0.0

torchvision>=0.15.0

tensorflow>=2.13.0

xgboost>=1.7.0

lightgbm>=4.0.0

optuna>=3.3.0

wandb>=0.15.0

tqdm>=4.65.0

jupyterlab>=4.0.0

black>=23.0.0

flake8>=6.0.0

pytest>=7.4.0

Install all at once

pip install -r requirements.txt

2. Mastering NumPy

NumPy (Numerical Python) is the foundation of scientific computing in Python. It provides multidimensional arrays and mathematical functions, and most ML libraries use NumPy internally.

Creating ndarrays

Basic array creation

arr1 = np.array([1, 2, 3, 4, 5])

arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(arr1.shape) # (5,)

print(arr2.shape) # (2, 3)

print(arr2.dtype) # int64

print(arr2.ndim) # 2

print(arr2.size) # 6

Special arrays

zeros = np.zeros((3, 4)) # all zeros

ones = np.ones((2, 3, 4)) # all ones

full = np.full((3, 3), 7) # all sevens

eye = np.eye(4) # identity matrix

empty = np.empty((2, 3)) # uninitialized

Range arrays

arange = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]

linspace = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]

logspace = np.logspace(0, 3, 4) # [1, 10, 100, 1000]

Random arrays

np.random.seed(42)

rand_uniform = np.random.rand(3, 4) # uniform [0, 1)

rand_normal = np.random.randn(3, 4) # standard normal

rand_int = np.random.randint(0, 10, (3, 4)) # random integers

Modern random API (recommended)

rng = np.random.default_rng(42)

samples = rng.normal(loc=0, scale=1, size=(100, 3))

Basic Operations and Broadcasting

a = np.array([[1, 2, 3], [4, 5, 6]])

b = np.array([[7, 8, 9], [10, 11, 12]])

Element-wise arithmetic

print(a + b) # element-wise addition

print(a - b) # element-wise subtraction

print(a * b) # element-wise multiplication

print(a / b) # element-wise division

print(a ** 2) # element-wise squaring

print(a % 2) # element-wise modulo

Broadcasting - operations on arrays of different shapes

x = np.array([[1], [2], [3]]) # shape: (3, 1)

y = np.array([10, 20, 30]) # shape: (3,) treated as (1, 3)

Broadcasting result: (3, 3)

result = x + y

print(result)

[[11, 21, 31],

[12, 22, 32],

[13, 23, 33]]

Practical Broadcasting: batch normalization

data = np.random.randn(100, 10) # 100 samples, 10 features

mean = data.mean(axis=0) # per-feature mean (shape: 10,)

std = data.std(axis=0) # per-feature std (shape: 10,)

normalized = (data - mean) / std # broadcasting normalization

print(normalized.mean(axis=0).round(10)) # approximately 0

print(normalized.std(axis=0).round(10)) # approximately 1

Indexing, Slicing, and Boolean Indexing

arr = np.arange(24).reshape(4, 6)

print(arr)

[[ 0 1 2 3 4 5]

[ 6 7 8 9 10 11]

[12 13 14 15 16 17]

[18 19 20 21 22 23]]

Basic indexing

print(arr[0, 0]) # 0

print(arr[3, 5]) # 23

print(arr[-1, -1]) # 23

Slicing

print(arr[1:3, 2:5]) # rows 1-2, cols 2-4

print(arr[:, 0]) # all rows, column 0

print(arr[::2, ::2]) # every 2nd row and column

Fancy indexing

rows = np.array([0, 2])

cols = np.array([1, 4])

print(arr[rows, cols]) # [arr[0,1], arr[2,4]] = [1, 16]

Boolean indexing (masking)

mask = arr > 12

print(arr[mask]) # elements greater than 12

Filtering with conditions

data = np.array([1, -2, 3, -4, 5, -6])

positive = data[data > 0] # [1, 3, 5]

np.where - conditional selection

result = np.where(data > 0, data, 0) # keep positives, zero out negatives

print(result) # [1, 0, 3, 0, 5, 0]

np.where to find indices

indices = np.where(data > 0)

print(indices) # (array([0, 2, 4]),)

Shape Transformations

arr = np.arange(12)

reshape

a = arr.reshape(3, 4)

b = arr.reshape(2, 2, 3)

c = arr.reshape(-1, 4) # -1 infers the size: gives (3, 4)

flatten vs ravel

flat1 = a.flatten() # always returns a copy

flat2 = a.ravel() # returns a view when possible (more memory-efficient)

transpose

mat = np.random.randn(3, 4)

transposed = mat.T

transposed2 = mat.transpose()

transposed3 = np.transpose(mat, (1, 0))

3D transpose

tensor = np.random.randn(2, 3, 4)

batch, channels, spatial -> batch, spatial, channels

reordered = tensor.transpose(0, 2, 1) # (2, 4, 3)

squeeze and expand_dims

x = np.array([[[1, 2, 3]]]) # shape: (1, 1, 3)

squeezed = np.squeeze(x) # (3,)

expanded = np.expand_dims(squeezed, axis=0) # (1, 3)

Stacking arrays

a = np.array([[1, 2], [3, 4]])

b = np.array([[5, 6], [7, 8]])

hstack = np.hstack([a, b]) # horizontal stack (2, 4)

vstack = np.vstack([a, b]) # vertical stack (4, 2)

Mathematical Functions

x = np.array([0, np.pi/6, np.pi/4, np.pi/3, np.pi/2])

Trigonometry

sin_x = np.sin(x)

cos_x = np.cos(x)

tan_x = np.tan(x)

Exponential and logarithm

exp_x = np.exp(x) # e^x

log_x = np.log(x + 1) # natural log (ln)

log2_x = np.log2(x + 1) # base-2 log

log10_x = np.log10(x + 1)

Power and root

sqrt_x = np.sqrt(x)

square_x = np.square(x) # x^2

power_x = np.power(x, 3) # x^3

Sigmoid and softmax

def sigmoid(x):

return 1 / (1 + np.exp(-x))

def softmax(x):

e_x = np.exp(x - x.max()) # subtract max for numerical stability

return e_x / e_x.sum()

z = np.array([1.0, 2.0, 3.0])

print(sigmoid(z)) # [0.731, 0.880, 0.952]

print(softmax(z)) # [0.090, 0.245, 0.665]

Linear Algebra

A = np.array([[1, 2], [3, 4]])

B = np.array([[5, 6], [7, 8]])

Matrix multiplication

C = np.dot(A, B) # classic approach

C = A @ B # preferred in Python 3.5+

C = np.matmul(A, B) # same as np.dot for 2D

Batched matrix multiplication (3D+)

batch_A = np.random.randn(32, 3, 4)

batch_B = np.random.randn(32, 4, 5)

batch_C = batch_A @ batch_B # (32, 3, 5)

Linear algebra functions

det = np.linalg.det(A) # determinant

inv = np.linalg.inv(A) # inverse

rank = np.linalg.matrix_rank(A) # rank

trace = np.trace(A) # trace

Eigendecomposition

eigenvalues, eigenvectors = np.linalg.eig(A)

Singular Value Decomposition (SVD)

U, S, Vt = np.linalg.svd(A)

Solve linear system Ax = b

b = np.array([5, 6])

x = np.linalg.solve(A, b)

Norms

v = np.array([3, 4])

l1_norm = np.linalg.norm(v, ord=1) # L1 norm: 7

l2_norm = np.linalg.norm(v, ord=2) # L2 norm: 5

inf_norm = np.linalg.norm(v, ord=np.inf) # max norm: 4

Vectorized Operations vs For Loops

n = 1_000_000

a = np.random.randn(n)

b = np.random.randn(n)

For loop

start = time.time()

result_loop = []

for i in range(n):

result_loop.append(a[i] * b[i])

loop_time = time.time() - start

print(f"For loop: {loop_time:.4f}s")

Vectorized

start = time.time()

result_vec = a * b

vec_time = time.time() - start

print(f"Vectorized: {vec_time:.4f}s")

print(f"Speedup: {loop_time / vec_time:.1f}x")

Typically 100-1000x faster

Practical: Neural Network Forward Pass with NumPy

class SimpleNeuralNetwork:

"""Two-layer neural network implemented with NumPy only"""

def __init__(self, input_size, hidden_size, output_size, seed=42):

np.random.seed(seed)

He initialization

self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)

self.b1 = np.zeros((1, hidden_size))

self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)

self.b2 = np.zeros((1, output_size))

def relu(self, z):

return np.maximum(0, z)

def relu_derivative(self, z):

return (z > 0).astype(float)

def softmax(self, z):

exp_z = np.exp(z - z.max(axis=1, keepdims=True))

return exp_z / exp_z.sum(axis=1, keepdims=True)

def forward(self, X):

Layer 1

self.Z1 = X @ self.W1 + self.b1

self.A1 = self.relu(self.Z1)

Layer 2

self.Z2 = self.A1 @ self.W2 + self.b2

self.A2 = self.softmax(self.Z2)

return self.A2

def cross_entropy_loss(self, y_pred, y_true):

m = y_true.shape[0]

log_probs = -np.log(y_pred[range(m), y_true] + 1e-8)

return log_probs.mean()

def backward(self, X, y_true, learning_rate=0.01):

m = X.shape[0]

Output layer gradient

dZ2 = self.A2.copy()

dZ2[range(m), y_true] -= 1

dZ2 /= m

dW2 = self.A1.T @ dZ2

db2 = dZ2.sum(axis=0, keepdims=True)

Hidden layer gradient

dA1 = dZ2 @ self.W2.T

dZ1 = dA1 * self.relu_derivative(self.Z1)

dW1 = X.T @ dZ1

db1 = dZ1.sum(axis=0, keepdims=True)

Weight update

self.W1 -= learning_rate * dW1

self.b1 -= learning_rate * db1

self.W2 -= learning_rate * dW2

self.b2 -= learning_rate * db2

def train(self, X, y, epochs=100, learning_rate=0.01):

losses = []

for epoch in range(epochs):

y_pred = self.forward(X)

loss = self.cross_entropy_loss(y_pred, y)

losses.append(loss)

self.backward(X, y, learning_rate)

if epoch % 10 == 0:

acc = (y_pred.argmax(axis=1) == y).mean()

print(f"Epoch {epoch:3d}: Loss={loss:.4f}, Acc={acc:.4f}")

return losses

Test

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20,

n_classes=3, n_informative=15,

random_state=42)

nn = SimpleNeuralNetwork(input_size=20, hidden_size=64, output_size=3)

losses = nn.train(X, y, epochs=50, learning_rate=0.1)

3. Mastering Pandas

Pandas is the core library for working with tabular data. It provides DataFrame and Series data structures and supports every step of data cleaning, transformation, and analysis.

Series and DataFrame

Creating a Series

s1 = pd.Series([1, 2, 3, 4, 5])

s2 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

s3 = pd.Series({'x': 100, 'y': 200, 'z': 300})

print(s2['a']) # 10

print(s2[['a', 'c']]) # a=10, c=30

Creating a DataFrame

data = {

'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

'age': [25, 30, 35, 28, 22],

'score': [88.5, 92.3, 78.1, 95.7, 83.2],

'passed': [True, True, False, True, True]

}

df = pd.DataFrame(data)

print(df.head())

print(df.tail(3))

print(df.info())

print(df.describe())

print(df.dtypes)

print(df.shape) # (5, 4)

Reading and Writing Data

CSV

df_csv = pd.read_csv('data.csv',

sep=',',

header=0,

index_col=0,

parse_dates=['date'],

encoding='utf-8',

na_values=['N/A', 'null', ''])

df_csv.to_csv('output.csv', index=False, encoding='utf-8')

Excel

df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1', header=0)

df_excel.to_excel('output.xlsx', sheet_name='Result', index=False)

JSON

df_json = pd.read_json('data.json', orient='records')

df_json.to_json('output.json', orient='records', indent=2)

Parquet (high-performance columnar format)

df.to_parquet('data.parquet', engine='pyarrow', compression='snappy')

df_parquet = pd.read_parquet('data.parquet')

SQL (SQLite example)

conn = sqlite3.connect('database.db')

df_sql = pd.read_sql_query("SELECT * FROM users WHERE age > 25", conn)

df.to_sql('new_table', conn, if_exists='replace', index=False)

Indexing with loc and iloc

df = pd.DataFrame({

'A': range(10),

'B': range(10, 20),

'C': range(20, 30)

}, index=[f'row{i}' for i in range(10)])

loc: label-based indexing

print(df.loc['row0', 'A']) # single value

print(df.loc['row0':'row3', 'A':'B']) # range (end inclusive)

print(df.loc[['row1', 'row5'], 'C']) # list

iloc: position-based indexing

print(df.iloc[0, 0]) # row 0, col 0

print(df.iloc[0:4, 0:2]) # range (end exclusive)

print(df.iloc[[1, 5], 2]) # list

Condition-based selection

mask = df['A'] > 5

filtered = df[mask]

filtered2 = df[df['B'].between(12, 17)]

filtered3 = df.query('A > 5 and B < 18')

Handling Missing Values

Create data with missing values

df = pd.DataFrame({

'age': [25, np.nan, 35, np.nan, 22],

'income': [50000, 60000, np.nan, 80000, np.nan],

'city': ['Seoul', 'Busan', None, 'Incheon', 'Seoul'],

'score': [88.5, 92.3, 78.1, np.nan, 83.2]

})

Inspect missing values

print(df.isnull().sum()) # count per column

print(df.isnull().sum() / len(df) * 100) # missing percentage

Drop missing values

df_dropped_rows = df.dropna() # drop rows with any NaN

df_dropped_cols = df.dropna(axis=1) # drop columns with any NaN

df_thresh = df.dropna(thresh=3) # keep rows with at least 3 non-NaN

Fill missing values

df_filled_0 = df.fillna(0)

df_filled_mean = df.fillna(df.mean())

df_filled_dict = df.fillna({

'age': df['age'].mean(),

'income': df['income'].median(),

'city': 'Unknown',

'score': df['score'].mean()

})

Forward/backward fill

df_ffill = df.fillna(method='ffill')

df_bfill = df.fillna(method='bfill')

Interpolation

df_interpolated = df.interpolate(method='linear')

Smart handling pattern

for col in df.columns:

missing_pct = df[col].isnull().mean()

if missing_pct > 0.5:

df.drop(columns=[col], inplace=True)

elif df[col].dtype == 'object':

df[col].fillna(df[col].mode()[0], inplace=True)

else:

df[col].fillna(df[col].median(), inplace=True)

Data Transformation

df = pd.DataFrame({

'text': ['hello world', 'PYTHON IS GREAT', 'data science'],

'value': [1, 2, 3],

'category': ['A', 'B', 'A']

})

apply: apply a function

df['text_upper'] = df['text'].apply(str.upper)

df['text_length'] = df['text'].apply(len)

Complex function

def process_text(text):

return ' '.join(word.capitalize() for word in text.lower().split())

df['text_processed'] = df['text'].apply(process_text)

Multiple columns simultaneously

def feature_engineer(row):

return pd.Series({

'value_squared': row['value'] ** 2,

'category_is_A': int(row['category'] == 'A')

})

new_features = df.apply(feature_engineer, axis=1)

df = pd.concat([df, new_features], axis=1)

map: apply a mapping table

category_map = {'A': 'Alpha', 'B': 'Beta', 'C': 'Gamma'}

df['category_name'] = df['category'].map(category_map)

String operations (vectorized)

texts = pd.Series(['Hello World', 'Python 3.12', 'Machine Learning'])

print(texts.str.lower())

print(texts.str.split())

print(texts.str.contains('Python'))

print(texts.str.extract(r'(\w+)\s+(\w+)'))

Groupby Aggregation

np.random.seed(42)

df = pd.DataFrame({

'team': np.random.choice(['A', 'B', 'C'], 100),

'role': np.random.choice(['dev', 'ds', 'pm'], 100),

'score': np.random.randint(60, 100, 100),

'salary': np.random.randint(3000, 8000, 100)

})

Basic groupby

grouped = df.groupby('team')

print(grouped['score'].mean())

print(grouped['salary'].describe())

Multiple keys

multi_grouped = df.groupby(['team', 'role'])

print(multi_grouped['score'].mean().unstack())

Custom aggregation

agg_result = df.groupby('team').agg(

avg_score=('score', 'mean'),

total_salary=('salary', 'sum'),

count=('score', 'count'),

max_score=('score', 'max'),

min_salary=('salary', 'min')

)

Custom aggregation function

def iqr(x):

return x.quantile(0.75) - x.quantile(0.25)

custom_agg = df.groupby('team')['score'].agg(['mean', 'median', 'std', iqr])

filter: keep groups matching a condition

large_teams = df.groupby('team').filter(lambda x: len(x) > 30)

Merging DataFrames

users = pd.DataFrame({

'user_id': [1, 2, 3, 4, 5],

'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

'age': [25, 30, 35, 28, 22]

})

orders = pd.DataFrame({

'order_id': [101, 102, 103, 104, 105, 106],

'user_id': [1, 2, 1, 3, 5, 6],

'amount': [150, 250, 80, 320, 190, 440]

})

Inner join (intersection)

inner = pd.merge(users, orders, on='user_id', how='inner')

Left join

left = pd.merge(users, orders, on='user_id', how='left')

Right join

right = pd.merge(users, orders, on='user_id', how='right')

Outer join (union)

outer = pd.merge(users, orders, on='user_id', how='outer')

concat

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

vertical = pd.concat([df1, df2], axis=0, ignore_index=True)

Practical: AI Training Data Preprocessing Pipeline

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder

def preprocess_titanic(df):

"""Titanic dataset preprocessing pipeline"""

df = df.copy()

Feature engineering

df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.')

title_map = {'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3}

df['Title'] = df['Title'].map(title_map).fillna(4)

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

Handle missing values

df['Age'].fillna(df.groupby('Title')['Age'].transform('median'), inplace=True)

df['Fare'].fillna(df['Fare'].median(), inplace=True)

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

Encoding

df['Sex'] = (df['Sex'] == 'male').astype(int)

df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

features = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked',

'FamilySize', 'IsAlone', 'Title']

return df[features]

4. Matplotlib and Seaborn Visualization

Basic Plots

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

Line plot

x = np.linspace(0, 2 * np.pi, 100)

axes[0, 0].plot(x, np.sin(x), 'b-', linewidth=2, label='sin(x)')

axes[0, 0].plot(x, np.cos(x), 'r--', linewidth=2, label='cos(x)')

axes[0, 0].set_title('Trigonometric Functions')

axes[0, 0].legend()

axes[0, 0].grid(True, alpha=0.3)

Bar plot

categories = ['Classification', 'Regression', 'Clustering', 'Dim. Reduction']

values = [85, 72, 68, 91]

bars = axes[0, 1].bar(categories, values,

color=['#3498db', '#e74c3c', '#2ecc71', '#f39c12'])

axes[0, 1].set_title('Algorithm Accuracy')

axes[0, 1].set_ylabel('Accuracy (%)')

for bar, val in zip(bars, values):

axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,

f'{val}%', ha='center', va='bottom')

Scatter plot

np.random.seed(42)

x_scatter = np.random.randn(100)

y_scatter = 2 * x_scatter + np.random.randn(100) * 0.5

axes[0, 2].scatter(x_scatter, y_scatter, alpha=0.6, c=y_scatter, cmap='viridis')

axes[0, 2].set_title('Scatter Plot')

Histogram

data = np.concatenate([

np.random.normal(0, 1, 500),

np.random.normal(4, 1.5, 300)

])

axes[1, 0].hist(data, bins=50, density=True, alpha=0.7, color='steelblue')

axes[1, 0].set_title('Data Distribution')

Box plot

box_data = [np.random.normal(i, 1, 100) for i in range(5)]

axes[1, 1].boxplot(box_data, labels=[f'Model{i+1}' for i in range(5)])

axes[1, 1].set_title('Model Performance Distribution')

plt.tight_layout()

plt.savefig('basic_plots.png', dpi=150, bbox_inches='tight')

plt.show()

Seaborn Statistical Visualization

sns.set_theme(style='whitegrid', palette='husl', font_scale=1.2)

Sample data

df = pd.DataFrame({

'model': np.repeat(['ResNet', 'VGG', 'EfficientNet', 'ViT'], 50),

'accuracy': np.concatenate([

np.random.normal(92, 2, 50),

np.random.normal(88, 3, 50),

np.random.normal(94, 1.5, 50),

np.random.normal(95, 2.5, 50)

]),

'params_M': np.concatenate([

np.random.normal(25, 2, 50),

np.random.normal(138, 5, 50),

np.random.normal(5.3, 0.3, 50),

np.random.normal(86, 3, 50)

])

})

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

Violin plot

sns.violinplot(data=df, x='model', y='accuracy', ax=axes[0, 0])

axes[0, 0].set_title('Accuracy Distribution by Model')

Heatmap (correlation)

corr_data = df[['accuracy', 'params_M']].corr()

sns.heatmap(corr_data, annot=True, fmt='.2f', cmap='RdYlGn',

center=0, ax=axes[0, 1])

axes[0, 1].set_title('Correlation Matrix')

Scatter with regression line

sns.regplot(data=df, x='params_M', y='accuracy',

scatter_kws={'alpha': 0.4}, ax=axes[1, 0])

axes[1, 0].set_title('Parameters vs Accuracy')

KDE distribution plot

for model in df['model'].unique():

subset = df[df['model'] == model]

sns.kdeplot(data=subset, x='accuracy', label=model, ax=axes[1, 1])

axes[1, 1].set_title('Accuracy KDE by Model')

axes[1, 1].legend()

plt.tight_layout()

plt.savefig('seaborn_plots.png', dpi=150, bbox_inches='tight')

plt.show()

Practical: Learning Curves and Confusion Matrix

from sklearn.metrics import confusion_matrix

def plot_learning_curve(train_losses, val_losses, train_accs, val_accs):

"""Visualize training and validation learning curves"""

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

epochs = range(1, len(train_losses) + 1)

ax1.plot(epochs, train_losses, 'b-', label='Train Loss', linewidth=2)

ax1.plot(epochs, val_losses, 'r--', label='Val Loss', linewidth=2)

ax1.fill_between(epochs, train_losses, val_losses, alpha=0.1, color='gray')

ax1.set_xlabel('Epoch')

ax1.set_ylabel('Loss')

ax1.set_title('Loss Curve')

ax1.legend()

ax1.grid(True, alpha=0.3)

ax2.plot(epochs, train_accs, 'b-', label='Train Accuracy', linewidth=2)

ax2.plot(epochs, val_accs, 'r--', label='Val Accuracy', linewidth=2)

best_epoch = np.argmax(val_accs)

ax2.axvline(x=best_epoch + 1, color='g', linestyle=':',

label=f'Best Epoch ({best_epoch+1})')

ax2.set_xlabel('Epoch')

ax2.set_ylabel('Accuracy')

ax2.set_title('Accuracy Curve')

ax2.legend()

ax2.grid(True, alpha=0.3)

plt.tight_layout()

return fig

def plot_confusion_matrix(y_true, y_pred, class_names):

"""Visualize a confusion matrix"""

cm = confusion_matrix(y_true, y_pred)

cm_pct = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',

xticklabels=class_names, yticklabels=class_names, ax=ax1)

ax1.set_title('Confusion Matrix (Counts)')

ax1.set_ylabel('True Label')

ax1.set_xlabel('Predicted Label')

sns.heatmap(cm_pct, annot=True, fmt='.2%', cmap='Greens',

xticklabels=class_names, yticklabels=class_names, ax=ax2)

ax2.set_title('Confusion Matrix (Rates)')

ax2.set_ylabel('True Label')

ax2.set_xlabel('Predicted Label')

plt.tight_layout()

return fig

5. Machine Learning with Scikit-learn

Data Preprocessing

from sklearn.preprocessing import (

StandardScaler, MinMaxScaler, RobustScaler,

LabelEncoder, OneHotEncoder

)

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]], dtype=float)

StandardScaler: mean 0, std 1

scaler = StandardScaler()

X_standard = scaler.fit_transform(X)

MinMaxScaler: scale to [0, 1]

min_max = MinMaxScaler(feature_range=(0, 1))

X_minmax = min_max.fit_transform(X)

RobustScaler: uses median and IQR, robust to outliers

robust = RobustScaler()

X_robust = robust.fit_transform(X)

LabelEncoder

le = LabelEncoder()

labels = ['cat', 'dog', 'bird', 'cat', 'dog']

encoded = le.fit_transform(labels) # [0, 2, 1, 0, 2]

OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

categories = np.array([['red'], ['green'], ['blue'], ['red']])

encoded_ohe = ohe.fit_transform(categories)

Feature Selection and Extraction

from sklearn.decomposition import PCA

from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.ensemble import RandomForestClassifier

X = np.random.randn(200, 20)

y = (X[:, 0] + X[:, 1] + np.random.randn(200) * 0.1 > 0).astype(int)

PCA

pca = PCA(n_components=10)

X_pca = pca.fit_transform(X)

print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2%}")

SelectKBest

selector = SelectKBest(f_classif, k=5)

X_kbest = selector.fit_transform(X, y)

selected = selector.get_support(indices=True)

print(f"Selected feature indices: {selected}")

Linear Models

from sklearn.linear_model import (

LinearRegression, LogisticRegression, Ridge, Lasso

)

from sklearn.datasets import make_classification, make_regression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, r2_score, accuracy_score

Regression

X_reg, y_reg = make_regression(n_samples=500, n_features=20, noise=0.1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2)

lr = LinearRegression()

lr.fit(X_train, y_train)

print(f"Linear R2: {r2_score(y_test, lr.predict(X_test)):.4f}")

ridge = Ridge(alpha=1.0)

ridge.fit(X_train, y_train)

print(f"Ridge R2: {r2_score(y_test, ridge.predict(X_test)):.4f}")

lasso = Lasso(alpha=0.01)

lasso.fit(X_train, y_train)

print(f"Lasso R2: {r2_score(y_test, lasso.predict(X_test)):.4f}")

print(f"Non-zero coefficients: {np.sum(lasso.coef_ != 0)}")

Tree-based Models

from sklearn.ensemble import (

RandomForestClassifier, GradientBoostingClassifier

)

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report

X, y = make_classification(n_samples=1000, n_features=20,

n_classes=2, n_informative=10,

random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {

'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),

'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),

'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),

}

for name, model in models.items():

model.fit(X_train, y_train)

acc = (model.predict(X_test) == y_test).mean()

print(f"{name}: {acc:.4f}")

Model Evaluation and Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(rf, X, y, cv=cv, scoring='accuracy', n_jobs=-1)

print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

GridSearchCV

param_grid = {

'n_estimators': [50, 100, 200],

'max_depth': [None, 5, 10],

'min_samples_split': [2, 5, 10]

}

grid_search = GridSearchCV(

RandomForestClassifier(random_state=42),

param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1

)

grid_search.fit(X, y)

print(f"Best params: {grid_search.best_params_}")

print(f"Best CV score: {grid_search.best_score_:.4f}")

Pipelines

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.impute import SimpleImputer

from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

n = 500

df = pd.DataFrame({

'age': np.random.randint(18, 70, n).astype(float),

'income': np.random.randint(20000, 100000, n).astype(float),

'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),

'city': np.random.choice(['New York', 'Chicago', 'Houston', 'Phoenix'], n),

'target': np.random.randint(0, 2, n)

})

X = df.drop('target', axis=1)

y = df['target']

numeric_features = ['age', 'income']

categorical_features = ['education', 'city']

numeric_transformer = Pipeline(steps=[

('imputer', SimpleImputer(strategy='median')),

('scaler', StandardScaler())

])

categorical_transformer = Pipeline(steps=[

('imputer', SimpleImputer(strategy='most_frequent')),

('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))

])

preprocessor = ColumnTransformer(transformers=[

('num', numeric_transformer, numeric_features),

('cat', categorical_transformer, categorical_features)

])

full_pipeline = Pipeline(steps=[

('preprocessor', preprocessor),

('classifier', RandomForestClassifier(n_estimators=100, random_state=42))

])

from sklearn.model_selection import cross_val_score

scores = cross_val_score(full_pipeline, X, y, cv=5, scoring='accuracy')

print(f"Pipeline CV accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")

6. Python Performance Optimization

List Comprehensions vs map vs for

n = 1_000_000

data = list(range(n))

For loop

start = time.time()

result_for = []

for x in data:

result_for.append(x ** 2)

print(f"For loop: {time.time() - start:.4f}s")

List comprehension

start = time.time()

result_lc = [x ** 2 for x in data]

print(f"List comprehension: {time.time() - start:.4f}s")

map

start = time.time()

result_map = list(map(lambda x: x ** 2, data))

print(f"Map: {time.time() - start:.4f}s")

NumPy vectorization

arr = np.array(data)

start = time.time()

result_np = arr ** 2

print(f"NumPy: {time.time() - start:.4f}s")

Generators

List vs generator memory comparison

list_comp = [x**2 for x in range(1_000_000)]

gen_expr = (x**2 for x in range(1_000_000))

print(f"List size: {sys.getsizeof(list_comp):,} bytes") # ~8MB

print(f"Generator size: {sys.getsizeof(gen_expr)} bytes") # ~120 bytes

Generator function

def infinite_data_loader(dataset, batch_size=32):

"""Infinite data loader generator"""

while True:

indices = np.random.permutation(len(dataset))

for i in range(0, len(dataset), batch_size):

batch_indices = indices[i:i + batch_size]

yield dataset[batch_indices]

Parallel Processing

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def cpu_intensive_task(n):

return sum(i**2 for i in range(n))

def io_bound_task(url):

time.sleep(0.1)

return f"Fetched: {url}"

ThreadPoolExecutor: best for I/O-bound tasks

urls = [f"https://example.com/data/{i}" for i in range(20)]

start = time.time()

with ThreadPoolExecutor(max_workers=10) as executor:

results = list(executor.map(io_bound_task, urls))

print(f"ThreadPool time: {time.time() - start:.2f}s")

ProcessPoolExecutor: best for CPU-bound tasks

numbers = [1_000_000] * 8

start = time.time()

with ProcessPoolExecutor(max_workers=4) as executor:

results = list(executor.map(cpu_intensive_task, numbers))

print(f"ProcessPool time: {time.time() - start:.2f}s")

Numba JIT Compilation

from numba import njit, prange

@njit(parallel=True)

def fast_matrix_norm(A):

"""Parallelized matrix norm computation"""

n, m = A.shape

result = 0.0

for i in prange(n):

for j in prange(m):

result += A[i, j] ** 2

return result ** 0.5

A = np.random.randn(1000, 1000)

Warmup (first run triggers JIT compilation)

_ = fast_matrix_norm(A)

start = time.time()

for _ in range(10):

result = fast_matrix_norm(A)

print(f"Numba: {time.time() - start:.4f}s")

start = time.time()

for _ in range(10):

result = np.linalg.norm(A)

print(f"NumPy: {time.time() - start:.4f}s")

7. AI/ML Utility Libraries

tqdm - Progress Bars

from tqdm import tqdm, trange

Basic usage

for i in tqdm(range(100)):

time.sleep(0.01)

Custom description

items = list(range(50))

for item in tqdm(items, desc='Processing', unit='sample'):

pass

Nested progress bars

for epoch in trange(10, desc='Epochs'):

for batch in trange(100, desc='Batches', leave=False):

pass

Manual update with metrics

with tqdm(total=100, desc='Training') as pbar:

for i in range(10):

pbar.update(10)

pbar.set_postfix({'loss': 0.5 - i * 0.04, 'acc': 0.7 + i * 0.02})

Weights and Biases (wandb) - Experiment Tracking

wandb.init(

project="ml-experiment",

name="run-001",

config={

"learning_rate": 0.001,

"batch_size": 32,

"epochs": 100,

"model": "ResNet50",

"optimizer": "AdamW"

}

)

for epoch in range(100):

train_loss = 1.0 - epoch * 0.009 + np.random.normal(0, 0.01)

val_loss = 1.0 - epoch * 0.008 + np.random.normal(0, 0.02)

train_acc = epoch * 0.009 + np.random.normal(0, 0.01)

val_acc = epoch * 0.008 + np.random.normal(0, 0.02)

wandb.log({

"epoch": epoch,

"train/loss": train_loss,

"val/loss": val_loss,

"train/acc": train_acc,

"val/acc": val_acc,

"learning_rate": 0.001 * (0.95 ** epoch)

})

wandb.finish()

pytest - Testing

tests/test_preprocessing.py

def normalize(x):

return (x - x.mean()) / x.std()

class TestNormalize:

def test_mean_zero(self):

x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

result = normalize(x)

assert abs(result.mean()) < 1e-10

def test_std_one(self):

x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

result = normalize(x)

assert abs(result.std() - 1.0) < 1e-10

def test_shape_preserved(self):

x = np.random.randn(10, 5)

result = normalize(x)

assert result.shape == x.shape

@pytest.fixture

def sample_df():

return pd.DataFrame({

'value': [1.0, 2.0, np.nan, 4.0, 5.0],

'label': ['a', 'b', 'c', 'd', 'e']

})

def test_dropna(sample_df):

result = sample_df.dropna()

assert result.isnull().sum().sum() == 0

Run with: pytest tests/ -v --coverage

Conclusion

This guide covered the essential Python ecosystem for AI/ML development:

- **Environment Setup**: Project isolation with venv, conda, and poetry

- **NumPy**: High-performance numerical computation through vectorized operations

- **Pandas**: Building data preprocessing and analysis pipelines

- **Matplotlib/Seaborn**: Rich visualizations to surface insights

- **Scikit-learn**: Complete ML workflow from preprocessing to model evaluation

- **Performance Optimization**: Generators, parallel processing, and Numba JIT

- **Utilities**: tqdm, wandb, hydra, and pytest

In real-world ML projects, these tools combine to build a complete pipeline: data loading, preprocessing, feature engineering, model training, evaluation, and deployment. Consult the official documentation for each tool to go deeper.

References

- [NumPy Documentation](https://numpy.org/doc/stable/)

- [Pandas Documentation](https://pandas.pydata.org/docs/)

- [Scikit-learn Documentation](https://scikit-learn.org/stable/)

- [Matplotlib Documentation](https://matplotlib.org/stable/)

- [Seaborn Documentation](https://seaborn.pydata.org/)

- [Weights and Biases Documentation](https://docs.wandb.ai/)