Skip to content
Published on

Python Complete Guide for AI/ML: Master NumPy, Pandas, Matplotlib, and Scikit-learn

Authors

Python Complete Guide for AI/ML

Python is the standard language for AI and machine learning. Its clean syntax, vast library ecosystem, and active community have made it the go-to choice for both researchers and engineers. This guide covers everything you need to master the core Python libraries for AI/ML development.


1. Setting Up Your Python AI/ML Environment

Choosing a Python Version

For AI/ML work, Python 3.10 or later is recommended. Python 3.10+ offers structural pattern matching, clearer error messages, and improved type hints. As of 2026, Python 3.12 is stable and compatible with most ML libraries.

# Check Python version
python --version
python3 --version

# Install a specific version with pyenv
pyenv install 3.12.0
pyenv global 3.12.0

Setting Up Virtual Environments

Virtual environments isolate dependencies per project.

venv (Standard Library)

# Create a virtual environment
python -m venv ml_env

# Activate (Linux/Mac)
source ml_env/bin/activate

# Activate (Windows)
ml_env\Scripts\activate

# Deactivate
deactivate

conda (Anaconda/Miniconda)

# Create environment
conda create -n ml_env python=3.12

# Activate
conda activate ml_env

# Install packages
conda install numpy pandas scikit-learn matplotlib

# List environments
conda env list

# Export environment
conda env export > environment.yml

# Restore environment
conda env create -f environment.yml

Poetry (Advanced Dependency Management)

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Initialize a project
poetry new ml_project
cd ml_project

# Add packages
poetry add numpy pandas scikit-learn torch

# Add dev dependencies
poetry add --dev pytest black flake8

# Run inside the environment
poetry run python train.py

Jupyter Notebook/Lab Setup

# Install JupyterLab
pip install jupyterlab

# Register kernel
python -m ipykernel install --user --name=ml_env --display-name "ML Environment"

# Launch JupyterLab
jupyter lab

# Install useful extensions
pip install jupyterlab-git
pip install nbformat

Jupyter config file (~/.jupyter/jupyter_lab_config.py)

c.ServerApp.open_browser = True
c.ServerApp.port = 8888
c.ServerApp.ip = '0.0.0.0'

GPU Python Environment (CUDA, cuDNN)

# Check CUDA version
nvidia-smi
nvcc --version

# Install PyTorch with CUDA (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install TensorFlow with GPU
pip install tensorflow[and-cuda]

# Verify GPU availability (PyTorch)
python -c "import torch; print(torch.cuda.is_available())"

# Check cuDNN
python -c "import torch; print(torch.backends.cudnn.version())"

Essential Package List

# requirements.txt
numpy>=1.24.0
pandas>=2.0.0
matplotlib>=3.7.0
seaborn>=0.12.0
scikit-learn>=1.3.0
scipy>=1.11.0
torch>=2.0.0
torchvision>=0.15.0
tensorflow>=2.13.0
xgboost>=1.7.0
lightgbm>=4.0.0
optuna>=3.3.0
wandb>=0.15.0
tqdm>=4.65.0
jupyterlab>=4.0.0
black>=23.0.0
flake8>=6.0.0
pytest>=7.4.0

# Install all at once
pip install -r requirements.txt

2. Mastering NumPy

NumPy (Numerical Python) is the foundation of scientific computing in Python. It provides multidimensional arrays and mathematical functions, and most ML libraries use NumPy internally.

Creating ndarrays

import numpy as np

# Basic array creation
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(arr1.shape)   # (5,)
print(arr2.shape)   # (2, 3)
print(arr2.dtype)   # int64
print(arr2.ndim)    # 2
print(arr2.size)    # 6

# Special arrays
zeros = np.zeros((3, 4))          # all zeros
ones = np.ones((2, 3, 4))         # all ones
full = np.full((3, 3), 7)         # all sevens
eye = np.eye(4)                    # identity matrix
empty = np.empty((2, 3))          # uninitialized

# Range arrays
arange = np.arange(0, 10, 2)      # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)   # [0.0, 0.25, 0.5, 0.75, 1.0]
logspace = np.logspace(0, 3, 4)   # [1, 10, 100, 1000]

# Random arrays
np.random.seed(42)
rand_uniform = np.random.rand(3, 4)           # uniform [0, 1)
rand_normal = np.random.randn(3, 4)           # standard normal
rand_int = np.random.randint(0, 10, (3, 4))  # random integers

# Modern random API (recommended)
rng = np.random.default_rng(42)
samples = rng.normal(loc=0, scale=1, size=(100, 3))

Basic Operations and Broadcasting

import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[7, 8, 9], [10, 11, 12]])

# Element-wise arithmetic
print(a + b)   # element-wise addition
print(a - b)   # element-wise subtraction
print(a * b)   # element-wise multiplication
print(a / b)   # element-wise division
print(a ** 2)  # element-wise squaring
print(a % 2)   # element-wise modulo

# Broadcasting - operations on arrays of different shapes
x = np.array([[1], [2], [3]])  # shape: (3, 1)
y = np.array([10, 20, 30])     # shape: (3,) treated as (1, 3)

# Broadcasting result: (3, 3)
result = x + y
print(result)
# [[11, 21, 31],
#  [12, 22, 32],
#  [13, 23, 33]]

# Practical Broadcasting: batch normalization
data = np.random.randn(100, 10)  # 100 samples, 10 features
mean = data.mean(axis=0)         # per-feature mean (shape: 10,)
std = data.std(axis=0)           # per-feature std (shape: 10,)

normalized = (data - mean) / std  # broadcasting normalization
print(normalized.mean(axis=0).round(10))  # approximately 0
print(normalized.std(axis=0).round(10))   # approximately 1

Indexing, Slicing, and Boolean Indexing

import numpy as np

arr = np.arange(24).reshape(4, 6)
print(arr)
# [[ 0  1  2  3  4  5]
#  [ 6  7  8  9 10 11]
#  [12 13 14 15 16 17]
#  [18 19 20 21 22 23]]

# Basic indexing
print(arr[0, 0])    # 0
print(arr[3, 5])    # 23
print(arr[-1, -1])  # 23

# Slicing
print(arr[1:3, 2:5])  # rows 1-2, cols 2-4
print(arr[:, 0])      # all rows, column 0
print(arr[::2, ::2])  # every 2nd row and column

# Fancy indexing
rows = np.array([0, 2])
cols = np.array([1, 4])
print(arr[rows, cols])  # [arr[0,1], arr[2,4]] = [1, 16]

# Boolean indexing (masking)
mask = arr > 12
print(arr[mask])  # elements greater than 12

# Filtering with conditions
data = np.array([1, -2, 3, -4, 5, -6])
positive = data[data > 0]  # [1, 3, 5]

# np.where - conditional selection
result = np.where(data > 0, data, 0)  # keep positives, zero out negatives
print(result)  # [1, 0, 3, 0, 5, 0]

# np.where to find indices
indices = np.where(data > 0)
print(indices)  # (array([0, 2, 4]),)

Shape Transformations

import numpy as np

arr = np.arange(12)

# reshape
a = arr.reshape(3, 4)
b = arr.reshape(2, 2, 3)
c = arr.reshape(-1, 4)   # -1 infers the size: gives (3, 4)

# flatten vs ravel
flat1 = a.flatten()  # always returns a copy
flat2 = a.ravel()    # returns a view when possible (more memory-efficient)

# transpose
mat = np.random.randn(3, 4)
transposed = mat.T
transposed2 = mat.transpose()
transposed3 = np.transpose(mat, (1, 0))

# 3D transpose
tensor = np.random.randn(2, 3, 4)
# batch, channels, spatial -> batch, spatial, channels
reordered = tensor.transpose(0, 2, 1)  # (2, 4, 3)

# squeeze and expand_dims
x = np.array([[[1, 2, 3]]])  # shape: (1, 1, 3)
squeezed = np.squeeze(x)     # (3,)
expanded = np.expand_dims(squeezed, axis=0)  # (1, 3)

# Stacking arrays
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

hstack = np.hstack([a, b])  # horizontal stack (2, 4)
vstack = np.vstack([a, b])  # vertical stack (4, 2)

Mathematical Functions

import numpy as np

x = np.array([0, np.pi/6, np.pi/4, np.pi/3, np.pi/2])

# Trigonometry
sin_x = np.sin(x)
cos_x = np.cos(x)
tan_x = np.tan(x)

# Exponential and logarithm
exp_x = np.exp(x)        # e^x
log_x = np.log(x + 1)   # natural log (ln)
log2_x = np.log2(x + 1) # base-2 log
log10_x = np.log10(x + 1)

# Power and root
sqrt_x = np.sqrt(x)
square_x = np.square(x)  # x^2
power_x = np.power(x, 3) # x^3

# Sigmoid and softmax
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def softmax(x):
    e_x = np.exp(x - x.max())  # subtract max for numerical stability
    return e_x / e_x.sum()

z = np.array([1.0, 2.0, 3.0])
print(sigmoid(z))   # [0.731, 0.880, 0.952]
print(softmax(z))   # [0.090, 0.245, 0.665]

Linear Algebra

import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication
C = np.dot(A, B)          # classic approach
C = A @ B                 # preferred in Python 3.5+
C = np.matmul(A, B)       # same as np.dot for 2D

# Batched matrix multiplication (3D+)
batch_A = np.random.randn(32, 3, 4)
batch_B = np.random.randn(32, 4, 5)
batch_C = batch_A @ batch_B  # (32, 3, 5)

# Linear algebra functions
det = np.linalg.det(A)               # determinant
inv = np.linalg.inv(A)               # inverse
rank = np.linalg.matrix_rank(A)      # rank
trace = np.trace(A)                   # trace

# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(A)

# Singular Value Decomposition (SVD)
U, S, Vt = np.linalg.svd(A)

# Solve linear system Ax = b
b = np.array([5, 6])
x = np.linalg.solve(A, b)

# Norms
v = np.array([3, 4])
l1_norm = np.linalg.norm(v, ord=1)         # L1 norm: 7
l2_norm = np.linalg.norm(v, ord=2)         # L2 norm: 5
inf_norm = np.linalg.norm(v, ord=np.inf)   # max norm: 4

Vectorized Operations vs For Loops

import numpy as np
import time

n = 1_000_000
a = np.random.randn(n)
b = np.random.randn(n)

# For loop
start = time.time()
result_loop = []
for i in range(n):
    result_loop.append(a[i] * b[i])
loop_time = time.time() - start
print(f"For loop: {loop_time:.4f}s")

# Vectorized
start = time.time()
result_vec = a * b
vec_time = time.time() - start
print(f"Vectorized: {vec_time:.4f}s")

print(f"Speedup: {loop_time / vec_time:.1f}x")
# Typically 100-1000x faster

Practical: Neural Network Forward Pass with NumPy

import numpy as np

class SimpleNeuralNetwork:
    """Two-layer neural network implemented with NumPy only"""

    def __init__(self, input_size, hidden_size, output_size, seed=42):
        np.random.seed(seed)
        # He initialization
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))

    def relu(self, z):
        return np.maximum(0, z)

    def relu_derivative(self, z):
        return (z > 0).astype(float)

    def softmax(self, z):
        exp_z = np.exp(z - z.max(axis=1, keepdims=True))
        return exp_z / exp_z.sum(axis=1, keepdims=True)

    def forward(self, X):
        # Layer 1
        self.Z1 = X @ self.W1 + self.b1
        self.A1 = self.relu(self.Z1)
        # Layer 2
        self.Z2 = self.A1 @ self.W2 + self.b2
        self.A2 = self.softmax(self.Z2)
        return self.A2

    def cross_entropy_loss(self, y_pred, y_true):
        m = y_true.shape[0]
        log_probs = -np.log(y_pred[range(m), y_true] + 1e-8)
        return log_probs.mean()

    def backward(self, X, y_true, learning_rate=0.01):
        m = X.shape[0]

        # Output layer gradient
        dZ2 = self.A2.copy()
        dZ2[range(m), y_true] -= 1
        dZ2 /= m

        dW2 = self.A1.T @ dZ2
        db2 = dZ2.sum(axis=0, keepdims=True)

        # Hidden layer gradient
        dA1 = dZ2 @ self.W2.T
        dZ1 = dA1 * self.relu_derivative(self.Z1)

        dW1 = X.T @ dZ1
        db1 = dZ1.sum(axis=0, keepdims=True)

        # Weight update
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2

    def train(self, X, y, epochs=100, learning_rate=0.01):
        losses = []
        for epoch in range(epochs):
            y_pred = self.forward(X)
            loss = self.cross_entropy_loss(y_pred, y)
            losses.append(loss)
            self.backward(X, y, learning_rate)

            if epoch % 10 == 0:
                acc = (y_pred.argmax(axis=1) == y).mean()
                print(f"Epoch {epoch:3d}: Loss={loss:.4f}, Acc={acc:.4f}")
        return losses


# Test
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20,
                             n_classes=3, n_informative=15,
                             random_state=42)

nn = SimpleNeuralNetwork(input_size=20, hidden_size=64, output_size=3)
losses = nn.train(X, y, epochs=50, learning_rate=0.1)

3. Mastering Pandas

Pandas is the core library for working with tabular data. It provides DataFrame and Series data structures and supports every step of data cleaning, transformation, and analysis.

Series and DataFrame

import pandas as pd
import numpy as np

# Creating a Series
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s3 = pd.Series({'x': 100, 'y': 200, 'z': 300})

print(s2['a'])          # 10
print(s2[['a', 'c']])  # a=10, c=30

# Creating a DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 28, 22],
    'score': [88.5, 92.3, 78.1, 95.7, 83.2],
    'passed': [True, True, False, True, True]
}
df = pd.DataFrame(data)

print(df.head())
print(df.tail(3))
print(df.info())
print(df.describe())
print(df.dtypes)
print(df.shape)  # (5, 4)

Reading and Writing Data

import pandas as pd

# CSV
df_csv = pd.read_csv('data.csv',
                      sep=',',
                      header=0,
                      index_col=0,
                      parse_dates=['date'],
                      encoding='utf-8',
                      na_values=['N/A', 'null', ''])

df_csv.to_csv('output.csv', index=False, encoding='utf-8')

# Excel
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1', header=0)
df_excel.to_excel('output.xlsx', sheet_name='Result', index=False)

# JSON
df_json = pd.read_json('data.json', orient='records')
df_json.to_json('output.json', orient='records', indent=2)

# Parquet (high-performance columnar format)
df.to_parquet('data.parquet', engine='pyarrow', compression='snappy')
df_parquet = pd.read_parquet('data.parquet')

# SQL (SQLite example)
import sqlite3
conn = sqlite3.connect('database.db')
df_sql = pd.read_sql_query("SELECT * FROM users WHERE age > 25", conn)
df.to_sql('new_table', conn, if_exists='replace', index=False)

Indexing with loc and iloc

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': range(10),
    'B': range(10, 20),
    'C': range(20, 30)
}, index=[f'row{i}' for i in range(10)])

# loc: label-based indexing
print(df.loc['row0', 'A'])            # single value
print(df.loc['row0':'row3', 'A':'B']) # range (end inclusive)
print(df.loc[['row1', 'row5'], 'C'])  # list

# iloc: position-based indexing
print(df.iloc[0, 0])         # row 0, col 0
print(df.iloc[0:4, 0:2])    # range (end exclusive)
print(df.iloc[[1, 5], 2])   # list

# Condition-based selection
mask = df['A'] > 5
filtered = df[mask]
filtered2 = df[df['B'].between(12, 17)]
filtered3 = df.query('A > 5 and B < 18')

Handling Missing Values

import pandas as pd
import numpy as np

# Create data with missing values
df = pd.DataFrame({
    'age': [25, np.nan, 35, np.nan, 22],
    'income': [50000, 60000, np.nan, 80000, np.nan],
    'city': ['Seoul', 'Busan', None, 'Incheon', 'Seoul'],
    'score': [88.5, 92.3, 78.1, np.nan, 83.2]
})

# Inspect missing values
print(df.isnull().sum())                      # count per column
print(df.isnull().sum() / len(df) * 100)      # missing percentage

# Drop missing values
df_dropped_rows = df.dropna()                # drop rows with any NaN
df_dropped_cols = df.dropna(axis=1)          # drop columns with any NaN
df_thresh = df.dropna(thresh=3)              # keep rows with at least 3 non-NaN

# Fill missing values
df_filled_0 = df.fillna(0)
df_filled_mean = df.fillna(df.mean())
df_filled_dict = df.fillna({
    'age': df['age'].mean(),
    'income': df['income'].median(),
    'city': 'Unknown',
    'score': df['score'].mean()
})

# Forward/backward fill
df_ffill = df.fillna(method='ffill')
df_bfill = df.fillna(method='bfill')

# Interpolation
df_interpolated = df.interpolate(method='linear')

# Smart handling pattern
for col in df.columns:
    missing_pct = df[col].isnull().mean()
    if missing_pct > 0.5:
        df.drop(columns=[col], inplace=True)
    elif df[col].dtype == 'object':
        df[col].fillna(df[col].mode()[0], inplace=True)
    else:
        df[col].fillna(df[col].median(), inplace=True)

Data Transformation

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'text': ['hello world', 'PYTHON IS GREAT', 'data science'],
    'value': [1, 2, 3],
    'category': ['A', 'B', 'A']
})

# apply: apply a function
df['text_upper'] = df['text'].apply(str.upper)
df['text_length'] = df['text'].apply(len)

# Complex function
def process_text(text):
    return ' '.join(word.capitalize() for word in text.lower().split())

df['text_processed'] = df['text'].apply(process_text)

# Multiple columns simultaneously
def feature_engineer(row):
    return pd.Series({
        'value_squared': row['value'] ** 2,
        'category_is_A': int(row['category'] == 'A')
    })

new_features = df.apply(feature_engineer, axis=1)
df = pd.concat([df, new_features], axis=1)

# map: apply a mapping table
category_map = {'A': 'Alpha', 'B': 'Beta', 'C': 'Gamma'}
df['category_name'] = df['category'].map(category_map)

# String operations (vectorized)
texts = pd.Series(['Hello World', 'Python 3.12', 'Machine Learning'])
print(texts.str.lower())
print(texts.str.split())
print(texts.str.contains('Python'))
print(texts.str.extract(r'(\w+)\s+(\w+)'))

Groupby Aggregation

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'team': np.random.choice(['A', 'B', 'C'], 100),
    'role': np.random.choice(['dev', 'ds', 'pm'], 100),
    'score': np.random.randint(60, 100, 100),
    'salary': np.random.randint(3000, 8000, 100)
})

# Basic groupby
grouped = df.groupby('team')
print(grouped['score'].mean())
print(grouped['salary'].describe())

# Multiple keys
multi_grouped = df.groupby(['team', 'role'])
print(multi_grouped['score'].mean().unstack())

# Custom aggregation
agg_result = df.groupby('team').agg(
    avg_score=('score', 'mean'),
    total_salary=('salary', 'sum'),
    count=('score', 'count'),
    max_score=('score', 'max'),
    min_salary=('salary', 'min')
)

# Custom aggregation function
def iqr(x):
    return x.quantile(0.75) - x.quantile(0.25)

custom_agg = df.groupby('team')['score'].agg(['mean', 'median', 'std', iqr])

# filter: keep groups matching a condition
large_teams = df.groupby('team').filter(lambda x: len(x) > 30)

Merging DataFrames

import pandas as pd

users = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 28, 22]
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'user_id': [1, 2, 1, 3, 5, 6],
    'amount': [150, 250, 80, 320, 190, 440]
})

# Inner join (intersection)
inner = pd.merge(users, orders, on='user_id', how='inner')

# Left join
left = pd.merge(users, orders, on='user_id', how='left')

# Right join
right = pd.merge(users, orders, on='user_id', how='right')

# Outer join (union)
outer = pd.merge(users, orders, on='user_id', how='outer')

# concat
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

vertical = pd.concat([df1, df2], axis=0, ignore_index=True)

Practical: AI Training Data Preprocessing Pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

def preprocess_titanic(df):
    """Titanic dataset preprocessing pipeline"""
    df = df.copy()

    # Feature engineering
    df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.')
    title_map = {'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3}
    df['Title'] = df['Title'].map(title_map).fillna(4)

    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

    # Handle missing values
    df['Age'].fillna(df.groupby('Title')['Age'].transform('median'), inplace=True)
    df['Fare'].fillna(df['Fare'].median(), inplace=True)
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

    # Encoding
    df['Sex'] = (df['Sex'] == 'male').astype(int)
    df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

    features = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked',
                'FamilySize', 'IsAlone', 'Title']

    return df[features]

4. Matplotlib and Seaborn Visualization

Basic Plots

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Line plot
x = np.linspace(0, 2 * np.pi, 100)
axes[0, 0].plot(x, np.sin(x), 'b-', linewidth=2, label='sin(x)')
axes[0, 0].plot(x, np.cos(x), 'r--', linewidth=2, label='cos(x)')
axes[0, 0].set_title('Trigonometric Functions')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Bar plot
categories = ['Classification', 'Regression', 'Clustering', 'Dim. Reduction']
values = [85, 72, 68, 91]
bars = axes[0, 1].bar(categories, values,
                       color=['#3498db', '#e74c3c', '#2ecc71', '#f39c12'])
axes[0, 1].set_title('Algorithm Accuracy')
axes[0, 1].set_ylabel('Accuracy (%)')
for bar, val in zip(bars, values):
    axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                    f'{val}%', ha='center', va='bottom')

# Scatter plot
np.random.seed(42)
x_scatter = np.random.randn(100)
y_scatter = 2 * x_scatter + np.random.randn(100) * 0.5
axes[0, 2].scatter(x_scatter, y_scatter, alpha=0.6, c=y_scatter, cmap='viridis')
axes[0, 2].set_title('Scatter Plot')

# Histogram
data = np.concatenate([
    np.random.normal(0, 1, 500),
    np.random.normal(4, 1.5, 300)
])
axes[1, 0].hist(data, bins=50, density=True, alpha=0.7, color='steelblue')
axes[1, 0].set_title('Data Distribution')

# Box plot
box_data = [np.random.normal(i, 1, 100) for i in range(5)]
axes[1, 1].boxplot(box_data, labels=[f'Model{i+1}' for i in range(5)])
axes[1, 1].set_title('Model Performance Distribution')

plt.tight_layout()
plt.savefig('basic_plots.png', dpi=150, bbox_inches='tight')
plt.show()

Seaborn Statistical Visualization

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

sns.set_theme(style='whitegrid', palette='husl', font_scale=1.2)

# Sample data
df = pd.DataFrame({
    'model': np.repeat(['ResNet', 'VGG', 'EfficientNet', 'ViT'], 50),
    'accuracy': np.concatenate([
        np.random.normal(92, 2, 50),
        np.random.normal(88, 3, 50),
        np.random.normal(94, 1.5, 50),
        np.random.normal(95, 2.5, 50)
    ]),
    'params_M': np.concatenate([
        np.random.normal(25, 2, 50),
        np.random.normal(138, 5, 50),
        np.random.normal(5.3, 0.3, 50),
        np.random.normal(86, 3, 50)
    ])
})

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Violin plot
sns.violinplot(data=df, x='model', y='accuracy', ax=axes[0, 0])
axes[0, 0].set_title('Accuracy Distribution by Model')

# Heatmap (correlation)
corr_data = df[['accuracy', 'params_M']].corr()
sns.heatmap(corr_data, annot=True, fmt='.2f', cmap='RdYlGn',
            center=0, ax=axes[0, 1])
axes[0, 1].set_title('Correlation Matrix')

# Scatter with regression line
sns.regplot(data=df, x='params_M', y='accuracy',
            scatter_kws={'alpha': 0.4}, ax=axes[1, 0])
axes[1, 0].set_title('Parameters vs Accuracy')

# KDE distribution plot
for model in df['model'].unique():
    subset = df[df['model'] == model]
    sns.kdeplot(data=subset, x='accuracy', label=model, ax=axes[1, 1])
axes[1, 1].set_title('Accuracy KDE by Model')
axes[1, 1].legend()

plt.tight_layout()
plt.savefig('seaborn_plots.png', dpi=150, bbox_inches='tight')
plt.show()

Practical: Learning Curves and Confusion Matrix

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix

def plot_learning_curve(train_losses, val_losses, train_accs, val_accs):
    """Visualize training and validation learning curves"""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    epochs = range(1, len(train_losses) + 1)

    ax1.plot(epochs, train_losses, 'b-', label='Train Loss', linewidth=2)
    ax1.plot(epochs, val_losses, 'r--', label='Val Loss', linewidth=2)
    ax1.fill_between(epochs, train_losses, val_losses, alpha=0.1, color='gray')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.set_title('Loss Curve')
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    ax2.plot(epochs, train_accs, 'b-', label='Train Accuracy', linewidth=2)
    ax2.plot(epochs, val_accs, 'r--', label='Val Accuracy', linewidth=2)
    best_epoch = np.argmax(val_accs)
    ax2.axvline(x=best_epoch + 1, color='g', linestyle=':',
                label=f'Best Epoch ({best_epoch+1})')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy')
    ax2.set_title('Accuracy Curve')
    ax2.legend()
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    return fig


def plot_confusion_matrix(y_true, y_pred, class_names):
    """Visualize a confusion matrix"""
    cm = confusion_matrix(y_true, y_pred)
    cm_pct = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_names, yticklabels=class_names, ax=ax1)
    ax1.set_title('Confusion Matrix (Counts)')
    ax1.set_ylabel('True Label')
    ax1.set_xlabel('Predicted Label')

    sns.heatmap(cm_pct, annot=True, fmt='.2%', cmap='Greens',
                xticklabels=class_names, yticklabels=class_names, ax=ax2)
    ax2.set_title('Confusion Matrix (Rates)')
    ax2.set_ylabel('True Label')
    ax2.set_xlabel('Predicted Label')

    plt.tight_layout()
    return fig

5. Machine Learning with Scikit-learn

Data Preprocessing

from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder
)
import numpy as np

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]], dtype=float)

# StandardScaler: mean 0, std 1
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)

# MinMaxScaler: scale to [0, 1]
min_max = MinMaxScaler(feature_range=(0, 1))
X_minmax = min_max.fit_transform(X)

# RobustScaler: uses median and IQR, robust to outliers
robust = RobustScaler()
X_robust = robust.fit_transform(X)

# LabelEncoder
le = LabelEncoder()
labels = ['cat', 'dog', 'bird', 'cat', 'dog']
encoded = le.fit_transform(labels)  # [0, 2, 1, 0, 2]

# OneHotEncoder
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
categories = np.array([['red'], ['green'], ['blue'], ['red']])
encoded_ohe = ohe.fit_transform(categories)

Feature Selection and Extraction

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X = np.random.randn(200, 20)
y = (X[:, 0] + X[:, 1] + np.random.randn(200) * 0.1 > 0).astype(int)

# PCA
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2%}")

# SelectKBest
selector = SelectKBest(f_classif, k=5)
X_kbest = selector.fit_transform(X, y)
selected = selector.get_support(indices=True)
print(f"Selected feature indices: {selected}")

Linear Models

from sklearn.linear_model import (
    LinearRegression, LogisticRegression, Ridge, Lasso
)
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
import numpy as np

# Regression
X_reg, y_reg = make_regression(n_samples=500, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2)

lr = LinearRegression()
lr.fit(X_train, y_train)
print(f"Linear R2: {r2_score(y_test, lr.predict(X_test)):.4f}")

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"Ridge R2: {r2_score(y_test, ridge.predict(X_test)):.4f}")

lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
print(f"Lasso R2: {r2_score(y_test, lasso.predict(X_test)):.4f}")
print(f"Non-zero coefficients: {np.sum(lasso.coef_ != 0)}")

Tree-based Models

from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier
)
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

X, y = make_classification(n_samples=1000, n_features=20,
                             n_classes=2, n_informative=10,
                             random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    acc = (model.predict(X_test) == y_test).mean()
    print(f"{name}: {acc:.4f}")

Model Evaluation and Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(rf, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

# GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1
)
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

Pipelines

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

np.random.seed(42)
n = 500
df = pd.DataFrame({
    'age': np.random.randint(18, 70, n).astype(float),
    'income': np.random.randint(20000, 100000, n).astype(float),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
    'city': np.random.choice(['New York', 'Chicago', 'Houston', 'Phoenix'], n),
    'target': np.random.randint(0, 2, n)
})

X = df.drop('target', axis=1)
y = df['target']

numeric_features = ['age', 'income']
categorical_features = ['education', 'city']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

from sklearn.model_selection import cross_val_score
scores = cross_val_score(full_pipeline, X, y, cv=5, scoring='accuracy')
print(f"Pipeline CV accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")

6. Python Performance Optimization

List Comprehensions vs map vs for

import time
import numpy as np

n = 1_000_000
data = list(range(n))

# For loop
start = time.time()
result_for = []
for x in data:
    result_for.append(x ** 2)
print(f"For loop: {time.time() - start:.4f}s")

# List comprehension
start = time.time()
result_lc = [x ** 2 for x in data]
print(f"List comprehension: {time.time() - start:.4f}s")

# map
start = time.time()
result_map = list(map(lambda x: x ** 2, data))
print(f"Map: {time.time() - start:.4f}s")

# NumPy vectorization
arr = np.array(data)
start = time.time()
result_np = arr ** 2
print(f"NumPy: {time.time() - start:.4f}s")

Generators

import sys

# List vs generator memory comparison
list_comp = [x**2 for x in range(1_000_000)]
gen_expr = (x**2 for x in range(1_000_000))

print(f"List size: {sys.getsizeof(list_comp):,} bytes")  # ~8MB
print(f"Generator size: {sys.getsizeof(gen_expr)} bytes") # ~120 bytes

# Generator function
def infinite_data_loader(dataset, batch_size=32):
    """Infinite data loader generator"""
    while True:
        indices = np.random.permutation(len(dataset))
        for i in range(0, len(dataset), batch_size):
            batch_indices = indices[i:i + batch_size]
            yield dataset[batch_indices]

Parallel Processing

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import time

def cpu_intensive_task(n):
    return sum(i**2 for i in range(n))

def io_bound_task(url):
    time.sleep(0.1)
    return f"Fetched: {url}"

# ThreadPoolExecutor: best for I/O-bound tasks
urls = [f"https://example.com/data/{i}" for i in range(20)]

start = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(io_bound_task, urls))
print(f"ThreadPool time: {time.time() - start:.2f}s")

# ProcessPoolExecutor: best for CPU-bound tasks
numbers = [1_000_000] * 8

start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(cpu_intensive_task, numbers))
print(f"ProcessPool time: {time.time() - start:.2f}s")

Numba JIT Compilation

from numba import njit, prange
import numpy as np
import time

@njit(parallel=True)
def fast_matrix_norm(A):
    """Parallelized matrix norm computation"""
    n, m = A.shape
    result = 0.0
    for i in prange(n):
        for j in prange(m):
            result += A[i, j] ** 2
    return result ** 0.5

A = np.random.randn(1000, 1000)

# Warmup (first run triggers JIT compilation)
_ = fast_matrix_norm(A)

start = time.time()
for _ in range(10):
    result = fast_matrix_norm(A)
print(f"Numba: {time.time() - start:.4f}s")

start = time.time()
for _ in range(10):
    result = np.linalg.norm(A)
print(f"NumPy: {time.time() - start:.4f}s")

7. AI/ML Utility Libraries

tqdm - Progress Bars

from tqdm import tqdm, trange
import time

# Basic usage
for i in tqdm(range(100)):
    time.sleep(0.01)

# Custom description
items = list(range(50))
for item in tqdm(items, desc='Processing', unit='sample'):
    pass

# Nested progress bars
for epoch in trange(10, desc='Epochs'):
    for batch in trange(100, desc='Batches', leave=False):
        pass

# Manual update with metrics
with tqdm(total=100, desc='Training') as pbar:
    for i in range(10):
        pbar.update(10)
        pbar.set_postfix({'loss': 0.5 - i * 0.04, 'acc': 0.7 + i * 0.02})

Weights and Biases (wandb) - Experiment Tracking

import wandb
import numpy as np

wandb.init(
    project="ml-experiment",
    name="run-001",
    config={
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 100,
        "model": "ResNet50",
        "optimizer": "AdamW"
    }
)

for epoch in range(100):
    train_loss = 1.0 - epoch * 0.009 + np.random.normal(0, 0.01)
    val_loss = 1.0 - epoch * 0.008 + np.random.normal(0, 0.02)
    train_acc = epoch * 0.009 + np.random.normal(0, 0.01)
    val_acc = epoch * 0.008 + np.random.normal(0, 0.02)

    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "val/loss": val_loss,
        "train/acc": train_acc,
        "val/acc": val_acc,
        "learning_rate": 0.001 * (0.95 ** epoch)
    })

wandb.finish()

pytest - Testing

# tests/test_preprocessing.py
import pytest
import numpy as np
import pandas as pd

def normalize(x):
    return (x - x.mean()) / x.std()

class TestNormalize:
    def test_mean_zero(self):
        x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
        result = normalize(x)
        assert abs(result.mean()) < 1e-10

    def test_std_one(self):
        x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
        result = normalize(x)
        assert abs(result.std() - 1.0) < 1e-10

    def test_shape_preserved(self):
        x = np.random.randn(10, 5)
        result = normalize(x)
        assert result.shape == x.shape


@pytest.fixture
def sample_df():
    return pd.DataFrame({
        'value': [1.0, 2.0, np.nan, 4.0, 5.0],
        'label': ['a', 'b', 'c', 'd', 'e']
    })


def test_dropna(sample_df):
    result = sample_df.dropna()
    assert result.isnull().sum().sum() == 0

# Run with: pytest tests/ -v --coverage

Conclusion

This guide covered the essential Python ecosystem for AI/ML development:

  • Environment Setup: Project isolation with venv, conda, and poetry
  • NumPy: High-performance numerical computation through vectorized operations
  • Pandas: Building data preprocessing and analysis pipelines
  • Matplotlib/Seaborn: Rich visualizations to surface insights
  • Scikit-learn: Complete ML workflow from preprocessing to model evaluation
  • Performance Optimization: Generators, parallel processing, and Numba JIT
  • Utilities: tqdm, wandb, hydra, and pytest

In real-world ML projects, these tools combine to build a complete pipeline: data loading, preprocessing, feature engineering, model training, evaluation, and deployment. Consult the official documentation for each tool to go deeper.

References