💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

1. Overview of Tabular Data Machine Learning

Why Do Tree-Based Models Excel at Tabular Data?

Tabular data — structured as rows and columns — represents over 80% of real-world business problems. It appears across domains including financial fraud detection, customer churn prediction, real estate pricing, and medical diagnosis.

While deep learning has revolutionized images, text, and audio, Gradient Boosted Trees still dominate tabular data competitions. Here is why:

**Strengths of Tree-Based Models:**

1. **Irregular decision boundaries**: Trees partition feature space with axis-aligned splits, naturally capturing complex non-linear relationships.

2. **Scale invariance**: Works well without feature normalization since gradient boosting only uses the rank ordering of data.

3. **Missing value handling**: XGBoost, LightGBM, and CatBoost all handle missing values internally.

4. **Categorical features**: Label encoding alone is often sufficient for good performance.

5. **Interpretability**: Feature importance and SHAP values allow explaining model decisions.

6. **Resistance to overfitting**: Ensemble methods are more robust than single models.

EDA (Exploratory Data Analysis) Strategy

Understanding your data deeply before model training is crucial.

from scipy import stats

def eda_overview(df):

"""Print comprehensive overview of a dataframe"""

print("=== Data Overview ===")

print(f"Shape: {df.shape}")

print(f"\nData Types:\n{df.dtypes}")

print(f"\nMissing Values:\n{df.isnull().sum()}")

print(f"\nMissing Ratios:\n{(df.isnull().mean() * 100).round(2)}")

print(f"\nNumeric Summary:\n{df.describe()}")

print(f"\nCategorical Feature Cardinality:")

for col in df.select_dtypes(include='object').columns:

print(f" {col}: {df[col].nunique()} unique values")

def plot_target_distribution(df, target_col):

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df[target_col], bins=50, color='steelblue', edgecolor='white')

axes[0].set_title(f'{target_col} Distribution')

stats.probplot(df[target_col].dropna(), dist="norm", plot=axes[1])

axes[1].set_title('Q-Q Plot (Normality Check)')

plt.tight_layout()

plt.show()

def detect_outliers_iqr(df, columns):

"""Detect outliers using IQR method"""

outlier_info = {}

for col in columns:

Q1 = df[col].quantile(0.25)

Q3 = df[col].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR

upper = Q3 + 1.5 * IQR

outliers = df[(df[col] < lower) | (df[col] > upper)]

outlier_info[col] = {

'count': len(outliers),

'ratio': len(outliers) / len(df),

'lower_bound': lower,

'upper_bound': upper

}

return pd.DataFrame(outlier_info).T

Missing Value Handling Strategies

from sklearn.impute import SimpleImputer, KNNImputer

from sklearn.experimental import enable_iterative_imputer

from sklearn.impute import IterativeImputer

def simple_imputation(df):

"""Median imputation for numeric, mode for categorical"""

numeric_cols = df.select_dtypes(include=[np.number]).columns

num_imputer = SimpleImputer(strategy='median')

df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])

cat_cols = df.select_dtypes(include='object').columns

cat_imputer = SimpleImputer(strategy='most_frequent')

df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

return df

def knn_imputation(df, n_neighbors=5):

"""KNN-based imputation (effective for small datasets)"""

numeric_cols = df.select_dtypes(include=[np.number]).columns

imputer = KNNImputer(n_neighbors=n_neighbors)

df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

return df

def mice_imputation(df, max_iter=10):

"""MICE: Multiple Imputation by Chained Equations"""

numeric_cols = df.select_dtypes(include=[np.number]).columns

imputer = IterativeImputer(max_iter=max_iter, random_state=42)

df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

return df

def add_missing_indicators(df):

"""Add binary columns indicating where values were missing"""

cols_with_missing = df.columns[df.isnull().any()].tolist()

for col in cols_with_missing:

df[f'{col}_missing'] = df[col].isnull().astype(int)

return df

2. Decision Trees

ID3 and CART Algorithms

Decision trees recursively partition data to build a tree structure. Two major algorithms exist:

**ID3 (Iterative Dichotomiser 3):**

- Uses Information Gain as the splitting criterion

- Only handles categorical features (multi-way splits)

- Only for categorical targets

**CART (Classification and Regression Trees):**

- Classification: Gini Impurity

- Regression: Mean Squared Error

- Always binary splits (two child nodes)

- Algorithm used by scikit-learn

Information Gain and Gini Impurity

**Entropy and Information Gain:**

Entropy measures impurity of a dataset. For k classes:

H(S) = -sum(p_i * log2(p_i))

Information Gain is the entropy reduction after a split:

IG(S, A) = H(S) - sum(|S_v|/|S| * H(S_v))

**Gini Impurity:**

Gini(S) = 1 - sum(p_i^2)

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

from sklearn.tree import export_text, plot_tree

def train_decision_tree(X_train, y_train, task='classification'):

if task == 'classification':

model = DecisionTreeClassifier(

criterion='gini', # 'gini' or 'entropy'

max_depth=5,

min_samples_split=10,

min_samples_leaf=5,

max_features=None,

random_state=42

)

else:

model = DecisionTreeRegressor(

criterion='squared_error',

max_depth=5,

min_samples_split=10,

min_samples_leaf=5,

random_state=42

)

model.fit(X_train, y_train)

return model

def visualize_tree(model, feature_names, class_names=None, max_depth=3):

plt.figure(figsize=(20, 10))

plot_tree(

model, feature_names=feature_names, class_names=class_names,

filled=True, rounded=True, max_depth=max_depth, fontsize=10

)

plt.title('Decision Tree Visualization')

plt.tight_layout()

plt.show()

print(export_text(model, feature_names=feature_names, max_depth=3))

3. Random Forest

Bagging and Feature Randomization

Random Forest combines **Bootstrap Aggregating (Bagging)** with **Feature Randomization**:

1. Generate bootstrap samples (sampling with replacement) from training data

2. Train an independent decision tree on each sample

3. At each split, randomly select only sqrt(n_features) features to consider

4. Aggregate predictions via averaging (regression) or majority voting (classification)

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

from sklearn.inspection import permutation_importance

def train_random_forest(X_train, y_train, task='classification'):

if task == 'classification':

model = RandomForestClassifier(

n_estimators=300,

max_depth=None,

min_samples_split=2,

min_samples_leaf=1,

max_features='sqrt',

bootstrap=True,

oob_score=True,

n_jobs=-1,

random_state=42

)

else:

model = RandomForestRegressor(

n_estimators=300,

max_features='sqrt',

oob_score=True,

n_jobs=-1,

random_state=42

)

model.fit(X_train, y_train)

print(f"OOB Score: {model.oob_score_:.4f}")

return model

def plot_feature_importance(model, feature_names, top_n=20):

importances = pd.Series(

model.feature_importances_, index=feature_names

).sort_values(ascending=False)

plt.figure(figsize=(10, 8))

importances.head(top_n).plot(kind='barh', color='steelblue')

plt.title(f'Random Forest Feature Importance (Top {top_n})')

plt.xlabel('Importance')

plt.gca().invert_yaxis()

plt.tight_layout()

plt.show()

return importances

def compute_permutation_importance(model, X_val, y_val, feature_names):

result = permutation_importance(

model, X_val, y_val, n_repeats=10, random_state=42, n_jobs=-1

)

return pd.DataFrame({

'feature': feature_names,

'importance_mean': result.importances_mean,

'importance_std': result.importances_std

}).sort_values('importance_mean', ascending=False)

4. Gradient Boosting

Mathematical Principles

Gradient Boosting minimizes a loss function by sequentially adding trees in the direction of the negative gradient:

F_m(x) = F_{m-1}(x) + learning_rate * h_m(x)

Here, the stage-m tree is fitted to the pseudo-residuals:

r_i = -[dL(y_i, F(x_i)) / dF(x_i)]

For regression with MSE loss, the pseudo-residual is the target minus the current ensemble prediction, which is the ordinary residual.

For binary classification with log loss, the pseudo-residual is the target minus the sigmoid of the current ensemble score.

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor

def train_sklearn_gbm(X_train, y_train, task='classification'):

if task == 'classification':

model = GradientBoostingClassifier(

n_estimators=200, learning_rate=0.1, max_depth=3,

subsample=0.8, max_features='sqrt', random_state=42

)

else:

model = GradientBoostingRegressor(

n_estimators=200, learning_rate=0.1, max_depth=3,

subsample=0.8, random_state=42

)

model.fit(X_train, y_train)

return model

5. XGBoost

XGBoost's Innovations

XGBoost (eXtreme Gradient Boosting), developed by Chen & Guestrin (2016), delivers major improvements over vanilla GBM:

1. **Regularization**: Adds L1 and L2 regularization to the objective function to prevent overfitting

2. **Second-order Taylor expansion**: More accurate tree structure search using both first and second derivatives

3. **Missing value learning**: Learns the optimal default direction for missing samples at each split

4. **Parallel processing**: Parallel feature split search (tree construction itself is sequential)

5. **Cache optimization**: Optimized memory access patterns for speed

6. **Block structure**: Compressed column storage for sparse data

Complete Hyperparameter Reference

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

def train_xgboost_full(X, y, task='binary'):

X_train, X_val, y_train, y_val = train_test_split(

X, y, test_size=0.2, random_state=42,

stratify=y if task == 'binary' else None

)

if task == 'binary':

objective = 'binary:logistic'

eval_metric = 'auc'

elif task == 'multiclass':

objective = 'multi:softprob'

eval_metric = 'mlogloss'

else:

objective = 'reg:squarederror'

eval_metric = 'rmse'

model = xgb.XGBClassifier(

Tree parameters

n_estimators=1000,

max_depth=6, # Default: 6. Deeper = more complex

min_child_weight=1, # Minimum sum of instance weight in a leaf

gamma=0, # Minimum loss reduction for a split

max_delta_step=0, # Max delta step per tree weight (useful for imbalance)

Sampling

subsample=0.8, # Row sampling ratio per tree

colsample_bytree=0.8, # Column sampling ratio per tree

colsample_bylevel=1.0, # Column sampling per level

colsample_bynode=1.0, # Column sampling per node

Regularization

reg_alpha=0.1, # L1 regularization

reg_lambda=1.0, # L2 regularization (default: 1)

Learning

learning_rate=0.01, # Step size shrinkage (eta)

scale_pos_weight=1, # For imbalanced: neg/pos ratio

System

objective=objective,

eval_metric=eval_metric,

tree_method='hist', # 'hist' (fast), 'exact', 'approx', 'gpu_hist'

random_state=42,

n_jobs=-1,

)

model.fit(

X_train, y_train,

eval_set=[(X_train, y_train), (X_val, y_val)],

early_stopping_rounds=50,

verbose=100,

)

print(f"Best iteration: {model.best_iteration}")

print(f"Best score: {model.best_score:.4f}")

return model, X_val, y_val

def explain_with_shap(model, X_val, feature_names):

"""SHAP-based model explanation"""

explainer = shap.TreeExplainer(model)

shap_values = explainer.shap_values(X_val)

plt.figure(figsize=(10, 8))

shap.summary_plot(shap_values, X_val, feature_names=feature_names, show=False)

plt.tight_layout()

plt.show()

mean_abs_shap = pd.DataFrame({

'feature': feature_names,

'importance': np.abs(shap_values).mean(axis=0)

}).sort_values('importance', ascending=False)

return mean_abs_shap

GPU-Accelerated XGBoost

GPU training (requires NVIDIA GPU)

model_gpu = xgb.XGBClassifier(

n_estimators=1000,

tree_method='gpu_hist', # GPU histogram method

predictor='gpu_predictor', # GPU prediction

device='cuda', # XGBoost 2.0+: use device='cuda'

random_state=42

)

6. LightGBM

Leaf-wise vs Level-wise Tree Growth

LightGBM (Light Gradient Boosting Machine), developed by Microsoft, introduces critical innovations for speed and memory efficiency.

**Level-wise (traditional):**

- Splits all leaves simultaneously by depth level

- Produces balanced trees

- Wastes computation on nodes with little gain

**Leaf-wise (LightGBM):**

- Always splits the single leaf with the greatest loss reduction

- May produce highly unbalanced trees

- Achieves lower loss with the same number of leaves

- Requires `max_depth` to prevent overfitting

GOSS and EFB

**GOSS (Gradient-based One-Side Sampling):**

- Keeps all samples with large gradients (high information content)

- Randomly drops samples with small gradients (already well-learned)

- Reduces data size while preserving learning accuracy

**EFB (Exclusive Feature Bundling):**

- Bundles mutually exclusive sparse features (never both non-zero simultaneously)

- Reduces feature count, speeding up training (especially effective for sparse data)

from sklearn.model_selection import StratifiedKFold

def train_lightgbm_cv(X, y, n_folds=5):

"""Train LightGBM with Stratified K-Fold CV"""

skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

oof_preds = np.zeros(len(X))

models = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):

print(f"\n--- Fold {fold + 1}/{n_folds} ---")

X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]

y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

model = lgb.LGBMClassifier(

n_estimators=2000,

num_leaves=31, # Main complexity parameter

max_depth=-1, # -1 means unlimited (control via num_leaves)

min_child_samples=20,

learning_rate=0.05,

subsample=0.8,

subsample_freq=1,

colsample_bytree=0.8,

reg_alpha=0.1,

reg_lambda=0.1,

boosting_type='gbdt', # 'gbdt', 'dart', 'goss', 'rf'

random_state=42,

n_jobs=-1,

verbose=-1,

)

model.fit(

X_train, y_train,

eval_set=[(X_val, y_val)],

callbacks=[

lgb.early_stopping(stopping_rounds=100, verbose=True),

lgb.log_evaluation(period=200),

)

oof_preds[val_idx] = model.predict_proba(X_val)[:, 1]

models.append(model)

fold_auc = roc_auc_score(y_val, oof_preds[val_idx])

print(f"Fold {fold+1} AUC: {fold_auc:.4f}")

overall_auc = roc_auc_score(y, oof_preds)

print(f"\nOverall OOF AUC: {overall_auc:.4f}")

return models, oof_preds

def lgbm_with_categorical(X_train, y_train, X_val, y_val, cat_features):

"""LightGBM with native categorical feature handling"""

for col in cat_features:

X_train[col] = X_train[col].astype('category').cat.codes

X_val[col] = X_val[col].astype('category').cat.codes

train_data = lgb.Dataset(

X_train, y_train,

categorical_feature=cat_features,

free_raw_data=False

)

val_data = lgb.Dataset(X_val, y_val, reference=train_data)

params = {

'objective': 'binary', 'metric': 'auc',

'num_leaves': 31, 'learning_rate': 0.05, 'verbose': -1,

}

model = lgb.train(

params, train_data, num_boost_round=1000,

valid_sets=[val_data],

callbacks=[lgb.early_stopping(100), lgb.log_evaluation(100)],

)

return model

7. CatBoost

Automatic Categorical Feature Handling

CatBoost (Categorical Boosting), developed by Yandex, excels at handling categorical features natively.

**CatBoost's categorical encoding:**

1. **Target Statistics (TS)**: Encode each category using the target mean

2. **Ordered Target Statistics**: Leak-free TS using data ordering

3. **One-Hot Encoding**: Automatically applied for low-cardinality features

Ordered Boosting

CatBoost's key innovation, **Ordered Boosting**, resolves prediction shift bias that arises when computing target statistics:

- Arrange data in a random order

- Statistics for sample i are computed using only samples 0 through i-1

- Only uses previously seen samples, eliminating leakage

from catboost import CatBoostClassifier, CatBoostRegressor, Pool

def train_catboost(X_train, y_train, X_val, y_val, cat_features=None):

"""CatBoost training — can pass raw string categories directly"""

train_pool = Pool(data=X_train, label=y_train, cat_features=cat_features)

val_pool = Pool(data=X_val, label=y_val, cat_features=cat_features)

model = CatBoostClassifier(

iterations=1000,

learning_rate=0.05,

depth=6,

l2_leaf_reg=3.0,

subsample=0.8,

colsample_bylevel=0.8,

cat_features=cat_features,

one_hot_max_size=2,

boosting_type='Ordered',

bootstrap_type='Bayesian',

bagging_temperature=1.0,

loss_function='Logloss',

eval_metric='AUC',

task_type='CPU',

random_seed=42,

verbose=100,

use_best_model=True,

early_stopping_rounds=50,

)

model.fit(train_pool, eval_set=val_pool, plot=False)

print(f"Best iteration: {model.get_best_iteration()}")

print(f"Best score: {model.get_best_score()}")

return model

def compare_gbm_models(X, y):

"""Compare performance and speed across the three GBM frameworks"""

from sklearn.model_selection import cross_val_score

results = {}

start = time.time()

xgb_model = xgb.XGBClassifier(

n_estimators=300, max_depth=6, learning_rate=0.1,

subsample=0.8, colsample_bytree=0.8, random_state=42, n_jobs=-1

)

xgb_scores = cross_val_score(xgb_model, X, y, cv=5, scoring='roc_auc')

results['XGBoost'] = {

'mean_auc': xgb_scores.mean(), 'std_auc': xgb_scores.std(),

'time_sec': time.time() - start

}

start = time.time()

lgb_model = lgb.LGBMClassifier(

n_estimators=300, num_leaves=31, learning_rate=0.1,

subsample=0.8, colsample_bytree=0.8, random_state=42,

n_jobs=-1, verbose=-1

)

lgb_scores = cross_val_score(lgb_model, X, y, cv=5, scoring='roc_auc')

results['LightGBM'] = {

'mean_auc': lgb_scores.mean(), 'std_auc': lgb_scores.std(),

'time_sec': time.time() - start

}

start = time.time()

cb_model = CatBoostClassifier(

iterations=300, depth=6, learning_rate=0.1, random_seed=42, verbose=0

)

cb_scores = cross_val_score(cb_model, X, y, cv=5, scoring='roc_auc')

results['CatBoost'] = {

'mean_auc': cb_scores.mean(), 'std_auc': cb_scores.std(),

'time_sec': time.time() - start

}

result_df = pd.DataFrame(results).T

print("=== GBM Model Comparison ===")

print(result_df.to_string())

return result_df

**XGBoost vs LightGBM vs CatBoost Comparison:**

| --------------- | ---------------- | ------------------ | -------------------- |

| GPU support | Yes | Yes | Yes |

8. Feature Engineering

Numeric Feature Transformations

from sklearn.preprocessing import (

StandardScaler, MinMaxScaler, RobustScaler,

PowerTransformer, QuantileTransformer, KBinsDiscretizer

)

from scipy.stats import boxcox

def transform_numeric_features(df, columns):

"""Apply various numeric transformations"""

transformed = df.copy()

for col in columns:

1. Log transform (right-skewed, positive values)

if (df[col] > 0).all():

transformed[f'{col}_log'] = np.log1p(df[col])

2. Square root transform

if (df[col] >= 0).all():

transformed[f'{col}_sqrt'] = np.sqrt(df[col])

3. Box-Cox transform (positive values only)

if (df[col] > 0).all():

transformed[f'{col}_boxcox'], _ = boxcox(df[col] + 1)

4. Yeo-Johnson transform (handles negative values)

pt = PowerTransformer(method='yeo-johnson')

transformed[f'{col}_yeojohnson'] = pt.fit_transform(df[[col]])

5. Quantile transform (maps to normal distribution)

qt = QuantileTransformer(output_distribution='normal', n_quantiles=1000)

transformed[f'{col}_quantile'] = qt.fit_transform(df[[col]])

return transformed

def create_bins(df, col, n_bins=10, strategy='quantile'):

"""Discretize continuous variable into bins"""

kbd = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy=strategy)

df[f'{col}_bin'] = kbd.fit_transform(df[[col]]).astype(int)

return df

Categorical Encoding

from category_encoders import (

TargetEncoder, LeaveOneOutEncoder,

CatBoostEncoder, BinaryEncoder

)

def encode_categorical_features(X_train, X_val, y_train, cat_features):

"""Multiple categorical encoding strategies"""

results = {}

Target Encoding (medium-high cardinality)

Warning: can cause target leakage — only use inside CV folds

te = TargetEncoder(cols=cat_features, smoothing=1.0)

results['target_enc'] = (

te.fit_transform(X_train, y_train),

te.transform(X_val)

)

Leave-One-Out Encoding (prevents target leakage)

loo = LeaveOneOutEncoder(cols=cat_features, sigma=0.05)

results['loo_enc'] = (

loo.fit_transform(X_train, y_train),

loo.transform(X_val)

)

CatBoost Encoding (Ordered Target Statistics)

cbe = CatBoostEncoder(cols=cat_features)

results['catboost_enc'] = (

cbe.fit_transform(X_train, y_train),

cbe.transform(X_val)

)

return results

def extract_datetime_features(df, date_col):

"""Extract rich datetime features"""

df[date_col] = pd.to_datetime(df[date_col])

df[f'{date_col}_year'] = df[date_col].dt.year

df[f'{date_col}_month'] = df[date_col].dt.month

df[f'{date_col}_day'] = df[date_col].dt.day

df[f'{date_col}_dayofweek'] = df[date_col].dt.dayofweek

df[f'{date_col}_dayofyear'] = df[date_col].dt.dayofyear

df[f'{date_col}_weekofyear'] = df[date_col].dt.isocalendar().week.astype(int)

df[f'{date_col}_quarter'] = df[date_col].dt.quarter

df[f'{date_col}_hour'] = df[date_col].dt.hour

df[f'{date_col}_is_weekend'] = (df[date_col].dt.dayofweek >= 5).astype(int)

df[f'{date_col}_is_month_end'] = df[date_col].dt.is_month_end.astype(int)

Cyclical encoding (sin/cos for periodic features)

df[f'{date_col}_month_sin'] = np.sin(2 * np.pi * df[f'{date_col}_month'] / 12)

df[f'{date_col}_month_cos'] = np.cos(2 * np.pi * df[f'{date_col}_month'] / 12)

df[f'{date_col}_hour_sin'] = np.sin(2 * np.pi * df[f'{date_col}_hour'] / 24)

df[f'{date_col}_hour_cos'] = np.cos(2 * np.pi * df[f'{date_col}_hour'] / 24)

return df

def create_aggregate_features(df, group_cols, agg_cols):

"""Create group-level aggregation features"""

for group_col in group_cols:

for agg_col in agg_cols:

prefix = f'{agg_col}_by_{group_col}'

agg = df.groupby(group_col)[agg_col].agg([

'mean', 'std', 'min', 'max', 'median'

])

agg.columns = [

f'{prefix}_mean', f'{prefix}_std',

f'{prefix}_min', f'{prefix}_max', f'{prefix}_median'

]

df = df.join(agg, on=group_col)

return df

9. Ensemble Methods

Stacking (Meta-Learning)

Stacking uses predictions from base models as features to train a meta-model.

from sklearn.model_selection import StratifiedKFold

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score

class StackingEnsemble:

"""Full stacking ensemble implementation"""

def __init__(self, base_models, meta_model, n_folds=5):

self.base_models = base_models

self.meta_model = meta_model

self.n_folds = n_folds

self.trained_base_models = []

def fit(self, X, y):

meta_features_train = np.zeros((len(X), len(self.base_models)))

skf = StratifiedKFold(n_splits=self.n_folds, shuffle=True, random_state=42)

for i, (name, model) in enumerate(self.base_models):

print(f"Training base model: {name}")

fold_models = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):

if hasattr(X, 'iloc'):

X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]

y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]

else:

X_tr, X_val = X[train_idx], X[val_idx]

y_tr, y_val = y[train_idx], y[val_idx]

model_clone = type(model)(**model.get_params())

model_clone.fit(X_tr, y_tr)

fold_models.append(model_clone)

if hasattr(model_clone, 'predict_proba'):

meta_features_train[val_idx, i] = model_clone.predict_proba(X_val)[:, 1]

else:

meta_features_train[val_idx, i] = model_clone.predict(X_val)

self.trained_base_models.append((name, fold_models))

self.meta_model.fit(meta_features_train, y)

return self

def predict_proba(self, X):

meta_features_test = np.zeros((len(X), len(self.base_models)))

for i, (name, fold_models) in enumerate(self.trained_base_models):

fold_preds = []

for model in fold_models:

if hasattr(model, 'predict_proba'):

fold_preds.append(model.predict_proba(X)[:, 1])

else:

fold_preds.append(model.predict(X))

meta_features_test[:, i] = np.mean(fold_preds, axis=0)

return self.meta_model.predict_proba(meta_features_test)

def weighted_blend(predictions, weights):

"""Weighted average blend of multiple model predictions"""

weights = np.array(weights) / sum(weights)

return sum(w * p for w, p in zip(weights, predictions))

10. TabNet (Deep Learning for Tabular Data)

TabNet Architecture

TabNet, published by Google in 2019, applies deep learning to tabular data using a **Sequential Attention** mechanism that dynamically selects which features to focus on at each step.

**Key components:**

1. **Feature Transformer**: Shared and step-specific layers processing selected features

2. **Attentive Transformer**: Generates sparse feature masks for each step

3. **Sequential Steps**: Multiple steps of sequential feature selection

4. **Sparse Attention**: Entropy regularization encourages sparse feature selection

from pytorch_tabnet.tab_model import TabNetClassifier, TabNetRegressor

def train_tabnet(X_train, y_train, X_val, y_val, cat_idxs=None, cat_dims=None):

"""Train TabNet with optional categorical embeddings"""

if cat_idxs is None:

cat_idxs = []

cat_dims = []

model = TabNetClassifier(

Architecture

n_d=64, # Prediction layer dimension

n_a=64, # Attention embedding dimension

n_steps=5, # Number of sequential attention steps

gamma=1.3, # Feature reusage coefficient (1.0-2.0)

n_independent=2, # Number of independent GLU layers

n_shared=2, # Number of shared GLU layers

Categorical embeddings

cat_idxs=cat_idxs,

cat_dims=cat_dims,

cat_emb_dim=1,

Regularization

lambda_sparse=1e-3, # Sparsity regularization coefficient

momentum=0.02,

epsilon=1e-15,

Optimizer

optimizer_fn=torch.optim.Adam,

optimizer_params=dict(lr=2e-2),

scheduler_params=dict(mode='min', patience=5, min_lr=1e-5, factor=0.9),

scheduler_fn=torch.optim.lr_scheduler.ReduceLROnPlateau,

mask_type='entmax', # 'sparsemax' or 'entmax'

verbose=10,

seed=42,

device_name='auto',

)

X_tr = X_train.values if hasattr(X_train, 'values') else X_train

y_tr = y_train.values if hasattr(y_train, 'values') else y_train

X_v = X_val.values if hasattr(X_val, 'values') else X_val

y_v = y_val.values if hasattr(y_val, 'values') else y_val

model.fit(

X_train=X_tr, y_train=y_tr,

eval_set=[(X_v, y_v)],

eval_name=['val'],

eval_metric=['auc'],

max_epochs=200,

patience=20,

batch_size=1024,

virtual_batch_size=128, # Ghost batch normalization size

num_workers=0,

drop_last=False,

)

return model

11. Kaggle-Level Pipeline

Cross-Validation Strategies

from sklearn.model_selection import (

StratifiedKFold, GroupKFold, StratifiedGroupKFold, TimeSeriesSplit

)

def kaggle_cv_pipeline(X, y, model, groups=None, n_folds=5):

"""Production-grade CV pipeline with leakage prevention"""

if groups is not None:

cv = StratifiedGroupKFold(n_splits=n_folds, shuffle=True, random_state=42)

split_args = (X, y, groups)

else:

cv = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

split_args = (X, y)

oof_predictions = np.zeros(len(X))

feature_importances = []

for fold, (train_idx, val_idx) in enumerate(cv.split(*split_args)):

if hasattr(X, 'iloc'):

X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]

y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]

else:

X_tr, X_val = X[train_idx], X[val_idx]

y_tr, y_val = y[train_idx], y[val_idx]

model.fit(X_tr, y_tr)

if hasattr(model, 'predict_proba'):

oof_predictions[val_idx] = model.predict_proba(X_val)[:, 1]

else:

oof_predictions[val_idx] = model.predict(X_val)

if hasattr(model, 'feature_importances_'):

feature_importances.append(model.feature_importances_)

fold_score = roc_auc_score(y_val, oof_predictions[val_idx])

print(f"Fold {fold+1} AUC: {fold_score:.5f}")

overall_score = roc_auc_score(y, oof_predictions)

print(f"Overall OOF AUC: {overall_score:.5f}")

mean_importance = np.mean(feature_importances, axis=0) if feature_importances else None

return oof_predictions, mean_importance, overall_score

def prevent_leakage_pipeline(X_train, X_val):

"""

Leakage-free preprocessing pipeline.

Critical rule: fit only on training data, transform both train and val.

"""

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.impute import SimpleImputer

pipeline = Pipeline([

('imputer', SimpleImputer(strategy='median')),

('scaler', StandardScaler()),

])

X_train_processed = pipeline.fit_transform(X_train)

X_val_processed = pipeline.transform(X_val)

return X_train_processed, X_val_processed, pipeline

12. Feature Selection

Boruta Algorithm

Boruta is a robust Random Forest-based feature selection algorithm. It creates shadow copies of all features and compares real feature importance against the best shadow feature.

from boruta import BorutaPy

from sklearn.ensemble import RandomForestClassifier

def boruta_feature_selection(X, y, max_iter=100):

rf = RandomForestClassifier(

n_estimators=200, max_depth=7, random_state=42, n_jobs=-1

)

boruta = BorutaPy(

estimator=rf, n_estimators='auto',

perc=100, alpha=0.05, max_iter=max_iter,

random_state=42, verbose=1

)

boruta.fit(X.values, y.values)

feature_ranking = pd.DataFrame({

'feature': X.columns,

'ranking': boruta.ranking_,

'selected': boruta.support_,

'tentative': boruta.support_weak_

}).sort_values('ranking')

selected = X.columns[boruta.support_].tolist()

tentative = X.columns[boruta.support_weak_].tolist()

print(f"Selected: {len(selected)}, Tentative: {len(tentative)}, "

f"Rejected: {len(X.columns) - len(selected) - len(tentative)}")

return selected, tentative, feature_ranking

def shap_feature_selection(model, X_val, threshold=0.01):

"""Select features based on mean absolute SHAP values"""

explainer = shap.TreeExplainer(model)

shap_values = explainer.shap_values(X_val)

shap_abs = np.abs(shap_values[1] if isinstance(shap_values, list) else shap_values)

mean_shap = shap_abs.mean(axis=0)

total_shap = mean_shap.sum()

feature_importance = pd.DataFrame({

'feature': X_val.columns,

'mean_abs_shap': mean_shap,

'shap_ratio': mean_shap / total_shap

}).sort_values('mean_abs_shap', ascending=False)

selected = feature_importance[feature_importance['mean_abs_shap'] > threshold * total_shap]

print(f"SHAP selection: {len(selected)} / {len(X_val.columns)} features")

return selected['feature'].tolist(), feature_importance

Complete Kaggle Pipeline Example

def complete_kaggle_pipeline(train_df, test_df, target_col, cat_features=None):

"""End-to-end Kaggle ML pipeline with ensemble"""

y = train_df[target_col]

X = train_df.drop(columns=[target_col])

Impute missing values

for col in X.select_dtypes(include=[np.number]).columns:

col_median = X[col].median()

X[col] = X[col].fillna(col_median)

test_df[col] = test_df[col].fillna(col_median)

if cat_features:

for col in cat_features:

X[col] = X[col].astype('category').cat.codes

test_df[col] = test_df[col].astype('category').cat.codes

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

xgb_oof = np.zeros(len(X))

lgb_oof = np.zeros(len(X))

cb_oof = np.zeros(len(X))

xgb_test = np.zeros(len(test_df))

lgb_test = np.zeros(len(test_df))

cb_test = np.zeros(len(test_df))

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):

X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]

y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]

XGBoost

xgb_m = xgb.XGBClassifier(

n_estimators=1000, max_depth=6, learning_rate=0.05,

subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1,

random_state=42, n_jobs=-1, eval_metric='auc'

)

xgb_m.fit(X_tr, y_tr, eval_set=[(X_val, y_val)],

early_stopping_rounds=50, verbose=False)

xgb_oof[val_idx] = xgb_m.predict_proba(X_val)[:, 1]

xgb_test += xgb_m.predict_proba(test_df)[:, 1] / 5

LightGBM

lgb_m = lgb.LGBMClassifier(

n_estimators=1000, num_leaves=31, learning_rate=0.05,

subsample=0.8, colsample_bytree=0.8, random_state=42,

n_jobs=-1, verbose=-1

)

lgb_m.fit(X_tr, y_tr, eval_set=[(X_val, y_val)],

callbacks=[lgb.early_stopping(50, verbose=False),

lgb.log_evaluation(-1)])

lgb_oof[val_idx] = lgb_m.predict_proba(X_val)[:, 1]

lgb_test += lgb_m.predict_proba(test_df)[:, 1] / 5

CatBoost

cb_m = CatBoostClassifier(

iterations=1000, depth=6, learning_rate=0.05,

random_seed=42, verbose=0, early_stopping_rounds=50

)

cb_m.fit(X_tr, y_tr, eval_set=(X_val, y_val))

cb_oof[val_idx] = cb_m.predict_proba(X_val)[:, 1]

cb_test += cb_m.predict_proba(test_df)[:, 1] / 5

print(f"XGBoost OOF AUC: {roc_auc_score(y, xgb_oof):.5f}")

print(f"LightGBM OOF AUC: {roc_auc_score(y, lgb_oof):.5f}")

print(f"CatBoost OOF AUC: {roc_auc_score(y, cb_oof):.5f}")

ensemble_oof = (xgb_oof + lgb_oof + cb_oof) / 3

ensemble_test = (xgb_test + lgb_test + cb_test) / 3

print(f"Ensemble OOF AUC: {roc_auc_score(y, ensemble_oof):.5f}")

return ensemble_test, ensemble_oof

Conclusion

This guide covered the complete pipeline for tabular data machine learning:

1. **EDA and preprocessing**: Systematic approach to understanding data and handling missing values and outliers

2. **Tree-based models**: Stepwise progression from decision trees to random forests and gradient boosting

3. **Modern GBM frameworks**: Strengths, weaknesses, and use cases for XGBoost, LightGBM, and CatBoost

4. **Feature engineering**: Translating domain knowledge into code with diverse techniques

5. **Ensembling**: Maximizing performance by combining multiple models

6. **TabNet**: Deep learning approach for tabular data

7. **Kaggle pipeline**: A complete production-level workflow

8. **Feature selection**: Removing irrelevant features to improve performance and interpretability

**Key takeaways:**

- Always start with thorough EDA to understand your data

- Design CV carefully to prevent leakage

- Ensembles almost always outperform single models

- Use SHAP values to interpret your models and combine with domain knowledge

- LightGBM excels at speed, CatBoost at categorical handling, XGBoost at stability

References

- [XGBoost Documentation](https://xgboost.readthedocs.io/)

- [LightGBM Documentation](https://lightgbm.readthedocs.io/)

- [CatBoost Documentation](https://catboost.ai/docs/)

- [SHAP GitHub](https://github.com/slundberg/shap)

- [pytorch-tabnet](https://github.com/dreamquark-ai/tabnet)

- Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system.

- Ke, G., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree.

- Prokhorenkova, L., et al. (2018). CatBoost: unbiased boosting with categorical features.

- Arik, S. O., & Pfister, T. (2021). TabNet: Attentive Interpretable Tabular Learning.