Introducción a LightGBM con R

Carlos A. Torres Cubilla

¿Qué es LightGBM?

LightGBM es un framework de gradient boosting que utiliza algoritmos de aprendizaje basados en árboles para crear modelos predictivos de manera eficiente y rápida.

¿Que significa LightGBM?

🔦 Light = ligero, rápido, optimizado
- Se refiere a que es una versión optimizada del algoritmo de Gradient Boosting.
- Está diseñado para consumir menos memoria y funcionar más rápido que alternativas como XGBoost o Random Forests.
📈 GBM = Gradient Boosting Machine
- Un tipo de técnica de ensemble learning que genera muchos modelos débiles (generalmente árboles de decisión) de forma secuencial, donde cada modelo aprende de los errores de sus predecesores para crear un modelo fuerte .
- Se basa en usar el gradiente del error para mejorar paso a paso.

LightGBM mejora los errores paso a paso combinando árboles de decisión pequeños.

¿Quién creó LightGBM?

Desarrollado por Microsoft Research Asia como parte del toolkit DMTK (Distributed Machine Learning Toolkit).

Lanzado en 2016 como proyecto open source.

“Creamos LightGBM para resolver problemas del mundo real a gran escala con eficiencia.”
— Microsoft DMTK Team

¿Cómo funciona el Gradient Boosting?

Podemos resumir el funcionamiento del Gradient Boosting en los siguientes 5 pasos:

📅 1. Predicción Inicial

Promedio de la variable objetivo o log-odds si es clasificación.

📈 2. Cálculo de Error

Se calcula la diferencia entre la predicción y el valor real.

🌳 3. Nuevo árbol

Se entrena un árbol sobre los errores residuales (gradientes).

⚖️ 4. Ponderar

Se multiplica por el learning rate para suavizar el ajuste.

➕ 5. Sumar

Se agrega el árbol al modelo existente.

🔄 El proceso se repite muchas veces, agregando un árbol nuevo en cada paso, hasta que se alcanza el número máximo de rondas (nrounds) o se activa un criterio de parada temprana (early_stopping_rounds) si el rendimiento en el conjunto de validación deja de mejorar.

Estos pasos se pueden expresar de manera matemática de manera facil

Paso 1: Predicción Inicial

El modelo comienza con una predicción constante para todos los datos.

En regresión, suele ser: \(F_0(x) = \bar{y}\)
En clasificación binaria, suele ser: \(F_0(x) = \log\left(\frac{p}{1 - p}\right), \quad p = \frac{\text{# de clase 1}}{n}\)

Paso 2: Calcular el Error o Gradiente

Para cada observación, se calcula cuánto se está equivocando el modelo actual:

\[ g_i^{1} = y_i - F_{0}(x_i) \]

Este paso corresponde al cálculo del gradiente de la función de pérdida con respecto a la predicción actual.

Paso 3: Entrenar un nuevo modelo sobre ese error

Se entrena un modelo débil que aprenda a corregir esos errores: \(h_1(x) \approx g_i^{1}\)

Paso 4: Escalar la corrección con un Learning Rate

Se ajusta la magnitud de la corrección aplicando un factor de aprendizaje \(\eta\) (por ejemplo, 0.1):

\[ \text{corrección} = \eta \cdot h_1(x) \]

Esto suaviza el aprendizaje y previene el sobreajuste.

Paso 5: Actualizar el modelo

Se suma la corrección al modelo acumulado anterior:

\[ F_m(x) = F_0(x) + \eta \cdot h_1(x) \]

Este proceso se repite durante varias iteraciones, agregando un nuevo modelo débil \(h(x)\) cada vez, hasta alcanzar un número máximo de rondas (nrounds) o un criterio de parada temprana (early_stopping_rounds)

Resultado final

La predicción final del modelo se puede representar mediante la siguiente función:

\[ F_m(x) = F_0(x) + \sum_{m=1}^M \eta \cdot h_m(x) \]

Donde:

\(F_m(x)\): predicción final,
\(\eta\): learning rate,
\(h_m(x)\): modelo débil,
\(M\): número total de iteraciones.

En la siguiente imagen se representa visualmente el proceso de mejora paso a paso de un modelo basado en Gradien Boosting.

Cada iteración mejora el modelo poco a poco, corrigiendo errores anteriores.

Ventajas y desventajas

LightGBM se destaca no solo por su rendimiento predictivo, sino también por su eficiencia computacional. Su adopción generalizada en aplicaciones reales de machine learning se puede explicar mediante cinco ventajas clave.

Ventajas

⚡

Velocidad

💾

Memoria

🎯

Precisión

🖥️

Paralelismo

📊

Escalabilidad

Ventaja	Descripción
Velocidad	Entrena modelos rápidamente, lo que lo hace ideal para proyectos con grandes volúmenes de datos
Memoria	Optimiza el consumo de memoria, permitiendo trabajar con datasets grandes sin requerir tanta RAM
Precisión	Proporciona resultados precisos y competitivos en tareas de predicción
Paralelismo	Aprovecha múltiples núcleos, clústeres y aceleración por GPU para entrenamientos más rápidos
Escalabilidad	Maneja grandes volúmenes de datos sin pérdida de rendimiento

Desventajas

⚠️

Overfitting

🔍

Difícil de interpretar

📉

Problemas en datasets pequeños

🧠

Curva de aprendizaje

🔌

Soporte parcial en entornos

Ventaja	Descripción
Overfitting	Al ser muy poderoso, puede memorizar ruido si no se ajusta correctamente
Difícil de interpretar	Es una “caja negra” frente a modelos más simples como regresión lineal
Problemas en datasets pequeños	Modelos simples pueden generalizar mejor en conjuntos de datos chicos
Curva de aprendizaje	Muchos hiperparámetros que requieren experiencia para ajustar bien
Soporte parcial en entornos	APIs disponibles, pero algunas integraciones pueden ser limitadas o inestables

¿Para qué utilizar LightGBM?

LightGBM es altamente versátil que permite resolver una amplia variedad de problemas, desde tareas tradicionales como clasificación y regresión, hasta aplicaciones más complejas como ranking, detección de fraude y sistemas de recomendación personalizados. Esto lo convierte en una herramienta poderosa y escalable para proyectos de aprendizaje automático en entornos reales.

Muchísimas empresas en todo el mundo utilizan LightGBM debido a su velocidad y precisión para resolver distintos problemas de machine learning. Algunos ejemplos de compañías líderes que han implementado LightGBM en sus soluciones para estos casos de uso:

Clasificación

Microsoft

Categorizar correos y detectar spam

Regresión

Uber

Estimar demanda y tiempos de llegada

Ranking

LinkedIn

Ordenar resultados de búsqueda y recomendaciones de empleo

Detección de fraude

PayPal

Detectar transacciones sospechosas en tiempo real

Sistemas de recomendación

Netflix

Personalizar sugerencias de contenido

Parámetros de LightGBM (I)

Cuando entrenamos un modelo con lightgbm, necesitamos decirle cómo aprender. Los parámetros del modelo le indican a LightGBM como y que tanto aprender. Para una información más completa de los parámetros de LightGBM visitar la documentación oficial en R.

⚙️ Parámetros generales

1. `objective`

Es el parámetro más importante de todos. Le dice a LightGBM qué tipo de problema estás resolviendo para que use la función de pérdida adecuada durante el entrenamiento.

🎓 Tipos de `objective` más comunes

Tipo de problema	Valor de `objective`	¿Qué hace internamente?
Regresión	`"regression"`	Minimiza el error cuadrático medio (MSE)
Clasificación binaria	`"binary"`	Minimiza la pérdida logarítmica
Clasificación múltiple (3 o más clases)	`"multiclass"`	Usa softmax y logloss para varias clases
Regresión con outliers	`"huber"`	Usa función Huber para menos
Ranking (ordenar ítems)	`"lambdarank"`	Optimiza una función de ranking

2. `metric`

Define cómo se mide el desempeño del modelo durante el entrenamiento. Se puede usar una o varias métricas a la vez, y se muestran en cada iteración si verbose > 0

🎓 Métricas más comunes por tipo de problema:

🔢 Regresión (`objective = "regression"`)

`metric`	¿Qué mide?
`"l2"`	Error cuadrático medio (MSE)
`"rmse"`	Raíz del error cuadrático medio
`"mae"`	Error absoluto medio
`"huber"`	Error Huber (menos sensible a outliers)

✅ Clasificación binaria (`objective = "binary"`)

`metric`	¿Qué mide?
`"binary_logloss"`	Pérdida logarítmica (predicciones vs probabilidades verdaderas)
`"auc"`	Área bajo la curva ROC
`"binary_error"`	Tasa de error (0/1 mal clasificados)

🎯 Clasificación multiclase (objective = “multiclass”)

`metric`	¿Qué mide?
`"multi_logloss"`	Pérdida logarítmica para múltiples clases
`"multi_error"`	Porcentaje de predicciones incorrectas

3. `boosting`

Indica qué algoritmo de boosting usar para entrenar el modelo. El método de boosting impacta en:

⏱️

Velocidad de entrenamiento

🧠

Precisión del modelo

💾

Consumo de memoria

🔥

Robustez al overfitting

LightGBM permite seleccionar entre 4 métodos de boosting

Valor	¿Qué hace?	¿Cuándo usarlo?
`"gbdt"`	Gradient Boosted Decision Trees (por defecto)	✔️ Preciso, funciona bien en casi todo
`"dart"`	Como GBDT, pero descarta algunos árboles al azar	✔️ Bueno para evitar overfitting
`"goss"`	Gradient-based One-Side Sampling: se entrena con un subconjunto de datos	✔️ Más rápido, útil con muchos datos
`"rf"`	Random Forest: como método de bagging, no es boosting puro	Solo si necesitas usar árboles aleatorios

4. `verbosity`

Controla cuánta información imprime LightGBM en consola mientras entrena el modelo.

🔇

-1

No imprime nada. Entrenamiento silencioso.

⚠️

0

Solo muestra errores graves.

ℹ️

1

Errores y advertencias (warnings).

🔍

2

Muestra información del progreso del entrenamiento.

🧠

3+

Modo detallado. Útil para debugging profundo.

Parámetros de LightGBM (II)

🌲 Estructura del árbol

1. `num_leaves`

Determina cuántas hojas (leaves) puede tener cada árbol de decisión que construye LightGBM. Una hoja es donde termina una decisión, y contiene la predicción final para una parte del dataset. Mientras más hojas, más complejo y flexible puede ser el árbol.

Situación	Valor sugerido
Dataset pequeño	15–31 hojas
Dataset grande	64–255 hojas
Muchas variables categóricas	Menos hojas, mejor
Cuando hay overfitting	Reduce `num_leaves`

⚠️ Importante: num_leaves debe ser menor que 2^max_depth para evitar overfitting.

2. `max_depth`

Controla cuántos niveles de decisiones puede tener un árbol individual. Si se reduce, se evita que los árboles se vuelvan demasiado complejos. Un valor de -1, LightGBM decide por su cuenta (crece hasta donde haga falta).

3. `min_data_in_leaf`

Indica la mínima cantidad de observaciones (filas) que debe tener cada hoja en un árbol. Una hoja es un grupo final de datos al que el modelo aplica una predicción. Si una hoja tiene muy pocos datos, puede memorizar el entrenamiento (overfitting). Esto permite controlar la capacidad del árbol de especializarse demasiado.

⚖️ Recomendaciones prácticas

Dataset	Valor sugerido
Pequeño (<10,000)	10–50
Mediano	100–500
Grande	500 o más

Parámetros de LightGBM (III)

🐢 Control del aprendizaje

1. `learning_rate`

Define cuánto se ajusta el modelo en cada iteración (árbol). Si el learning_rate es muy grande, el modelo aprende rápido, pero puede pasarse de largo.

📊 Efecto según el valor

`learning_rate`	¿Qué pasa?
Alto (≥ 0.2)	Aprende rápido, pero puede sobreajustar o divergir
Medio (0.05–0.1)	Equilibrio entre velocidad y precisión
Bajo (<0.05)	Aprende lento pero con mayor control y precisión final

⚠️ Cuidado:

Valores muy altos (learning_rate > 0.3) → el modelo puede ser inestable y no generalizar bien.

Valores muy bajos (learning_rate < 0.01) → el modelo entrena muy lento y requiere más fine tuning.

2. `nrounds`

Define cuántos árboles se van a construir durante el entrenamiento. Este parámetro va fuera de params. Se usa en validación con lgb.cv() o lgb.train() si hay conjunto de validación.

🧪 Recomendación práctica

`learning_rate`	`nrounds` sugerido
`0.1`	100–500
`0.05`	500–1000
`0.01`	1000–3000

✅ Tip importante:

📉 Usá (early stopping) para evitar entrenar de más. Esto detiene el entrenamiento cuando el modelo deja de mejorar sobre un conjunto de validación.

Parámetros de LightGBM (IV)

🛑 Regularización y generalización

1. `early_stopping_round`

Detiene el entrenamiento automáticamente si el modelo no mejora en un número determinado de iteraciones consecutivas sobre un conjunto de validación.

⏱️ Ahorra tiempo Evita entrenar más rondas de las necesarias.

❌ Reduce el sobreajuste Se detiene cuando ya no mejora la validación.

🎯 Mejora la precisión Ayuda a encontrar el número ideal de árboles.

2. `bagging_fraction`

Indica el porcentaje de observaciones (filas) usadas para entrenar cada árbol. Esto reduce varianza del modelo al hacer que cada árbol vea datos distintos. LightGBM utiliza muestreo sin reemplazo en cada árbol, de manera de que cada fila solo se utilizara una única vez en el entrenamiento de cada uno de los arboles de manera aleatoria.

3. `feature_fraction`

Indica el porcentaje de columnas (features) seleccionadas aleatoriamente para construir cada árbol. Esto introduce variabilidad y reduce la correlación entre árboles. Es de mucha utilidad cuando hay muchas columnas.

4. `min_gain_to_split`

Impide que se creen ramas nuevas si la ganancia no supera cierto umbral, actuando como un filtro para evitar divisiones poco informativas. Esto ayuda a simplificar los árboles.

5. `lambda_l1`

La regularización L1 (Lasso) castiga la suma de los valores absolutos de las hojas del árbol. Esto fuerza la simplicidad empujado pesos de variables poco importantes hacia cero. Suele ser útil cuando hay muchas variables poco útiles.

6. `lambda_l2`

La regularización L2 (Ridge) penaliza la suma de los cuadrados de los valores en las hojas. Esto evita que las predicciones sean demasiado extremas, ayudandon así a la estabilidad del modelo.

Datos para el ejemplo

Fuente

Los datos provienen del conjunto público Cleveland Heart Disease Database, disponible en el repositorio de UCI y utilizado en la competencia “Machine Learning with a Heart” de DrivenData

Objetivo

El objetivo del proyecto es predecir la presencia o ausencia de enfermedad cardíaca (heart_disease_present) a partir de información clínica básica de cada paciente.

heart_disease_present = 0: no hay cardiopatía presente
heart_disease_present = 1: presencia de una enfermedad cardíaca

Variables

En los datos proporcionados hay 14 columnas, donde la columna patient_id es un identificador único y aleatorio y las 13 variables restantes se representan la suguiente información:

slope_of_peak_exercise_st_segment: pendiente del segmento ST en el electrocardiograma durante el pico del ejercicio físico. Esta medida indica la calidad del flujo sanguíneo al corazón, ya que una pendiente plana o descendente puede ser un indicador de enfermedad cardíaca.
- slope_of_peak_exercise_st_segment = 1: Pendiente ascendente (Up-sloping)
- slope_of_peak_exercise_st_segment = 2: Pendiente plana (Flat)
- slope_of_peak_exercise_st_segment = 3: Pendiente descendente (Down-sloping)
thal: resultados de una prueba de esfuerzo con talio que mide el flujo sanguíneo al corazón, con valores posibles normal, defecto fijo y defecto reversible.
- Normal: Flujo sanguíneo normal. No se detectan problemas de perfusión
- Fixed Defect: Puede indicar una zona del corazón que ya no recibe sangre debido a daño permanente.
- Reversible Defect: Indica isquemia, donde el flujo sanguíneo se ve comprometido solo durante el esfuerzo.
resting_blood_pressure: presión sanguínea en reposo.
chest_pain_type: tipo de dolor torácico (4 valores).
num_major_vessels: número de vasos mayores (0-3) coloreados por fluroscopia.
fasting_blood_sugar_gt_120_mg_per_dl: azúcar en sangre en ayunas > 120 mg/dl.
- fasting_blood_sugar_gt_120_mg_per_dl = 0: la glucosa en ayunas es menor o igual a 120 mg/dL
- fasting_blood_sugar_gt_120_mg_per_dl = 1: la glucosa en ayunas es mayor a 120 mg/dL
resting_ekg_results: resultados electrocardiográficos en reposo.
- resting_ekg_results = 0: Normal
- resting_ekg_results = 1: Anormalidad de la onda ST-T (problemas de oxigenación del corazón)
- resting_ekg_results = 2: Hipertrofia ventricular izquierda probable
serum_cholesterol_mg_per_dl: colesterol sérico en mg/dl.
oldpeak_eq_st_depression: depresión del ST inducida por el ejercicio en relación con el reposo. Esta es una medida de anormalidad en los electrocardiogramas.
sex: genero.
- sex = 0: mujer
- sex = 1: hombre
edad: edad en años
max_heart_rate_achieved: frecuencia cardiaca máxima alcanzada (pulsaciones por minuto)
exercise_induced_angina: presencia de una angina de pecho (dolor o molestia en el pecho) inducida por el ejercicio durante una prueba de esfuerzo cardíaca.
- exercise_induced_angina = 0: Falso
- exercise_induced_angina = 1: Verdadero

Instalación

CRAN

LightGBM está disponible en CRAN y puede instalarse con el siguiente código R.

install.packages("lightgbm", repos = "https://cran.r-project.org")

GitHub

Cada versión del paquete que ha estado en CRAN también está disponible en los LightGBM releases, con un formato de nombre lightgbm-{VERSION}-r-cran.tar.gz

devtools::install_github("microsoft/LightGBM", subdir = "R-package")

Ejemplo: librerías

readr dplyr fastDummies lightgbm pROC ggplot2

library(readr)
library(dplyr)
library(summarytools)
library(fastDummies)
library(lightgbm)
library(pROC)
library(ggplot2)

Ejemplo: lectura de datos

train_values <- read_csv("data/train_values.csv")
train_labels <- read_csv("data/train_labels.csv")
test_values <- read_csv("data/test_values.csv")
test_labels <- read_csv("data/test_labels.csv")

Datos de entrenamiento

Variable objetivo para entrenamiento

Ejemplo: Análisis exploratorio de datos (EDA)

Antes de procesar los datos es recomendable realizar una exploración sencilla de la información que ayude a comprender mejor la infomración con la que se está tranbajando.

Variable objetivo

Es de suma importancia conocer si contamos con información desbalanceada para tomar decisiones sobre el muestreo necesario para el entrenamiento.

train_labels |>
  mutate(heart_disease_present = as.factor(heart_disease_present)) |>
  count(heart_disease_present) |>
  ggplot(aes(x = heart_disease_present, y = n, fill = heart_disease_present)) +
  geom_col() +
  labs(title = "Conteo variable respuesta para el set de Train") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("#00BA38", "#F8766D"))

Variables categróricas

Es importante conocer las variables categóricas que se tienen y considerar su distribución. Para variables con 3 o más categorías se recomienda utilizar técnicas de One Hot Encoding que ayuden a facilitar el aprendizaje al modelo.

slope_of_peak_exercise_st_segment

train_values |>
  mutate(slope_of_peak_exercise_st_segment = as.factor(slope_of_peak_exercise_st_segment)) |>
  count(slope_of_peak_exercise_st_segment) |>
  ggplot(aes(x = slope_of_peak_exercise_st_segment, y = n, fill = slope_of_peak_exercise_st_segment)) +
  geom_col() +
  labs(title = "Conteo por slope_of_peak_exercise_st_segment") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("#00BA38", "#619CFF", "#F8766D"))

thal

train_values |>
  mutate(thal = as.factor(thal)) |>
  count(thal) |>
  ggplot(aes(x = thal, y = n, fill = thal)) +
  geom_col() +
  labs(title = "Conteo por thal") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5))

chest_pain_type

train_values |>
  mutate(chest_pain_type = as.factor(chest_pain_type)) |>
  count(chest_pain_type) |>
  ggplot(aes(x = chest_pain_type, y = n, fill = chest_pain_type)) +
  geom_col() +
  labs(title = "Conteo por chest pain") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5))

fasting_blood_sugar_gt_120_mg_per_dl

train_values |>
  mutate(fasting_blood_sugar_gt_120_mg_per_dl = as.factor(fasting_blood_sugar_gt_120_mg_per_dl)) |>
  count(fasting_blood_sugar_gt_120_mg_per_dl) |>
  ggplot(aes(x = fasting_blood_sugar_gt_120_mg_per_dl, y = n, fill = fasting_blood_sugar_gt_120_mg_per_dl)) +
  geom_col() +
  labs(title = "Conteo por fasting_blood_sugar_gt_120_mg_per_dlo") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("#619CFF", "#F8766D"))

resting_ekg_results

train_values |>
  mutate(resting_ekg_results = as.factor(resting_ekg_results)) |>
  count(resting_ekg_results) |>
  ggplot(aes(x = resting_ekg_results, y = n, fill = resting_ekg_results)) +
  geom_col() +
  labs(title = "Conteo por chest pain") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("#00BA38", "#619CFF", "#F8766D"))

sex

train_values |>
  mutate(sex = as.factor(sex)) |>
  count(sex) |>
  ggplot(aes(x = sex, y = n, fill = sex)) +
  geom_col() +
  labs(title = "Conteo por Sexo") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("#F8766D", "#619CFF"))

exercise_induced_angina

train_values |>
  mutate(exercise_induced_angina = as.factor(exercise_induced_angina)) |>
  count(exercise_induced_angina) |>
  ggplot(aes(x = exercise_induced_angina, y = n, fill = exercise_induced_angina)) +
  geom_col() +
  labs(title = "Conteo por exercise_induced_anginao") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("#619CFF", "#F8766D"))

Variables continuas

Estadísticas descriptivas

descr(
  train_values |> dplyr::select(
    resting_blood_pressure, 
    num_major_vessels, 
    serum_cholesterol_mg_per_dl, 
    oldpeak_eq_st_depression, 
    age, 
    max_heart_rate_achieved
  ),
  stats = "common",
  transpose = TRUE,
  round.digits = 2
) |> as.data.frame()

## Error in if (grepl("^get\\(", deparse(call[[.p$var]]))) {: the condition has length > 1

Estadísticas descriptivas por presencia y ausencia de enfermedad cardiaca

stby(
  data = train_values |> dplyr::select(
    resting_blood_pressure, 
    num_major_vessels, 
    serum_cholesterol_mg_per_dl, 
    oldpeak_eq_st_depression, 
    age, 
    max_heart_rate_achieved
  ),
  INDICES = train_labels$heart_disease_present,
  FUN = descr,
  stats = "common",
  transpose = TRUE,
  style="rmarkdown" 
)

## Error in if (grepl(.st_env$re$two_names, str, perl = TRUE)) {: the condition has length > 1

## Error in if (grepl(.st_env$re$two_names, str, perl = TRUE)) {: the condition has length > 1

## ### Descriptive Statistics  
## 
## |                          &nbsp; |   Mean | Std.Dev |    Min | Median |    Max | N.Valid |
## |--------------------------------:|-------:|--------:|-------:|-------:|-------:|--------:|
## |                         **age** |  53.66 |    9.76 |  29.00 |  53.50 |  74.00 |  100.00 |
## |     **max_heart_rate_achieved** | 156.87 |   19.09 | 105.00 | 160.00 | 202.00 |  100.00 |
## |           **num_major_vessels** |   0.33 |    0.68 |   0.00 |   0.00 |   3.00 |  100.00 |
## |    **oldpeak_eq_st_depression** |   0.63 |    0.79 |   0.00 |   0.25 |   4.20 |  100.00 |
## |      **resting_blood_pressure** | 130.12 |   17.04 |  94.00 | 130.00 | 180.00 |  100.00 |
## | **serum_cholesterol_mg_per_dl** | 245.46 |   57.01 | 126.00 | 237.50 | 564.00 |  100.00 |
## 
## Table: Table continues below
## 
##  
## 
## |                          &nbsp; |      N | Pct.Valid |
## |--------------------------------:|-------:|----------:|
## |                         **age** | 100.00 |    100.00 |
## |     **max_heart_rate_achieved** | 100.00 |    100.00 |
## |           **num_major_vessels** | 100.00 |    100.00 |
## |    **oldpeak_eq_st_depression** | 100.00 |    100.00 |
## |      **resting_blood_pressure** | 100.00 |    100.00 |
## | **serum_cholesterol_mg_per_dl** | 100.00 |    100.00 |
## 
## **Group:** train_labels$heart_disease_present = 0  
## **N:** 80  
## 
## |                          &nbsp; |   Mean | Std.Dev |    Min | Median |    Max | N.Valid |
## |--------------------------------:|-------:|--------:|-------:|-------:|-------:|--------:|
## |                         **age** |  56.25 |    8.62 |  35.00 |  58.00 |  77.00 |   80.00 |
## |     **max_heart_rate_achieved** | 140.25 |   22.16 |  96.00 | 142.00 | 182.00 |   80.00 |
## |           **num_major_vessels** |   1.15 |    1.08 |   0.00 |   1.00 |   3.00 |   80.00 |
## |    **oldpeak_eq_st_depression** |   1.49 |    1.29 |   0.00 |   1.40 |   6.20 |   80.00 |
## |      **resting_blood_pressure** | 132.80 |   16.96 | 100.00 | 130.00 | 180.00 |   80.00 |
## | **serum_cholesterol_mg_per_dl** | 253.90 |   46.73 | 149.00 | 255.50 | 353.00 |   80.00 |
## 
## Table: Table continues below
## 
##  
## 
## |                          &nbsp; |     N | Pct.Valid |
## |--------------------------------:|------:|----------:|
## |                         **age** | 80.00 |    100.00 |
## |     **max_heart_rate_achieved** | 80.00 |    100.00 |
## |           **num_major_vessels** | 80.00 |    100.00 |
## |    **oldpeak_eq_st_depression** | 80.00 |    100.00 |
## |      **resting_blood_pressure** | 80.00 |    100.00 |
## | **serum_cholesterol_mg_per_dl** | 80.00 |    100.00 |

Ejemplo: preprocesamiento de datos

El preprocesamiento de datos es el paso previo al entrenamiento, donde preparamos los datos para que el modelo pueda aprender de manera efectiva. Se debe limpiar, transformar y codificar los datos para que el modelo aprenda bien.

Preparación de las features de entrenamiento

X_train <- train_values |> 
  select(-patient_id) |>             # Se elimina la columna ID ya que no aporta info
  mutate(thal = as.factor(thal)) |>  # Convertimos 'thal' a factor (categórica)
  dummy_cols(
    select_columns = "thal",
    remove_selected_columns = TRUE
  ) |>                               # Creamos columnas dummy para 'thal'
  as.matrix()                        # Convertimos el data frame a matriz

Preparación de la variable objetivo o target

y_train <- train_labels |> 
  select(heart_disease_present) |> 
  as.matrix()

Se repite el proceso para el conjunto de validación

X_test <- test_values |> 
  select(-patient_id) |> 
  mutate(thal = as.factor(thal)) |> 
  dummy_cols(
    select_columns = "thal", 
    remove_selected_columns = TRUE
  ) |>
  as.matrix()

y_test <- test_labels |> 
  select(heart_disease_present) |> 
  as.matrix()

Creación de los datasets de LightGBM

dtrain <- lgb.Dataset(data = X_train, label = y_train)
dtest  <- lgb.Dataset(data = X_test, label = y_test)

Ejemplo: entrenamiento

params <- list(
  objective = "binary",
  metric = "auc",
  boosting = "gbdt",
  verbosity = 3,
  num_leaves = 31,
  learning_rate = 0.05,
  bagging_fraction = 0.9,
  feature_fraction = 0.9,
  bagging_freq = 1,
  seed = 123456,
  bagging_seed = 123456,
  feature_fraction_seed = 123456
)

model <- lgb.train(
  params = params,
  nrounds = 100,
  data = dtrain,
  valids = list(train = dtrain, test = dtest),
  early_stopping_rounds = 10,
  record = TRUE
)

## [LightGBM] [Info] Number of positive: 80, number of negative: 100
## [LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.761111
## [LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.250000
## [LightGBM] [Debug] init for col-wise cost 0.001227 seconds, init for row-wise cost 0.001399 seconds
## [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002406 seconds.
## You can set `force_row_wise=true` to remove the overhead.
## And if memory is not enough, you can set `force_col_wise=true`.
## [LightGBM] [Debug] Using Sparse Multi-Val Bin
## [LightGBM] [Info] Total Bins 201
## [LightGBM] [Info] Number of data points in the train set: 180, number of used features: 14
## [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.444444 -> initscore=-0.223144
## [LightGBM] [Info] Start training from score -0.223144
## [LightGBM] [Debug] Re-bagging, using 165 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [1]:  train's auc:0.864  test's auc:0.85375 
## Will train until there is no improvement in 10 rounds.
## [LightGBM] [Debug] Re-bagging, using 160 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [2]:  train's auc:0.892563  test's auc:0.89775 
## [LightGBM] [Debug] Re-bagging, using 161 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [3]:  train's auc:0.902937  test's auc:0.8995 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [4]:  train's auc:0.907813  test's auc:0.892 
## [LightGBM] [Debug] Re-bagging, using 167 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [5]:  train's auc:0.91025  test's auc:0.894 
## [LightGBM] [Debug] Re-bagging, using 163 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [6]:  train's auc:0.91825  test's auc:0.9035 
## [LightGBM] [Debug] Re-bagging, using 165 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [7]:  train's auc:0.92475  test's auc:0.91275 
## [LightGBM] [Debug] Re-bagging, using 173 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [8]:  train's auc:0.928125  test's auc:0.91 
## [LightGBM] [Debug] Re-bagging, using 161 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [9]:  train's auc:0.9245  test's auc:0.909 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [10]:  train's auc:0.92775  test's auc:0.9145 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [11]:  train's auc:0.927625  test's auc:0.912 
## [LightGBM] [Debug] Re-bagging, using 166 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [12]:  train's auc:0.933125  test's auc:0.9155 
## [LightGBM] [Debug] Re-bagging, using 162 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [13]:  train's auc:0.935937  test's auc:0.9175 
## [LightGBM] [Debug] Re-bagging, using 158 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [14]:  train's auc:0.940625  test's auc:0.91775 
## [LightGBM] [Debug] Re-bagging, using 162 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [15]:  train's auc:0.942  test's auc:0.91825 
## [LightGBM] [Debug] Re-bagging, using 163 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [16]:  train's auc:0.943937  test's auc:0.91875 
## [LightGBM] [Debug] Re-bagging, using 157 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [17]:  train's auc:0.945  test's auc:0.91925 
## [LightGBM] [Debug] Re-bagging, using 161 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [18]:  train's auc:0.94575  test's auc:0.92025 
## [LightGBM] [Debug] Re-bagging, using 166 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 3
## [19]:  train's auc:0.945375  test's auc:0.91775 
## [LightGBM] [Debug] Re-bagging, using 158 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [20]:  train's auc:0.9445  test's auc:0.9195 
## [LightGBM] [Debug] Re-bagging, using 157 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [21]:  train's auc:0.945562  test's auc:0.9205 
## [LightGBM] [Debug] Re-bagging, using 151 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 5 and depth = 3
## [22]:  train's auc:0.948063  test's auc:0.9255 
## [LightGBM] [Debug] Re-bagging, using 161 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [23]:  train's auc:0.948688  test's auc:0.9235 
## [LightGBM] [Debug] Re-bagging, using 170 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [24]:  train's auc:0.951063  test's auc:0.925 
## [LightGBM] [Debug] Re-bagging, using 163 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 3
## [25]:  train's auc:0.951688  test's auc:0.9265 
## [LightGBM] [Debug] Re-bagging, using 158 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 5 and depth = 3
## [26]:  train's auc:0.951313  test's auc:0.9255 
## [LightGBM] [Debug] Re-bagging, using 167 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 3
## [27]:  train's auc:0.951812  test's auc:0.925 
## [LightGBM] [Debug] Re-bagging, using 161 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [28]:  train's auc:0.951562  test's auc:0.924 
## [LightGBM] [Debug] Re-bagging, using 163 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [29]:  train's auc:0.952812  test's auc:0.9295 
## [LightGBM] [Debug] Re-bagging, using 159 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [30]:  train's auc:0.953313  test's auc:0.93 
## [LightGBM] [Debug] Re-bagging, using 166 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [31]:  train's auc:0.954562  test's auc:0.9335 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [32]:  train's auc:0.954812  test's auc:0.9335 
## [LightGBM] [Debug] Re-bagging, using 165 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 3
## [33]:  train's auc:0.95675  test's auc:0.9345 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 5
## [34]:  train's auc:0.95775  test's auc:0.9325 
## [LightGBM] [Debug] Re-bagging, using 162 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [35]:  train's auc:0.95975  test's auc:0.9345 
## [LightGBM] [Debug] Re-bagging, using 168 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [36]:  train's auc:0.96175  test's auc:0.9355 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [37]:  train's auc:0.96125  test's auc:0.9355 
## [LightGBM] [Debug] Re-bagging, using 159 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 3
## [38]:  train's auc:0.9625  test's auc:0.936 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [39]:  train's auc:0.9625  test's auc:0.936 
## [LightGBM] [Debug] Re-bagging, using 170 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [40]:  train's auc:0.9625  test's auc:0.937 
## [LightGBM] [Debug] Re-bagging, using 160 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [41]:  train's auc:0.963625  test's auc:0.9395 
## [LightGBM] [Debug] Re-bagging, using 159 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [42]:  train's auc:0.965625  test's auc:0.939 
## [LightGBM] [Debug] Re-bagging, using 165 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [43]:  train's auc:0.9665  test's auc:0.938 
## [LightGBM] [Debug] Re-bagging, using 160 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [44]:  train's auc:0.967875  test's auc:0.939 
## [LightGBM] [Debug] Re-bagging, using 168 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [45]:  train's auc:0.967875  test's auc:0.938 
## [LightGBM] [Debug] Re-bagging, using 158 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 5 and depth = 3
## [46]:  train's auc:0.967125  test's auc:0.9385 
## [LightGBM] [Debug] Re-bagging, using 160 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 3
## [47]:  train's auc:0.96875  test's auc:0.941 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 5
## [48]:  train's auc:0.9695  test's auc:0.942 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 3
## [49]:  train's auc:0.968875  test's auc:0.9405 
## [LightGBM] [Debug] Re-bagging, using 166 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 3
## [50]:  train's auc:0.97025  test's auc:0.9415 
## [LightGBM] [Debug] Re-bagging, using 158 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [51]:  train's auc:0.97075  test's auc:0.942 
## [LightGBM] [Debug] Re-bagging, using 156 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 5 and depth = 3
## [52]:  train's auc:0.97075  test's auc:0.9415 
## [LightGBM] [Debug] Re-bagging, using 164 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [53]:  train's auc:0.972125  test's auc:0.9425 
## [LightGBM] [Debug] Re-bagging, using 163 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [54]:  train's auc:0.97175  test's auc:0.943 
## [LightGBM] [Debug] Re-bagging, using 163 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [55]:  train's auc:0.971875  test's auc:0.942 
## [LightGBM] [Debug] Re-bagging, using 166 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3
## [56]:  train's auc:0.97275  test's auc:0.942 
## [LightGBM] [Debug] Re-bagging, using 161 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [57]:  train's auc:0.97375  test's auc:0.9405 
## [LightGBM] [Debug] Re-bagging, using 165 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [58]:  train's auc:0.973375  test's auc:0.9405 
## [LightGBM] [Debug] Re-bagging, using 163 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 3
## [59]:  train's auc:0.973875  test's auc:0.939 
## [LightGBM] [Debug] Re-bagging, using 165 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [60]:  train's auc:0.975375  test's auc:0.936 
## [LightGBM] [Debug] Re-bagging, using 162 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [61]:  train's auc:0.9765  test's auc:0.937 
## [LightGBM] [Debug] Re-bagging, using 160 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 7 and depth = 4
## [62]:  train's auc:0.97675  test's auc:0.9375 
## [LightGBM] [Debug] Re-bagging, using 161 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [63]:  train's auc:0.978  test's auc:0.9345 
## [LightGBM] [Debug] Re-bagging, using 163 data to train
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 4
## [64]:  train's auc:0.97775  test's auc:0.937 
## Early stopping, best iteration is: [54]:  train's auc:0.97175  test's auc:0.943

Ejemplo: cálculo del gini

El índice de Gini es una métrica de performance para modelos de clasificación binaria que mide qué tan bien separa el modelo las clases.

El índice de Gini se calcula a partir del AUC de la curva ROC:

\[ Gini = 2 \times AUC - 1 \]

# Cálculo de ginis
auc_train <- model$record_evals$train$auc$eval
auc_test <- model$record_evals$test$auc$eval
gini_train <- 2 * unlist(auc_train) - 1
gini_test <- 2 * unlist(auc_test) - 1

# Data frame con los ginis
gini_df <- data.frame(
  iter = seq_along(gini_train),
  train = gini_train,
  test = gini_test
)

# Gráfico de ginis
ggplot(gini_df, aes(x = iter)) +
  ylim(0.5, 1) +
  geom_line(aes(y = train, color = "Train"), size = 1.2) +
  geom_line(aes(y = test, color = "Test"), size = 1.2) +
  scale_color_manual(values = c("Train" = "#1f77b4", "Test" = "#ff7f0e")) +
  labs(
    title = "Evolucion del Gini por ronda",
    x = "Iteracion",
    y = "Gini",
    color = "Conjunto"
  )+
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5))

Ejemplo: curva ROC

La curva ROC (Receiver Operating Characteristic) muestra la capacidad de un modelo binario para distinguir entre clases positivas y negativas.

🔍 ¿Qué representa la curva ROC?

Eje Y: Sensibilidad (True Positive Rate) → ¿Cuántos positivos detectamos correctamente?
Eje X: 1 - Especificidad (False Positive Rate) → ¿Cuántos negativos estamos clasificando mal?
Diagonal: Clasificador aleatorio (sin poder predictivo)

Cuanto más arriba y a la izquierda, mejor el modelo

# Cálculo de la curva ROC
roc_train <- roc(y_train, predict(model, X_train))
roc_test  <- roc(y_test, predict(model, X_test))

# Data frames con las curvas ROC de train y test
df_train <- data.frame(
  tpr = rev(roc_train$sensitivities),
  fpr = rev(1 - roc_train$specificities),
  dataset = "Train"
)
df_test <- data.frame(
  tpr = rev(roc_test$sensitivities),
  fpr = rev(1 - roc_test$specificities),
  dataset = "Test"
)

# Unión de data frames
roc_df <- rbind(df_train, df_test)

# Gráfico AUC ROC
ggplot(roc_df, aes(x = fpr, y = tpr, color = dataset)) +
  geom_line(size = 1.2) +
  scale_color_manual(values = c("Train" = "#1f77b4", "Test" = "#ff7f0e")) +
  geom_abline(linetype = "dashed", color = "gray") +
  labs(
    title = "Curva ROC",
    x = "1 - Especificidad",
    y = "Sensibilidad"
  )+
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5))

Ejemplo: matriz de confusión

La matriz de confusión muestra cómo se comporta el modelo al predecir. Las celdas diagonales indican predicciones correctas. Nos permite identificar cuántos falsos positivos y falsos negativos hay, lo cual es clave cuando el costo de equivocarse es diferente según el caso.

\[ \begin{array}{c|c|c} \textbf{Observación} \backslash \textbf{Predicción} & \textbf{Positivos} & \textbf{Negativos} \\ \hline \textbf{Positivos} & \begin{array}{c}\text{Verdaderos}\\ \text{Positivos (TP)}\end{array} & \begin{array}{c}\text{Falsos}\\ \text{Negativos (FN)}\end{array} \\ \hline \textbf{Negativos} & \begin{array}{c}\text{Falsos}\\ \text{Positivos (FP)}\end{array} & \begin{array}{c}\text{Verdaderos}\\ \text{Negativos (TN)}\end{array} \end{array} \]

# Test
df_conf_matrix_test  <- table(
    Predicted = predict(model, X_test) |> round(),  
    Actual = y_test
  ) |> 
  as.data.frame() |>
  mutate(
    Predicted = factor(Predicted, levels = c("1", "0")),
    Actual = factor(Actual, levels = c("0", "1")),
  ) 

# Plot matriz de confusión de Test
ggplot(df_conf_matrix_test, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6) +
  scale_fill_gradient(low = "white", high = "#badb33") +
  labs(title = "Matriz de confusión", x = "Valor real", y = "Predicción") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Ejemplo: métricas de clasificación

Para el cálculo de las métricas es necesario partir de la matriz de confusi+on generada de forma similar a la diapositiva anterior.

# Train
conf_matrix_train <- table(
  Actual    = y_train,
  Predicted = predict(model, X_train) |> round()
) |> as.matrix()

# Test
conf_matrix_test <- table(
  Actual    = y_test,
  Predicted = predict(model, X_test) |> round()
) |> as.matrix()

Ambas tablas tienen el siguiente formato:

\[ \begin{array}{c|c|c} \textbf{Observación} \backslash \textbf{Predicción} & \textbf{Positivos} & \textbf{Negativos} \\ \hline \textbf{Positivos} & \begin{array}{c}\text{Verdaderos}\\ \text{Positivos (TP)}\end{array} & \begin{array}{c}\text{Falsos}\\ \text{Negativos (FN)}\end{array} \\ \hline \textbf{Negativos} & \begin{array}{c}\text{Falsos}\\ \text{Positivos (FP)}\end{array} & \begin{array}{c}\text{Verdaderos}\\ \text{Negativos (TN)}\end{array} \end{array} \]

Accuracy

Mide la proporción de predicciones correctas (positivas y negativas).

\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

accuracy_train <- sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
accuracy_test  <- sum(diag(conf_matrix_test))  / sum(conf_matrix_test)

Precision

Indica de todos los positivos predichos, cuántos eran realmente positivos.

\[ \text{Precision} = \frac{TP}{TP + FP} \]

precision_train <- conf_matrix_train["1","1"] / sum(conf_matrix_train["1",])
precision_test  <- conf_matrix_test["1","1"]   / sum(conf_matrix_test["1",])

Recall

Indica de todos los positivos reales, cuántos se detectaron correctamente.

\[ \text{Recall} = \frac{TP}{TP + FN} \]

recall_train <- conf_matrix_train["1","1"] / sum(conf_matrix_train[,"1"])
recall_test  <- conf_matrix_test["1","1"]   / sum(conf_matrix_test[,"1"])

F1-Score

EIndica un equilibrio entre Precision y Recall. Es su media armónica.

\[ \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

f1_train <- 2 * precision_train * recall_train / (precision_train + recall_train)
f1_test  <- 2 * precision_test * recall_test / (precision_test + recall_test)

Resultados

metrics <- data.frame(
  Dataset = c("Train", "Test"),
  Accuracy = c(
    round(accuracy_train * 100, 2), 
    round(accuracy_test * 100, 2)
  ),
  Precision = c(
    round(precision_train * 100, 2), 
    round(precision_test * 100, 2)
  ),
  Recall = c(
    round(recall_train * 100, 2), 
    round(recall_test * 100, 2)
  ),
  F1 = c(
    round(f1_train, 4), 
    round(f1_test, 4)
  )
)

Métricas de Evaluación
Dataset	Accuracy	Precision	Recall	F1
Train	88.89	85.0	89.47	0.8718
Test	85.56	77.5	88.57	0.8267

Ejemplo: feature importance

El feature importance de lightgbm muestra 3 columnas:

Gain: medida del incremento en la función objetivo cuando la variable se usa para hacer un split en los árboles. En términos simples, indica cuánto ayuda esa variable a mejorar el modelo.
Cover: representa la proporción de observaciones que caen en nodos donde la variable se usó para dividir. Es decir, cuánto “cubre” esa variable en el dataset durante el entrenamiento.
Frequency: indica el número relativo de veces que esa variable fue utilizada para dividir (split) en todos los árboles del modelo. Más frecuencia indica que el modelo recurrió más a esa variable para hacer predicciones.

lgb.importance(model)

##                               Feature       Gain      Cover  Frequency
##                                <char>      <num>      <num>      <num>
##  1:                   chest_pain_type 0.20326560 0.12694120 0.09154930
##  2:            thal_reversible_defect 0.14266360 0.11227208 0.07746479
##  3:                       thal_normal 0.13405734 0.05580824 0.03169014
##  4:                 num_major_vessels 0.12166727 0.13943864 0.11267606
##  5:          oldpeak_eq_st_depression 0.10139065 0.11903299 0.13380282
##  6:                               age 0.08316405 0.10301168 0.12676056
##  7:                               sex 0.04187116 0.06896128 0.07042254
##  8:       serum_cholesterol_mg_per_dl 0.03798092 0.06998566 0.09507042
##  9:           exercise_induced_angina 0.03763722 0.05367753 0.04225352
## 10: slope_of_peak_exercise_st_segment 0.03220603 0.03843475 0.04225352
## 11:           max_heart_rate_achieved 0.02957810 0.05502971 0.08450704
## 12:               resting_ekg_results 0.02173702 0.02614218 0.03873239
## 13:            resting_blood_pressure 0.01278105 0.03126409 0.05281690

LightGBM también permite realizar un gráfico del feature importance.

lgb.importance(model) |> 
  lgb.plot.importance(top_n = 20)

Ejemplo: consideraciones finales

Guardar el modelo entrenado

Una vez entrenado el modelo, es recomendable guardarlo para no tener que repetir el proceso de entrenamiento. Esto es útil cuando queremos reutilizar el modelo en futuras predicciones.

lgb.save(model, "modelo_lgbm.txt")

Cargar el modelo previamente guardado

Una vez guardado un modelo, podemos volver a cargarlo de manera similar a como se cargan los dataframes. Esto es especialmente útil para usar el modelo en otros scripts, sesiones, o después de reiniciar R.

modelo_cargado <- lgb.load("modelo_lgbm.txt")

## [LightGBM] [Debug] Skipped header "tree" in file modelo_lgbm.txt
## [LightGBM] [Debug] Skipped header "tree" in file modelo_lgbm.txt
## [LightGBM] [Debug] Time for loading model: 0.006596 seconds

Verificar veracidad del modelo

Es posible comprobar que el modelo cargado produce los mismos resultados que el original.

all.equal(
  predict(model, X_test),
  predict(modelo_cargado, X_test)
)

## [1] TRUE