Analisis de ofertas de empleos relacionado con la IA (EE.UU, 2025)

Fátima I.S — Fri, 01 Aug 2025 15:08:43 +0000

Imputación de salarios en datos de portales de empleo de IA con Random Forest

🐢 Hoy quiero compartir un proyecto en el que he estado trabajando últimamente.

Me he propuesto ir mostrando proyectos de vez en cuando para mejorar mis habilidades y aprender nuevas cosas. Además, ¡un feedback me vendría genial para seguir creciendo!

Soy consciente de que esto puede sonar directo, pero estoy en plena fase de mejora. Quien se dedica a lo técnico sabe que esto es un proceso constante de evolución, ¿no?

Contexto y objetivo

En este proyecto analicé una base de datos scrapeada de portales de empleo de EE.UU (obtenida de kaggle), centrada en puestos relacionados con IA (ingenieros de ML, científicos de datos, ingenieros de investigación, científicos aplicados, entre otros).

La mayoría de los puestos tenían un seniority alto y especializaciones muy nicho.

El objetivo principal era analizar y sacar conclusiones sobre la tendencia de los empleos relacionados con la IA, los sueldos, los conocimientos técnicos que se están pidiendo en las ofertas y en definitiva tener una visión panorámica de a donde va evolucionando los empleos ligados a la IA. Pero me topé con 2 columnas importantes para el análisis pero con muchos nulos (cerca del 32%): tipo de jornada y salario.

Desafío: imputar jornada laboral

Convertí todos los salarios a sueldo diario para estandarizar, y solo imputé salarios faltantes cuando la jornada dominante era clara (≥70%), dejando el resto como desconocidos.

Aqui muestro el bucle que usé para imputar por jornada dominante clara

dominant_jobtypes ={}
threshold = 0.7

for position, row in jobtype_by_position.iterrows():
  top_jobtype = row.idxmax()
  top_freq = row.max()
  if top_freq >= threshold:
    dominant_jobtypes[position] = top_jobtype

Desafío: imputar salarios faltantes

El 32% de los salarios estaban ausentes y formatos variados ( por dia, año, por contrato, intervalos...)
Primero los unifique todo a un solo formato, busque el formato mayoritario para no manipular demaciado los datos y así evitar introducir demaciado ruido y una vez limpios, en formato numérico y en el mismo formato tocó solucionar los nulos.
Tras barajar varias posibilidades al final me decidí por usar un modelo Random Forest para imputarlos. Elegí este modelo porque:

Maneja bien relaciones no lineales
Resiste outliers
Permite obtener una visión aproximada, ya que no pretendía exactitud extrema (los sueldos dependen de muchos factores que no estaban en el dataset, como experiencia o especialización) Entre otras ventajas que tiene

Proceso (resumido para no alargar mucho)

Separé los datos en dos grupos: salarios conocidos y faltantes.
Del grupo de conocidos, reservé un subconjunto para validación.
Codifiqué variables categóricas usando ordinal encoding y target encoding con validación cruzada para evitar fugas de datos.
Optimizé hiperparámetros con RandomizedSearchCV para mejorar el modelo.
Evalué con métricas de regresión: MAE, RMSE y R². Dado que el salario es muy variable, estas métricas sirvieron para validar que el modelo capturaba tendencias generales.
Finalmente, imputé los valores faltantes con las predicciones del modelo.

Lo que aprendí en este proyecto

Este proyecto fue una gran oportunidad para practicar:

Limpieza y preparación de datos
Feature engineering
Modelado con datos reales y complejos
Manejo de incertidumbre en la imputación

Además, el tema del dataset es muy interesante y relevante para entender el mercado laboral en IA.

Puedes ver el proyecto completo y el código en mi Analisis empleos ia EE.UU, 2025.

Os dejo un par de las gráficas que obtuve para el análisis:

10 Things I Thought Before Starting Data Science (And What I Really Learned)

Fátima I.S — Wed, 23 Jul 2025 12:17:48 +0000

When I started learning Data Science, I had a lot of preconceived ideas. Some of them helped me move forward, others left me quite confused. This path is fascinating, but it's also full of noise, myths, and oversimplifications that aren't always helpful.

Today I’m sharing 10 beliefs I had at the beginning, and what I learned as I moved forward. It's not a list of “mistakes”, but rather key learning moments. Because in the end, messing up is part of growing as a data scientist.

1. `I believed I had to master everything before starting to practice`

I was paralyzed by the feeling of not knowing enough. I thought I had to master Python, statistics, visualization,** machine learning**… all of it, before tackling real projects.

What I learned is that you learn way faster by doing than by waiting to “feel ready.” Practicing from day one—even with doubts—is what actually turns you into a professional.

2. `I thought a good model was the most important thing`

I focused on algorithms, accuracy, techniques… believing that was the core of my value.

What I learned is that the model is just a part of the puzzle. Truly understanding the problem, cleaning and transforming the data with intention, and communicating insights clearly is what really makes the difference.

3. `I believed plots were just visual extras`

I made charts because “you had to,” without really thinking about their purpose.

I learned that visualization is analysis. Being able to see patterns, anomalies, correlations, or errors in charts changes your understanding of the problem entirely.

4. `I thought machine learning and AI were the same thing`

I used both terms interchangeably. They seemed like synonyms.

What I learned is that machine learning is only a part of AI. There’s symbolic AI, logic-based systems, expert systems… and now LLMs. Understanding the difference gives you valuable perspective.

5. `I believed a data scientist was just someone who could code and build models`

I focused only on the technical side, thinking that was enough.

What I learned is that you also need good judgment, b*usiness context, **critical thinking, and the **ability to explain complexity with clarity*. A true data scientist connects data with decisions.

6. `I thought complex models were always better`

I was fascinated by powerful models like XGBoost or neural nets, thinking they would always beat the simple ones.

I learned that sometimes a well-thought-out regression or a decision tree can be more useful, more interpretable, and even more accurate—especially if your data doesn’t justify the complexity.

7. `I believed more data = better model`

I saw large datasets as an automatic advantage.

What I learned is that if your data is messy, biased, or irrelevant, more volume only amplifies the problems. Quality matters more than quantity.

8. `I thought machine learning was just statistics`

I believed that if I understood statistics, I was halfway there.

What I learned: you also need engineering, software skills, validation, pipelines, reproducibility… and a whole lot of things that go beyond theory.

9. `I didn’t know what data leakage was`

When I got unrealistically good results, I just celebrated them.

What I learned: if you train with information you shouldn’t have used, your model is useless. Knowing how to separate train, validation, and test isn’t a technical detail—it’s critical.

10. `I thought cross-validation always worked`

I used it by default without considering the type of data.

What I learned is that, for example, with time series data, cross-validation can lead to totally wrong conclusions. Choosing the right validation strategy for your problem is part of the craft.

The bottom line is that every one of these points helped me grow, even if they frustrated me at first. And I’m sure I’ll keep changing my mind about many things, because that’s part of deep learning too.

I’m not sharing this to give advice, but to show that even missteps—if you reflect on them—can bring you closer to your best self as a data professional.

Forem: Fátima I.S

Analisis de ofertas de empleos relacionado con la IA (EE.UU, 2025)

Imputación de salarios en datos de portales de empleo de IA con Random Forest

Contexto y objetivo

Desafío: imputar jornada laboral

Desafío: imputar salarios faltantes

Proceso (resumido para no alargar mucho)

Lo que aprendí en este proyecto

10 Things I Thought Before Starting Data Science (And What I Really Learned)

When I started learning Data Science, I had a lot of preconceived ideas. Some of them helped me move forward, others left me quite confused. This path is fascinating, but it's also full of noise, myths, and oversimplifications that aren't always helpful.

1. I believed I had to master everything before starting to practice

2. I thought a good model was the most important thing

3. I believed plots were just visual extras

4. I thought machine learning and AI were the same thing

5. I believed a data scientist was just someone who could code and build models

6. I thought complex models were always better

7. I believed more data = better model

8. I thought machine learning was just statistics

9. I didn’t know what data leakage was

10. I thought cross-validation always worked