<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nuria</title>
    <description>The latest articles on Forem by Nuria (@nuriadevs).</description>
    <link>https://forem.com/nuriadevs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F456259%2F75cec192-2d28-45b5-a9f6-aa76f17b5db7.webp</url>
      <title>Forem: Nuria</title>
      <link>https://forem.com/nuriadevs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nuriadevs"/>
    <language>en</language>
    <item>
      <title>Data Analysis with Python: Spotify Songs Dataset</title>
      <dc:creator>Nuria</dc:creator>
      <pubDate>Tue, 03 Feb 2026 19:41:32 +0000</pubDate>
      <link>https://forem.com/nuriadevs/data-analysis-with-python-spotify-songs-dataset-3n4p</link>
      <guid>https://forem.com/nuriadevs/data-analysis-with-python-spotify-songs-dataset-3n4p</guid>
      <description>&lt;h2&gt;
  
  
  Data Analysis with Python: Spotify Songs Dataset
&lt;/h2&gt;

&lt;p&gt;Within the field of &lt;strong&gt;data science&lt;/strong&gt;, loading or exploratory data analysis are some of the tasks you can perform on a &lt;strong&gt;dataset&lt;/strong&gt;. Additionally, depending on the information you need to obtain, you'll have to carry out other additional tasks.&lt;/p&gt;

&lt;p&gt;Before starting a data analysis, it's necessary to know the steps to follow. In the following list, you can see the order of their implementation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Loading data (&lt;strong&gt;dataset&lt;/strong&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Exploratory data analysis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data preparation and preprocessing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data visualization.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Machine learning model generation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Machine learning model training.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Predictive model definition.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Evaluation of the trained model with reserved data.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the exercise that I explain below, I only want to obtain information about Spotify songs. Since this is a brief analysis written in Python, if you want to see the complete exercise, you can download it from the repository on Github.&lt;/p&gt;




&lt;h3&gt;
  
  
  ⚠️ Before Starting
&lt;/h3&gt;

&lt;p&gt;Before starting a data analysis, it's very important to define the information you need to obtain, because without a clear objective, you won't have a starting point.&lt;/p&gt;




&lt;h3&gt;
  
  
  Loading the Dataset
&lt;/h3&gt;

&lt;p&gt;The dataset (&lt;code&gt;MostStreamedSpotifySongs2024.csv&lt;/code&gt;) consists of several columns that reference the main streaming music platforms. In this case, I only want to explore Spotify data. The information I want to know is the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Songs by year&lt;/li&gt;
&lt;li&gt;  Song percentage: Explicit VS Non-Explicit&lt;/li&gt;
&lt;li&gt;  Most listened to songs by year with and without explicit content&lt;/li&gt;
&lt;li&gt;  Song with the most streams&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Importing the Libraries
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Pandas&lt;/strong&gt;, &lt;strong&gt;Numpy&lt;/strong&gt;, &lt;strong&gt;Matplotlib&lt;/strong&gt;, and &lt;strong&gt;Seaborn&lt;/strong&gt; libraries make the work much easier due to the large number of methods they offer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Data manipulation with DataFrames.
import pandas as pd

# Numerical operations and array handling.
import numpy as np

# Chart creation.
import matplotlib.pyplot as plt

# Advanced statistical visualization.
import seaborn as sns

# Display charts in Jupyter notebook.
%matplotlib inline

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Reading the File
&lt;/h3&gt;

&lt;p&gt;In this exercise, there is only a single file in csv format with ISO-8859-1 encoding. To avoid reading errors, it's important to specify the encoding, as some files contain special characters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Reading the file, the encoding is ISO-8859-1

file_path = ('MostStreamedSpotifySongs2024.csv')
data = pd.read_csv(file_path, encoding='ISO-8859-1')

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Visualizing the Data Table
&lt;/h3&gt;

&lt;p&gt;Once the data is loaded, you need to visualize the information it contains. The &lt;strong&gt;head()&lt;/strong&gt; method displays the first five rows of the file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# View the table with all the data

data.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Dataset Dimensions
&lt;/h3&gt;

&lt;p&gt;Knowing the dimensions of the dataset helps understand the amount of data you'll be working with.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Dataset dimensions

print(f'Dataset size: {data.shape}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  DataFrame Observation
&lt;/h3&gt;

&lt;p&gt;Before starting data cleaning, you need to check if there is missing data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# List of categorical and numerical variables

data.info()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Null Data and Duplicate Data
&lt;/h4&gt;

&lt;p&gt;After observing that data is missing in the columns, the next step is to know the number of null and duplicate data. To get the total of both, add the &lt;strong&gt;sum()&lt;/strong&gt; method to each one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Sum of null values

data.isnull().sum()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Sum of duplicate records

data.duplicated().sum()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Cleaning
&lt;/h2&gt;

&lt;p&gt;The following cleaning processes are necessary to achieve an &lt;strong&gt;intact dataset&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Duplicate Rows
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;drop_duplicates()&lt;/strong&gt; method is used to remove duplicate data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Find all duplicate records

duplicated_rows = data[data.duplicated()]

# Display duplicate records

print(duplicated_rows)

# Remove duplicate rows

print(f'Dataset size before removing duplicate rows: {data.shape}')
data.drop_duplicates(inplace=True) 
print(f'Dataset size after removing duplicate rows: {data.shape}')

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Null Rows
&lt;/h3&gt;

&lt;p&gt;The first step is to filter the rows where &lt;em&gt;Artist&lt;/em&gt; is null and remove them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Filter rows where 'Artist' is null

null_artists = data[data['Artist'].isnull()]

# Display the indices of rows with null values in 'Artist'

print("\nIndices of artists that are null:")
print(null_artists.index.tolist())

# Remove null artists

print(f"Number of null artists before removing them: {data['Artist'].isnull().sum()}")
data.dropna(subset=['Artist'], inplace=True)
print(f"Number of null artists after removing them: {data['Artist'].isnull().sum()}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Transforming the Data
&lt;/h2&gt;

&lt;p&gt;Since the objective of the analysis is to explore only Spotify data, the columns corresponding to other music platforms are removed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Remove columns that are not considered for the main objective
# Define the list of columns to remove

columns_to_drop = [
    'YouTube Views', 'YouTube Likes', 'TikTok Posts', 'TikTok Likes', 'TikTok Views', 
    'YouTube Playlist Reach', 'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins', 
    'Deezer Playlist Count', 'Deezer Playlist Reach', 'Amazon Playlist Count', 'Pandora Streams', 
    'Pandora Track Stations', 'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity'
]

# Remove the columns

data.drop(columns=columns_to_drop, axis=1, inplace=True)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Visualization
&lt;/h2&gt;

&lt;p&gt;After performing the data loading, cleaning, and transformation processes, the next step is to visualize the information requested by the exercise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Songs by Year
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Count the number of songs by year

songs_by_year = data['Year'].value_counts().sort_index()

# Create the chart

plt.figure(figsize=(10, 6))
songs_by_year.plot(kind='bar', color='skyblue')
plt.title('Number of Songs by Year')
plt.xlabel('Year')
plt.ylabel('Number of Songs')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='-', alpha=0.7)

# Display the chart

plt.tight_layout()
plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Song Percentage: Explicit vs Non-Explicit
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Total songs with explicit lyrics
# Count the number of occurrences of 0 and 1

value_counts = data['Explicit Track'].value_counts()

# Map binary values to explicit labels

labels = ['Explicit', 'Non-Explicit']
sizes = [value_counts.get(1, 0), value_counts.get(0, 0)]

# Create the pie chart

plt.figure(figsize=(4, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=['skyblue', 'salmon'])
plt.title('Song Distribution: Explicit vs Non-Explicit')

# Display the chart

plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Most Listened to Songs by Year with and without Explicit Content
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Filter explicit and non-explicit songs

explicit_data = data[data['Explicit Track'] == 1]
no_explicit_data = data[data['Explicit Track'] == 0]

# Group by year with explicit content and without explicit content

explicit_track = explicit_data.groupby('Year')['Track'].count().reset_index()
no_explicit_track = no_explicit_data.groupby('Year')['Track'].count().reset_index()

# Rename columns to unify the DataFrame

explicit_track.rename(columns={'Track': 'Count'}, inplace=True)
explicit_track['Explicit'] = 'Yes'
no_explicit_track.rename(columns={'Track': 'Count'}, inplace=True)
no_explicit_track['Explicit'] = 'No'

# Merge the two DataFrames

data_combined = pd.concat([explicit_track, no_explicit_track])

# Create the chart using Seaborn

plt.figure(figsize=(12, 6))
sns.set_style("whitegrid")

# Create bar chart

sns.barplot(data=data_combined, x='Year', y='Count', hue='Explicit')

# Add title and labels

plt.title('Songs by Year According to Their Content')
plt.xlabel('Year')
plt.ylabel('Number of Songs')

# Display the chart

plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Song with the Most Streams
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Identify the row with the most listened to song

most_listened_song = data.loc[data['Spotify Streams'].idxmax()]
print(f"The song with the most streams is '{most_listened_song['Track']}' by {most_listened_song['Artist']} with {most_listened_song['Spotify Streams']} streams.")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;After exploring and visualizing the data of the most listened to songs on Spotify in 2024, I've drawn the following insights.&lt;/p&gt;

&lt;p&gt;In the chart of &lt;em&gt;Songs by year according to their content&lt;/em&gt;, you can observe an increase in the number of songs with explicit content from 2015 onwards. The explanation for this increase may be due to the following factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Increase in new artists who use more explicit language.&lt;/li&gt;
&lt;li&gt;  Emergence or fusion of new musical styles.&lt;/li&gt;
&lt;li&gt;  Reflections of society in song lyrics with advocacy motives.&lt;/li&gt;
&lt;li&gt;  Other reasons.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another result that I found curious is that the song with the most plays is one of my favorites and it's not the one with the highest score. Then, the question arises: what is the key to success in a song?&lt;/p&gt;

&lt;h3&gt;
  
  
  🚀 Want to explore the project further?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;💻 Check out the project on GitHub: &lt;a href="https://github.com/nuriadevs/music-data-analysis" rel="noopener noreferrer"&gt;music-data-analysis&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Watch my favorite, most-streamed song on YouTube: &lt;a href="https://www.youtube.com/watch?v=4NRXx6U8ABQ" rel="noopener noreferrer"&gt;Watch on YouTube&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope this article has been useful to you. 🍀&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>pandas</category>
      <category>numpy</category>
    </item>
  </channel>
</rss>
