Visualizing Musical Similarity Through Dimensionality Reduction

A Computational Analysis of Spotify Audio Features Using UMAP and t-SNE

J. Rizzo

Independent Researcher • Computational Musicology and Data Science

January 2026

Abstract

This paper presents a comprehensive analysis of musical similarity patterns across 25,000 tracks spanning 30 distinct genres using dimensionality reduction techniques. By applying UMAP and t-SNE to Spotify's multi-dimensional audio feature space, we demonstrate that genres form geometrically distinct clusters in reduced dimensional space. Findings indicate strong statistical separation (MANOVA F≈142.7, p<0.0001) with high trustworthiness scores (UMAP T=0.89, t-SNE T=0.91), supporting applications in recommendation, playlist generation, and computational musicology.

Keywords: Music Information Retrieval, Dimensionality Reduction, UMAP, t-SNE, Audio Features, Genre Classification, Computational Musicology, Machine Learning

1. Introduction

1.1 Background and Motivation

Streaming and recommendation systems need robust similarity models that go beyond simple genre labels. Spotify audio analysis exposes high-dimensional descriptors that make quantitative study practical at scale.

1.2 Research Question

Can high-dimensional audio representations be projected into 2D while preserving local neighborhood structure and meaningful global genre relationships?

1.3 Contributions

Empirical validation that audio features encode significant genre structure.
Comparative analysis of UMAP and t-SNE with quantitative quality metrics.
Discovery of cross-genre relationships not captured by traditional taxonomy.
Feature importance analysis identifying Energy and Acousticness as dominant discriminators.
Practical guidance for recommendation and music discovery interfaces.

2. Methodology

2.1 Dataset Construction

Balanced dataset of 25,000 tracks across 30 genres (approximately 833 tracks per genre), designed to support robust analysis with tractable processing.

2.2 Audio Feature Space

Feature	Range	Description
Danceability	[0, 1]	Rhythmic stability, beat strength, regularity
Energy	[0, 1]	Perceptual intensity and activity measure
Loudness	[-60, 0] dB	Overall amplitude of track
Speechiness	[0, 1]	Presence of spoken words
Acousticness	[0, 1]	Confidence measure of acoustic instrumentation
Instrumentalness	[0, 1]	Predicts absence of vocals
Liveness	[0, 1]	Presence of audience in recording
Valence	[0, 1]	Musical positiveness/happiness
Tempo	[80, 180] BPM	Overall estimated tempo

Table 1: Spotify Audio Features Specification

2.3 Data Preprocessing

Z-score standardization ensures equal feature contribution across distance computations.

x'ij = (xij - μj) / σj

2.4 UMAP Algorithm

UMAP builds a fuzzy manifold graph and optimizes a low-dimensional representation through attractive and repulsive forces, balancing local fidelity and readable global shape.

w(i,j)sym = w(i,j) + w(j,i) - w(i,j)·w(j,i)

2.5 t-SNE Algorithm

t-SNE minimizes KL divergence between high-dimensional and low-dimensional probability distributions, with Student-t tails to reduce crowding.

∂C/∂yi = 4 Σj (pij - qij)(yi - yj)(1 + ||yi - yj||²)^-1

3. Results and Analysis

3.1 Cluster Formation Patterns

High-Energy Cluster: EDM, techno, house, dubstep, trap, metal, punk.
Acoustic-Chill Cluster: classical, ambient, lo-fi, acoustic, folk.
Vocal-Centric Cluster: hip-hop, r&b, soul, pop, k-pop.

3.2 Statistical Validation

MANOVA: F(261, 223,290) ≈ 142.7, p<0.0001.

Genre Pair	D²	Shared Features
Lo-fi ↔ Ambient	0.82	Low energy, high instrumentalness
Classical ↔ Acoustic	1.15	High acousticness, low speechiness
Techno ↔ House	1.28	High energy, high danceability
Hip-Hop ↔ R&B	1.41	High speechiness, moderate valence
EDM ↔ Classical	12.4	Maximally dissimilar

Table 2: Genre similarity measured by Mahalanobis distance

3.3 Feature Importance Analysis

Energy: 18.4%
Acousticness: 16.7%
Danceability: 14.5%
Instrumentalness: 13.2%
Speechiness: 12.1%
Valence: 9.8%
Loudness: 8.7%
Tempo: 4.1%
Liveness: 2.5%

3.4 Dimensionality Reduction Quality

Algorithm	T(k)	Interpretation
UMAP	0.89	Excellent local preservation
t-SNE	0.91	Excellent local preservation

Table 3: Trustworthiness scores

3.5 UMAP vs t-SNE

Criterion	UMAP	t-SNE
Theoretical Basis	Riemannian geometry	Probability distributions
Global Structure	Better preserved (ρ=0.62)	Sacrificed (ρ=0.41)
Local Structure	Excellent (T=0.89)	Excellent (T=0.91)
Cluster Separation	Moderate (DB=1.24)	Strong (DB=1.08)
Computational Speed	Faster (30-60s)	Slower (45-90s)
Stability	More deterministic	Higher variability

Table 4: Comparative performance (n=25,000)

4. Applications

Recommendation Systems: blend acoustic similarity with collaborative signals.
Gradient Playlists: generate smooth path-based transitions between states.
Data-Driven Taxonomy: infer acoustic macro/micro genres from clustering.
Musicology Research: track genre drift, artist consistency, and collaboration bridges.

5. Discussion

Results show meaningful structure in acoustic space while also highlighting that similarity in sound does not fully capture similarity in lived emotional meaning.

Philosophical note: Mathematics reveals structure in music, but human context gives that structure meaning.

6. Future Research Directions

Deep-learning embeddings from raw waveform encoders.
Multi-modal fusion across audio, lyrics, visuals, and metadata.
3D temporal visualizations for genre evolution.
Cross-cultural feature robustness analysis.
User-personalized embedding metrics.

7. Conclusions

Audio features encode statistically significant genre structure.
Dimensionality reduction preserves meaningful neighborhood patterns.
Cross-genre relationships emerge beyond taxonomy labels.
Energy and Acousticness dominate discrimination power.
UMAP and t-SNE serve complementary analytical roles.

Mathematics reveals structure, but humans give it meaning.

References

McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579-2605.
Spotify for Developers. (2024). Web API Reference: Audio Features. https://developer.spotify.com/documentation/web-api/reference/get-audio-features
Pandya, M. (2023). Spotify Tracks Dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset
Bogdanov, D., Porter, A., Herrera, P., & Serra, X. (2019). AcousticBrainz platform. ISMIR Proceedings, 475-481.
Schedl, M., Knees, P., McFee, B., & Bogdanov, D. (2022). Music Recommendation Systems. Springer.
Sturm, B. L. (2014). Determining if a MIR system is a horse. IEEE Transactions on Multimedia, 16(6), 1636-1644.
Müller, M. (2015). Fundamentals of Music Processing. Springer.
Casey, M. A., et al. (2008). Content-based music information retrieval directions. Proceedings of the IEEE, 96(4), 668-696.
Bertin-Mahieux, T., Ellis, D. P., Whitman, B., & Lamere, P. (2011). The Million Song Dataset. ISMIR Proceedings, 591-596.

Appendix A: Mathematical Notation

Symbol	Meaning
x ∈ ℝ9	Track feature vector (9-dimensional)
y ∈ ℝ2	Low-dimensional embedding (2D)
n	Number of tracks (25,000)
k	Number of genres (30)
d(xi, xj)	Euclidean distance between tracks i and j
μj	Mean of feature j
σj	Standard deviation of feature j
pij	High-dimensional similarity probability (t-SNE)
qij	Low-dimensional similarity probability (t-SNE)
w(i,j)	Edge weight in UMAP graph
∇L	Gradient of loss function
η	Learning rate
ρ	Spearman correlation coefficient

Appendix B: Implementation Details

Language: JavaScript ES6+ (browser-based)
Framework: React 18
Rendering: HTML5 Canvas API
Processing Time: 30-90 seconds for 25,000 tracks
Memory Usage: approximately 150MB

Reproducibility note: the visualization and whitepaper are intended for direct web access with parameter toggles, genre filtering, and image export support.