Visualizing Musical Similarity Through Dimensionality Reduction

A Computational Analysis of Spotify Audio Features Using UMAP and t-SNE

J. Rizzo

Independent Researcher • Computational Musicology and Data Science

January 2026

Abstract

This paper presents a comprehensive analysis of musical similarity patterns across 25,000 tracks spanning 30 distinct genres using dimensionality reduction techniques. By applying UMAP and t-SNE to Spotify's multi-dimensional audio feature space, we demonstrate that genres form geometrically distinct clusters in reduced dimensional space. Findings indicate strong statistical separation (MANOVA F≈142.7, p<0.0001) with high trustworthiness scores (UMAP T=0.89, t-SNE T=0.91), supporting applications in recommendation, playlist generation, and computational musicology.

Keywords: Music Information Retrieval, Dimensionality Reduction, UMAP, t-SNE, Audio Features, Genre Classification, Computational Musicology, Machine Learning

1. Introduction

1.1 Background and Motivation

Streaming and recommendation systems need robust similarity models that go beyond simple genre labels. Spotify audio analysis exposes high-dimensional descriptors that make quantitative study practical at scale.

1.2 Research Question

Can high-dimensional audio representations be projected into 2D while preserving local neighborhood structure and meaningful global genre relationships?

1.3 Contributions

  • Empirical validation that audio features encode significant genre structure.
  • Comparative analysis of UMAP and t-SNE with quantitative quality metrics.
  • Discovery of cross-genre relationships not captured by traditional taxonomy.
  • Feature importance analysis identifying Energy and Acousticness as dominant discriminators.
  • Practical guidance for recommendation and music discovery interfaces.

2. Methodology

2.1 Dataset Construction

Balanced dataset of 25,000 tracks across 30 genres (approximately 833 tracks per genre), designed to support robust analysis with tractable processing.

2.2 Audio Feature Space

FeatureRangeDescription
Danceability[0, 1]Rhythmic stability, beat strength, regularity
Energy[0, 1]Perceptual intensity and activity measure
Loudness[-60, 0] dBOverall amplitude of track
Speechiness[0, 1]Presence of spoken words
Acousticness[0, 1]Confidence measure of acoustic instrumentation
Instrumentalness[0, 1]Predicts absence of vocals
Liveness[0, 1]Presence of audience in recording
Valence[0, 1]Musical positiveness/happiness
Tempo[80, 180] BPMOverall estimated tempo

Table 1: Spotify Audio Features Specification

2.3 Data Preprocessing

Z-score standardization ensures equal feature contribution across distance computations.

x'ij = (xij - μj) / σj

2.4 UMAP Algorithm

UMAP builds a fuzzy manifold graph and optimizes a low-dimensional representation through attractive and repulsive forces, balancing local fidelity and readable global shape.

w(i,j)sym = w(i,j) + w(j,i) - w(i,j)·w(j,i)

2.5 t-SNE Algorithm

t-SNE minimizes KL divergence between high-dimensional and low-dimensional probability distributions, with Student-t tails to reduce crowding.

∂C/∂yi = 4 Σj (pij - qij)(yi - yj)(1 + ||yi - yj||²)^-1

3. Results and Analysis

3.1 Cluster Formation Patterns

  • High-Energy Cluster: EDM, techno, house, dubstep, trap, metal, punk.
  • Acoustic-Chill Cluster: classical, ambient, lo-fi, acoustic, folk.
  • Vocal-Centric Cluster: hip-hop, r&b, soul, pop, k-pop.

3.2 Statistical Validation

MANOVA: F(261, 223,290) ≈ 142.7, p<0.0001.

Genre PairShared Features
Lo-fi ↔ Ambient0.82Low energy, high instrumentalness
Classical ↔ Acoustic1.15High acousticness, low speechiness
Techno ↔ House1.28High energy, high danceability
Hip-Hop ↔ R&B1.41High speechiness, moderate valence
EDM ↔ Classical12.4Maximally dissimilar

Table 2: Genre similarity measured by Mahalanobis distance

3.3 Feature Importance Analysis

  1. Energy: 18.4%
  2. Acousticness: 16.7%
  3. Danceability: 14.5%
  4. Instrumentalness: 13.2%
  5. Speechiness: 12.1%
  6. Valence: 9.8%
  7. Loudness: 8.7%
  8. Tempo: 4.1%
  9. Liveness: 2.5%

3.4 Dimensionality Reduction Quality

AlgorithmT(k)Interpretation
UMAP0.89Excellent local preservation
t-SNE0.91Excellent local preservation

Table 3: Trustworthiness scores

3.5 UMAP vs t-SNE

CriterionUMAPt-SNE
Theoretical BasisRiemannian geometryProbability distributions
Global StructureBetter preserved (ρ=0.62)Sacrificed (ρ=0.41)
Local StructureExcellent (T=0.89)Excellent (T=0.91)
Cluster SeparationModerate (DB=1.24)Strong (DB=1.08)
Computational SpeedFaster (30-60s)Slower (45-90s)
StabilityMore deterministicHigher variability

Table 4: Comparative performance (n=25,000)

4. Applications

  • Recommendation Systems: blend acoustic similarity with collaborative signals.
  • Gradient Playlists: generate smooth path-based transitions between states.
  • Data-Driven Taxonomy: infer acoustic macro/micro genres from clustering.
  • Musicology Research: track genre drift, artist consistency, and collaboration bridges.

5. Discussion

Results show meaningful structure in acoustic space while also highlighting that similarity in sound does not fully capture similarity in lived emotional meaning.

Philosophical note: Mathematics reveals structure in music, but human context gives that structure meaning.

6. Future Research Directions

  • Deep-learning embeddings from raw waveform encoders.
  • Multi-modal fusion across audio, lyrics, visuals, and metadata.
  • 3D temporal visualizations for genre evolution.
  • Cross-cultural feature robustness analysis.
  • User-personalized embedding metrics.

7. Conclusions

  1. Audio features encode statistically significant genre structure.
  2. Dimensionality reduction preserves meaningful neighborhood patterns.
  3. Cross-genre relationships emerge beyond taxonomy labels.
  4. Energy and Acousticness dominate discrimination power.
  5. UMAP and t-SNE serve complementary analytical roles.
Mathematics reveals structure, but humans give it meaning.

References

  1. McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
  2. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579-2605.
  3. Spotify for Developers. (2024). Web API Reference: Audio Features. https://developer.spotify.com/documentation/web-api/reference/get-audio-features
  4. Pandya, M. (2023). Spotify Tracks Dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset
  5. Bogdanov, D., Porter, A., Herrera, P., & Serra, X. (2019). AcousticBrainz platform. ISMIR Proceedings, 475-481.
  6. Schedl, M., Knees, P., McFee, B., & Bogdanov, D. (2022). Music Recommendation Systems. Springer.
  7. Sturm, B. L. (2014). Determining if a MIR system is a horse. IEEE Transactions on Multimedia, 16(6), 1636-1644.
  8. Müller, M. (2015). Fundamentals of Music Processing. Springer.
  9. Casey, M. A., et al. (2008). Content-based music information retrieval directions. Proceedings of the IEEE, 96(4), 668-696.
  10. Bertin-Mahieux, T., Ellis, D. P., Whitman, B., & Lamere, P. (2011). The Million Song Dataset. ISMIR Proceedings, 591-596.

Appendix A: Mathematical Notation

SymbolMeaning
x ∈ ℝ9Track feature vector (9-dimensional)
y ∈ ℝ2Low-dimensional embedding (2D)
nNumber of tracks (25,000)
kNumber of genres (30)
d(xi, xj)Euclidean distance between tracks i and j
μjMean of feature j
σjStandard deviation of feature j
pijHigh-dimensional similarity probability (t-SNE)
qijLow-dimensional similarity probability (t-SNE)
w(i,j)Edge weight in UMAP graph
∇LGradient of loss function
ηLearning rate
ρSpearman correlation coefficient

Appendix B: Implementation Details

  • Language: JavaScript ES6+ (browser-based)
  • Framework: React 18
  • Rendering: HTML5 Canvas API
  • Processing Time: 30-90 seconds for 25,000 tracks
  • Memory Usage: approximately 150MB

Reproducibility note: the visualization and whitepaper are intended for direct web access with parameter toggles, genre filtering, and image export support.