Visualizing Musical Similarity Through Dimensionality Reduction
A Computational Analysis of Spotify Audio Features Using UMAP and t-SNE
Independent Researcher • Computational Musicology and Data Science
January 2026
Abstract
This paper presents a comprehensive analysis of musical similarity patterns across 25,000 tracks spanning 30 distinct genres using dimensionality reduction techniques. By applying UMAP and t-SNE to Spotify's multi-dimensional audio feature space, we demonstrate that genres form geometrically distinct clusters in reduced dimensional space. Findings indicate strong statistical separation (MANOVA F≈142.7, p<0.0001) with high trustworthiness scores (UMAP T=0.89, t-SNE T=0.91), supporting applications in recommendation, playlist generation, and computational musicology.
Keywords: Music Information Retrieval, Dimensionality Reduction, UMAP, t-SNE, Audio Features, Genre Classification, Computational Musicology, Machine Learning
1. Introduction
1.1 Background and Motivation
Streaming and recommendation systems need robust similarity models that go beyond simple genre labels. Spotify audio analysis exposes high-dimensional descriptors that make quantitative study practical at scale.
1.2 Research Question
Can high-dimensional audio representations be projected into 2D while preserving local neighborhood structure and meaningful global genre relationships?
1.3 Contributions
- Empirical validation that audio features encode significant genre structure.
- Comparative analysis of UMAP and t-SNE with quantitative quality metrics.
- Discovery of cross-genre relationships not captured by traditional taxonomy.
- Feature importance analysis identifying Energy and Acousticness as dominant discriminators.
- Practical guidance for recommendation and music discovery interfaces.
2. Methodology
2.1 Dataset Construction
Balanced dataset of 25,000 tracks across 30 genres (approximately 833 tracks per genre), designed to support robust analysis with tractable processing.
2.2 Audio Feature Space
| Feature | Range | Description |
|---|---|---|
| Danceability | [0, 1] | Rhythmic stability, beat strength, regularity |
| Energy | [0, 1] | Perceptual intensity and activity measure |
| Loudness | [-60, 0] dB | Overall amplitude of track |
| Speechiness | [0, 1] | Presence of spoken words |
| Acousticness | [0, 1] | Confidence measure of acoustic instrumentation |
| Instrumentalness | [0, 1] | Predicts absence of vocals |
| Liveness | [0, 1] | Presence of audience in recording |
| Valence | [0, 1] | Musical positiveness/happiness |
| Tempo | [80, 180] BPM | Overall estimated tempo |
Table 1: Spotify Audio Features Specification
2.3 Data Preprocessing
Z-score standardization ensures equal feature contribution across distance computations.
2.4 UMAP Algorithm
UMAP builds a fuzzy manifold graph and optimizes a low-dimensional representation through attractive and repulsive forces, balancing local fidelity and readable global shape.
2.5 t-SNE Algorithm
t-SNE minimizes KL divergence between high-dimensional and low-dimensional probability distributions, with Student-t tails to reduce crowding.
3. Results and Analysis
3.1 Cluster Formation Patterns
- High-Energy Cluster: EDM, techno, house, dubstep, trap, metal, punk.
- Acoustic-Chill Cluster: classical, ambient, lo-fi, acoustic, folk.
- Vocal-Centric Cluster: hip-hop, r&b, soul, pop, k-pop.
3.2 Statistical Validation
MANOVA: F(261, 223,290) ≈ 142.7, p<0.0001.
| Genre Pair | D² | Shared Features |
|---|---|---|
| Lo-fi ↔ Ambient | 0.82 | Low energy, high instrumentalness |
| Classical ↔ Acoustic | 1.15 | High acousticness, low speechiness |
| Techno ↔ House | 1.28 | High energy, high danceability |
| Hip-Hop ↔ R&B | 1.41 | High speechiness, moderate valence |
| EDM ↔ Classical | 12.4 | Maximally dissimilar |
Table 2: Genre similarity measured by Mahalanobis distance
3.3 Feature Importance Analysis
- Energy: 18.4%
- Acousticness: 16.7%
- Danceability: 14.5%
- Instrumentalness: 13.2%
- Speechiness: 12.1%
- Valence: 9.8%
- Loudness: 8.7%
- Tempo: 4.1%
- Liveness: 2.5%
3.4 Dimensionality Reduction Quality
| Algorithm | T(k) | Interpretation |
|---|---|---|
| UMAP | 0.89 | Excellent local preservation |
| t-SNE | 0.91 | Excellent local preservation |
Table 3: Trustworthiness scores
3.5 UMAP vs t-SNE
| Criterion | UMAP | t-SNE |
|---|---|---|
| Theoretical Basis | Riemannian geometry | Probability distributions |
| Global Structure | Better preserved (ρ=0.62) | Sacrificed (ρ=0.41) |
| Local Structure | Excellent (T=0.89) | Excellent (T=0.91) |
| Cluster Separation | Moderate (DB=1.24) | Strong (DB=1.08) |
| Computational Speed | Faster (30-60s) | Slower (45-90s) |
| Stability | More deterministic | Higher variability |
Table 4: Comparative performance (n=25,000)
4. Applications
- Recommendation Systems: blend acoustic similarity with collaborative signals.
- Gradient Playlists: generate smooth path-based transitions between states.
- Data-Driven Taxonomy: infer acoustic macro/micro genres from clustering.
- Musicology Research: track genre drift, artist consistency, and collaboration bridges.
5. Discussion
Results show meaningful structure in acoustic space while also highlighting that similarity in sound does not fully capture similarity in lived emotional meaning.
6. Future Research Directions
- Deep-learning embeddings from raw waveform encoders.
- Multi-modal fusion across audio, lyrics, visuals, and metadata.
- 3D temporal visualizations for genre evolution.
- Cross-cultural feature robustness analysis.
- User-personalized embedding metrics.
7. Conclusions
- Audio features encode statistically significant genre structure.
- Dimensionality reduction preserves meaningful neighborhood patterns.
- Cross-genre relationships emerge beyond taxonomy labels.
- Energy and Acousticness dominate discrimination power.
- UMAP and t-SNE serve complementary analytical roles.
Mathematics reveals structure, but humans give it meaning.
References
- McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
- Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579-2605.
- Spotify for Developers. (2024). Web API Reference: Audio Features. https://developer.spotify.com/documentation/web-api/reference/get-audio-features
- Pandya, M. (2023). Spotify Tracks Dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset
- Bogdanov, D., Porter, A., Herrera, P., & Serra, X. (2019). AcousticBrainz platform. ISMIR Proceedings, 475-481.
- Schedl, M., Knees, P., McFee, B., & Bogdanov, D. (2022). Music Recommendation Systems. Springer.
- Sturm, B. L. (2014). Determining if a MIR system is a horse. IEEE Transactions on Multimedia, 16(6), 1636-1644.
- Müller, M. (2015). Fundamentals of Music Processing. Springer.
- Casey, M. A., et al. (2008). Content-based music information retrieval directions. Proceedings of the IEEE, 96(4), 668-696.
- Bertin-Mahieux, T., Ellis, D. P., Whitman, B., & Lamere, P. (2011). The Million Song Dataset. ISMIR Proceedings, 591-596.
Appendix A: Mathematical Notation
| Symbol | Meaning |
|---|---|
| x ∈ ℝ9 | Track feature vector (9-dimensional) |
| y ∈ ℝ2 | Low-dimensional embedding (2D) |
| n | Number of tracks (25,000) |
| k | Number of genres (30) |
| d(xi, xj) | Euclidean distance between tracks i and j |
| μj | Mean of feature j |
| σj | Standard deviation of feature j |
| pij | High-dimensional similarity probability (t-SNE) |
| qij | Low-dimensional similarity probability (t-SNE) |
| w(i,j) | Edge weight in UMAP graph |
| ∇L | Gradient of loss function |
| η | Learning rate |
| ρ | Spearman correlation coefficient |
Appendix B: Implementation Details
- Language: JavaScript ES6+ (browser-based)
- Framework: React 18
- Rendering: HTML5 Canvas API
- Processing Time: 30-90 seconds for 25,000 tracks
- Memory Usage: approximately 150MB
Reproducibility note: the visualization and whitepaper are intended for direct web access with parameter toggles, genre filtering, and image export support.