Building a lyric search engine

Y'all probably know that my views on AI are somewhat nuanced. I’m not 100% “AI BAD!!!” but I’m also hesitant to rely on AI for a lot of things, and generally do not care for generative AI or any situation where you need AI to “reason” on things.

But, recently I’ve wanted to remember the name of a song that I listened to a lot, and where the lyrics I can remember don’t come up in any of the major lyrics databases. I listen to a lot of obscure indie music that tends to get lost by the major platforms, and I’ve been packratting music for decades now.

Further, it’s only fairly recently that music started to get lyrics embedded into the id3 tags (thanks to bandcamp really pushing for that) and even the streaming platforms have taken forever to pick it up. So a lot of the music I listen to has never had its lyrics entered in any sort of machine-searchable way.

But, hey, there are plenty of AI models for vocal extraction and text transcription… so why not actually use them?

I found two existing Python libraries that could do what I need, Whisper and DEMUCS. Both of them are… fairly badly-written wrappers to AI models (which seem to have been vibe coded using similar programming standards), and they have a lot of jank to work around (for example, their error handling is abysmal), but they got me going well enough.

So, I wrote a pretty basic script which scans a directory of audio files, optionally runs DEMUCS to extract the vocals, and then runs Whisper to transcribe the lyrics, and then it plops the results into a text file.

Here’s a test of some of the output, on my song Behind a Mask. First, here’s the original lyrics:

Feeling lost in an elephant’s dream
Caught up in the digital stream
Cave of shadows, can’t say what I mean
A million miles far from home

Best foot forward, take a step back
Play defense while on the attack
Sowing discord, can’t cut them no slack
Everybody so alone

If everyone could see
Themselves through some other eyes
Innocent, caught up in the machinery
An algorithmic disguise

Making everything great again
Forget about your family and friends
Looking for a means to an end
Just don’t want to be alone

Fragile gem, split in two
Internal dialogue of them vs. you
Find acceptance with nothing to do
The voices there are all your own

If everyone could see
Themselves through some other eyes
Innocent, caught up in the machinery
An algorithmic disguise

A deeper body ache, and make a bad mistake
A million jumping on
Sharing a bad take, Nobody can quite shake
The feeling that you’re gone

Feeling lost in an elephant’s dream
Getting washed away by the digital stream
If it feels better to share another meme
Doesn’t make you less alone

If everyone could see
Themselves through some other eyes
Innocent, caught up in the machinery
An algorithmic disguise

If everyone could see
Themselves through some other eyes
Innocent, caught up in the machinery
An algorithmic disguise

Here’s what Whisper did to it on its own (after I’ve reformatted it):

Feeling lost in an elephant's dream
Caught up in the digital stream
Cave of shadows, can't say what I mean
A million miles far from home

Best foot forward, take a step back
Fight offense while on the attack
Sewing discord, can't cut them no slack
Everybody's so alone

If everyone could see
themselves through some other eyes
Innocent, caught up in the machinery
An algorithmic disguise

Making everything great again
Forget about your family and friends
Looking for a means to an end
Just don't want to be alone

Fragile gem split in two
Internal dialogue of them versus you
Find acceptance with nothing to do
The voices there are all your own

If everyone could see
themselves through some other eyes
Innocent, caught up in the machinery
An algorithmic disguise

A deeper body ache, And make a bad mistake
A million jumping on
Sharing a bad shape, Nobody cares what shape
The feeling that you're gone

Feeling lost in an elephant's dream
Getting washed away by the digital stream
If it feels better to share another me
Does it make you less alone?

If everyone could see
themselves through some other eyes
Innocent, caught up in the machinery
An algorithmic disguise
An algorithmic disguise
If everyone could see
themselves through some other eyes
If everyone could see
themselves through some other eyes
Innocent, caught up in the machinery
An algorithmic disguise
Innocent, caught up in the machinery
An algorithmic disguise
What if everyone could see
yourself in a machine array
An algorithmic disguise?
Innocent, caught up in the machinery
An algorithmic disguise
Thank you.

For the most part it’s surprisingly decent, although there’s a few things that throw it off, some understandable (such as sowing -> sewing), some a bit less so (such as the bridge, “sharing a bad take, nobody can quite shake” -> “sharing a bad shape, nobody cares what shape”). Then the ending completely falls apart, which I think is the model getting confused by the vocal harmony. It might be trying to process each vocal layer separately or something? But that doesn’t quite make sense since the bridge also has a vocal harmony. And then it added “Thank you” to the end, but I’ve noticed that in my other tests, isolated instruments sometimes cause spurious words to be detected as well.

Anyway. Whisper is trying to do some difficult work here as it has to pick out the vocals from an instrumental backing. But what if we pre-separate the vocals with DEMUCS?

Feeling lost in an elephant's dream
caught up in the digital stream
Cave of shadows, can't say what I mean
a million miles far from home

Best foot forward, take a step back
play defense while on the attack
Sewing discord, can't cut them no slack
everybody's so alone

If everyone could see
themselves through some other eyes
Innocent, caught up in the machinery
an algorithmic disguise

Making everything great again
forget about your family and friends
Looking for a means to an end
just don't want to be alone

Fragile gem, split in two
Internal dialogue of them versus you
Find acceptance with nothing to do
The voices there are all your own

If everyone could see
themselves through some other eyes
Innocent, caught up in the machinery
an algorithmic disguise

A deeper body ache, and make a bad mistake
a million jumping on
Sharing a bad mistake, nobody can part shape
the feeling that you're gone

Feeling lost in an elephant's dream
Getting washed away by the digital stream
If it feels better to share another meme
Does it make you less alone?

If everyone could see
themselves through some other eyes
Innocent, caught up in the machinery
an algorithmic disguise

If everyone could see
themselves through some other eyes
Innocent, caught up in the machinery
an algorithmic disguise

Innocent, caught up in the Become of the Chievery Jethers,
live in the

DEMUCS helps the results to be significantly better for the most part. The bridge still gets a bit screwed up but not nearly as badly as without DEMUCS, and it completely invents a couple of lines at the end for some reason, which are again probably it trying to make sense of the harmonies as two separate speakers.

But anyway, these results are good enough that I should at least be able to build a decent search index on my library, and gosh darnit I’ll be finding that one song1 I’m trying to remember! Hopefully. And if nothing else I’ll at least have a toy I can put online somewhere.

I mean, spot-checking other songs it’s just completely getting a lot of things totally wrong. So I’m not super optimistic. But it’s a fun experiment. Maybe.

Like, uh, some of my songs it just completely makes things up for, and some things it doesn’t detect any lyrics at all, or is just a really long run of “Thank you” ad infinitum.

So, anyway, I’ve had this chewing away at my music library for a couple days now. It’s been running for a couple of days and has so far finished around 7500 tracks, when I have around 117963 songs in my library, so it’s gonna be at this for… a while, probably 3-4 weeks in total as a guesstimate.

Meanwhile, I’ve ordered parts for a new VR PC (because Furality made me realize just how old this computer is — 8 years! — and how overdue it is for a full replacement) so soon this machine can become a dedicated GPU compute server that lives in the basement or something and I have a bunch of other fun things I want to build that will take advantage of it. Hopefully something a bit more useful than an automatic mondegreen generator, anyway.