Debugging Gapless MP3 Playback with the System Shock (2023) soundtrack

Sunday, 3 September 2023

On many (music) albums, each track is supposed to play directly into the next, with no gap. Silence between tracks can be created deliberately when the album is produced, but this is not required. Gaps are often undesirable because each track may lead directly into the next: this happens when playing a CD on an actual CD player, for example. Track breaks are not audible.

In the early days of the MP3 format, if you ripped a CD to MP3, there would always be gaps between the tracks. It is technically challenging to encode MP3 files so that they play continuously. But gapless playback is important. Even a very short silence between tracks is noticeable and annoying when it is unintended. Therefore the MP3 format was extended to add metadata ("enc_delay" and "enc_padding") which indicates how many samples should be removed from the beginning and end of the track so that there is no gap. Separately, work was carried out to add a true gapless mode (e.g. LAME's --nogap option), and other music file formats such as Vorbis and FLAC were designed for gapless playback from the beginning.

The simplest and most common way to achieve gapless playback with MP3 files is to use "enc_delay" and "enc_padding". The feature was introduced by LAME in October 2001 and many encoders and decoders now support it. In fact, I had thought that support was almost universal, and it is only when I began to investigate the topic that I found that media playing software often doesn't support gapless playback correctly. I also found out a lot about how MP3 actually works, and how the basic standard (created in 1991) was extended over the years. MP3 is now so old that the patents have expired, and there are better music file formats, but MP3 continues to be supported everywhere and is still an important standard.

I usually use Foobar2000 as a music player, both on my desktop PC and mobile phone. Foobar supports gapless playback of MP3 files and many other formats, this has worked perfectly for years, and so it's very jarring to hear gaps in an album. It normally means that the album was encoded a long time ago, and needs to be copied again with modern software.

I certainly didn't expect to hear gaps in a new MP3 album, but this is exactly what happened when I bought the System Shock (2023 remake) soundtrack by Jonathan Peros. I'd completed the game and particularly enjoyed the music, which is very different to the music in the original version of the game. Each level has two soundtracks, one quite ambient and atmospheric, another upbeat and energetic. The game switches between them depending on the context, and on the album, the atmospheric "exploration" music mixes into the "combat" music for each level, then back to the "exploration" music for the next level.

There was a choice about whether to buy the album from Bandcamp or Steam. Bandcamp music is always available in lossless formats and CD quality is the minimum, whereas you take your chances with Steam. However, Steam was cheaper. The Steam version of the soundtrack album turned out to be in MP3 format, though at maximum (320k) bitrate, and this seemed fine, since I already know from ABX testing that I'm quite unable to distinguish between lossless formats and high-bitrate MP3s when they are played normally.

I was definitely surprised to hear gaps between the tracks: for example, between tracks 2 and 3, the music for the "Medical" level segues from "exploration" to "combat". This is supposed to be a continuous transition, but when played in foobar2000 (on PC or mobile), there is an audible gap of 65 milliseconds between the tracks. The gap is visible in Audacity if I record the output of Foobar:

65ms gap between tracks 2 and 3

I wanted to fix this problem, so I began to try to find out how gapless playback actually worked for MP3 files, and why it wasn't working in this case.

Initially, I assumed the gapless metadata was missing. In a typical file, you can see fields like "ENC_DELAY" and "ENC_PADDING" here:

ENC_DELAY and ENC_PADDING in a correctly encoded MP3 file

These seemed to be missing from the tracks I had downloaded:

ENC_DELAY and ENC_PADDING missing from System Shock soundtrack file

How are "enc_delay" and "enc_padding" actually stored in the MP3 file? At first, I thought they might be part of the "ID3" metadata which contains the track name and the artist, but this is not the case: actually, "enc_delay" and "enc_padding" are embedded in the first "frame" of the MP3 data, as part of the "LAME tag". The "LAME tag" is itself an extension to the "Xing tag", which was introduced to support variable bitrate (VBR) MP3 files. They're not part of the metadata as such - rather, they are stored in unused space within a silent MP3 frame at the beginning of the file. Here's a detailed description of the contents of the Xing and LAME tags. Unlike the ID3 metadata, the tag is not supposed to be edited by the user after the MP3 file has been created.

I found that I could restore "enc_delay" and "enc_padding" by using the "Fix VBR MP3 Header..." feature in Foobar, but the values would be 0, and I would have to do some manual steps to restore the correct values. This would be quite time-consuming.

I also noticed that the Xing and LAME tag was present in the files. In track 2, the MP3 frame begins at byte 0x67d35, and the following hex dump clearly shows the "Info" and "LAME" text (highlighted in red):

Xing/LAME tag is present in the file

From the description of the format here, I know that enc_delay and enc_padding appear at an offset of 0xb1 from the beginning of the MP3 frame (highlighted in blue). They are not zero! The three bytes containing the data are highlighted in green: the values are 0x240 (576) and 0x68a (1674). The question is then: why isn't Foobar able to make use of this data?

At this point I tried some other MP3 playing programs. I was surprised to find out how few of them actually support gapless MP3 playback at all: the two versions of Windows Media Player on my PC can't do it. iTunes can do it, but not when the MP3 files are short (e.g. a few seconds each). I guess that the buffering only allows a single track lookahead. Spotify didn't play local files gaplessly, though it seems to have no trouble when playing tracks from its cloud service. (Local file playback has always seemed barely supported in Spotify, and as to the feature to transfer local files to mobile... consider yourself lucky if it ever works.)

Foobar, Winamp, mpg123 and LAME all support gapless MP3 playback and work correctly even with short MP3 files. However, only LAME (version 3.100) is able to play the System Shock files gaplessly. Foobar and mpg123 are not able to recognise the Xing/LAME tag, and so they ignore enc_delay and enc_padding completely. Winamp is able to recognise the Xing/LAME tag within its "File info" window, but it does not make use of enc_delay and enc_padding information during playback of the System Shock files. iTunes also plays the files with gaps.

The mystery deepened. I spent a while guessing about what might be wrong: for example, could the Xing/LAME tag fields contain errors preventing decoding (e.g. a CRC error)? But comparisons with working MP3 files didn't show up anything significant, and the CRCs appeared to be correct. (I copied the CRC checking algorithm from LAME in order to check this.)

Foobar itself is not easy to debug, as it is not open source. However, mpg123 and LAME are open source. Gapless playback worked in LAME, but not mpg123, so I just needed to compare their behaviour and understand the reason for the difference. I built both of them from source code, using LAME 3.100 mpg123 1.26.4, and enabled all of the debug options I could find. I saw an interesting message from mpg123:

        Note: Junk at the beginning (0xfbe06400)

This is part of an MP3 frame header. MP3 files are divided into frames, and each frame begins with a 4 byte header followed by a number of bytes which can be calculated using the header. I dug further, and got more information:

        [../src/libmpg123/libmpg123.c:693] debug: read frame
        [../src/libmpg123/parse.c:535] debug: trying to get frame 0 at 0
        [../src/libmpg123/parse.c:1077] debug: doing ahead check with BPF 1044 at 4
        [../src/libmpg123/parse.c:1096] debug: After fetching next header, at 4
        [../src/libmpg123/parse.c:1103] debug: does next header 0x00000000 match first 0xfffbe044?
        [../src/libmpg123/parse.c:1106] debug: No, the header was not valid, start from beginning...
        [../src/libmpg123/parse.c:535] debug: trying to get frame 0 at 1

The System Shock MP3 files are 320kbit MP3s, so each frame is either 1044 or 1045 bytes in size. Stepping through the code with GDB, I realised that the problem is not the Xing/LAME tag itself, but rather the frame after it. mpg123 and Foobar both expect the next MP3 frame to start at the correct place, exactly 1044 or 1045 bytes after the first one. But it's not present: instead, that part of the file is zero. mpg123 and Foobar both assume that the Xing/LAME tag is invalid and ignore it.

For some reason, in these files, the structure is bad. The Xing/LAME tag is followed by 680 zero bytes instead of an MP3 frame. This is followed by a second silent MP3 frame. Finally, the actual audio begins. Here's the structure of the file containing track 2, "The Ruined Infirmary".

        0x000000  ID3v2 data
        0x067d35  Silent MP3 frame containing LAME/Xing tag - 1044 bytes
        0x068149  Extra zeroes - 680 bytes (why?)
        0x0683f1  Additional silent MP3 frame - 1044 bytes (why?)
        0x068805  First MP3 frame containing audio
        0xa647d3  End of file

Both the extra zeroes and the additional silent MP3 frame are not normal. I have searched all of the MP3 files I have, and none of them share this strange feature, though I do have some other files with incorrect Xing/LAME tags. Most commonly, the stream size field is incorrect, though some have CRC errors. The System Shock files appear to have been encoded with LAME 3.101 beta 2, but when I downloaded and tested that LAME version, I found that it encoded files normally. Whatever the problem is, it seems to be systematic, because the number of additional bytes is the same in all of the System Shock files (1728 bytes). This is true even though the number of ID3v2 bytes varies.

While it is hard to explain this encoding error, I can at least fix it. Both the extra zeroes and the extra silent frame need to be removed in order for playback to work correctly. A program can remove this data. It should:

  • search for the LAME/Xing tag (e.g. at 0x67d35)
  • move to the position where the next MP3 frame should start (e.g. at 0x67d35 + 1044)
  • remove bytes until the header of the first MP3 frame containing audio is reached
  • update the stream size, music size, and CRC fields

I wrote this program in Python, borrowing parts from LAME and mpg123 in order to correctly encode and decode files. Those interested can find it here: fix-gapless.py. I've also collected the MD5 sums of the MP3 files before and after applying the program, so that you can check if you have the same versions of the files that I used.

After processing, the System Shock files play back correctly in Foobar, Winamp, mpg123 and iTunes.

I still can't explain why the System Shock files were encoded like this. Perhaps it is relevant that LAME is a library as well as a program, and various programs use LAME as an encoder. Sometimes this has odd results. I suspect that it is up to the program using LAME to set some of the metadata fields correctly. For example, in older Audacity versions, the "Export Multiple..." option will create MP3 files in which the stream size is the entire size of the input rather than the size of each individual track, even though LAME is used. Perhaps the production software used to make the System Shock soundtrack also has some strange features like this.

In any case, having fixed the files, I am happy with them, and I have also learned more about the MP3 format than I knew beforehand. MP3 has been part of my life since the 1990s, and yet I have never known much about it. I do remember needing to concatenate tracks in order to get them to play gaplessly in the 1990s, and I remember seeing the first variable-bitrate files and how an old version of Winamp was not really able to understand them, the seek bar jumping around crazily as the bitrate changed. Things have moved on since then - though, while variable bitrate is generally supported, gapless playback isn't. Perhaps it's not something that most people notice or care about. But I think it matters, and that it's worth fixing.