Taglib Unicode Madness.

September 21, 2012 @ 21:00 - Development

Lately I’ve been working on a Cocoa MP3 tagger/renamer app: it gathers features from various useful programs that didn’t make the cut on their own for not having them all. It was all fun and games until I met with Unicode weirdings in tag saving [via TagLib].

The scenario is:

  • fetch data from Discogs API into NSString fields of a model;
  • convert such strings into C-strings [const char*] via UTF8String method of NSString class;
  • set them into TagLib::Tag property of each file;

Everything was fine for plain english releases [being english language almost free from diacritical marks], then I stumbled upon an italian record: data fetching went flawless, but when I persisted it to files and I checked Xcode console I met Mr. √® [the MacRoman representation of è, italian for third-person singular of to be]. Shouldn’t UTF8String take care of encoding non-ASCII characters?

At a first glance I thought about a library issue, but even the minimal NSLog(@"%s", [@"è" UTF8String]) example was broken! I tried then to mess with Taglib parameters, but I was having no clue at all; after a googling session, I understood I needed wide characters, which are compatible with this Taglib::String constructor.

How to perform conversion from NSString to wstring?

NSData* asData = [string dataUsingEncoding:kEncoding_wchar_t];

TagLib::wstring ws = TagLib::wstring(
	(wchar_t*)[asData bytes],
	[asData length] / sizeof(wchar_t)
);

where wstring is a provided implementation of std::wstring [not defined in all systems as stated here], and kEncoding_wchar_t is defined as following:

#if TARGET_RT_BIG_ENDIAN
	const NSStringEncoding kEncoding_wchar_t =
	CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingUTF32BE);
#else
	const NSStringEncoding kEncoding_wchar_t =
	CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingUTF32LE);
#endif

Finally I could build desired string:

TagLib::String::String(ws, TagLib::String::Type::Latin1);

and set into files’ tag, which are correctly saved and rendered by the program itself, by QuickLook and by external media players.

One mystery lasts: I had to use TagLib::String::Type::Latin1 encoding flag, and not expected TagLib::String::Type::UTF8, which threw a “Unicode conversion error” exception: I will ask Stackoverflow later maybe.