Unicode, UTF-8, GEDCOM 5.5.1, GEDCOMs and Macs

If you’re here, it’s because you have questions about Unicode, UTF-8, GEDCOM 5.5.1, or other similar GEDCOM issues on your Mac.

First, a couple of relevant links:
Mac Genealogy Software – Unicode, UTF-8, and GEDCOMs
Conversion of Unicode GEDCOM Files on a Mac

Unicode and UTF-8 – Abbreviations
Before we get started, some abbreviations you might see, both in this article and elsewhere.
– UCS = Universal Character Set
– UTF = Unicode Transformation Format

What is Unicode?
(Wikipedia) provides this definition: “Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world’s writing systems.” I don’t want to go too much more into Unicode on its own as you can see from the Wikipedia article that it’s an extensive topic.

Why even use Unicode?
Unicode GEDCOMs are needed by those dealing with Asian and Eastern European characters/names and in other situations (including HTML/web-based systems). As a matter of fact, Unicode was supposed to be the default character set in the GEDCOM 6 format, which never materialized. I can tell you that it has come up in genealogy software discussions on the internet.

Unicode vs UTF-8 – Differences, Explanation?
UTF-8 (Wikipedia) is probably the most common form of Unicode on the internet, since it’s extensively used in HTML pages that support Unicode. It supports backwards compatibility with ASCII, which is partly why it’s so popular and fairly easy to support. UTF-8 encoding was introduced in the draft of GEDCOM 5.5.1 (more on GEDCOM 5.5.1 below). Unicode can be supported without UTF-8 being supported. UTF-8 can be supported without Unicode being fully supported. Confusing? Yes, but it’s not necessarily as large of an issue as it may seem. For those who need such support, it is an issue though.

Why are there problems, why is it a big deal? And GEDCOM 5.5.1?
Let me start off by explaining possibly why there are problems, or at least give a logical reason for the absence in some programs of such support. All currently developed Mac genealogy applications, and all Windows genealogy applications that are in widespread use, support GEDCOM 5.5. GEDCOM 5.5 came out in 1996. Three years later, in 1999, GEDCOM 5.5.1 (draft) was released. Note that it’s a draft. GEDCOM 5.5.1, among other things, added support for UTF-8 encoding. It was never formally approved as a standard, but many genealogy applications do support it, as does FamilySearch.org. GEDCOM 5.5.1 added support for internet-based tags such as WWW and EMAIL. Unfortunately, ever since 1999, we’ve been in a holding pattern when it comes to GEDCOM formats.

Unicode, UTF-8, and Mac OS X
Unicode/UTF-8 isn’t an issue for Mac OS X itself for a simple reason – Mac OS X is considered to be “Unix-like” and has many of the underpinnings of Unix-based systems, and Unicode support is very important for Unix systems. This could possibly also be the reason why the vast majority of Mac OS X genealogy apps have Unicode/UTF-8 support.

Personal Ancestral File 5.2 and UTF-8
To further confuse the issue, Personal Ancestral File 5.2, while officially supporting GEDCOM 5.5, uses UTF-8 internally, and can output UTF-8 GEDCOM files. For those migrating from Personal Ancestral File to Mac OS X genealogy apps, it could be an issue if they need to export with UTF-8.

Conspiracy Theories
Let’s get this out of the way now: There is a viewpoint among some in the genealogy community that some genealogy software developers don’t support certain formats, encoding, etc., to keep you locked into their “walled garden” (Wikipedia explanation). Whether that’s true or not, it should not stop them from supporting the importing of such GEDCOM files. Of course if they supported the importing of such GEDCOM files, there would be no logical reason for them not to support exporting to such GEDCOM files.

Should Developers Support GEDCOM 5.5.1?
Yes. Many genealogy applications support it already, Personal Ancestral File 5.2 supports parts of GEDCOM 5.5.1, FamilySearch.org supports GEDCOM 5.5.1. GEDCOM 5.5.1 adds needed tags for websites and email addresses, as well as geolocation information. I’m not a developer, but look at the existing genealogy applications out there – geolocation information alone is very important to most genealogists these days.

GEDCOM X is a proposed new GEDCOM format supported by FamilySearch and a few others. I have not seen anything relating to Unicode and GEDCOM X, but I do wish them well in their endeavor – we need a new standard, and GEDCOM X has better source and record support, including digital records, and better sharing and linking. We have pushed the existing GEDCOM 5.5 and GEDCOM 5.5.1 formats to their limits, and regardless of the conspiracy theories of developers deliberately trying to lock you in to their software, it’s hard for genealogy software developers to maintain strict support for GEDCOM 5.5 while supporting the many types of data/information that genealogists take for granted these days – new types of records, geolocation/GPS information, etc.

These days, you start to lose a lot of information when you export to GEDCOM unless your genealogy application sticks to GEDCOM 5.5/5.5.1 as its format. For some packages, it can reach a point where the problems with moving over to a new genealogy program outweighs the benefits.

I have no doubt that most Mac genealogy software developers would quickly hop on board a better industry-wide standard for the exchange of genealogy information – look at how fast many embraced the iOS (iPad, iPhone, iPod touch) platform. Look at how fast many made the switch from Classic Mac OS 9 to Mac OS X, and from PowerPC to Intel.