Thursday 26 November 2015

Our Days of Future Passed — Part I



In this three-part article, I want to summarise the current state of the STEMMA® research project. Changes on the Web site have been deliberately infrequent of late to enable me to find the time to finish it, but that closure is now in sight. This data-model specification was designed to represent our “days of future passed”,[1] or the future laying-down of our daily events — using less surreal syntax.

The changes in the latest version of the specification were recently summarised at STEMMA V4.0, but I want to take this extra time to put those changes, and the overall philosophy of STEMMA, into a perspective that can be recognised by genealogists and software designers alike.

The original goals of STEMMA were twofold: to develop a data model that could represent what I was already doing in genealogy, and to investigate innovations that were not constrained by legacy products or models. Too often, research for the future is limited by the legacies of the past, but I found that I was in an ideal situation to try new solutions, and to look at software genealogy from a different perspective.

The development of the specification, and of associated software, has involved several iterations as one would expect from cycles of experimentation. However, as I became a better genealogist then my requirements also changed, and that meant unpicking some parts in order to re-knit them differently. Early incarnations were primarily conclusion-based, but as I tried to link conclusions (including so-called “facts”) to their supporting evidence, and ultimately to the information in the underlying sources, then I realised that this was a huge field that needed considerable thought — not just a bunch of citations and some data links.

I tried hard to accommodate the main approaches to genealogy in the data model (e.g. family trees), in additional to my own much-broader scope, and to some of the more prevalent software schemes — albeit with much generalisation. I did wonder, several times, whether it was indeed possible to have a single model encompassing all of:

  • Family Trees and pedigrees.
  • Event-based genealogy, where we look at the events in the lives of the persons, or other subjects.
  • One-name and one-place studies.
  • Handling of non-family and non-familial relationships.
  • Generalised micro-history, including additional subjects such as places, animals, and groups.
  • Looking at places as another hierarchical type of subject rather than simply a name, and using a similar approach for groups.
  • Clear separation of conclusions from evidence, and from source material.
  • Personae, including multi-tier ones.
  • Source-based genealogy, where conclusions are built from source information, rather than simply tacking citations onto conclusions.
  • My bottom-up non-goal-directed approach to assimilating sources, described as Source mining.
  • Integration of stories, research, proof arguments and other forms of narrative.
  • Representation of diplomatic transcriptions.
  • International applicability.
  • Extensibility of type systems using namespaces.
  • Generalised approach to sources & citations that accommodates layers, analytical notes, and even attribution.

You may think that such an ambitious set of goals would yield a hugely complex Frankenstein’s monster of a model, but the more I worked on it, the more things would slot into place. At a certain point, a design — any design — reaches a level of order and elegance that compares favourably with its functionality and capabilities, and I believe it’s about there.

Previous attempts to describe STEMMA haven’t gained much traction and this is partly due to the prevailing notion that genealogical software products merely maintain a database of discrete data items. For instance, QuickLesson 20: Research Reports for Research Success, on the Evidence Explained site, relegates software considerations (other than using a word-processor) to the final step: “Step 4: Data entry?” on the basis that you will want to “… cherry-pick individual bits of data and record them in a spread sheet or other data-management software”. If you’re going to read on then you will need to exorcise all such notions and familiarities for this will be fundamentally different!

One of the foundational elements of the STEMMA design is that there are multiple independent sets of linkages within the model. What this means is that the various entities, such as Persons, Places, Events, etc., are linked in multiple ways, each according to some real-world rationale, and these cooperate to deliver a very rich structure. For instance, the lineage of a Person is a set of hierarchical linkages that is independent of any association with Events, and that means that the same model can be applied to a tree-based arboreal genealogy or an event-based history, or a combination of these. Also, the endless ways in which these linkages can be visualised is not prescribed by the data model; that’s the prerogative of the software product.

This concept was eventually used to provide another set of linkages that connected conclusions to evidence, to information, and to sources. All the right concepts were there in the earlier incarnations, but it wasn’t until v4.0 that they were connected properly.[2]

On the surface of it, the direction in which the specification has proceeded has widened the scope of a data model far beyond what many genealogists and software vendors have considered, or would like to have considered. Indeed, it was pointed out to me during discussions within FHISO that I have the luxury of not having to worry about backwards compatibility. This is partly why I now wish to illustrate how this one data model can be applied to each of the main genealogical approaches, and implicitly to suggest that these approaches do not have to be exclusive of each other; we need to avoid the little-endian versus big-endian[3] arguments and see that they all have merit.

STEMMA has two notional sub-models: conclusional[4] and informational, and the following sections will make reference to them.

Arboreal Genealogy

Arboreal (tree) genealogy is characterised by a focus on biological lineage. This is often mirrored by an underpinning database schema designed specifically to support a tree-based view of lineage, or of pedigree.

What the diagram below illustrates is that each Person entity in a STEMMA tree can be associated with multiple Source entities, each describing a specific source of information, and encapsulating the relevant resources (such as images, documents, and artefacts) and citations.

A lineage hierarchy for Persons or Animals

Each of those sources can yield Properties — items of extracted and summarised information — for the corresponding Person. For instance, if a person was mentioned in multiple census sources then each of them might yield a different residential address, differing ages (of course), but even conflicting places of birth. Properties are one of two mechanisms for associating information with a subject entity. The other (via the Source entity) is part of the informational sub-model but Properties are part of the conclusional sub-model. That is because they represent normalised information, and any relationship or other subject identification involves a direct connection to a conclusion subject entity, such as another Person. Each Property basically consists of a name and one-or-more values (see Is That a Fact?), and may be used to represent simple information, such as a name or age, or a relationship to another subject, such as a Person or Place. Although each Property can also retain a copy of the associated source fragment, indicating how the information was originally expressed, the overall mechanism is primarily designed for database-orientated products. They are useful for presenting a synopsis of that subject, but they cannot be used for detailed analysis or correlation.

As shown here, they are an ideal mechanism for arboreal approaches where information is directly associated with the relevant Persons. Although this linkage was designed to represent static Properties (ones that do not change over time, such as a blood group), it could be used to represent dynamic ones, such as a marriage date — but more on that later.

STEMMA V4.0 introduced the Animal entity as another subject type, in addition to the existing Person, Place, and Group. Some might ask ‘why animals’ but they are important to a great many people’s history. If anyone ever writes about me in the (far-off) future, and fails to mention my dogs, then I would haunt their hard-drive. Interestingly, it wasn’t difficult to generalise the software support for Person entities to include Animal entities; they both have biological lineage, and STEMMA’s name support already coped with their differences.

The astute reader may be asking where marriages fit into this arboreal scheme. It’s true that I mentioned handling a marriage date as a static Property, but ideally they would be handled as Events (next section), along with every other thing that happened in their lives. Not making it a fundamental part of a tree actually made the inclusion of Animals easier since it emphasised that marriage is not a prerequisite for lineage — trying to blend the concepts together will fail, and quickly so!

One subtle but important point to note here: there is no “STEMMA tree”, per se; a tree is just a way of visualising the hierarchical linkages associated with lineage. All Person entities may or may-not be linked in such a way, and that implicitly means that a STEMMA Document (i.e. a file) can describe multiple independent trees.

Just as Persons and Animals share many characteristics, and especially their lineage-based hierarchies, so too do Groups and Places; they both have a type of hierarchy that is time-based. With lineage, every subject has just two parent subjects — one male and one female — but with organisational hierarchies, each subject has just one parent that may change over time.

The following diagram illustrates how a place hierarchy has a very similar relationship to sources and Properties in STEMMA.
An organisational hierarchy for Places or Groups

Event-Based Genealogy

Events are something that happened in a given place on a particular date, or range of dates. Event-based genealogy gives a more dynamic representation of information related to Persons, or other subjects, and so is more applicable to family history than to genealogy in its limited literal sense (i.e. lineage).

Organising information both by geography and by time is an essential step in the representation of history. The following diagram illustrates a single Event that is supported by two sources. As above, the associated Source entities can embrace multiple resources and citations. The diagram shows that these sources may make reference to subjects of each of the types supported by STEMMA: persons, animals, places, and groups; but they can now yield dynamic Properties rather than the static ones mentioned above. That is, each of the Property values can be traced to a particular time and place via the Event entity and its supporting sources.

Event linkages to the relevant subject entities

The Event entity is still part of the conclusional sub-model, even though the Source entities encapsulate details of the supporting sources. For instance, two mentions of a marriage date or place, say from a certificate and from a newspaper announcement, may differ slightly, and yet the Event entity would represent the conclusions about the true details.

Note, too, that each of the subject types in the above diagram indicate that they are still part of their respective hierarchies. In other words, the Event linkages are independent of the hierarchical linkages of those subjects.

Unusually, STEMMA Events are also hierarchical. This means that a complex event — one with structure that can be broken into separate phases or layers — can be represented as a whole. A simple example of this involves a voyage event whose embarkation and disembarkation occurred at different times and places, and which can be represent as child events.

Narrative Genealogy

Narrative genealogy involves the use of humanly-generated natural language to describe the persons, and other subjects, in our history, as well as all the events that touched them. In common with a number of other people, I strongly believe that software cannot generate anything resembling readable narrative, and that advocates demonstrate more misplaced pride than real-life use-cases.

Narrative can be used for essays, notes, reports, and many other purposes, but STEMMA also includes transcriptions. This includes their structural and presentational aspects such as paragraphing and line-counting, original emphasis such as underlining or italics, corrections and other annotation, and marginalia or footnotes. Since transcribed extracts will often appear in essays or reports then narrative and transcription are both supported as a single feature.

I will continue discussing this genealogical approach in Part II of this series.

Source-Based Genealogy

Source-based genealogy involves a focus on the source, and the assimilation of the information therein. For instance, working with a simple birth certificate might yield the names of the child and parents, mother’s maiden name, father’s occupation, birth sex, the date and place of birth, name and residence of the informant, and the date of registration. Beginning from the source means that we can organise our copies of the information (usually images and transcripts), create a source citation, and have all of that information available before we start any detailed analysis.

Conversely, and with online genealogy especially, the norm is to cherry-pick selected names that have been extracted and entered into some index for the user’s benefit. This divorces those names from any context associated with the source, and so is insufficient for a detailed analysis. Unfortunately, the underlying source is too-often ignored leaving users working with only partial information. It also means that citations are generally an afterthought.

I will continue discussing this genealogical approach in Part III of this series.

Software Design

I want to round-off the first part of this series of blog-posts by making some observations about genealogical software.

There are two broad approaches to any software design: the first involves designing the code to provide specific product functionality, usually as dictated by some product manager. The second involves taking a step-back and designing for the bigger picture. This usually involves a software architect and results in a more adaptable design with greater potential for evolution. A case where the former has happened in genealogy is where products were designed to support trees, and hence the biological lineage of persons. Notwithstanding that lineage is not a true tree, those designs then found it hard to represent history, evidence & sources, geography, reports & essays, or anything other than persons (see The Lineage Trap).

A User Interface (UI) is a crucial part of a software product, not just because it can make a product easier or harder to use, but because a well-designed UI can give a sense of the physicality of the data being manipulated. When the computer world introduced Graphical User Interfaces (GUI) then it became possible to depict things using pictures rather than text, but also to give graphical control to the end-user. That meant the ability to do such things as drag-and-drop or manipulate parts of a picture. A simple example might be to indicate a data relationship by drawing a line between two entities on the screen, as opposed to filling in a textual field. Unfortunately, genealogical products tend to use a lot of form-fill, and present a bunch of boxes rather than a tangible UI. Part of the reason for this may be that such UIs are harder to create for the world of the Web, and harder to use on hand-held devices. A consequence, though, is that those products largely solicit conclusions. When asked to provide details of a spouse, say, an end-user is typically invited to provide name, date-of-birth, etc., without having to say where the information came from. At best, the user can tack on some citation, or electronic bookmark.

Although STEMMA was initially conceived as supporting import/export or long-term storage of data, that quickly became a secondary feature. A result of its deep level of representation meant that no database-orientated product could index it adequately to achieve its full potential. However, indexing it into memory, on-the-fly, meant that (a) full and efficient indexing was possible, (b) that no import/export was necessary as the definitive source format could be exchanged, and (c) that no special consideration was needed for long-term storage or backup of database content. The article Do Genealogists Really Need a Database? explained how reliance on a conventional database is folly, and that it introduces performance degradation, risk of corruption, incompatibility between different database vendors or proprietary schemas, and forces the need to invent other representations for import/export, etc.




[1] No, not the X-Men film title, which uses “past” rather than “passed”; I am from a different generation. The title borrows heavily from the 1967 concept album called Days of Future Passed by the English rock group: The Moody Blues, of whom I was, and still am, a huge fan.
[2] In the words of British comedians Morecambe & Wise, I wasn’t “playing all the wrong notes", I was “playing all the right notes but not necessarily in the right order”.
[3] This terminology comes from the satirical novel Gulliver's Travels by Irish writer and clergyman Jonathan Swift, in which two religious sects of Lilliputians are divided between those who crack open their soft-boiled eggs from the little end, and those who crack from the big end.
[4] The word conclusional is not in most English dictionaries. The usage here ("of or pertaining to a conclusion") may be found in: Bryan A. Garner, Garner on Language and Writing: Selected Essays and Speeches (American Bar Association, 2009), p.330, where it compares the use of: conclusory, conclusional, and conclusionary.

Sunday 22 November 2015

STEMMA V4.0



A little later than I had expected, but I have now completed the changes necessary for STEMMA V4.0. This specification is now published on the STEMMA Web site and is anticipated to be the last major revision necessary for this micro-history data model (small refinements continuing).

The main focus of this change have been the correct separation of conclusion from information and evidence, and allowing them to support drill-down (inspecting a conclusion to see the associated how and why), and to support the alternative bottom-up approach of Source Mining. Although this has been a goal from the earliest work on this project, the associated research and experimentation hasn’t always taken the correct path — but then that’s the nature of research, and the model is better for it.

Much of the text on the Web site has been revised, often with significant re-wording, and similarly with some of my older blog-posts. Although this particular subject sits between two different worlds (genealogy and software), each with their own vocabulary that may clash or cause ambiguity, I also admit that some of my older word choices were the result of genealogical inexperience.

Changes to the data model include:

  • Introduction of a new Source entity that embraces both Citations and Resources for a particular information source. Citations and Resource entities are now connected to Source entity rather than to each other.
  • Support for source assimilation & analysis, source mining, and the ability to drill-down on conclusions, all provided via the Source entity.
  • The <References> element, within Events, is now superseded by <SourceLnk> which links to the new Source entity. Enclosed *Ref elements (e.g. <PersonRef>) changed to *Lnk elements for consistency. Removal of the ID attribute introduced in V3.0.
  • Support for cross-source analysis and correlation via a new Matrix entity.
  • Support for a generalised approach to multi-tier personae.
  • Additional of Animal entity, strongly modelled on Person entity, including related mark-up and namespaces.
  • <CitationLnk>/<ResourceLnk> from Person, Place, Group, and Event entities, changed to <SourceLnk>.
  • Reviewed the goal of sticking to XHTML tags for presentation, replacement of the <Hi> element with HTML-like ones, and the addition of support for <sup>/<sub> elements, columnar text, simple tables, and indentation.
  • Removal of ‘Unreadable’ mode from the <Anom> element.
  • Support for distinguishing manuscript and typescript transcriptions in the <Text> element. Support for numbering lines and pages in transcriptions. Positional control over annotations such as marginalia.
  • <FromText> element added to <Narrative> in order to share re-usable sections of text. This has meant that the NoteKey attribute, in the semantic mark-up, was no longer required and so was deleted.
  • Categorisation of the layers in a Citation chain.
  • The optional <DisplayFormat> element of the Citation entity has been re-interpreted as a set of pre-formatted language-specific strings. This may exist in addition to the mandatory set of named parameter values, and the two together can also be used as a simple citation-template.
  • The Intrinsic Functions, mentioned at the end of Semantic Mark-up, have been changed to Intrinsic Methods in preparation for defining a run-time object model. The set is also supplemented by ones for accessing subject-entity names.
  • Small changes to subject-entity *-name-mode vocabularies to factor-out a generic name-mode (missing from previous specification).
  • Place coordinates (including bounding shapes) are now time-dependent, the same as any parent-Place link.
  • Added Canton and Colony to place-type vocabulary. The place-type of House is now replaced by Number and Apartment for flexibility.
  • <Quality>, <Reliability>, and <Credibility> elements moved from the Citation entity to the new Source entity.

Although small refinements will continue, I want to concentrate subsequent efforts on describing advantages and philosophy of the data model, and in providing more worked examples.

There will be a series of blog-posts following this one that will provide a high-level introduction in order to set the scene.