non-printing delimiter characters

Sometimes, it is convienient to produce a string or text file that contains multiple items, or other rich structure, in a delimited format.

When you have such information, you typically have a need to parse it back up into individual fields. Common formats such as quotes and commas, tabs and linebreaks, suffer from two problems: First, you need to provide a way to escape out the delimiter character itself, so it may be present in the data. This complicates the otherwise simple problem of parsing the string. Second, it doesn't provide for nested data.

In the old days before modern networking concepts, text data contained embedded controls that "meant something" rather than being a printed character. The '\n' is a reminant of this concept. However, on PC's the entire 256-character character set was assigned printable glyphs, so sometimes these "control characters" were indeed found in strings. The idea of a reserved range of control characters had been lost.

In Unicode, there is a reserved range of control characters that indeed have no printable glyph and are meant as "controls" in the old sence of the word. Unicode has assigned codes for every character, so you have no business grabbing these reserved characters for other needs. Because they have no printable image, they have no business ever appearing in a string of text.

Instead, these characters are ideal for use as delimiters to separate and mark information stored as text. The Classics library defines the following characters for the purpose:

static const wchar_t ESC= L'\x001b';
static const wchar_t Split4= L'\x001c';
static const wchar_t Split3= L'\x001d';
static const wchar_t Split2= L'\x001e';
static const wchar_t Split1= L'\x001f';
static const wchar_t Open1= L'\x0011';
static const wchar_t Close1= L'\x0012';
static const wchar_t Open2= L'\x0013';
static const wchar_t Close2= L'\x0014';

There are two kinds of markers here: Splits and Groupers.

A split character, because its a control code with no printable representation, has no business being part of a text message and is therefore used in the higher-level protocal combining messages. Point is, the Split character should be unique in the passage of text to be parsed. It is a simple process to split the text up at the Split characters, and not worry about escaping out delimiters that appear within the text.

So, since this system is designed to be general and extendable, what happens if software component A generates strings containing splits, and passes it to software component B that uses splits for its own purpose and prohibits splits within its underlying data? Allowing for general use seems to undermine the very idea of having a reserved character.

That is why this complete description (along with code known as the string_marker module) is provided. A more elaborate standard is needed so that Splits are indeed generally useful, yet easy and efficient to use.

A component that uses splits will prohibit the apparance of splits in its input data. So another component that generated data with splits needs to escape out the offending characters first. The "escape" process defined here avoids complicating the parser because the offending character does not appear (in an escaped-out context) in the resulting text.

All the above listed characters must be escaped out. To escape out character X, replace X with ESC Y, where Y is 128+X. Because ESC is itself escaped out, any existing ESC Y is replaced with ESC ESC-prime Y, which does not alter Y, but introduces another ESC-prime (128+ESC) which preserves the level of escaping, and can be easily and precicely reversed. Code that simply looks for the Split character can be oblivious to the presence of escaped out splits within its contained data.

The functions escape and unescape in the string_marker module perform this function.

Although this works in the general case, it's a little cumbersome for data that is essentially hierarchial in the first place. So, another concept, Groupers, is also available.

While Splits are used to simply separate data, Groupers surround data in ballanced pairs, and therefore easily nest. There are two Groupers, which I think of as two different kinds of parenthesis (like () and {}) which nest independently.

When looking for Splits, ballanced pairs of Groupers are skipped. So a Split within a group is ignored. Only the Split at the same level is significant. Likewise, looking for ballanced Groupers will automatically skip nested Groupers contained within it. The functions scan_for and scan_for_match are used to look for Splits and ballanced Group pairs, respectivly, while ignoring internal nested groupings.

Page content copyright 1998 by John M. Dlugosz. Home:http://www.dlugosz.com, email:mailto:john@dlugosz.com