See more on Chunks in general.
A key definition of this KEYD subtype is used for accepting a passphrase from the user. The String To Key Algorithm then produces the key from the user’s input.
The payload definition consists of a Prompt string, Salt, and Stretchcount.
The prompt string is a lpstring containing the text to display to the user, to identify which passphrase is being requested, and means nothing to the algorithm. The user enters his passphrase in response to this prompt.
The prompt string is encoded in UTF-8 text, like any lpstring, plus uses character codes hex 1A and hex 1B to support multi-language prompts.
If the prompt string begins with a hex 1A character, it is followed by a language specifier, terminated by a hex 1B character, and then the prompt in that language. Then it may contain another hex 1A character introducing another such unit. The best match is displayed, based on the user’s settings. An empty language string will always match the user’s settings, but always be the worst available match.
The language specifier is specified using RFC 3066, Tags for the Identification of Languages.
For example, using C string literal syntax to specify the control codes, “\x1Aen-US\x1BEnter your supervisor password\x1Afr\x1BEntrez votre mot de passe de surveillant”.
If the string does not begin with the hex 1A control code, then it may not contain any control codes and the entire string is taken as the prompt, unconditionally.
The prompt is wrapped in approprate user-interface elements by the program, so it should not contain a trailing delimiter or other “prompt” marks. These will be added as proper by the user interface in use, whether it be a dialog box, console input prompt, or whatever.
The salt is a lpbinary value containing some random bytes. Different salt values used with the same passphrase will generate different keys.
This is based on the treatment in Secure Applications of Low-Entropy Keys by J. Kelsey, B. Schneier, C. Hall, and D. Wagner. It is a uintV containing the value t such that the iteration count used in the string-to-key algorithm is 2t. For example, encode the value 7 to indicate an iteration count of 128. The iteration count can be interpreted as the number of bits to stretch a low-entropy password (or perhaps two times the number of bits?).
The user’s input is taken as a UTF-8-encoded string. It does not include any terminating end-of-line character or nul, even though hitting the ENTER key may be part of the user interface for inputting the string.
The string shall be encoded in Unicode in Normalization Form D. This is important, since different systems may encode characters differently, and the user will just type the proper keys on the keyboard. If the archive creator used a program that expressed, for example “é” using the single precomposed character U+00E9 and then gave it to someone who tried to extract using a system whose user interface expressed the same character using separate base and accent codes (U+0065 U+0301), the strings would not match even though both users typed the same passphrase! So, the implementation must fully expand all characters into base plus modifiers and put the modifers in canonocal order.
All Unicode code points are allowed except for U+0000, U+FFFF, the range U+D800—U+DFFF, and the range U+0001—U+001F except for TAB and LF which are allowed. The largest Unicode code value is U+10FFFF. Although UTF-8 can encode up to 31-bit values, values larger than the Unicode range are not allowed.
The range of U+D800—U+DFFF is for Surrogate Pairs, which are 2048 values reserved for representing codes larger than U+FFFF in UTF-16 encoding. Any occurance of a value in this range, when using 16 bits per value, is a way of escaping-out values that don’t fit into 16 bits. Codes in this range, when encoded in a way that could represent them (e.g. 32 bits per value; or UTF-8 variable-length) are illegal. If the program used UTF-16 internally to gather the prompt (e.g. Windows uses UTF-16LE natively), it could be a serious burdon on the implementation to support characters in this range.
The characters U+0000 and U+FFFF are not allowed because the implementation may use them as flags. For example, a program may use a nul-terminated string rather than a counted string, and these cannot contain the nul character.
The C0 control codes are not allowed (except for LF and TAB) so they may be used as special control codes or reserved values for the implementation. For example, it may use a ctrl-G (U+0007) as an escape code to indicate that the next 4 charaters are the hex value of a code point, as discussed in the notes below. When processing the string, the implementation has a set of “safe” code values that may be used for delimiters without worrying about how the escape those out should they happen to appear in the actual string.
The LF character (U+000A) and the TAB character (U+0009) are permitted, as these are commonly used in text and easily typed. They represent a line-break and the Tab key on the keyboard, respectivly. If the passphrase contains multiple lines, with linebreaks as part of the passphrase (e.g. the user types Enter within a multi-line text box) then the linebreak should be represented as a U+000A character. It is important that different implementations don’t represent this in different ways, since that would cause confusion and invisible interoperability problems. So, take the end-of-line marks from the text you get back from the user interface and convert them to LF’s, if the user interface uses something different (e.g. Windows uses CR LF in its text widgets).
Other than that, anything is allowed. The passphrase may contain characters that are unknown to the implementation (e.g. the creator of the archive had a more updated Unicode definition). Characters that are used for semantic meaning in rendering are treated as just ordinary characters, since the passphrase string is typically not rendered! They are just a list of code values. You can even have a modifier as the first character, with nothing preceeding it to modify. That is, the program will convert the list of codes to UTF-8 and not care about their meaning or try to interpret it in any way. The limitations above are present to make it easier to implement on a variety of platforms and to prevent invisible incompatibility.
A user interface should provide a way to input non-standard characters or codes that cannot be typed. Systems may differ in what is easily typed or supported by the user interface, and we don’t want to restrict users to using ASCII-only for interoperability. Using more creative characters is a principle way of getting better (harder to guess or brute-force) passwords out of a user.
Alice is a phoneticist, and has software on her computer to facilitate the easy typing of IPA symbol characters in the range U+0250—U+2AF. She uses some in her passphrase.
Later, she needs to extract a file on a computer other than her normal one. Without the IPA keyboard driver, how to type her passphrase? She opens a document in the local word processor on the subject of phonetics, locates the necessary character, copies it to the clipboard, and pastes it at the proper point when typing the rest of the passphrase.
Alexis (Алексей) created a passphrase in Russian. Later, Bob uses the key-cap applet to browse the Cyrillic characters and copy them to the clipboard, and then paste that into the program. (As an aside, it’s interesting to note that the last character in his name (й) can be encoded two different ways: U+0439 or the pair U+0438 + U+0306. This is another example of why Normalization is important!)
Aaron uses a character that does not exist on Bob’s computer, such as U+143E. The key-cap applet on Bob’s computer doesn't show Canadian Aboriginal Syllabics and his has-everything font is based on Unicode 2.0, not Unicode 3.0, as are all the tables on his computer relating to looking up characters. Bob has to use the escape mechanism built into the extraction program, since he can’t type that character in his user interface at all.
So Bob types Ctrl-G and then the 4 hex digits 143E. The program receives 5 characters from the GUI, but is programmed to note the Ctrl-G as an escape code and replaces this with a single U+143E during processing.
Achan purposfully uses a string that is not in Normalized Form D. He uses U+0308 (combining diaeresis) as the first character of the string, and uses the precomposed character U+00CF (letter I with diaeresis). Any attempt by Bob to paste a diaeresis into the user interface as the first character is taken as a U+00A8 (¨), the diaeresis alone. If he typed Ï in any manner, the program would express that in Normalized form D, a regular I followed by the combining diaeresis as two separate codes.
However, the escape mechanism allows entering of any legal code value, and codes entered this way are immune from the normalization process that treats input as being in Normalized Form D. The only way to enter the passphrase is with Ctrl-G followed by 0308 and Ctrl-G followed by 00CF, respecivly.
A program must provide some way to avoid the limitations of the host platform in being able to enter any sequence of allowed Unicode characters for a passphrase. This particular way is a suggestion only.
The Escape System uses a Control-G character (U+0007, BEL) as a special flag. The user types CTRL-G followed by 4 hex digits, and these 5 characters are entered into the string. For example, U+143E would be encoded as the five characters U+0007 U+0031 U+0034 U+0033 U+0045 (alternativly, the last code may be U+0065).
For codes larger than can be expressed as 4 hex digits, use two CTRL-G characters followed by 6 hex digits. For example, U+20000 would be entered as eight characters U+0007 U+0007 U+0030 U+0032 U+0030 U+0030 U+0030 U+0030.
The implementation takes the string from the user interface. Ignoring the concept of using CTRL-G as an escape mechanism, it normalizes the string to Form D and transforms it to UTF-8. The hex digits are seen as just ordinary letters by this phase, and pass through without any changes, and are encoded as UTF-8 as one byte each. Then, it looks for any CTRL-G characters and replaces the 5 or 8 bytes with the properly-encoded value indicated. The escape mechanism is applied last, so any characters so-expressed will naturally be taken exactly as you coded them.
Let X ← CrypHash (“KEYD64” || Passphrase || Salt).
For i (1..iteration_count)
X ← CrypHash (X).
0000: xx ; length is 0001: 00 2F ; Type is KEYD-b 0003: 01 ; Instance #1 0004: 06 50 72 6F 6D 70 74 ; length=6, "Prompt" 000B: 04 4E 09 F1 54 ; length=4, 4 random bytes of salt 0010: 07 ; means 2**7 iterations yyyy: xx ; checksum
Page content copyright 2003 by John M. Dlugosz. Home:http://www.dlugosz.com, email:mailto:email@example.com