Computer systems require a way to identify the people associated with them. These identifiers have been called "user names" or "account names." The identifers are typically short, alphanumeric strings. In general, these identifiers must be unique.
The uniqueness is usually achieved in one of three ways:
1) The identifiers are assigned in a unique manner without using information associated with the individual. Example identifiers are:
ax54tv cs00034
This method was often used by large timesharing systems. While it achieved the uniqueness property, there was no way of guessing the identifier without knowing it through other means.
2) The identifiers are assigned in a unique manner where the bulk of the identifier is algorithmically derived from the individual's name. Example identifers are:
Craig.A.Finseth-1 Finseth1 caf-1 fins0001
3) The identifiers are in general not assigned in a unique manner: the identifier is algorithmically derived from the individual's name
Finseth [Page 1]
RFC 1439 Uniqueness of Unique Identifiers March 1993
and duplicates are handled in an ad-hoc manner. Example identifiers are:
Craig.Finseth caf
Now that we have widespread electronic mail, an important feature of an identifier system is the ability to predict the identifier based on other information associated with the individual. This other information is typically the person's name.
Methods two and three make such predictions possible, especially if you have one example mapping from a person's name to the identifier. Method two relies on using some or all of the name and algorithmically varying it to ensure uniqueness (for example, by appending an integer). Method three relies on using some or all of the name and selects an alternate identifier in the case of a duplication.
For both methods, it is important to minimize the need for making the adjustments required to ensure uniqueness (i.e., an integer that is not 1 or an alternate identifier). The probability that an adjustment will be required depends on the format of the identifer and the size of the organization.
There are a number of popular identifier formats. This section will list some of them and supply both typical and maximum values for the number of possible identifiers. A "typical" value is the number that you are likely to run into in real life. A "maximum" value is the largest number of possible (without getting extreme about it) values. All ranges are expressed as a number of bits.
There are three popular formats based on initials: those with one, two, or three letters. (The number of people with more than three initials is assumed to be small.) Values:
format typical maximum
I 4 5 II 8 10 III 12 15
Finseth [Page 2]
RFC 1439 Uniqueness of Unique Identifiers March 1993
You can also think of these as first, middle, and last initials:
Here are all possible combinations of nothing, initial, and full name for first, middle, and last. The number of Middle names is assumed to be the same as the number of First names. Values:
format typical maximum
_ _ _ 0 0 _ _ L 4 5 _ _ Last 9 13
_ M _ 4 5 _ M L 5 10 _ M Last 13 18
_ Middle _ 8 14 _ Middle L 12 19
Finseth [Page 3]
RFC 1439 Uniqueness of Unique Identifiers March 1993
_ Middle Last 17 27
F _ _ 4 5 F _ L 5 10 F _ Last 13 18
F M _ 5 10 F M L 12 15 F M Last 17 23
F Middle _ 12 19 F Middle L 16 24 F Middle Last 21 32
First _ _ 8 14 First _ L 12 19 First _ Last 17 27
First M _ 12 19 First M L 16 24 First M Last 21 32
First Middle _ 16 28 First Middle L 20 33 First Middle Last 26 40
As can be seen, the information content in these identifiers in no case exceeds 40 bits and the typical information content never exceeds 26 bits. The content of most of them is in the 8 to 20 bit range. Duplicates are thus not only possible but likely.
The method used to compute the probability of duplicates is the same as that of the well-known "birthday" problem. For a universe of N items, the probability of duplicates in X members is expressed by:
N N-1 N-2 N-(X-1) - x --- x --- x ... x ------- N N N N
A program to compute this function for selected values of N is given in the appendix, as is its complete output.
The "1%" column is the number of items (people) before an organization of that (universe) size has a 1% chance of a duplicate. Similarly for 2%, 5%, 10%, and 20%.
Finseth [Page 4]
RFC 1439 Uniqueness of Unique Identifiers March 1993
For example, assume an organization were to select the "First Last" form. This form has 17 bits (typical) and 27 bits (maximum) of information. The relevant line is:
For an organization with 100 people, the probability of a duplicate would be between 2% and 5% (probably around 4%). If the organization had 1,000 people, the probability of a duplicate would be much greater than 20%.
Appendix: Reuse of Identifiers and Privacy Issues
Let's say that an organization were to select the format:
First.M.Last-#
as my own organization has. Is the -# required, or can one simply do:
Finseth [Page 5]
RFC 1439 Uniqueness of Unique Identifiers March 1993
Craig.A.Finseth
for the first one and
Craig.A.Finseth-2
(or -1) for the second? The answer is "no," although for non-obvious reasons.
Assume that the organization has made this selection and a third party wants to send e-mail to Craig.A.Finseth. Because of the Electronic Communications Privacy Act of 1987, an organization must treat electronic mail with care. In this case, there is no way for the third party user to reliably know that sending to Craig.A.Finseth is (may be) the wrong party. On the other hand, if the -# suffix is always present and attempts to send mail to the non-suffix form are rejected, the third party user will realize that they must have the suffix in order to have a unique identifier.
For similar reasons, identifiers in this form should not be re-used in the life of the mail system.
Bruce Lansky (1984). The Best Baby Name Book. Deephaven, MN: Meadowbrook. ISBN 0-671-54463-2.
Lareina Rule (1988). Name Your Baby. Bantam. ISBN 0-553-27145-8.
Security Considerations
Security issues are not discussed in this memo.
Author's Address
Craig A. Finseth Networking Services Computer and Information Services University of Minnesota 130 Lind Hall 207 Church St. SE Minneapolis, MN 55455-0134
EMail: Craig.A.Finseth-1@umn.edu or fin@unet.umn.edu