Network Working Group A. Surtees Request for Comments: 4464 M. West Category: Informational Siemens/Roke Manor Research May 2006
Signaling Compression (SigComp) Users' Guide
Status of This Memo
This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2006).
Abstract
This document provides an informational guide for users of the Signaling Compression (SigComp) protocol. The aim of the document is to assist users when making SigComp implementation decisions, for example, the choice of compression algorithm and the level of robustness against lost or misordered packets.
This document provides an informational guide for users of the SigComp protocol, RFC 3320 [2]. The idea behind SigComp is to standardize a Universal Decompressor Virtual Machine (UDVM) that can be programmed to understand the output of many well-known compressors including DEFLATE [8] and LZW [7]. The bytecode for the chosen compression algorithm is uploaded to the UDVM as part of the compressed data.
The basic SigComp RFC describes the actions that an endpoint must take upon receiving a SigComp message. However, the entity responsible for generating new SigComp messages (the SigComp compressor) is left as an implementation decision; any compressor can be used provided that it generates SigComp messages that can be successfully decompressed by the receiving endpoint.
This document gives examples of a number of different compressors that can be used by the SigComp protocol. It also gives examples of how to use some of the mechanisms (such as acknowledgements) described in RFC 3321 [3].
When implementing a SigComp compressor, the first step is to choose a compression algorithm that can encode the application messages into a (hopefully) smaller form. Since SigComp can upload bytecode for new algorithms to the receiving endpoint, arbitrary compression algorithms can be supported provided that suitable bytecode has been written for the corresponding decompressor.
This document provides example bytecode for the following algorithms:
Any of the above algorithms may be useful depending on the desired compression ratio, processing and memory requirements, code size, implementation complexity, and Intellectual Property (IPR) considerations.
As well as encoding the application messages using the chosen algorithm, the SigComp compressor is responsible for ensuring that messages can be correctly decompressed even if packets are lost or misordered during transmission. The SigComp feedback mechanism can
Surtees & West Informational [Page 3]
RFC 4464 SigComp Users' Guide May 2006
be used to acknowledge successful decompression at the remote endpoint.
The following robustness techniques and other mechanisms specific to the SigComp environment are covered in this document:
Any or all of the above mechanisms can be implemented in conjunction with the chosen compression algorithm. An example subroutine of UDVM bytecode is provided for each of the mechanisms; these subroutines can be added to the bytecode for one of the basic compression algorithms. (Note: The subroutine or the basic algorithm may require minor modification to ensure they work together correctly.)
Writing UDVM programs directly in bytecode would be a daunting task, so a simple assembly language is provided to facilitate the creation of new decompression algorithms. The assembly language includes mnemonic codes for each of the UDVM instructions, as well as simple directives for evaluating integer expressions, padding the bytecode, and so forth.
The syntax of the UDVM assembly language uses the customary two-level description technique, partitioning the grammar into a lexical and a syntactic level.
On a lexical level, a string of assembly consists of zero or more tokens optionally separated by whitespace. Each token can be a text name, an instruction opcode, a delimiter, or an integer (specified as decimal, binary, or hex).
Surtees & West Informational [Page 4]
RFC 4464 SigComp Users' Guide May 2006
The following ABNF description, RFC 4234 [1], specifies the syntax of a token:
Whitespace that matches <ws> is skipped between tokens, but serves to terminate the longest match for a token.
Comments are specified by the symbol ";" and are terminated by the end of the line, for example:
LOAD (temp, 1) ; This is a comment.
Any other input is a syntax error.
When parsing on the lexical level, the string of assembly should be divided up into a list of successive tokens. The whitespace and comments should also be deleted. The assembly should then be parsed on the syntactic level as explained in Section 3.2.
Once the string of assembly has been divided into tokens as per Section 3.1, the next step is to convert the assembly into a string of UDVM bytecode.
Surtees & West Informational [Page 5]
RFC 4464 SigComp Users' Guide May 2006
On a syntactic level, a string of assembly consists of zero or more instructions, directives, or labels, each of which is itself built up from one or more lexical tokens.
The following ABNF description specifies the syntax of the assembly language. Note that the lexical parsing step is assumed to have been carried out; so in particular, the boundaries between tokens are already known, and the comments and whitespace have been deleted:
assembly = *(instruction / directive / label) instruction = opcode ["(" operand *("," operand) ")"] operand = [["$"] expression] ; Operands can be left blank if they can ; be automatically inferred by the ; compiler, e.g., a literal operand ; that specifies the total number of ; operands for the instruction. ; When "$" is prepended to an operand, ; the corresponding integer is an ; address rather than the actual operand ; value. This symbol is mandatory for ; reference operands, optional for ; multitypes and addresses, and ; disallowed for literals. label = ":" name directive = padding / data / set / readonly / unknown-directive unknown-directive = name ["(" expression *("," expression) ")"] ; The parser can ignore unknown ; directives. The resulting bytecode ; may or may not generate the expected ; results. padding = ("pad" / "align" / "at") "(" expression ")" data = ("byte" / "word") "(" expression *("," expression) ")" readonly = "readonly" "(" "0" / "1" ")" set = "set" "(" name "," expression ")" expression = value / "(" expression operator expression ")" value = dec / bin / hex / name / "." / "!" ; "." is the location of this ; instruction/directive, whereas "!" is ; the location of the closest ; DECOMPRESSION-FAILURE
The following sections define how to convert the instructions, labels and directives into UDVM bytecode:
The operand values needed by particular instructions or directives can be given in the form of expressions. An expression can include one or more values specified as decimal, binary, or hex (binary values are preceded by "0b" and hex values are preceded by "0x"). The expression may also include one or more of the following operators:
"+" Addition "-" Subtraction "*" Multiplication "/" Integer division "%" Modulo arithmetic (a%b := a modulo b) "&" Binary AND "|" Binary OR "^" Binary XOR "~" Binary XNOR "<<" Binary LSHIFT ">>" Binary RSHIFT
The operands for each operator must always be surrounded by parentheses so that the order in which the operators should be evaluated is clear. For example:
((1 + (2 * 3)) & (0xabcd - 0b00101010)) gives the result 3.
Expressions can also include the special values "." and "!". When the symbol "." is encountered, it is replaced by the location in the bytecode of the current instruction/directive. When the symbol "!" is encountered it is replaced by the location in the bytecode of the closest DECOMPRESSION-FAILURE instruction (i.e., the closest zero byte). This can be useful when writing UDVM instructions that call a decompression failure, for example:
INPUT-BYTES (1, temp, !)
The above instruction causes a decompression failure to occur if it tries to input data from beyond the end of the compressed message.
Note: When using "!" in the assembly language to generate bytecode, care must be taken to ensure that the address of the zero used at bytecode generation time will still contain zero when the bytecode is run. The readonly directive (see Section 3.2.3) can be used to do this.
Surtees & West Informational [Page 7]
RFC 4464 SigComp Users' Guide May 2006
It is also possible to assign integer values to text names: when a text name is encountered in an expression, it is replaced by the integer value assigned to it. Section 3.2.3 explains how to assign integer values to text names.
A UDVM instruction is specified by the instruction opcode followed by zero or more operands. The instruction operands are enclosed in parentheses and separated by commas, for example:
ADD ($3, 4)
When generating the bytecode, the parser should replace the instruction opcode with the corresponding 1-byte value as per Figure 11 of SigComp [2].
Each operand consists of an expression that evaluates to an integer, optionally preceded by the symbol "$". This symbol indicates that the supplied integer value must be interpreted as the memory address at which the operand value can be found, rather than the actual operand value itself.
When converting each instruction operand to bytecode, the parser first determines whether the instruction expects the operand to be a literal, a reference, a multitype, or an address. If the operand is a literal, then, as per Figure 8 of SigComp, the parser inserts bytecode (usually the shortest) capable of encoding the supplied operand value.
Since literal operands are used to indicate the total number of operands for an instruction, it is possible to leave a literal operand blank and allow its value to be inferred automatically by the assembler. For example:
MULTILOAD (64, , 1, 2, 3, 4)
The missing operand should be given the value 4 because it is followed by a total of 4 operands.
If the operand is a reference, then, as per Figure 9 of SigComp, the parser inserts bytecode (usually the shortest) capable of encoding the supplied memory address. Note that reference operands will always be preceded by the symbol "$" in assembly because they always encode memory addresses rather than actual operand values.
Surtees & West Informational [Page 8]
RFC 4464 SigComp Users' Guide May 2006
If the operand is a multitype, then the parser first checks whether the symbol "$" is present. If so, then, as per Figure 10 of SigComp, it inserts bytecode (usually the shortest) capable of encoding the supplied integer as a memory address. If not, then, as per Figure 10 of SigComp, it inserts bytecode (usually the shortest) that encodes the supplied integer as an operand value.
If the operand is an address, then the parser checks whether the symbol "$" is present. If so, then the supplied integer is encoded as a memory address, just as for the multitype instruction above. If not, then the byte position of the opcode is subtracted from the supplied integer modulo 16, and the result is encoded as an operand value as per Figure 10 of SigComp.
The length of the resulting bytecode is dependent on the parser in use. There can be several correct and usable representations of the same instruction.
The assembly language provides a number of directives for evaluating expressions, moving instructions to a particular memory address, etc.
The directives "pad", "align", and "at" can be used to add padding to the bytecode.
The directive "pad (n)" appends n consecutive padding bytes to the bytecode. The actual value of the padding bytes is unimportant, so when the bytecode is uploaded to the UDVM, the padding bytes can be set to the initial values contained in the UDVM memory (this helps to reduce the size of a SigComp message).
The directive "align (n)" appends the minimum number of padding bytes to the bytecode such that the total number of bytes of bytecode generated so far is a multiple of n bytes. If the bytecode is already aligned to a multiple of n bytes, then no padding bytes are added.
The directive "at (n)" appends enough padding bytes to the bytecode such that the total number of bytes of bytecode generated so far is exactly n bytes. If more than n bytes have already been generated before the "at" directive is encountered then the assembly code contains an error.
The directives "byte" and "word" can be used to add specific data strings to the bytecode.
Surtees & West Informational [Page 9]
RFC 4464 SigComp Users' Guide May 2006
The directive "byte (n[0],..., n[k-1])" appends k consecutive bytes to the bytecode. The byte string is supplied as expressions that evaluate to give integers n[0],..., n[k-1] from 0 to 255.
The directive "word (n[0],..., n[k-1])" appends k consecutive 2-byte words to the bytecode. The word string is supplied as expressions that evaluate to give integers n[0],..., n[k-1] from 0 to 65535.
The directive "set (name, n)" assigns an integer value n to a specified text name. The integer value can be supplied in the form of an expression.
The directive "readonly (n)" where n is 0 or 1 can be used to indicate that an area of memory could be changed (0) or will not be changed (1) during the execution of the UDVM. This directive could be used, for example, in conjunction with "!" to ensure that the address of the zero used will still contain zero when the bytecode is executed. If no readonly directive is used, then any address containing zero can be used by "!" (i.e., by default, there is assumed to be a readonly (1) directive at Address 0) and it is up to the author of the assembly code to ensure that the address in question will still contain zero when the bytecode is executed. If the readonly directive is used, then bytes between a readonly (0) and readonly (1) pair are NOT to be used by "!". When a readonly directive has been used, the bytes obey that directive from that address to either another readonly directive or the end of UDVM memory, whichever comes first.
A label is a special directive used to assign memory addresses to text names.
Labels are specified by a single colon followed by the text name to be defined. The (absolute) position of the byte immediately following the label is evaluated and assigned to the text name. For example:
:start
LOAD (temp, 1)
Since the label "start" occurs at the beginning of the bytecode, it is assigned the integer value 0.
Note that writing the label ":name" has exactly the same behavior as writing the directive "set (name, .)".
Once the parser has converted a string of assembly into the corresponding bytecode, it must be copied to the UDVM memory beginning at Address 0 and then executed, beginning from the first UDVM instruction in the bytecode.
SigComp provides the following message format for uploading bytecode to the UDVM:
The destination field should be set to the memory address of the first UDVM instruction. Note that if this address cannot be represented by the destination field, then the bytecode cannot be uploaded to the UDVM using the standard SigComp header. In particular, the memory address of the first UDVM instruction must always be a multiple of 64 bytes or the standard SigComp header cannot be used. Of course, there may be other ways to upload the bytecode to the UDVM, such as retrieving the bytecode directly via the INPUT-BYTES instruction.
Additionally, all memory addresses between Address 0 and Address 31 inclusive are initialized to endpoint-specific values by the UDVM, so they must be specified as padding in the bytecode, or the standard SigComp header cannot be used. Memory addresses from Address 32 to Address (destination - 1) inclusive are initialized to 0, so they must be specified either as padding or as 0s if the bytecode is to be successfully uploaded using the standard SigComp header.
Surtees & West Informational [Page 11]
RFC 4464 SigComp Users' Guide May 2006
The code_len field should be set to the smallest value such that all memory addresses beginning at Address (destination + code_len) are either as initialised by the UDVM (to 0) or as set by the bytecode at runtime.
The "uploaded UDVM bytecode" should be set to contain the segment of bytecode that lies between Address (destination) and Address (destination + code_len - 1) inclusive.
This section describes a number of compression algorithms that can be used by a SigComp compressor. In each case, the document provides UDVM bytecode for the corresponding decompression algorithm, which can be uploaded to the receiving endpoint as part of a SigComp message. Each algorithm (as written in this section) assumes that there is a 16K decompression memory size, there are 16 cycles per bit, and there is an 8K state memory size. Decompression will succeed with a smaller value for state memory size; however, the full state will not be created.
Section 4.1.1 covers a simple algorithm in some detail, including the steps required to compress and decompress a SigComp message. The remaining sections cover well-known compression algorithms that can be adapted for use in SigComp with minimal modification.
This section describes how to implement a very simple compression algorithm based on LZ77 [5].
A compressed message generated by the simplified LZ77 scheme consists of a sequence of 4-byte characters, where each character contains a 2-byte position value followed by a 2-byte length value. Each pair of integers identifies a byte string in the UDVM memory; when concatenated, these byte strings form the decompressed message.
When implementing a bytecode decompressor for the simplified LZ77 scheme, the UDVM memory is partitioned into five distinct areas, as shown below:
The first 128 bytes are used to hold the 2-byte variables needed by the LZ77 decompressor. Within this memory, the first 64 bytes are used as a scratch-pad, holding the 2-byte variables that can be discarded between SigComp messages. In contrast, the next 64 bytes (and in fact all of the UDVM memory starting from Address 64) should be saved after decompressing a SigComp message to improve the compression ratio of subsequent messages.
The bytecode for the LZ77 decompressor is stored beginning at Address 128. A total of 128 bytes are reserved for the bytecode although the LZ77 decompressor requires less; this allows room for adding additional features to the decompressor at a later stage.
The next 256 bytes are initialized by the bytecode to contain the integers 0 to 255 inclusive. The purpose of this memory area is to provide a dictionary of all possible uncompressed characters; this is important to ensure that the compressor can always generate a sequence of position/length pairs that encode a given message. For example, a byte with value 0x41 (corresponding to the ASCII character "A") can be found at Address 0x0141 of the UDVM memory, so the compressed character 0x0141 0001 will decompress to give this ASCII character. Note that encoding each byte in the application message as a separate 4-byte compressed character is not recommended, however, as the resulting "compressed" message is four times as large as the original uncompressed message.
The compression ratio of LZ77 is improved by the remaining UDVM memory, which is used to store a history buffer containing the previously decompressed messages. Compressed characters can point to strings that have previously been decompressed and stored in the buffer, so the overall compression ratio of the LZ77 algorithm improves as the decompressor "learns" more text strings and is able to encode longer strings using a single compressed character. The buffer is circular, so older messages are overwritten by new data when the buffer becomes full.
The steps required to implement an LZ77 compressor and decompressor are similar, although compression is more processor-intensive as it requires a searching operation to be performed. Assembly for the simplified LZ77 decompressor is given below:
; Variables that do not need to be stored after decompressing each ; SigComp message are stored here:
at (32)
:position_value pad (2) :length_value pad (2)
Surtees & West Informational [Page 13]
RFC 4464 SigComp Users' Guide May 2006
at (42)
set (requested_feedback_location, 0)
; The UDVM registers must be stored beginning at Address 64:
at (64)
; Variables that should be stored after decompressing a message are ; stored here. These variables will form part of the SigComp state ; item created by the bytecode:
:byte_copy_left pad (2) :byte_copy_right pad (2) :decompressed_pointer pad (2)
set (returned_parameters_location, 0)
align (64)
:initialize_memory
set (udvm_memory_size, 8192) set (state_length, (udvm_memory_size - 64))
; The UDVM registers byte_copy_left and byte_copy_right are set to ; indicate the bounds of the circular buffer in the UDVM memory. A ; variable decompressed_pointer is also created and set pointing to ; the start of the circular buffer:
; The "dictionary" area of the UDVM memory is initialized to contain ; the values 0 to 255 inclusive:
MEMSET (static_dictionary, 256, 0, 1)
:decompress_sigcomp_message
:next_character
; The next character in the compressed message is read by the UDVM ; and the position and length integers are stored in the variables ; position_value and length_value, respectively. If no more ; compressed data is available, the decompressor jumps to the ; "end_of_message" subroutine:
INPUT-BYTES (4, position_value, end_of_message)
Surtees & West Informational [Page 14]
RFC 4464 SigComp Users' Guide May 2006
; The position_value and length_value point to a byte string in the ; UDVM memory, which is copied into the circular buffer at the ; position specified by decompressed_pointer. This allows the string ; to be referenced by later characters in the compressed message:
; Memory for the dictionary and the circular buffer are reserved by ; the following statements:
:static_dictionary pad (256) :circular_buffer
The task of an LZ77 compressor is simply to discover a sequence of 4-byte compressed characters that the above bytecode will decompress to give the desired application message. As an example, a message compressed using the simplified LZ77 algorithm is given below:
The uncompressed message is "The Restaurant at the End of the Universe\n".
Surtees & West Informational [Page 15]
RFC 4464 SigComp Users' Guide May 2006
The bytecode for the LZ77 decompressor can be uploaded as part of the compressed message, as specified in Section 3.3. However, in order to improve the overall compression ratio, it is important to avoid uploading bytecode in every compressed message. For this reason, SigComp allows the UDVM to save an area of its memory as a state item between compressed messages. Once a state item has been created, it can be retrieved by sending the corresponding state identifier using the following SigComp message format:
The partial_state_identifier field must contain the first 6 bytes of the state identifier for the state item to be accessed (see [2] for details of how state identifiers are derived).
Note that the partial_state_identifier field could be 9 or 12 bytes and that in these cases, bits 6 and 7 of the first byte of the message would be 10 or 11, respectively.
This section provides UDVM bytecode for the simple but effective LZSS compression algorithm [6].
The principal improvement offered by LZSS over LZ77 is that each compressed character begins with a 1-bit indicator flag to specify whether the character is a literal or an offset/length pair. A literal value is simply a single uncompressed byte that is appended directly to the decompressed message.
An offset/length pair contains a 12-bit offset value from 1 to 4096 inclusive, followed by a 4-bit length value from 3 to 18 inclusive. Taken together, these values specify one of the previously received
Surtees & West Informational [Page 16]
RFC 4464 SigComp Users' Guide May 2006
text strings in the circular buffer, which is then appended to the end of the decompressed message.
Assembly for an LZSS decompressor is given below:
at (32) readonly (0)
:index pad (2) :length_value pad (2) :old_pointer pad (2)
at (42)
set (requested_feedback_location, 0)
at (64)
:byte_copy_left pad (2) :byte_copy_right pad (2) :input_bit_order pad (2) :decompressed_pointer pad (2)
set (returned_parameters_location, 0)
align (64) readonly (1)
:initialize_memory
set (udvm_memory_size, 8192) set (state_length, (udvm_memory_size - 64))
This section provides UDVM bytecode for the well-known LZW compression algorithm LZW [7]. This algorithm is used in a number of standards including the GIF image format.
LZW compression operates in a similar manner to LZ77 in that it maintains a circular buffer of previously received decompressed data, and each compressed character references exactly one byte string from the circular buffer. However, LZW also maintains a "codebook" containing 1024 position/length pairs that point to byte strings that LZW believes are most likely to occur in the uncompressed data.
The byte strings stored in the LZW codebook can be referenced by sending a single 10-bit value from 0 to 1023 inclusive. The UDVM extracts the corresponding text string from the codebook and appends it to the end of the decompressed message. It then creates a new codebook entry containing the current text string and the next character to occur in the decompressed message.
Surtees & West Informational [Page 18]
RFC 4464 SigComp Users' Guide May 2006
Assembly for an LZW decompressor is given below:
at (32)
:length_value pad (2) :position_value pad (2) :index pad (2)
at (42)
set (requested_feedback_location, 0)
at (64)
:byte_copy_left pad (2) :byte_copy_right pad (2) :input_bit_order pad (2)
:codebook_next pad (2) :current_length pad (2) :decompressed_pointer pad (2)
set (returned_parameters_location, 0)
align (64)
:initialize_memory
set (udvm_memory_size, 8192) set (state_length, (udvm_memory_size - 64))
This section provides UDVM bytecode for the DEFLATE compression algorithm. DEFLATE is the algorithm used in the well-known "gzip" file format.
The following bytecode will decompress the DEFLATE compressed data format [8] with the following modifications:
1. The DEFLATE compressed data format separates blocks of compressed data by transmitting 7 consecutive zero bits. Each SigComp message is assumed to contain a separate block of compressed data, so the end-of-block bits are implicit and do not need to be transmitted at the end of a SigComp message.
2. This bytecode supports only DEFLATE block type 01 (data compressed with fixed Huffman codes).
Assembly for the DEFLATE decompressor is given below:
at (32) readonly (0)
:index pad (2) :extra_length_bits pad (2) :length_value pad (2) :extra_distance_bits pad (2) :distance_value pad (2)
at (42)
set (requested_feedback_location, 0)
at (64)
:byte_copy_left pad (2) :byte_copy_right pad (2) :input_bit_order pad (2) :decompressed_pointer pad (2)
Surtees & West Informational [Page 21]
RFC 4464 SigComp Users' Guide May 2006
:length_table pad (116) :distance_table pad (120)
set (returned_parameters_location, 0)
align (64)
readonly (1) :initialize_memory
set (udvm_memory_size, 8192) set (state_length, (udvm_memory_size - 64)) set (length_table_start, (((length_table - 4) + 65536) / 4)) set (length_table_mid, (length_table_start + 24)) set (distance_table_start, (distance_table / 4))
This section provides UDVM bytecode for the LZJH compression algorithm. LZJH is the algorithm adopted by the International Telecommunication Union (ITU-T) Recommendation V.44 [9].
Assembly for the LZJH decompressor is given below:
at (32) readonly (0)
; The following 2-byte variables are stored in the scratch-pad memory ; area because they do not need to be saved after decompressing a ; SigComp message:
:length_value pad (2) :position_value pad (2) :index pad (2) :extra_extension_bits pad (2) :codebook_old pad (2)
at (42)
set (requested_feedback_location, 0)
at (64)
; UDVM_registers
:byte_copy_left pad (2) :byte_copy_right pad (2)
:input_bit_order pad (2)
Surtees & West Informational [Page 24]
RFC 4464 SigComp Users' Guide May 2006
; The following 2-byte variables are saved as state after ; decompressing a SigComp message:
:current_length pad (2) :decompressed_pointer pad (2) :ordinal_length pad (2) :codeword_length pad (2) :codebook_next pad (2)
set (returned_parameters_location, 0)
align (64) readonly (1)
:initialize_memory
; The following constants can be adjusted to configure the LZJH ; decompressor. The current settings are as recommended in the V.44 ; specification (given that a total of 8K UDVM memory is available):
set (udvm_memory_size, 8192) ; sets the total memory for LZJH set (max_extension_length, 8) ; sets the maximum string extension set (min_ordinal_length, 7) ; sets the minimum ordinal length set (min_codeword_length, 6) ; sets the minimum codeword length
set (codebook_start, 4492) set (first_codeword, (codebook_start - 12)) set (state_length, (udvm_memory_size - 64))
; The following code decompresses the standard 1-bit LZJH prefix ; that specifies whether the next character is an ordinal or a ; codeword/control value:
; The following code decompresses the special LZJH prefix that only ; occurs after a codeword. It specifies whether the next character ; is an ordinal, a codeword/control value, or a string extension:
; The following code decompresses an ordinal character and creates ; a new codebook entry consisting of the ordinal character and the ; next character to be decompressed:
set (index_lsb, (index + 1)) set (current_length_lsb, (current_length + 1))
; The following code interprets a codeword as an index into the LZJH ; codebook. It extracts the position/length pair from the specified ; codebook entry; the position/length pair points to a byte string ; in the circular buffer, which is then copied to the end of the ; decompressed message. The code also creates a new codebook entry ; consisting of the byte string plus the next character to be ; decompressed:
; The code can handle all of the control characters in V.44 except ; for ETM (Enter Transparent Mode), which is not required for ; message-based protocols such as SigComp.
COMPARE ($index, 1, !, flush, stepup)
:flush
; The FLUSH control character jumps to the beginning of the next ; complete byte in the compressed message:
INPUT-BYTES (0, 0, 0) JUMP (standard_prefix)
:stepup
Surtees & West Informational [Page 27]
RFC 4464 SigComp Users' Guide May 2006
; The STEPUP control character increases the number of bits used to ; encode an ordinal value or a codeword:
Alternative algorithms can also be used with SigComp. This section shows a modified version of the DEFLATE [8] algorithm. The two-stage encoding of DEFLATE is replaced by a single step with a discrete Huffman code for each symbol. The literal/length symbol probabilities are dependent upon whether the previous symbol was a literal or a match. Bit handling is also simpler, in that all bits are input using the INPUT-HUFFMAN instruction and the value of the H bit does not change so all bits are input, read, and interpreted in the same order.
Surtees & West Informational [Page 28]
RFC 4464 SigComp Users' Guide May 2006
Assembly for the algorithm is given below. String matching rules are the same as for the other LZ-based algorithms, with the alternative encoding of the literals and length/distance pairs.
at (32) readonly (0)
:index pad (2) :distance_value pad (2) :old_pointer pad (2)
at (42)
set (requested_feedback_location, 0)
at (64)
:byte_copy_left pad (2) :byte_copy_right pad (2) :input_bit_order pad (2) :decompressed_pointer pad (2)
set (returned_parameters_location, 0)
at (128) readonly (1)
:initialize_memory
set (udvm_memory_size, 8192) set (state_length, (udvm_memory_size - 64))
This section covers the additional mechanisms that can be employed by SigComp to improve the overall compression ratio, including the use of acknowledgements, dictionaries, and sharing state between two directions of a compressed message flow.
An example of assembly code is provided for these mechanisms. Depending on the mechanism and basic algorithm in use, the assembly code for either the mechanism or the basic algorithm may require modification (e.g., if the algorithm uses 'no more input' to jump to end_of_message, following end_of_message with an input instruction for CRC will not work). In any case, these are examples and there may be alternative ways to make use of the mechanisms.
Surtees & West Informational [Page 31]
RFC 4464 SigComp Users' Guide May 2006
When each of the compression algorithms described in Section 4 has successfully decompressed the current SigComp message, the contents of the UDVM memory are saved as a SigComp state item. Subsequent messages can access this state item by uploading the correct state identifier to the receiving endpoint, which avoids the need to upload the bytecode for the compression algorithm on a per-message basis. However, before a state item can be accessed, the compressor must first ensure that it is available at the receiving endpoint.
For each SigComp compartment, the receiving endpoint maintains a list of currently available states (where the total amount of state saved does not exceed the state_memory_size for the compartment). The SigComp compressor should maintain a similar list containing the states that it has instructed the receiving endpoint to save.
As well as tracking the list of state items that it has saved at the remote endpoint, the compressor also maintains a flag for each state item indicating whether or not the state can safely be accessed. State items should not be accessed until they have been acknowledged (e.g., by using the SigComp feedback mechanism as per Section 5.1).
State items are deleted from the list when adding a new piece of state when the total state_memory_size for the compartment is full. The state to be deleted is determined according to age and retention priority as discussed in SigComp [2]. The SigComp compressor should not attempt to access any state items that have been deleted in this manner, as they may no longer be available at the receiving endpoint.
SigComp [2] defines a feedback mechanism to allow the compressor to request feedback from the decompressor, to give the compressor indication that a message has been received and correctly decompressed and that state storage has been attempted. (Note: This mechanism cannot convey the success or failure of individual state creation requests.) In order to invoke the feedback mechanism, the following fields must be reserved in the UDVM memory:
These fields can be reserved in any of the algorithms of Section 4 by replacing the line "set (requested_feedback_location, 0)" with the following assembly:
:requested_feedback_location pad (1) :requested_feedback_length pad (1) :requested_feedback_field pad (12) :hash_start pad (8)
When a SigComp message is successfully decompressed and saved as state, the following bytecode instructs the receiving endpoint to return the first 6 bytes of the corresponding state identifier. The bytecode can be added to any of the compression algorithms of Section 4 immediately following the ":end_of_message" label:
The receiving endpoint then returns the state identifier in the "returned feedback field" of the next SigComp message to be transmitted in the reverse direction.
When the state identifier is returned, the compressor can set the availability flag for the corresponding state to 1.
Certain protocols that can be compressed using SigComp offer a fixed, mandatory state item known as a static dictionary. This dictionary contains a number of text strings that commonly occur in messages generated by the protocol in question. The overall compression ratio can often be improved by accessing the text phrases from this static dictionary rather than by uploading them as part of the compressed message.
As an example, a static dictionary is provided for the protocols SIP and SDP, RFC 3485 [4]. This dictionary is designed for use by a wide range of compression algorithms including all of the ones covered in Section 4.
Surtees & West Informational [Page 33]
RFC 4464 SigComp Users' Guide May 2006
In any of the compression algorithms of Section 4, the static dictionary can be accessed by inserting the following instruction immediately after the ":initialize_memory" label:
STATE-ACCESS (dictionary_id, 6, 0, 0, 1024, 0)
The parameters of STATE-ACCESS instruction will depend on the compression algorithm in use.
The following lines should also be inserted immediately after the END-MESSAGE instruction:
:dictionary_id
byte (0xfb, 0xe5, 0x07, 0xdf, 0xe5, 0xe6)
The text strings contained in the static dictionary can then be accessed in exactly the same manner as the text strings from previously decompressed messages (see Section 5.1 for further details).
Note that in some cases it is sufficient to load only part of the static dictionary into the UDVM memory. Further information on the contents of the SIP and SDP static dictionary can be found in the relevant document, RFC 3485 [4].
The acknowledgement scheme of Section 5.1 is designed to indicate the successful decompression of a message. However, it does not guarantee that the decompressed message is identical to the original message, since decompression of a corrupted message could succeed but with some characters being incorrect. This could lead to an incorrect message being passed to the application or unexpected contents of state to be stored. In order to prevent this happening, a CRC check could be used.
If an additional CRC check is required, then the following bytecode can be inserted after the ":end_of_message" label:
The bytecode extracts a 2-byte CRC from the end of the SigComp message and compares it with a CRC calculated over the UDVM memory. Decompression failure occurs if the two CRC values do not match.
Surtees & West Informational [Page 34]
RFC 4464 SigComp Users' Guide May 2006
A definition of the CRC polynomial used by the CRC instruction can be found in SigComp [2].
If a particular endpoint is able to offer more processing or memory resources than the mandatory minimum, the SigComp feedback mechanism can be used to announce that these resources are available to the remote endpoint. This may help to improve the overall compression ratio between the two endpoints.
Additionally, if an endpoint has any pieces of state that may be useful for the remote endpoint to reference, it can advertise the identifiers for the states. The remote endpoint can then make use of any that it also knows about (i.e., knows the contents of), for example, a dictionary or shared mode state (see Section 5.5).
The values of the following SigComp parameters can be announced using the SigComp advertisement mechanism:
cycles_per_bit decompression_memory_size state_memory_size SigComp_version state identifiers
Surtees & West Informational [Page 35]
RFC 4464 SigComp Users' Guide May 2006
As explained in SigComp, in order to announce the values of these parameters, the following fields must be reserved in the UDVM memory:
These fields can be reserved in any of the algorithms of Section 4 by replacing the line "set (returned_parameters_location, 0)" with the following piece of assembly:
:adverts_len pad (1) :adverts_len_lsb pad (1) :returned_parameters_location pad (1) :returned_sigcomp_version pad (1) :state_ids pad (x)
where x is enough space for the number state identifiers that the endpoint wishes to advertise.
When a SigComp message is successfully decompressed and saved as state, the following bytecode announces to the receiving endpoint that additional resources and pieces of state are available at the sending endpoint:
Note that the integer value "N" should be set equal to the amount of resources available at the sending endpoint. N should be expressed as a 2-byte integer with the most significant bits corresponding to the cycles_per_bit parameter and the least significant bits corresponding to the SigComp_version parameter.
The length of the state identifiers followed by the state identifiers in the format shown are appended to the end of the compressed message.
This section provides bytecode for implementing the SigComp shared compression mechanism, RFC 3321 [3]. If two endpoints A and B are communicating via SigComp, shared compression allows the messages sent from Endpoint A to Endpoint B to be compressed relative to the messages sent from Endpoint B to Endpoint A (and vice versa). This may improve the overall compression ratio by reducing the need to transmit the same information in both directions.
As described in RFC 3321 [3], two steps must be taken to implement shared compression at an endpoint.
First, it is necessary to announce to the remote endpoint that shared compression is available. This is done by announcing the state identifier as an available piece of state. This can be done using the returned_parameters_location announcement as in Section 5.4.
Second, assuming that such an announcement is received from the remote endpoint, then the state created by shared compression needs to be accessed by the message sent in the opposite direction. This can be done in a similar way to accessing the static dictionary (see Section 5.2), but using the appropriate state identifier, for example, by using the INPUT-BYTES instruction as below:
This document describes implementation options for the SigComp protocol [2]. Consequently, the security considerations for this document match those of SigComp.
Thanks to Richard Price, Carsten Bormann, Adam Roach, Lawrence Conroy, Christian Schmidt, Max Riegel, Lars-Erik Jonsson, Jonathan Rosenberg, Stefan Forsgren, Krister Svanbro, Miguel Garcia, Christopher Clanton, Khiem Le, Ka Cheong Leung, and Zoltan Barczikay for valuable input and review.
Special thanks to Pekka Pessi and Cristian Constantin, who served as committed working group document reviewers.
The IETF has been notified of intellectual property rights claimed in regard to some or all of the specification contained in this document. For more information consult the online list of claimed rights.
[1] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 4234, October 2005.
[2] Price, R., Bormann, C., Christoffersson, J., Hannu, H., Liu, Z., and J. Rosenberg, "Signaling Compression (SigComp)", RFC 3320, January 2003.
[3] Hannu, H., Christoffersson, J., Forsgren, S., Leung, K.-C., Liu, Z., and R. Price, "Signaling Compression (SigComp) - Extended Operations", RFC 3321, January 2003.
[4] Garcia-Martin, M., Bormann, C., Ott, J., Price, R., and A.B. Roach, "The Session Initiation Protocol (SIP) and Session Description Protocol (SDP) Static Dictionary for Signaling Compression (SigComp)", RFC 3485, February 2003.
[5] Ziv, J. and A. Lempel, "A universal algorithm for sequential data compression", IEEE 23:337-343, 1977.
[6] Storer, J., "Data Compression: Methods and Theory", Computer Science Press ISBN 0-88175-161-8, 1998.
Surtees & West Informational [Page 38]
RFC 4464 SigComp Users' Guide May 2006
[7] Nelson, M., "LZW Data Compression", Dr Dobb's Journal, October 1989.
[8] Deutsch, P., "DEFLATE Compressed Data Format Specification version 1.3", RFC 1951, May 1996.
[9] "Data Compression Procedures", ITU-T Recommendation V.44, November 2000.
Surtees & West Informational [Page 39]
RFC 4464 SigComp Users' Guide May 2006
Appendix A. UDVM Bytecode for the Compression Algorithms
The following sections list the UDVM bytecode generated for each compression algorithm of Section 4.
Note that the different assemblers can output different bytecode for the same piece of assembly code, so a valid assembler can produce results different from those presented below. However, the following bytecode should always generate the same decompressed messages on any UDVM.
This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.
This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.
Acknowledgement
Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA).