Network Working Group M. Chadalapaka Request for Comments: 3783 R. Elliott Category: Informational Hewlett-Packard Co. May 2004
Small Computer Systems Interface (SCSI) Command Ordering Considerations with iSCSI
Status of this Memo
This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.
Copyright (C) The Internet Society (2004). All Rights Reserved.
Internet Small Computer Systems Interface (iSCSI) is a Small Computer Systems Interface (SCSI) transport protocol designed to run on top of TCP. The iSCSI session abstraction is equivalent to the classic SCSI "I_T nexus", which represents the logical relationship between an Initiator and a Target (I and T) required in order to communicate via the SCSI family of protocols. The iSCSI session provides an ordered command delivery from the SCSI initiator to the SCSI target. This document goes into the design considerations that led to the iSCSI session model as it is defined today, relates the SCSI command ordering features defined in T10 specifications to the iSCSI concepts, and finally provides guidance to system designers on how true command ordering solutions can be built based on iSCSI.
iSCSI is a SCSI transport protocol ([iSCSI]) designed to enable running SCSI application protocols on TCP/IP networks, including potentially the Internet. Given the size and scope of the Internet, iSCSI thus enables some exciting new SCSI applications. Potential new application areas for exploiting iSCSI's value include the following:
a) Larger (diameter) Storage Area Networks (SANs) than had been possible until now b) Asynchronous remote mirroring c) Remote tape vaulting
Each of these applications takes advantage of the practically unlimited geographical distance that iSCSI enables between a SCSI initiator and a SCSI target. In each of these cases, because of the long delays involved, there is a very high incentive for the initiator to stream SCSI commands back-to-back without waiting for the SCSI status of previous commands. Command streaming may be employed primarily by two classes of applications - while one class may not particularly care about ordered command execution, the other class does rely on ordered command execution (i.e. there is an application-level dependency on the ordering among SCSI commands). As an example, cases b) and c) listed earlier clearly require ordered command execution. A mirroring application does not want the writes to be committed out of order on the remote SCSI target, so as to
Chadalapaka & Elliott Informational [Page 2]
RFC 3783 Command Ordering May 2004
preserve the transactional integrity of the data on that target. To summarize, SCSI command streaming, when coupled with the guarantee of ordered command execution on the SCSI target, is extremely valuable for a critical class of applications in long-latency networks.
This document reviews the various protocol considerations in designing storage solutions that employ SCSI command ordering. This document also analyzes and explains the design intent of [iSCSI] with respect to command ordering.
- I_T nexus: [SAM2] defines the I_T nexus as a relationship between a SCSI initiator port and a SCSI target port. [iSCSI] defines an iSCSI session as the iSCSI representation of an I_T nexus. In the iSCSI context, the I_T nexus (i.e. the iSCSI session) is a relationship between an iSCSI initiator's end of the session (SCSI Initiator Port) and the iSCSI target's Portal Group (SCSI Target Port).
- PDU (Protocol Data Unit): An iSCSI initiator and iSCSI target communicate using iSCSI protocol messages. These messages are called "iSCSI protocol data units" (iSCSI PDUs).
- SCSI device: A SCSI device is an entity that contains one or more SCSI ports that are connected to a service delivery subsystem and supports SCSI application protocols. In the iSCSI context, the SCSI Device is the component within an iSCSI Node that provides the SCSI functionality. The SCSI Device Name is defined to be the iSCSI Name of the node.
- Session: A group of logically related iSCSI connections that link an initiator with a target form a session (equivalent to a SCSI I-T nexus). The number of participating iSCSI connections within an iSCSI session may vary over time. The multiplicity of connections at the iSCSI level is completely hidden for the SCSI layer - each SCSI port in an I_T nexus sees only one peer SCSI port across all the connections of a session.
Acronym Definition -------------------------------------------------------------- ACA Auto Contingent Allegiance ASC Additional Sense Code ASCQ Additional Sense Code Qualifier CRN Command Reference Number IETF Internet Engineering Task Force ISID Initiator Session Identifier ITT Initiator Task Tag LU Logical Unit LUN Logical Unit Number NIC Network Interface Card PDU Protocol Data Unit TMF Task Management Function TSIH Target Session Identifying Handle SAN Storage Area Network SCSI Small Computer Systems Interface TCP Transmission Control Protocol UA Unit Attention WG Working Group
The iSCSI protocol is a mapping of the SCSI remote procedure invocation model (see [SAM2]) over the TCP protocol.
SCSI's notion of a task maps to an iSCSI task. Each iSCSI task is uniquely identified within that I_T nexus by a 32-bit unique identifier called Initiator Task Tag (ITT). The ITT is both an iSCSI identifier of the task and a classic SCSI task tag.
SCSI commands from the initiator to the target are carried in iSCSI requests called SCSI Command PDUs. SCSI status back to the initiator is carried in iSCSI responses called SCSI Response PDUs. SCSI Data- out from the initiator to the target is carried in SCSI Data-Out PDUs, and the SCSI Data-in back to the initiator is carried in SCSI Data-in PDUs.
In the iSCSI model, the SCSI I_T nexus maps directly to the iSCSI session, which is an iSCSI protocol abstraction spanning one or more TCP connections. The iSCSI protocol defines the semantics in order to realize one logical flow of bidirectional communication on the I_T nexus, potentially spanning multiple TCP connections (as many as 2^16). The multiplicity of iSCSI connections is thus completely contained at the iSCSI layer, while the SCSI layer is presented with a single I_T nexus, even in a multi-connection session. A session between a pair of given iSCSI nodes is identified by the session identifier (SSID) and each connection within a given session is uniquely identified by a connection identifier (CID) in iSCSI. The SSID itself has two components - Initiator Session Identifier (ISID) and a Target Session Identifying Handler (TSIH) - each identifying one end of the same session.
There are four crucial functional facets of iSCSI that together present this single logical flow abstraction to the SCSI layer, even with an iSCSI session spanning across multiple iSCSI connections.
a) Ordered command delivery: A sequence of SCSI commands that is striped across all the connections in the session is "reordered" by the target iSCSI layer into an identical sequence based on a Command Sequence Number (CmdSN) that is unique across the session. The goal is to achieve bandwidth aggregation from multiple TCP connections, but to still make it appear to the target SCSI layer as if all the commands had travelled in one flow.
b) Connection allegiance: All the PDU exchanges for a SCSI Command, up to and including the SCSI Response PDU for the Command, are required to flow on the same iSCSI connection at any given time. This again is intended to hide the multi- connection nature of a session because the SCSI layer on either side will never see the PDU contents out of order (e.g., status cannot bypass read data for an initiator).
c) Task set management function handling: [iSCSI] specifies an ordered sequence of steps for the iSCSI layer on the SCSI target in handling the two SCSI task management functions (TMFs) that manage SCSI task sets. The two TMFs are ABORT TASK SET that aborts all active tasks in a session, and CLEAR TASK SET that clears the tasks in the task set. The goal of the sequence of steps is to guarantee that the initiator receives the SCSI Response PDUs of all unaffected tasks before the TMF Response itself arrives, regardless of the number of connections in the iSCSI session. This operational model is
Chadalapaka & Elliott Informational [Page 5]
RFC 3783 Command Ordering May 2004
again intended to preserve the single flow abstraction to the SCSI layer.
d) Immediate task management function handling: Even when a TMF request is marked as "immediate" (i.e. only has a position in the command stream, but does not consume a CmdSN), [iSCSI] defines semantics that require the target iSCSI layer to ensure that the TMF request is executed as if the commands and the TMF request were all flowing on a single logical channel. This ensures that the TMF request will act on tasks that it was meant to manage.
The following sections will analyze the "Ordered command delivery" aspect in more detail, since command ordering is the focus of this document.
A couple of important questions related to iSCSI command ordering were considered early on in the design of the iSCSI protocol. The questions were:
a) What should be the command ordering behavior required of iSCSI implementations in the presence of transport errors, such as errors that corrupt the data in a fashion that is not detected by the TCP checksum (e.g., two offsetting bit flips in the same bit position), but is detected by the iSCSI CRC digest?
b) Should [iSCSI] require both initiators and targets to use ordered command delivery?
Since the answers to these questions are critical to the understanding of the ordering behavior required by the iSCSI protocol, the following sub-sections consider them in more detail.
The final disposition of question a) in section 3.3.1 was reflected in [RFC3347], "iSCSI MUST specify strictly ordered delivery of SCSI commands over an iSCSI session between an initiator/target pair, even in the presence of transport errors." Stated differently, an iSCSI digest failure, or an iSCSI connection termination, must not cause the iSCSI layer on a target to allow executing the commands in an order different from that intended (as indicated by the CmdSN order) by the initiator. This design choice is enormously helpful in building storage systems and solutions that can now always assume
Chadalapaka & Elliott Informational [Page 6]
RFC 3783 Command Ordering May 2004
command ordering to be a service characteristic of an iSCSI substrate.
Note that by taking the position that an iSCSI session always guarantees command ordering, [iSCSI] was indirectly implying that the principal reason for the multi-connection iSCSI session abstraction was to allow ordered bandwidth aggregation for an I_T nexus. In deployment models where this cross-connection ordering mandated by [iSCSI] is deemed expensive, a serious consideration should be given to deploying multiple single-connection sessions instead.
The final resolution of b) in section 3.3.1 by the iSCSI protocol designers was in favor of not always requiring the initiators to use command ordering. This resolution is reflected in dropping the mandatory ACA usage requirement on the initiators, and allowing an ABORT TASK TMF to plug a command hole etc., since these are conscious choices an initiator makes in favor of not using ordered command delivery. The net result can be discerned by a careful reader of [iSCSI] - the onus of ensuring ordered command delivery is always on the iSCSI targets, while the initiators may or may not utilize command ordering. iSCSI targets, being the servers in the client- server model, do not really attempt to establish whether or not a client (initiator) intends to take advantage of command ordering service, but instead simply always provide the guaranteed delivery service. The rationale here is that there are inherent SCSI and application-level dependencies, as we shall see in building a command ordered solution, that are beyond the scope of [iSCSI], to mandate or even discern the intent with respect to the usage of command ordering.
The service delivery subsystem (see [SAM2]) abstraction provided by an iSCSI session is guaranteed to have the intrinsic property of ordered delivery of commands to the target SCSI layer under all conditions. Consequently, the guarantee of the ordered command delivery is across the entire I_T nexus spanning all the LUs that the nexus is authorized to access. It is the initiator's discretion as to whether or not this property will be used.
A storage systems designer working with SCSI and iSCSI has to consider the following protocol features in SCSI and iSCSI layers, each of which has a role to play in realizing the command ordering goal.
CRN is an ordered sequence number which, when enabled for a device server, increments by one for each I_T_L nexus (see [SAM2]). The one notable drawback with CRN is that there is no SCSI-generic way (such as through mode pages) to enable or disable the CRN feature. [SAM2] also leaves the usage semantics of CRN for the SCSI transport protocol, such as iSCSI, to specify. [iSCSI] chose not to support the CRN feature for various reasons.
[SAM2] defines the following four task attributes - SIMPLE, ORDERED, HEAD OF QUEUE, and ACA. Each task to an LU may be assigned an attribute. [SAM2] defines the ordering constraints that each of these attributes conveys to the device server that is servicing the task. In particular, judicious use of ORDERED and SIMPLE attributes applied to a stream of pipelined commands could convey the precise execution schema for the commands that the initiator issues, provided the commands are received in the same order on the target.
ACA is an LU-level condition that is triggered when a command (with the NACA bit set to 1) completes with CHECK CONDITION. When ACA is triggered, it prevents all commands other than those with the ACA attribute from executing until the CLEAR ACA task management function is executed, while blocking all the other tasks already in the task set. See [SAM2] for the detailed semantics of ACA. Since ACA is closely tied to the notion of a task set, one would ideally have to select the scope of the task set (by setting the TST bit to 1 in the control mode page of the LU) to be per-initiator in order to prevent command failures in one I_T_L nexus from impacting other I_T_L nexuses through ACA.
When UA interlock is enabled, the logical unit does not clear any standard Unit Attention condition reported with autosense, and in addition, establishes a Unit Attention condition when a task is terminated with one of BUSY, TASK SET FULL, or RESERVATION CONFLICT statuses. This so-called "interlocked UA" is cleared only when the device server executes an explicit REQUEST SENSE ([SPC3]) command from the same initiator. From a functionality perspective, the scope of UA interlock today is slightly different from ACA's because it enforces ordering behavior for completion statuses other than CHECK CONDITION, but otherwise conceptually has the same design intent as ACA. On the other hand, ACA is somewhat more sophisticated because it allows special "cleanup" tasks (ones with ACA attribute) to execute when ACA is active. One of the principal reasons UA interlock came into being was that SCSI designers wanted a command ordering feature without the side effects of using the aforementioned TST bit in the control mode page.
As noted in section 3.2 and section 3.3, the iSCSI protocol enforces and guarantees ordered command delivery per iSCSI session using the CmdSN, and this is an attribute of the SCSI transport layer. Note further that any command ordering solution that seeks to realize ordering from the initiator SCSI layer to the target SCSI layer would be of practical value only when the command ordering is guaranteed by the SCSI transport layer. In other words, the related SCSI application layer protocol features such as ACA etc. are based on the premise of an ordered SCSI transport. Thus, iSCSI's command ordering is the last piece in completing the puzzle of building solutions that rely on ordered command execution, by providing the crucial guarantee that all the commands handed to the initiator iSCSI layer will be transported and handed to the target SCSI layer in the same order.
[iSCSI] mandates that when an iSCSI connection fails, the active tasks on that connection must be terminated if not recovered within a certain negotiated time limit. When an iSCSI target does terminate some subset of tasks due to iSCSI connection dynamics, there is a danger that the SCSI layer would simply move on to the next tasks waiting to be processed and execute them out-of-order unbeknownst to the initiator SCSI layer. To preclude this danger, [iSCSI] further mandates the following:
Chadalapaka & Elliott Informational [Page 9]
RFC 3783 Command Ordering May 2004
a) The tasks terminated due to the connection failure must be internally terminated by the iSCSI target "as if" due to a CHECK CONDITION. While this particular completion status is never communicated back to the initiator, the "as if" is still meaningful and required because if the initiator were using ACA as the command ordering mechanism of choice, a SCSI-level ACA will be triggered due to this mandatory CHECK CONDITION. This addresses the aforementioned danger.
b) After the tasks are terminated due to the connection failure, the iSCSI target must report a Unit Attention condition on the next command processed on any connection for each affected I_T_L nexus of that session. This is required because if the initiator were using UA interlock as the command ordering mechanism of choice, a SCSI-level UA will trigger a UA- interlock. This again addresses the aforementioned danger. iSCSI targets must report this UA with the status of CHECK CONDITION, and the ASC/ASCQ value of 47h/7Fh ("SOME COMMANDS CLEARED BY ISCSI PROTOCOL EVENT").
In general, command ordering is automatically enforced if targets and initiators comply with the iSCSI specification. However, listed below are certain additional related implementation considerations for the iSCSI initiators and targets to take note of.
a) Even when all iSCSI and SCSI command ordering considerations earlier noted in this document were applied, it is beneficial for iSCSI initiators to proactively avoid scenarios that would otherwise lead to out-of-order command execution. This is simply because the SCSI command ordering features such as UA interlock are likely to be costlier in performance when they are allowed to be triggered. [iSCSI] provides enough guidance on how to implement this proactive detection of PDU ordering errors.
b) The whole notion of command streaming does of course assume that the target in question supports command queueing. An iSCSI target desirous of supporting command ordering solutions should ensure that the SCSI layer on the target supports command queuing. The remote backup (tape vaulting) applications that iSCSI enables make an especially compelling case that tape devices should give a very serious consideration to supporting command queuing, at least when used in conjunction with iSCSI.
Chadalapaka & Elliott Informational [Page 10]
RFC 3783 Command Ordering May 2004
c) An iSCSI target desirous of supporting high-performance command ordering solutions that involve specifying a description of execution schema should ensure that the SCSI layer on the target in fact does support the ORDERED and SIMPLE task attributes.
d) There is some consideration of expanding the scope of UA interlock to encompass CHECK CONDITION status, and thus make it the only required command ordering functionality of implementations to build command ordering solutions. Until this is resolved in T10, the currently defined semantics of UA interlock and ACA warrant implementing both features by iSCSI targets desirous of supporting command ordering solutions.
[iSCSI] describes a "principle of conservative reuse" that encourages iSCSI initiators to reuse the same ISIDs (see section 3.2) to various SCSI target ports, in order to present the same SCSI initiator port name to those target ports. This is in fact a very crucial implementation consideration that must be complied with. [SPC3] mandates the SCSI targets to associate persistent reservations and the related registrations with the SCSI initiator port names whenever they are required by the SCSI transport protocol. Since [iSCSI] requires the mandatory SCSI initiator port names based on ISIDs, iSCSI targets are required to work off the SCSI initiator port names, and thus indirectly the ISIDs, in enforcing the persistent reservations.
This fact has the following implications for the implementations:
a) If a persistent reservation/registration is intended to be used across multiple SCSI ports of a SCSI device, the initiator iSCSI implementation must use the same ISID across associated iSCSI sessions connecting to different iSCSI target portal groups of the SCSI device.
b) If a persistent reservation/registration is intended to be used across the power loss of a SCSI target, the initiator iSCSI implementation must use the same ISIDs as before in re-establishing the associated iSCSI sessions upon subsequent reboot in order to rely on the persist through power loss capability.
For security considerations in using the iSCSI protocol, refer to the Security Considerations section in [iSCSI]. This document does not introduce any additional security considerations other than those already discussed in [iSCSI].
We are grateful to the IPS working group whose work defined the iSCSI protocol. Thanks also to David Black (EMC) who encouraged the publication of this document. Special thanks to Randy Haagens (HP) for his insights on the topic of command ordering. Thanks are also due to Elizabeth Rodriguez for carefully reviewing this document.
Copyright (C) The Internet Society (2004). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.
This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- email@example.com.
Funding for the RFC Editor function is currently provided by the Internet Society.