RFC 8881




Internet Engineering Task Force (IETF)                    D. Noveck, Ed.
Request for Comments: 8881                                        NetApp
Obsoletes: 5661                                                 C. Lever
Category: Standards Track                                         ORACLE
ISSN: 2070-1721                                              August 2020


      Network File System (NFS) Version 4 Minor Version 1 Protocol

Abstract



   This document describes the Network File System (NFS) version 4 minor
   version 1, including features retained from the base protocol (NFS
   version 4 minor version 0, which is specified in RFC 7530) and
   protocol extensions made subsequently.  The later minor version has
   no dependencies on NFS version 4 minor version 0, and is considered a
   separate protocol.

   This document obsoletes RFC 5661.  It substantially revises the
   treatment of features relating to multi-server namespace, superseding
   the description of those features appearing in RFC 5661.

Status of This Memo



   This is an Internet Standards Track document.

   This document is a product of the Internet Engineering Task Force
   (IETF).  It represents the consensus of the IETF community.  It has
   received public review and has been approved for publication by the
   Internet Engineering Steering Group (IESG).  Further information on
   Internet Standards is available in Section 2 of RFC 7841.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   https://www.rfc-editor.org/info/rfc8881.

Copyright Notice



   Copyright (c) 2020 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

Table of Contents



   1.  Introduction
     1.1.  Introduction to This Update
     1.2.  The NFS Version 4 Minor Version 1 Protocol
     1.3.  Requirements Language
     1.4.  Scope of This Document
     1.5.  NFSv4 Goals
     1.6.  NFSv4.1 Goals
     1.7.  General Definitions
     1.8.  Overview of NFSv4.1 Features
     1.9.  Differences from NFSv4.0
   2.  Core Infrastructure
     2.1.  Introduction
     2.2.  RPC and XDR
     2.3.  COMPOUND and CB_COMPOUND
     2.4.  Client Identifiers and Client Owners
     2.5.  Server Owners
     2.6.  Security Service Negotiation
     2.7.  Minor Versioning
     2.8.  Non-RPC-Based Security Services
     2.9.  Transport Layers
     2.10. Session
   3.  Protocol Constants and Data Types
     3.1.  Basic Constants
     3.2.  Basic Data Types
     3.3.  Structured Data Types
   4.  Filehandles
     4.1.  Obtaining the First Filehandle
     4.2.  Filehandle Types
     4.3.  One Method of Constructing a Volatile Filehandle
     4.4.  Client Recovery from Filehandle Expiration
   5.  File Attributes
     5.1.  REQUIRED Attributes
     5.2.  RECOMMENDED Attributes
     5.3.  Named Attributes
     5.4.  Classification of Attributes
     5.5.  Set-Only and Get-Only Attributes
     5.6.  REQUIRED Attributes - List and Definition References
     5.7.  RECOMMENDED Attributes - List and Definition References
     5.8.  Attribute Definitions
     5.9.  Interpreting owner and owner_group
     5.10. Character Case Attributes
     5.11. Directory Notification Attributes
     5.12. pNFS Attribute Definitions
     5.13. Retention Attributes
   6.  Access Control Attributes
     6.1.  Goals
     6.2.  File Attributes Discussion
     6.3.  Common Methods
     6.4.  Requirements
   7.  Single-Server Namespace
     7.1.  Server Exports
     7.2.  Browsing Exports
     7.3.  Server Pseudo File System
     7.4.  Multiple Roots
     7.5.  Filehandle Volatility
     7.6.  Exported Root
     7.7.  Mount Point Crossing
     7.8.  Security Policy and Namespace Presentation
   8.  State Management
     8.1.  Client and Session ID
     8.2.  Stateid Definition
     8.3.  Lease Renewal
     8.4.  Crash Recovery
     8.5.  Server Revocation of Locks
     8.6.  Short and Long Leases
     8.7.  Clocks, Propagation Delay, and Calculating Lease Expiration
     8.8.  Obsolete Locking Infrastructure from NFSv4.0
   9.  File Locking and Share Reservations
     9.1.  Opens and Byte-Range Locks
     9.2.  Lock Ranges
     9.3.  Upgrading and Downgrading Locks
     9.4.  Stateid Seqid Values and Byte-Range Locks
     9.5.  Issues with Multiple Open-Owners
     9.6.  Blocking Locks
     9.7.  Share Reservations
     9.8.  OPEN/CLOSE Operations
     9.9.  Open Upgrade and Downgrade
     9.10. Parallel OPENs
     9.11. Reclaim of Open and Byte-Range Locks
   10. Client-Side Caching
     10.1.  Performance Challenges for Client-Side Caching
     10.2.  Delegation and Callbacks
     10.3.  Data Caching
     10.4.  Open Delegation
     10.5.  Data Caching and Revocation
     10.6.  Attribute Caching
     10.7.  Data and Metadata Caching and Memory Mapped Files
     10.8.  Name and Directory Caching without Directory Delegations
     10.9.  Directory Delegations
   11. Multi-Server Namespace
     11.1.  Terminology
     11.2.  File System Location Attributes
     11.3.  File System Presence or Absence
     11.4.  Getting Attributes for an Absent File System
     11.5.  Uses of File System Location Information
     11.6.  Trunking without File System Location Information
     11.7.  Users and Groups in a Multi-Server Namespace
     11.8.  Additional Client-Side Considerations
     11.9.  Overview of File Access Transitions
     11.10. Effecting Network Endpoint Transitions
     11.11. Effecting File System Transitions
     11.12. Transferring State upon Migration
     11.13. Client Responsibilities When Access Is Transitioned
     11.14. Server Responsibilities Upon Migration
     11.15. Effecting File System Referrals
     11.16. The Attribute fs_locations
     11.17. The Attribute fs_locations_info
     11.18. The Attribute fs_status
   12. Parallel NFS (pNFS)
     12.1.  Introduction
     12.2.  pNFS Definitions
     12.3.  pNFS Operations
     12.4.  pNFS Attributes
     12.5.  Layout Semantics
     12.6.  pNFS Mechanics
     12.7.  Recovery
     12.8.  Metadata and Storage Device Roles
     12.9.  Security Considerations for pNFS
   13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type
     13.1.  Client ID and Session Considerations
     13.2.  File Layout Definitions
     13.3.  File Layout Data Types
     13.4.  Interpreting the File Layout
     13.5.  Data Server Multipathing
     13.6.  Operations Sent to NFSv4.1 Data Servers
     13.7.  COMMIT through Metadata Server
     13.8.  The Layout Iomode
     13.9.  Metadata and Data Server State Coordination
     13.10. Data Server Component File Size
     13.11. Layout Revocation and Fencing
     13.12. Security Considerations for the File Layout Type
   14. Internationalization
     14.1.  Stringprep Profile for the utf8str_cs Type
     14.2.  Stringprep Profile for the utf8str_cis Type
     14.3.  Stringprep Profile for the utf8str_mixed Type
     14.4.  UTF-8 Capabilities
     14.5.  UTF-8 Related Errors
   15. Error Values
     15.1.  Error Definitions
     15.2.  Operations and Their Valid Errors
     15.3.  Callback Operations and Their Valid Errors
     15.4.  Errors and the Operations That Use Them
   16. NFSv4.1 Procedures
     16.1.  Procedure 0: NULL - No Operation
     16.2.  Procedure 1: COMPOUND - Compound Operations
   17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL
   18. NFSv4.1 Operations
     18.1.  Operation 3: ACCESS - Check Access Rights
     18.2.  Operation 4: CLOSE - Close File
     18.3.  Operation 5: COMMIT - Commit Cached Data
     18.4.  Operation 6: CREATE - Create a Non-Regular File Object
     18.5.  Operation 7: DELEGPURGE - Purge Delegations Awaiting
             Recovery
     18.6.  Operation 8: DELEGRETURN - Return Delegation
     18.7.  Operation 9: GETATTR - Get Attributes
     18.8.  Operation 10: GETFH - Get Current Filehandle
     18.9.  Operation 11: LINK - Create Link to a File
     18.10. Operation 12: LOCK - Create Lock
     18.11. Operation 13: LOCKT - Test for Lock
     18.12. Operation 14: LOCKU - Unlock File
     18.13. Operation 15: LOOKUP - Lookup Filename
     18.14. Operation 16: LOOKUPP - Lookup Parent Directory
     18.15. Operation 17: NVERIFY - Verify Difference in Attributes
     18.16. Operation 18: OPEN - Open a Regular File
     18.17. Operation 19: OPENATTR - Open Named Attribute Directory
     18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access
     18.19. Operation 22: PUTFH - Set Current Filehandle
     18.20. Operation 23: PUTPUBFH - Set Public Filehandle
     18.21. Operation 24: PUTROOTFH - Set Root Filehandle
     18.22. Operation 25: READ - Read from File
     18.23. Operation 26: READDIR - Read Directory
     18.24. Operation 27: READLINK - Read Symbolic Link
     18.25. Operation 28: REMOVE - Remove File System Object
     18.26. Operation 29: RENAME - Rename Directory Entry
     18.27. Operation 31: RESTOREFH - Restore Saved Filehandle
     18.28. Operation 32: SAVEFH - Save Current Filehandle
     18.29. Operation 33: SECINFO - Obtain Available Security
     18.30. Operation 34: SETATTR - Set Attributes
     18.31. Operation 37: VERIFY - Verify Same Attributes
     18.32. Operation 38: WRITE - Write to File
     18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control
     18.34. Operation 41: BIND_CONN_TO_SESSION - Associate Connection
             with Session
     18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID
     18.36. Operation 43: CREATE_SESSION - Create New Session and
             Confirm Client ID
     18.37. Operation 44: DESTROY_SESSION - Destroy a Session
     18.38. Operation 45: FREE_STATEID - Free Stateid with No Locks
     18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory
             Delegation
     18.40. Operation 47: GETDEVICEINFO - Get Device Information
     18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for
             a File System
     18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a
             Layout
     18.43. Operation 50: LAYOUTGET - Get Layout Information
     18.44. Operation 51: LAYOUTRETURN - Release Layout Information
     18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed
             Object
     18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing
             and Control
     18.47. Operation 54: SET_SSV - Update SSV for a Client ID
     18.48. Operation 55: TEST_STATEID - Test Stateids for Validity
     18.49. Operation 56: WANT_DELEGATION - Request Delegation
     18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID
     18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims
             Finished
     18.52. Operation 10044: ILLEGAL - Illegal Operation
   19. NFSv4.1 Callback Procedures
     19.1.  Procedure 0: CB_NULL - No Operation
     19.2.  Procedure 1: CB_COMPOUND - Compound Operations
   20. NFSv4.1 Callback Operations
     20.1.  Operation 3: CB_GETATTR - Get Attributes
     20.2.  Operation 4: CB_RECALL - Recall a Delegation
     20.3.  Operation 5: CB_LAYOUTRECALL - Recall Layout from Client
     20.4.  Operation 6: CB_NOTIFY - Notify Client of Directory
             Changes
     20.5.  Operation 7: CB_PUSH_DELEG - Offer Previously Requested
             Delegation to Client
     20.6.  Operation 8: CB_RECALL_ANY - Keep Any N Recallable Objects
     20.7.  Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources
             for Recallable Objects
     20.8.  Operation 10: CB_RECALL_SLOT - Change Flow Control Limits
     20.9.  Operation 11: CB_SEQUENCE - Supply Backchannel Sequencing
             and Control
     20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending
             Delegation Wants
     20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible
             Lock Availability
     20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of Device
             ID Changes
     20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation
   21. Security Considerations
   22. IANA Considerations
     22.1.  IANA Actions
     22.2.  Named Attribute Definitions
     22.3.  Device ID Notifications
     22.4.  Object Recall Types
     22.5.  Layout Types
     22.6.  Path Variable Definitions
   23. References
     23.1.  Normative References
     23.2.  Informative References
   Appendix A.  The Need for This Update
   Appendix B.  Changes in This Update
     B.1.  Revisions Made to Section 11 of RFC 5661
     B.2.  Revisions Made to Operations in RFC 5661
     B.3.  Revisions Made to Error Definitions in RFC 5661
     B.4.  Other Revisions Made to RFC 5661
   Appendix C.  Security Issues That Need to Be Addressed
   Acknowledgments

   Authors' Addresses



1.  Introduction



1.1.  Introduction to This Update



   Two important features previously defined in minor version 0 but
   never fully addressed in minor version 1 are trunking, which is the
   simultaneous use of multiple connections between a client and server,
   potentially to different network addresses, and Transparent State
   Migration, which allows a file system to be transferred between
   servers in a way that provides to the client the ability to maintain
   its existing locking state across the transfer.

   The revised description of the NFS version 4 minor version 1
   (NFSv4.1) protocol presented in this update is necessary to enable
   full use of these features together with other multi-server namespace
   features.  This document is in the form of an updated description of
   the NFSv4.1 protocol previously defined in RFC 5661 [66].  RFC 5661
   is obsoleted by this document.  However, the update has a limited
   scope and is focused on enabling full use of trunking and Transparent
   State Migration.  The need for these changes is discussed in
   Appendix A.  Appendix B describes the specific changes made to arrive
   at the current text.

   This limited-scope update replaces the current NFSv4.1 RFC with the
   intention of providing an authoritative and complete specification,
   the motivation for which is discussed in [36], addressing the issues
   within the scope of the update.  However, it will not address issues
   that are known but outside of this limited scope as could be expected
   by a full update of the protocol.  Below are some areas that are
   known to need addressing in a future update of the protocol:

   *  Work needs to be done with regard to RFC 8178 [67], which
      establishes NFSv4-wide versioning rules.  As RFC 5661 is currently
      inconsistent with that document, changes are needed in order to
      arrive at a situation in which there would be no need for RFC 8178
      to update the NFSv4.1 specification.

   *  Work needs to be done with regard to RFC 8434 [70], which
      establishes the requirements for parallel NFS (pNFS) layout types,
      which are not clearly defined in RFC 5661.  When that work is done
      and the resulting documents approved, the new NFSv4.1
      specification document will provide a clear set of requirements
      for layout types and a description of the file layout type that
      conforms to those requirements.  Other layout types will have
      their own specification documents that conform to those
      requirements as well.

   *  Work needs to be done to address many errata reports relevant to
      RFC 5661, other than errata report 2006 [64], which is addressed
      in this document.  Addressing that report was not deferrable
      because of the interaction of the changes suggested there and the
      newly described handling of state and session migration.

      The errata reports that have been deferred and that will need to
      be addressed in a later document include reports currently
      assigned a range of statuses in the errata reporting system,
      including reports marked Accepted and those marked Hold For
      Document Update because the change was too minor to address
      immediately.

      In addition, there is a set of other reports, including at least
      one in state Rejected, that will need to be addressed in a later
      document.  This will involve making changes to consensus decisions
      reflected in RFC 5661, in situations in which the working group
      has decided that the treatment in RFC 5661 is incorrect and needs
      to be revised to reflect the working group's new consensus and to
      ensure compatibility with existing implementations that do not
      follow the handling described in RFC 5661.

      Note that it is expected that all such errata reports will remain
      relevant to implementors and the authors of an eventual
      rfc5661bis, despite the fact that this document obsoletes RFC 5661
      [66].

   *  There is a need for a new approach to the description of
      internationalization since the current internationalization
      section (Section 14) has never been implemented and does not meet
      the needs of the NFSv4 protocol.  Possible solutions are to create
      a new internationalization section modeled on that in [68] or to
      create a new document describing internationalization for all
      NFSv4 minor versions and reference that document in the RFCs
      defining both NFSv4.0 and NFSv4.1.

   *  There is a need for a revised treatment of security in NFSv4.1.
      The issues with the existing treatment are discussed in
      Appendix C.

   Until the above work is done, there will not be a consistent set of
   documents that provides a description of the NFSv4.1 protocol, and
   any full description would involve documents updating other documents
   within the specification.  The updates applied by RFC 8434 [70] and
   RFC 8178 [67] to RFC 5661 also apply to this specification, and will
   apply to any subsequent v4.1 specification until that work is done.

1.2.  The NFS Version 4 Minor Version 1 Protocol



   The NFS version 4 minor version 1 (NFSv4.1) protocol is the second
   minor version of the NFS version 4 (NFSv4) protocol.  The first minor
   version, NFSv4.0, is now described in RFC 7530 [68].  It generally
   follows the guidelines for minor versioning that are listed in
   Section 10 of RFC 3530 [37].  However, it diverges from guidelines 11
   ("a client and server that support minor version X must support minor
   versions 0 through X-1") and 12 ("no new features may be introduced
   as mandatory in a minor version").  These divergences are due to the
   introduction of the sessions model for managing non-idempotent
   operations and the RECLAIM_COMPLETE operation.  These two new
   features are infrastructural in nature and simplify implementation of
   existing and other new features.  Making them anything but REQUIRED
   would add undue complexity to protocol definition and implementation.
   NFSv4.1 accordingly updates the minor versioning guidelines
   (Section 2.7).

   As a minor version, NFSv4.1 is consistent with the overall goals for
   NFSv4, but extends the protocol so as to better meet those goals,
   based on experiences with NFSv4.0.  In addition, NFSv4.1 has adopted
   some additional goals, which motivate some of the major extensions in
   NFSv4.1.

1.3.  Requirements Language



   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [1].

1.4.  Scope of This Document



   This document describes the NFSv4.1 protocol.  With respect to
   NFSv4.0, this document does not:

   *  describe the NFSv4.0 protocol, except where needed to contrast
      with NFSv4.1.

   *  modify the specification of the NFSv4.0 protocol.

   *  clarify the NFSv4.0 protocol.

1.5.  NFSv4 Goals



   The NFSv4 protocol is a further revision of the NFS protocol defined
   already by NFSv3 [38].  It retains the essential characteristics of
   previous versions: easy recovery; independence of transport
   protocols, operating systems, and file systems; simplicity; and good
   performance.  NFSv4 has the following goals:

   *  Improved access and good performance on the Internet

      The protocol is designed to transit firewalls easily, perform well
      where latency is high and bandwidth is low, and scale to very
      large numbers of clients per server.

   *  Strong security with negotiation built into the protocol

      The protocol builds on the work of the ONCRPC working group in
      supporting the RPCSEC_GSS protocol.  Additionally, the NFSv4.1
      protocol provides a mechanism to allow clients and servers the
      ability to negotiate security and require clients and servers to
      support a minimal set of security schemes.

   *  Good cross-platform interoperability

      The protocol features a file system model that provides a useful,
      common set of features that does not unduly favor one file system
      or operating system over another.

   *  Designed for protocol extensions

      The protocol is designed to accept standard extensions within a
      framework that enables and encourages backward compatibility.

1.6.  NFSv4.1 Goals



   NFSv4.1 has the following goals, within the framework established by
   the overall NFSv4 goals.

   *  To correct significant structural weaknesses and oversights
      discovered in the base protocol.

   *  To add clarity and specificity to areas left unaddressed or not
      addressed in sufficient detail in the base protocol.  However, as
      stated in Section 1.4, it is not a goal to clarify the NFSv4.0
      protocol in the NFSv4.1 specification.

   *  To add specific features based on experience with the existing
      protocol and recent industry developments.

   *  To provide protocol support to take advantage of clustered server
      deployments including the ability to provide scalable parallel
      access to files distributed among multiple servers.

1.7.  General Definitions



   The following definitions provide an appropriate context for the
   reader.

   Byte:  In this document, a byte is an octet, i.e., a datum exactly 8
      bits in length.

   Client:  The client is the entity that accesses the NFS server's
      resources.  The client may be an application that contains the
      logic to access the NFS server directly.  The client may also be
      the traditional operating system client that provides remote file
      system services for a set of applications.

      A client is uniquely identified by a client owner.

      With reference to byte-range locking, the client is also the
      entity that maintains a set of locks on behalf of one or more
      applications.  This client is responsible for crash or failure
      recovery for those locks it manages.

      Note that multiple clients may share the same transport and
      connection and multiple clients may exist on the same network
      node.

   Client ID:  The client ID is a 64-bit quantity used as a unique,
      short-hand reference to a client-supplied verifier and client
      owner.  The server is responsible for supplying the client ID.

   Client Owner:  The client owner is a unique string, opaque to the
      server, that identifies a client.  Multiple network connections
      and source network addresses originating from those connections
      may share a client owner.  The server is expected to treat
      requests from connections with the same client owner as coming
      from the same client.

   File System:  The file system is the collection of objects on a
      server (as identified by the major identifier of a server owner,
      which is defined later in this section) that share the same fsid
      attribute (see Section 5.8.1.9).

   Lease:  A lease is an interval of time defined by the server for
      which the client is irrevocably granted locks.  At the end of a
      lease period, locks may be revoked if the lease has not been
      extended.  A lock must be revoked if a conflicting lock has been
      granted after the lease interval.

      A server grants a client a single lease for all state.

   Lock:  The term "lock" is used to refer to byte-range (in UNIX
      environments, also known as record) locks, share reservations,
      delegations, or layouts unless specifically stated otherwise.

   Secret State Verifier (SSV):  The SSV is a unique secret key shared
      between a client and server.  The SSV serves as the secret key for
      an internal (that is, internal to NFSv4.1) Generic Security
      Services (GSS) mechanism (the SSV GSS mechanism; see
      Section 2.10.9).  The SSV GSS mechanism uses the SSV to compute
      message integrity code (MIC) and Wrap tokens.  See
      Section 2.10.8.3 for more details on how NFSv4.1 uses the SSV and
      the SSV GSS mechanism.

   Server:  The Server is the entity responsible for coordinating client
      access to a set of file systems and is identified by a server
      owner.  A server can span multiple network addresses.

   Server Owner:  The server owner identifies the server to the client.
      The server owner consists of a major identifier and a minor
      identifier.  When the client has two connections each to a peer
      with the same major identifier, the client assumes that both peers
      are the same server (the server namespace is the same via each
      connection) and that lock state is shareable across both
      connections.  When each peer has both the same major and minor
      identifiers, the client assumes that each connection might be
      associable with the same session.

   Stable Storage:  Stable storage is storage from which data stored by
      an NFSv4.1 server can be recovered without data loss from multiple
      power failures (including cascading power failures, that is,
      several power failures in quick succession), operating system
      failures, and/or hardware failure of components other than the
      storage medium itself (such as disk, nonvolatile RAM, flash
      memory, etc.).

      Some examples of stable storage that are allowable for an NFS
      server include:

      1.  Media commit of data; that is, the modified data has been
          successfully written to the disk media, for example, the disk
          platter.

      2.  An immediate reply disk drive with battery-backed, on-drive
          intermediate storage or uninterruptible power system (UPS).



      3.  Server commit of data with battery-backed intermediate storage
          and recovery software.



      4.  Cache commit with uninterruptible power system (UPS) and
          recovery software.



   Stateid:  A stateid is a 128-bit quantity returned by a server that
      uniquely defines the open and locking states provided by the
      server for a specific open-owner or lock-owner/open-owner pair for
      a specific file and type of lock.

   Verifier:  A verifier is a 64-bit quantity generated by the client
      that the server can use to determine if the client has restarted
      and lost all previous lock state.

1.8.  Overview of NFSv4.1 Features



   The major features of the NFSv4.1 protocol will be reviewed in brief.
   This will be done to provide an appropriate context for both the
   reader who is familiar with the previous versions of the NFS protocol
   and the reader who is new to the NFS protocols.  For the reader new
   to the NFS protocols, there is still a set of fundamental knowledge
   that is expected.  The reader should be familiar with the External
   Data Representation (XDR) and Remote Procedure Call (RPC) protocols
   as described in [2] and [3].  A basic knowledge of file systems and
   distributed file systems is expected as well.

   In general, this specification of NFSv4.1 will not distinguish those
   features added in minor version 1 from those present in the base
   protocol but will treat NFSv4.1 as a unified whole.  See Section 1.9
   for a summary of the differences between NFSv4.0 and NFSv4.1.

1.8.1.  RPC and Security



   As with previous versions of NFS, the External Data Representation
   (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFSv4.1
   protocol are those defined in [2] and [3].  To meet end-to-end
   security requirements, the RPCSEC_GSS framework [4] is used to extend
   the basic RPC security.  With the use of RPCSEC_GSS, various
   mechanisms can be provided to offer authentication, integrity, and
   privacy to the NFSv4 protocol.  Kerberos V5 is used as described in
   [5] to provide one security framework.  With the use of RPCSEC_GSS,
   other mechanisms may also be specified and used for NFSv4.1 security.

   To enable in-band security negotiation, the NFSv4.1 protocol has
   operations that provide the client a method of querying the server
   about its policies regarding which security mechanisms must be used
   for access to the server's file system resources.  With this, the
   client can securely match the security mechanism that meets the
   policies specified at both the client and server.

   NFSv4.1 introduces parallel access (see Section 1.8.2.2), which is
   called pNFS.  The security framework described in this section is
   significantly modified by the introduction of pNFS (see
   Section 12.9), because data access is sometimes not over RPC.  The
   level of significance varies with the storage protocol (see
   Section 12.2.5) and can be as low as zero impact (see Section 13.12).

1.8.2.  Protocol Structure



1.8.2.1.  Core Protocol



   Unlike NFSv3, which used a series of ancillary protocols (e.g., NLM,
   NSM (Network Status Monitor), MOUNT), within all minor versions of
   NFSv4 a single RPC protocol is used to make requests to the server.
   Facilities that had been separate protocols, such as locking, are now
   integrated within a single unified protocol.

1.8.2.2.  Parallel Access



   Minor version 1 supports high-performance data access to a clustered
   server implementation by enabling a separation of metadata access and
   data access, with the latter done to multiple servers in parallel.

   Such parallel data access is controlled by recallable objects known
   as "layouts", which are integrated into the protocol locking model.
   Clients direct requests for data access to a set of data servers
   specified by the layout via a data storage protocol which may be
   NFSv4.1 or may be another protocol.

   Because the protocols used for parallel data access are not
   necessarily RPC-based, the RPC-based security model (Section 1.8.1)
   is obviously impacted (see Section 12.9).  The degree of impact
   varies with the storage protocol (see Section 12.2.5) used for data
   access, and can be as low as zero (see Section 13.12).

1.8.3.  File System Model



   The general file system model used for the NFSv4.1 protocol is the
   same as previous versions.  The server file system is hierarchical
   with the regular files contained within being treated as opaque byte
   streams.  In a slight departure, file and directory names are encoded
   with UTF-8 to deal with the basics of internationalization.

   The NFSv4.1 protocol does not require a separate protocol to provide
   for the initial mapping between path name and filehandle.  All file
   systems exported by a server are presented as a tree so that all file
   systems are reachable from a special per-server global root
   filehandle.  This allows LOOKUP operations to be used to perform
   functions previously provided by the MOUNT protocol.  The server
   provides any necessary pseudo file systems to bridge any gaps that
   arise due to unexported gaps between exported file systems.

1.8.3.1.  Filehandles



   As in previous versions of the NFS protocol, opaque filehandles are
   used to identify individual files and directories.  Lookup-type and
   create operations translate file and directory names to filehandles,
   which are then used to identify objects in subsequent operations.

   The NFSv4.1 protocol provides support for persistent filehandles,
   guaranteed to be valid for the lifetime of the file system object
   designated.  In addition, it provides support to servers to provide
   filehandles with more limited validity guarantees, called volatile
   filehandles.

1.8.3.2.  File Attributes



   The NFSv4.1 protocol has a rich and extensible file object attribute
   structure, which is divided into REQUIRED, RECOMMENDED, and named
   attributes (see Section 5).

   Several (but not all) of the REQUIRED attributes are derived from the
   attributes of NFSv3 (see the definition of the fattr3 data type in
   [38]).  An example of a REQUIRED attribute is the file object's type
   (Section 5.8.1.2) so that regular files can be distinguished from
   directories (also known as folders in some operating environments)
   and other types of objects.  REQUIRED attributes are discussed in
   Section 5.1.

   An example of three RECOMMENDED attributes are acl, sacl, and dacl.
   These attributes define an Access Control List (ACL) on a file object
   (Section 6).  An ACL provides directory and file access control
   beyond the model used in NFSv3.  The ACL definition allows for
   specification of specific sets of permissions for individual users
   and groups.  In addition, ACL inheritance allows propagation of
   access permissions and restrictions down a directory tree as file
   system objects are created.  RECOMMENDED attributes are discussed in
   Section 5.2.

   A named attribute is an opaque byte stream that is associated with a
   directory or file and referred to by a string name.  Named attributes
   are meant to be used by client applications as a method to associate
   application-specific data with a regular file or directory.  NFSv4.1
   modifies named attributes relative to NFSv4.0 by tightening the
   allowed operations in order to prevent the development of non-
   interoperable implementations.  Named attributes are discussed in
   Section 5.3.

1.8.3.3.  Multi-Server Namespace



   NFSv4.1 contains a number of features to allow implementation of
   namespaces that cross server boundaries and that allow and facilitate
   a nondisruptive transfer of support for individual file systems
   between servers.  They are all based upon attributes that allow one
   file system to specify alternate, additional, and new location
   information that specifies how the client may access that file
   system.

   These attributes can be used to provide for individual active file
   systems:

   *  Alternate network addresses to access the current file system
      instance.

   *  The locations of alternate file system instances or replicas to be
      used in the event that the current file system instance becomes
      unavailable.

   These file system location attributes may be used together with the
   concept of absent file systems, in which a position in the server
   namespace is associated with locations on other servers without there
   being any corresponding file system instance on the current server.
   For example,

   *  These attributes may be used with absent file systems to implement
      referrals whereby one server may direct the client to a file
      system provided by another server.  This allows extensive multi-
      server namespaces to be constructed.

   *  These attributes may be provided when a previously present file
      system becomes absent.  This allows nondisruptive migration of
      file systems to alternate servers.

1.8.4.  Locking Facilities



   As mentioned previously, NFSv4.1 is a single protocol that includes
   locking facilities.  These locking facilities include support for
   many types of locks including a number of sorts of recallable locks.
   Recallable locks such as delegations allow the client to be assured
   that certain events will not occur so long as that lock is held.
   When circumstances change, the lock is recalled via a callback
   request.  The assurances provided by delegations allow more extensive
   caching to be done safely when circumstances allow it.

   The types of locks are:

   *  Share reservations as established by OPEN operations.

   *  Byte-range locks.

   *  File delegations, which are recallable locks that assure the
      holder that inconsistent opens and file changes cannot occur so
      long as the delegation is held.

   *  Directory delegations, which are recallable locks that assure the
      holder that inconsistent directory modifications cannot occur so
      long as the delegation is held.

   *  Layouts, which are recallable objects that assure the holder that
      direct access to the file data may be performed directly by the
      client and that no change to the data's location that is
      inconsistent with that access may be made so long as the layout is
      held.

   All locks for a given client are tied together under a single client-
   wide lease.  All requests made on sessions associated with the client
   renew that lease.  When the client's lease is not promptly renewed,
   the client's locks are subject to revocation.  In the event of server
   restart, clients have the opportunity to safely reclaim their locks
   within a special grace period.

1.9.  Differences from NFSv4.0



   The following summarizes the major differences between minor version
   1 and the base protocol:

   *  Implementation of the sessions model (Section 2.10).

   *  Parallel access to data (Section 12).

   *  Addition of the RECLAIM_COMPLETE operation to better structure the
      lock reclamation process (Section 18.51).

   *  Enhanced delegation support as follows.

      -  Delegations on directories and other file types in addition to
         regular files (Section 18.39, Section 18.49).

      -  Operations to optimize acquisition of recalled or denied
         delegations (Section 18.49, Section 20.5, Section 20.7).

      -  Notifications of changes to files and directories
         (Section 18.39, Section 20.4).

      -  A method to allow a server to indicate that it is recalling one
         or more delegations for resource management reasons, and thus a
         method to allow the client to pick which delegations to return
         (Section 20.6).

   *  Attributes can be set atomically during exclusive file create via
      the OPEN operation (see the new EXCLUSIVE4_1 creation method in
      Section 18.16).

   *  Open files can be preserved if removed and the hard link count
      ("hard link" is defined in an Open Group [6] standard) goes to
      zero, thus obviating the need for clients to rename deleted files
      to partially hidden names -- colloquially called "silly rename"
      (see the new OPEN4_RESULT_PRESERVE_UNLINKED reply flag in
      Section 18.16).

   *  Improved compatibility with Microsoft Windows for Access Control
      Lists (Section 6.2.3, Section 6.2.2, Section 6.4.3.2).

   *  Data retention (Section 5.13).

   *  Identification of the implementation of the NFS client and server
      (Section 18.35).

   *  Support for notification of the availability of byte-range locks
      (see the new OPEN4_RESULT_MAY_NOTIFY_LOCK reply flag in
      Section 18.16 and see Section 20.11).

   *  In NFSv4.1, LIPKEY and SPKM-3 are not required security mechanisms
      [39].

2.  Core Infrastructure

2.1.  Introduction



   NFSv4.1 relies on core infrastructure common to nearly every
   operation.  This core infrastructure is described in the remainder of
   this section.

2.2.  RPC and XDR



   The NFSv4.1 protocol is a Remote Procedure Call (RPC) application
   that uses RPC version 2 and the corresponding eXternal Data
   Representation (XDR) as defined in [3] and [2].

2.2.1.  RPC-Based Security



   Previous NFS versions have been thought of as having a host-based
   authentication model, where the NFS server authenticates the NFS
   client, and trusts the client to authenticate all users.  Actually,
   NFS has always depended on RPC for authentication.  One of the first
   forms of RPC authentication, AUTH_SYS, had no strong authentication
   and required a host-based authentication approach.  NFSv4.1 also
   depends on RPC for basic security services and mandates RPC support
   for a user-based authentication model.  The user-based authentication
   model has user principals authenticated by a server, and in turn the
   server authenticated by user principals.  RPC provides some basic
   security services that are used by NFSv4.1.

2.2.1.1.  RPC Security Flavors



   As described in "Authentication", Section 7 of [3], RPC security is
   encapsulated in the RPC header, via a security or authentication
   flavor, and information specific to the specified security flavor.
   Every RPC header conveys information used to identify and
   authenticate a client and server.  As discussed in Section 2.2.1.1.1,
   some security flavors provide additional security services.

   NFSv4.1 clients and servers MUST implement RPCSEC_GSS.  (This
   requirement to implement is not a requirement to use.)  Other
   flavors, such as AUTH_NONE and AUTH_SYS, MAY be implemented as well.

2.2.1.1.1.  RPCSEC_GSS and Security Services


   RPCSEC_GSS [4] uses the functionality of GSS-API [7].  This allows
   for the use of various security mechanisms by the RPC layer without
   the additional implementation overhead of adding RPC security
   flavors.

2.2.1.1.1.1.  Identification, Authentication, Integrity, Privacy


   Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate
   users on clients to servers, and servers to users.  It can also
   perform integrity checking on the entire RPC message, including the
   RPC header, and on the arguments or results.  Finally, privacy,
   usually via encryption, is a service available with RPCSEC_GSS.
   Privacy is performed on the arguments and results.  Note that if
   privacy is selected, integrity, authentication, and identification
   are enabled.  If privacy is not selected, but integrity is selected,
   authentication and identification are enabled.  If integrity and
   privacy are not selected, but authentication is enabled,
   identification is enabled.  RPCSEC_GSS does not provide
   identification as a separate service.

   Although GSS-API has an authentication service distinct from its
   privacy and integrity services, GSS-API's authentication service is
   not used for RPCSEC_GSS's authentication service.  Instead, each RPC
   request and response header is integrity protected with the GSS-API
   integrity service, and this allows RPCSEC_GSS to offer per-RPC
   authentication and identity.  See [4] for more information.

   NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and
   authentication service.  NFSv4.1 servers MUST support RPCSEC_GSS's
   privacy service.  NFSv4.1 clients SHOULD support RPCSEC_GSS's privacy
   service.

2.2.1.1.1.2.  Security Mechanisms for NFSv4.1


   RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide
   security services.  Therefore, NFSv4.1 clients and servers MUST
   support the Kerberos V5 security mechanism.

   The use of RPCSEC_GSS requires selection of mechanism, quality of
   protection (QOP), and service (authentication, integrity, privacy).
   For the mandated security mechanisms, NFSv4.1 specifies that a QOP of
   zero is used, leaving it up to the mechanism or the mechanism's
   configuration to map QOP zero to an appropriate level of protection.
   Each mandated mechanism specifies a minimum set of cryptographic
   algorithms for implementing integrity and privacy.  NFSv4.1 clients
   and servers MUST be implemented on operating environments that comply
   with the REQUIRED cryptographic algorithms of each REQUIRED
   mechanism.

2.2.1.1.1.2.1.  Kerberos V5


   The Kerberos V5 GSS-API mechanism as described in [5] MUST be
   implemented with the RPCSEC_GSS services as specified in the
   following table:

      column descriptions:
      1 == number of pseudo flavor
      2 == name of pseudo flavor
      3 == mechanism's OID
      4 == RPCSEC_GSS service
      5 == NFSv4.1 clients MUST support
      6 == NFSv4.1 servers MUST support

      1      2        3                    4                     5   6
      ------------------------------------------------------------------
      390003 krb5     1.2.840.113554.1.2.2 rpc_gss_svc_none      yes yes
      390004 krb5i    1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes
      390005 krb5p    1.2.840.113554.1.2.2 rpc_gss_svc_privacy    no yes

   Note that the number and name of the pseudo flavor are presented here
   as a mapping aid to the implementor.  Because the NFSv4.1 protocol
   includes a method to negotiate security and it understands the GSS-
   API mechanism, the pseudo flavor is not needed.  The pseudo flavor is
   needed for the NFSv3 since the security negotiation is done via the
   MOUNT protocol as described in [40].

   At the time NFSv4.1 was specified, the Advanced Encryption Standard
   (AES) with HMAC-SHA1 was a REQUIRED algorithm set for Kerberos V5.
   In contrast, when NFSv4.0 was specified, weaker algorithm sets were
   REQUIRED for Kerberos V5, and were REQUIRED in the NFSv4.0
   specification, because the Kerberos V5 specification at the time did
   not specify stronger algorithms.  The NFSv4.1 specification does not
   specify REQUIRED algorithms for Kerberos V5, and instead, the
   implementor is expected to track the evolution of the Kerberos V5
   standard if and when stronger algorithms are specified.

2.2.1.1.1.2.1.1.  Security Considerations for Cryptographic Algorithms
                  in Kerberos V5


   When deploying NFSv4.1, the strength of the security achieved depends
   on the existing Kerberos V5 infrastructure.  The algorithms of
   Kerberos V5 are not directly exposed to or selectable by the client
   or server, so there is some due diligence required by the user of
   NFSv4.1 to ensure that security is acceptable where needed.

2.2.1.1.1.3.  GSS Server Principal


   Regardless of what security mechanism under RPCSEC_GSS is being used,
   the NFS server MUST identify itself in GSS-API via a
   GSS_C_NT_HOSTBASED_SERVICE name type.  GSS_C_NT_HOSTBASED_SERVICE
   names are of the form:

        service@hostname

   For NFS, the "service" element is

        nfs

   Implementations of security mechanisms will convert nfs@hostname to
   various different forms.  For Kerberos V5, the following form is
   RECOMMENDED:

        nfs/hostname

2.3.  COMPOUND and CB_COMPOUND



   A significant departure from the versions of the NFS protocol before
   NFSv4 is the introduction of the COMPOUND procedure.  For the NFSv4
   protocol, in all minor versions, there are exactly two RPC
   procedures, NULL and COMPOUND.  The COMPOUND procedure is defined as
   a series of individual operations and these operations perform the
   sorts of functions performed by traditional NFS procedures.

   The operations combined within a COMPOUND request are evaluated in
   order by the server, without any atomicity guarantees.  A limited set
   of facilities exist to pass results from one operation to another.
   Once an operation returns a failing result, the evaluation ends and
   the results of all evaluated operations are returned to the client.

   With the use of the COMPOUND procedure, the client is able to build
   simple or complex requests.  These COMPOUND requests allow for a
   reduction in the number of RPCs needed for logical file system
   operations.  For example, multi-component look up requests can be
   constructed by combining multiple LOOKUP operations.  Those can be
   further combined with operations such as GETATTR, READDIR, or OPEN
   plus READ to do more complicated sets of operation without incurring
   additional latency.

   NFSv4.1 also contains a considerable set of callback operations in
   which the server makes an RPC directed at the client.  Callback RPCs
   have a similar structure to that of the normal server requests.  In
   all minor versions of the NFSv4 protocol, there are two callback RPC
   procedures: CB_NULL and CB_COMPOUND.  The CB_COMPOUND procedure is
   defined in an analogous fashion to that of COMPOUND with its own set
   of callback operations.

   The addition of new server and callback operations within the
   COMPOUND and CB_COMPOUND request framework provides a means of
   extending the protocol in subsequent minor versions.

   Except for a small number of operations needed for session creation,
   server requests and callback requests are performed within the
   context of a session.  Sessions provide a client context for every
   request and support robust replay protection for non-idempotent
   requests.

2.4.  Client Identifiers and Client Owners



   For each operation that obtains or depends on locking state, the
   specific client needs to be identifiable by the server.

   Each distinct client instance is represented by a client ID.  A
   client ID is a 64-bit identifier representing a specific client at a
   given time.  The client ID is changed whenever the client re-
   initializes, and may change when the server re-initializes.  Client
   IDs are used to support lock identification and crash recovery.

   During steady state operation, the client ID associated with each
   operation is derived from the session (see Section 2.10) on which the
   operation is sent.  A session is associated with a client ID when the
   session is created.

   Unlike NFSv4.0, the only NFSv4.1 operations possible before a client
   ID is established are those needed to establish the client ID.

   A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION
   operation using that client ID (eir_clientid as returned from
   EXCHANGE_ID) is required to establish and confirm the client ID on
   the server.  Establishment of identification by a new incarnation of
   the client also has the effect of immediately releasing any locking
   state that a previous incarnation of that same client might have had
   on the server.  Such released state would include all byte-range
   lock, share reservation, layout state, and -- where the server
   supports neither the CLAIM_DELEGATE_PREV nor CLAIM_DELEG_CUR_FH claim
   types -- all delegation state associated with the same client with
   the same identity.  For discussion of delegation state recovery, see
   Section 10.2.1.  For discussion of layout state recovery, see
   Section 12.7.1.

   Releasing such state requires that the server be able to determine
   that one client instance is the successor of another.  Where this
   cannot be done, for any of a number of reasons, the locking state
   will remain for a time subject to lease expiration (see Section 8.3)
   and the new client will need to wait for such state to be removed, if
   it makes conflicting lock requests.

   Client identification is encapsulated in the following client owner
   data type:

   struct client_owner4 {
           verifier4       co_verifier;
           opaque          co_ownerid<NFS4_OPAQUE_LIMIT>;
   };

   The first field, co_verifier, is a client incarnation verifier,
   allowing the server to distinguish successive incarnations (e.g.,
   reboots) of the same client.  The server will start the process of
   canceling the client's leased state if co_verifier is different than
   what the server has previously recorded for the identified client (as
   specified in the co_ownerid field).

   The second field, co_ownerid, is a variable length string that
   uniquely defines the client so that subsequent instances of the same
   client bear the same co_ownerid with a different verifier.

   There are several considerations for how the client generates the
   co_ownerid string:

   *  The string should be unique so that multiple clients do not
      present the same string.  The consequences of two clients
      presenting the same string range from one client getting an error
      to one client having its leased state abruptly and unexpectedly
      cancelled.

   *  The string should be selected so that subsequent incarnations
      (e.g., restarts) of the same client cause the client to present
      the same string.  The implementor is cautioned from an approach
      that requires the string to be recorded in a local file because
      this precludes the use of the implementation in an environment
      where there is no local disk and all file access is from an
      NFSv4.1 server.

   *  The string should be the same for each server network address that
      the client accesses.  This way, if a server has multiple
      interfaces, the client can trunk traffic over multiple network
      paths as described in Section 2.10.5.  (Note: the precise opposite
      was advised in the NFSv4.0 specification [37].)

   *  The algorithm for generating the string should not assume that the
      client's network address will not change, unless the client
      implementation knows it is using statically assigned network
      addresses.  This includes changes between client incarnations and
      even changes while the client is still running in its current
      incarnation.  Thus, with dynamic address assignment, if the client
      includes just the client's network address in the co_ownerid
      string, there is a real risk that after the client gives up the
      network address, another client, using a similar algorithm for
      generating the co_ownerid string, would generate a conflicting
      co_ownerid string.

   Given the above considerations, an example of a well-generated
   co_ownerid string is one that includes:

   *  If applicable, the client's statically assigned network address.

   *  Additional information that tends to be unique, such as one or
      more of:

      -  The client machine's serial number (for privacy reasons, it is
         best to perform some one-way function on the serial number).

      -  A Media Access Control (MAC) address (again, a one-way function
         should be performed).

      -  The timestamp of when the NFSv4.1 software was first installed
         on the client (though this is subject to the previously
         mentioned caution about using information that is stored in a
         file, because the file might only be accessible over NFSv4.1).

      -  A true random number.  However, since this number ought to be
         the same between client incarnations, this shares the same
         problem as that of using the timestamp of the software
         installation.

   *  For a user-level NFSv4.1 client, it should contain additional
      information to distinguish the client from other user-level
      clients running on the same host, such as a process identifier or
      other unique sequence.

   The client ID is assigned by the server (the eir_clientid result from
   EXCHANGE_ID) and should be chosen so that it will not conflict with a
   client ID previously assigned by the server.  This applies across
   server restarts.

   In the event of a server restart, a client may find out that its
   current client ID is no longer valid when it receives an
   NFS4ERR_STALE_CLIENTID error.  The precise circumstances depend on
   the characteristics of the sessions involved, specifically whether
   the session is persistent (see Section 2.10.6.5), but in each case
   the client will receive this error when it attempts to establish a
   new session with the existing client ID and receives the error
   NFS4ERR_STALE_CLIENTID, indicating that a new client ID needs to be
   obtained via EXCHANGE_ID and the new session established with that
   client ID.

   When a session is not persistent, the client will find out that it
   needs to create a new session as a result of getting an
   NFS4ERR_BADSESSION, since the session in question was lost as part of
   a server restart.  When the existing client ID is presented to a
   server as part of creating a session and that client ID is not
   recognized, as would happen after a server restart, the server will
   reject the request with the error NFS4ERR_STALE_CLIENTID.

   In the case of the session being persistent, the client will re-
   establish communication using the existing session after the restart.
   This session will be associated with the existing client ID but may
   only be used to retransmit operations that the client previously
   transmitted and did not see replies to.  Replies to operations that
   the server previously performed will come from the reply cache;
   otherwise, NFS4ERR_DEADSESSION will be returned.  Hence, such a
   session is referred to as "dead".  In this situation, in order to
   perform new operations, the client needs to establish a new session.
   If an attempt is made to establish this new session with the existing
   client ID, the server will reject the request with
   NFS4ERR_STALE_CLIENTID.

   When NFS4ERR_STALE_CLIENTID is received in either of these
   situations, the client needs to obtain a new client ID by use of the
   EXCHANGE_ID operation, then use that client ID as the basis of a new
   session, and then proceed to any other necessary recovery for the
   server restart case (see Section 8.4.2).

   See the descriptions of EXCHANGE_ID (Section 18.35) and
   CREATE_SESSION (Section 18.36) for a complete specification of these
   operations.

2.4.1.  Upgrade from NFSv4.0 to NFSv4.1



   To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a
   value of data type client_owner4 in an EXCHANGE_ID with a value of
   data type nfs_client_id4 that was established using the SETCLIENTID
   operation of NFSv4.0.  A server that does so will allow an upgraded
   client to avoid waiting until the lease (i.e., the lease established
   by the NFSv4.0 instance client) expires.  This requires that the
   value of data type client_owner4 be constructed the same way as the
   value of data type nfs_client_id4.  If the latter's contents included
   the server's network address (per the recommendations of the NFSv4.0
   specification [37]), and the NFSv4.1 client does not wish to use a
   client ID that prevents trunking, it should send two EXCHANGE_ID
   operations.  The first EXCHANGE_ID will have a client_owner4 equal to
   the nfs_client_id4.  This will clear the state created by the NFSv4.0
   client.  The second EXCHANGE_ID will not have the server's network
   address.  The state created for the second EXCHANGE_ID will not have
   to wait for lease expiration, because there will be no state to
   expire.

2.4.2.  Server Release of Client ID



   NFSv4.1 introduces a new operation called DESTROY_CLIENTID
   (Section 18.50), which the client SHOULD use to destroy a client ID
   it no longer needs.  This permits graceful, bilateral release of a
   client ID.  The operation cannot be used if there are sessions
   associated with the client ID, or state with an unexpired lease.

   If the server determines that the client holds no associated state
   for its client ID (associated state includes unrevoked sessions,
   opens, locks, delegations, layouts, and wants), the server MAY choose
   to unilaterally release the client ID in order to conserve resources.
   If the client contacts the server after this release, the server MUST
   ensure that the client receives the appropriate error so that it will
   use the EXCHANGE_ID/CREATE_SESSION sequence to establish a new client
   ID.  The server ought to be very hesitant to release a client ID
   since the resulting work on the client to recover from such an event
   will be the same burden as if the server had failed and restarted.
   Typically, a server would not release a client ID unless there had
   been no activity from that client for many minutes.  As long as there
   are sessions, opens, locks, delegations, layouts, or wants, the
   server MUST NOT release the client ID.  See Section 2.10.13.1.4 for
   discussion on releasing inactive sessions.

2.4.3.  Resolving Client Owner Conflicts



   When the server gets an EXCHANGE_ID for a client owner that currently
   has no state, or that has state but the lease has expired, the server
   MUST allow the EXCHANGE_ID and confirm the new client ID if followed
   by the appropriate CREATE_SESSION.

   When the server gets an EXCHANGE_ID for a new incarnation of a client
   owner that currently has an old incarnation with state and an
   unexpired lease, the server is allowed to dispose of the state of the
   previous incarnation of the client owner if one of the following is
   true:

   *  The principal that created the client ID for the client owner is
      the same as the principal that is sending the EXCHANGE_ID
      operation.  Note that if the client ID was created with
      SP4_MACH_CRED state protection (Section 18.35), the principal MUST
      be based on RPCSEC_GSS authentication, the RPCSEC_GSS service used
      MUST be integrity or privacy, and the same GSS mechanism and
      principal MUST be used as that used when the client ID was
      created.

   *  The client ID was established with SP4_SSV protection
      (Section 18.35, Section 2.10.8.3) and the client sends the
      EXCHANGE_ID with the security flavor set to RPCSEC_GSS using the
      GSS SSV mechanism (Section 2.10.9).

   *  The client ID was established with SP4_SSV protection, and under
      the conditions described herein, the EXCHANGE_ID was sent with
      SP4_MACH_CRED state protection.  Because the SSV might not persist
      across client and server restart, and because the first time a
      client sends EXCHANGE_ID to a server it does not have an SSV, the
      client MAY send the subsequent EXCHANGE_ID without an SSV
      RPCSEC_GSS handle.  Instead, as with SP4_MACH_CRED protection, the
      principal MUST be based on RPCSEC_GSS authentication, the
      RPCSEC_GSS service used MUST be integrity or privacy, and the same
      GSS mechanism and principal MUST be used as that used when the
      client ID was created.

   If none of the above situations apply, the server MUST return
   NFS4ERR_CLID_INUSE.

   If the server accepts the principal and co_ownerid as matching that
   which created the client ID, and the co_verifier in the EXCHANGE_ID
   differs from the co_verifier used when the client ID was created,
   then after the server receives a CREATE_SESSION that confirms the
   client ID, the server deletes state.  If the co_verifier values are
   the same (e.g., the client either is updating properties of the
   client ID (Section 18.35) or is attempting trunking (Section 2.10.5),
   the server MUST NOT delete state.

2.5.  Server Owners



   The server owner is similar to a client owner (Section 2.4), but
   unlike the client owner, there is no shorthand server ID.  The server
   owner is defined in the following data type:

   struct server_owner4 {
    uint64_t       so_minor_id;
    opaque         so_major_id<NFS4_OPAQUE_LIMIT>;
   };

   The server owner is returned from EXCHANGE_ID.  When the so_major_id
   fields are the same in two EXCHANGE_ID results, the connections that
   each EXCHANGE_ID were sent over can be assumed to address the same
   server (as defined in Section 1.7).  If the so_minor_id fields are
   also the same, then not only do both connections connect to the same
   server, but the session can be shared across both connections.  The
   reader is cautioned that multiple servers may deliberately or
   accidentally claim to have the same so_major_id or so_major_id/
   so_minor_id; the reader should examine Sections 2.10.5 and 18.35 in
   order to avoid acting on falsely matching server owner values.

   The considerations for generating an so_major_id are similar to that
   for generating a co_ownerid string (see Section 2.4).  The
   consequences of two servers generating conflicting so_major_id values
   are less dire than they are for co_ownerid conflicts because the
   client can use RPCSEC_GSS to compare the authenticity of each server
   (see Section 2.10.5).

2.6.  Security Service Negotiation



   With the NFSv4.1 server potentially offering multiple security
   mechanisms, the client needs a method to determine or negotiate which
   mechanism is to be used for its communication with the server.  The
   NFS server may have multiple points within its file system namespace
   that are available for use by NFS clients.  These points can be
   considered security policy boundaries, and, in some NFS
   implementations, are tied to NFS export points.  In turn, the NFS
   server may be configured such that each of these security policy
   boundaries may have different or multiple security mechanisms in use.

   The security negotiation between client and server SHOULD be done
   with a secure channel to eliminate the possibility of a third party
   intercepting the negotiation sequence and forcing the client and
   server to choose a lower level of security than required or desired.
   See Section 21 for further discussion.

2.6.1.  NFSv4.1 Security Tuples



   An NFS server can assign one or more "security tuples" to each
   security policy boundary in its namespace.  Each security tuple
   consists of a security flavor (see Section 2.2.1.1) and, if the
   flavor is RPCSEC_GSS, a GSS-API mechanism Object Identifier (OID), a
   GSS-API quality of protection, and an RPCSEC_GSS service.

2.6.2.  SECINFO and SECINFO_NO_NAME



   The SECINFO and SECINFO_NO_NAME operations allow the client to
   determine, on a per-filehandle basis, what security tuple is to be
   used for server access.  In general, the client will not have to use
   either operation except during initial communication with the server
   or when the client crosses security policy boundaries at the server.
   However, the server's policies may also change at any time and force
   the client to negotiate a new security tuple.

   Where the use of different security tuples would affect the type of
   access that would be allowed if a request was sent over the same
   connection used for the SECINFO or SECINFO_NO_NAME operation (e.g.,
   read-only vs. read-write) access, security tuples that allow greater
   access should be presented first.  Where the general level of access
   is the same and different security flavors limit the range of
   principals whose privileges are recognized (e.g., allowing or
   disallowing root access), flavors supporting the greatest range of
   principals should be listed first.

2.6.3.  Security Error



   Based on the assumption that each NFSv4.1 client and server MUST
   support a minimum set of security (i.e., Kerberos V5 under
   RPCSEC_GSS), the NFS client will initiate file access to the server
   with one of the minimal security tuples.  During communication with
   the server, the client may receive an NFS error of NFS4ERR_WRONGSEC.
   This error allows the server to notify the client that the security
   tuple currently being used contravenes the server's security policy.
   The client is then responsible for determining (see Section 2.6.3.1)
   what security tuples are available at the server and choosing one
   that is appropriate for the client.

2.6.3.1.  Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME



   This section explains the mechanics of NFSv4.1 security negotiation.

2.6.3.1.1.  Put Filehandle Operations


   The term "put filehandle operation" refers to PUTROOTFH, PUTPUBFH,
   PUTFH, and RESTOREFH.  Each of the subsections herein describes how
   the server handles a subseries of operations that starts with a put
   filehandle operation.

2.6.3.1.1.1.  Put Filehandle Operation + SAVEFH


   The client is saving a filehandle for a future RESTOREFH, LINK, or
   RENAME.  SAVEFH MUST NOT return NFS4ERR_WRONGSEC.  To determine
   whether or not the put filehandle operation returns NFS4ERR_WRONGSEC,
   the server implementation pretends SAVEFH is not in the series of
   operations and examines which of the situations described in the
   other subsections of Section 2.6.3.1.1 apply.

2.6.3.1.1.2.  Two or More Put Filehandle Operations


   For a series of N put filehandle operations, the server MUST NOT
   return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations.
   The Nth put filehandle operation is handled as if it is the first in
   a subseries of operations.  For example, if the server received a
   COMPOUND request with this series of operations -- PUTFH, PUTROOTFH,
   LOOKUP -- then the PUTFH operation is ignored for NFS4ERR_WRONGSEC
   purposes, and the PUTROOTFH, LOOKUP subseries is processed as
   according to Section 2.6.3.1.1.3.

2.6.3.1.1.3.  Put Filehandle Operation + LOOKUP (or OPEN of an Existing
              Name)


   This situation also applies to a put filehandle operation followed by
   a LOOKUP or an OPEN operation that specifies an existing component
   name.

   In this situation, the client is potentially crossing a security
   policy boundary, and the set of security tuples the parent directory
   supports may differ from those of the child.  The server
   implementation may decide whether to impose any restrictions on
   security policy administration.  There are at least three approaches
   (sec_policy_child is the tuple set of the child export,
   sec_policy_parent is that of the parent).

   (a)  sec_policy_child <= sec_policy_parent (<= for subset).  This
        means that the set of security tuples specified on the security
        policy of a child directory is always a subset of its parent
        directory.

   (b)  sec_policy_child ^ sec_policy_parent != {} (^ for intersection,
        {} for the empty set).  This means that the set of security
        tuples specified on the security policy of a child directory
        always has a non-empty intersection with that of the parent.

   (c)  sec_policy_child ^ sec_policy_parent == {}.  This means that the
        set of security tuples specified on the security policy of a
        child directory may not intersect with that of the parent.  In
        other words, there are no restrictions on how the system
        administrator may set up these tuples.

   In order for a server to support approaches (b) (for the case when a
   client chooses a flavor that is not a member of sec_policy_parent)
   and (c), the put filehandle operation cannot return NFS4ERR_WRONGSEC
   when there is a security tuple mismatch.  Instead, it should be
   returned from the LOOKUP (or OPEN by existing component name) that
   follows.

   Since the above guideline does not contradict approach (a), it should
   be followed in general.  Even if approach (a) is implemented, it is
   possible for the security tuple used to be acceptable for the target
   of LOOKUP but not for the filehandles used in the put filehandle
   operation.  The put filehandle operation could be a PUTROOTFH or
   PUTPUBFH, where the client cannot know the security tuples for the
   root or public filehandle.  Or the security policy for the filehandle
   used by the put filehandle operation could have changed since the
   time the filehandle was obtained.

   Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in
   response to the put filehandle operation if the operation is
   immediately followed by a LOOKUP or an OPEN by component name.

2.6.3.1.1.4.  Put Filehandle Operation + LOOKUPP


   Since SECINFO only works its way down, there is no way LOOKUPP can
   return NFS4ERR_WRONGSEC without SECINFO_NO_NAME.  SECINFO_NO_NAME
   solves this issue via style SECINFO_STYLE4_PARENT, which works in the
   opposite direction as SECINFO.  As with Section 2.6.3.1.1.3, a put
   filehandle operation that is followed by a LOOKUPP MUST NOT return
   NFS4ERR_WRONGSEC.  If the server does not support SECINFO_NO_NAME,
   the client's only recourse is to send the put filehandle operation,
   LOOKUPP, GETFH sequence of operations with every security tuple it
   supports.

   Regardless of whether SECINFO_NO_NAME is supported, an NFSv4.1 server
   MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle
   operation if the operation is immediately followed by a LOOKUPP.

2.6.3.1.1.5.  Put Filehandle Operation + SECINFO/SECINFO_NO_NAME


   A security-sensitive client is allowed to choose a strong security
   tuple when querying a server to determine a file object's permitted
   security tuples.  The security tuple chosen by the client does not
   have to be included in the tuple list of the security policy of
   either the parent directory indicated in the put filehandle operation
   or the child file object indicated in SECINFO (or any parent
   directory indicated in SECINFO_NO_NAME).  Of course, the server has
   to be configured for whatever security tuple the client selects;
   otherwise, the request will fail at the RPC layer with an appropriate
   authentication error.

   In theory, there is no connection between the security flavor used by
   SECINFO or SECINFO_NO_NAME and those supported by the security
   policy.  But in practice, the client may start looking for strong
   flavors from those supported by the security policy, followed by
   those in the REQUIRED set.

   The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put
   filehandle operation that is immediately followed by SECINFO or
   SECINFO_NO_NAME.  The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC
   from SECINFO or SECINFO_NO_NAME.

2.6.3.1.1.6.  Put Filehandle Operation + Nothing


   The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC.

2.6.3.1.1.7.  Put Filehandle Operation + Anything Else


   "Anything Else" includes OPEN by filehandle.

   The security policy enforcement applies to the filehandle specified
   in the put filehandle operation.  Therefore, the put filehandle
   operation MUST return NFS4ERR_WRONGSEC when there is a security tuple
   mismatch.  This avoids the complexity of adding NFS4ERR_WRONGSEC as
   an allowable error to every other operation.

   A COMPOUND containing the series put filehandle operation +
   SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way
   for the client to recover from NFS4ERR_WRONGSEC.

   The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to any operation
   other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by
   component name).

2.6.3.1.1.8.  Operations after SECINFO and SECINFO_NO_NAME


   Suppose a client sends a COMPOUND procedure containing the series
   SEQUENCE, PUTFH, SECINFO_NONAME, READ, and suppose the security tuple
   used does not match that required for the target file.  By rule (see
   Section 2.6.3.1.1.5), neither PUTFH nor SECINFO_NO_NAME can return
   NFS4ERR_WRONGSEC.  By rule (see Section 2.6.3.1.1.7), READ cannot
   return NFS4ERR_WRONGSEC.  The issue is resolved by the fact that
   SECINFO and SECINFO_NO_NAME consume the current filehandle (note that
   this is a change from NFSv4.0).  This leaves no current filehandle
   for READ to use, and READ returns NFS4ERR_NOFILEHANDLE.

2.6.3.1.2.  LINK and RENAME


   The LINK and RENAME operations use both the current and saved
   filehandles.  Technically, the server MAY return NFS4ERR_WRONGSEC
   from LINK or RENAME if the security policy of the saved filehandle
   rejects the security flavor used in the COMPOUND request's
   credentials.  If the server does so, then if there is no intersection
   between the security policies of saved and current filehandles, this
   means that it will be impossible for the client to perform the
   intended LINK or RENAME operation.

   For example, suppose the client sends this COMPOUND request:
   SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, RENAME "c" "d", where
   filehandles bFH and aFH refer to different directories.  Suppose no
   common security tuple exists between the security policies of aFH and
   bFH.  If the client sends the request using credentials acceptable to
   bFH's security policy but not aFH's policy, then the PUTFH aFH
   operation will fail with NFS4ERR_WRONGSEC.  After a SECINFO_NO_NAME
   request, the client sends SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH,
   RENAME "c" "d", using credentials acceptable to aFH's security policy
   but not bFH's policy.  The server returns NFS4ERR_WRONGSEC on the
   RENAME operation.

   To prevent a client from an endless sequence of a request containing
   LINK or RENAME, followed by a request containing SECINFO_NO_NAME or
   SECINFO, the server MUST detect when the security policies of the
   current and saved filehandles have no mutually acceptable security
   tuple, and MUST NOT return NFS4ERR_WRONGSEC from LINK or RENAME in
   that situation.  Instead the server MUST do one of two things:

   *  The server can return NFS4ERR_XDEV.

   *  The server can allow the security policy of the current filehandle
      to override that of the saved filehandle, and so return NFS4_OK.

2.7.  Minor Versioning



   To address the requirement of an NFS protocol that can evolve as the
   need arises, the NFSv4.1 protocol contains the rules and framework to
   allow for future minor changes or versioning.

   The base assumption with respect to minor versioning is that any
   future accepted minor version will be documented in one or more
   Standards Track RFCs.  Minor version 0 of the NFSv4 protocol is
   represented by [37], and minor version 1 is represented by this RFC.
   The COMPOUND and CB_COMPOUND procedures support the encoding of the
   minor version being requested by the client.

   The following items represent the basic rules for the development of
   minor versions.  Note that a future minor version may modify or add
   to the following rules as part of the minor version definition.

   1.   Procedures are not added or deleted.

        To maintain the general RPC model, NFSv4 minor versions will not
        add to or delete procedures from the NFS program.

   2.   Minor versions may add operations to the COMPOUND and
        CB_COMPOUND procedures.

        The addition of operations to the COMPOUND and CB_COMPOUND
        procedures does not affect the RPC model.

        *  Minor versions may append attributes to the bitmap4 that
           represents sets of attributes and to the fattr4 that
           represents sets of attribute values.

           This allows for the expansion of the attribute model to allow
           for future growth or adaptation.

        *  Minor version X must append any new attributes after the last
           documented attribute.

           Since attribute results are specified as an opaque array of
           per-attribute, XDR-encoded results, the complexity of adding
           new attributes in the midst of the current definitions would
           be too burdensome.

   3.   Minor versions must not modify the structure of an existing
        operation's arguments or results.

        Again, the complexity of handling multiple structure definitions
        for a single operation is too burdensome.  New operations should
        be added instead of modifying existing structures for a minor
        version.

        This rule does not preclude the following adaptations in a minor
        version:

        *  adding bits to flag fields, such as new attributes to
           GETATTR's bitmap4 data type, and providing corresponding
           variants of opaque arrays, such as a notify4 used together
           with such bitmaps

        *  adding bits to existing attributes like ACLs that have flag
           words

        *  extending enumerated types (including NFS4ERR_*) with new
           values

        *  adding cases to a switched union

   4.   Minor versions must not modify the structure of existing
        attributes.

   5.   Minor versions must not delete operations.



        This prevents the potential reuse of a particular operation
        "slot" in a future minor version.

   6.   Minor versions must not delete attributes.



   7.   Minor versions must not delete flag bits or enumeration values.



   8.   Minor versions may declare an operation MUST NOT be implemented.



        Specifying that an operation MUST NOT be implemented is
        equivalent to obsoleting an operation.  For the client, it means
        that the operation MUST NOT be sent to the server.  For the
        server, an NFS error can be returned as opposed to "dropping"
        the request as an XDR decode error.  This approach allows for
        the obsolescence of an operation while maintaining its structure
        so that a future minor version can reintroduce the operation.

        1.  Minor versions may declare that an attribute MUST NOT be
            implemented.

        2.  Minor versions may declare that a flag bit or enumeration
            value MUST NOT be implemented.

   9.   Minor versions may downgrade features from REQUIRED to
        RECOMMENDED, or RECOMMENDED to OPTIONAL.

   10.  Minor versions may upgrade features from OPTIONAL to
        RECOMMENDED, or RECOMMENDED to REQUIRED.

   11.  A client and server that support minor version X SHOULD support
        minor versions zero through X-1 as well.



   12.  Except for infrastructural changes, a minor version must not
        introduce REQUIRED new features.



        This rule allows for the introduction of new functionality and
        forces the use of implementation experience before designating a
        feature as REQUIRED.  On the other hand, some classes of
        features are infrastructural and have broad effects.  Allowing
        infrastructural features to be RECOMMENDED or OPTIONAL
        complicates implementation of the minor version.

   13.  A client MUST NOT attempt to use a stateid, filehandle, or
        similar returned object from the COMPOUND procedure with minor
        version X for another COMPOUND procedure with minor version Y,
        where X != Y.

2.8.  Non-RPC-Based Security Services



   As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for
   identification, authentication, integrity, and privacy.  NFSv4.1
   itself provides or enables additional security services as described
   in the next several subsections.

2.8.1.  Authorization



   Authorization to access a file object via an NFSv4.1 operation is
   ultimately determined by the NFSv4.1 server.  A client can
   predetermine its access to a file object via the OPEN (Section 18.16)
   and the ACCESS (Section 18.1) operations.

   Principals with appropriate access rights can modify the
   authorization on a file object via the SETATTR (Section 18.30)
   operation.  Attributes that affect access rights include mode, owner,
   owner_group, acl, dacl, and sacl.  See Section 5.

2.8.2.  Auditing



   NFSv4.1 provides auditing on a per-file object basis, via the acl and
   sacl attributes as described in Section 6.  It is outside the scope
   of this specification to specify audit log formats or management
   policies.

2.8.3.  Intrusion Detection



   NFSv4.1 provides alarm control on a per-file object basis, via the
   acl and sacl attributes as described in Section 6.  Alarms may serve
   as the basis for intrusion detection.  It is outside the scope of
   this specification to specify heuristics for detecting intrusion via
   alarms.

2.9.  Transport Layers



2.9.1.  REQUIRED and RECOMMENDED Properties of Transports



   NFSv4.1 works over Remote Direct Memory Access (RDMA) and non-RDMA-
   based transports with the following attributes:

   *  The transport supports reliable delivery of data, which NFSv4.1
      requires but neither NFSv4.1 nor RPC has facilities for ensuring
      [41].

   *  The transport delivers data in the order it was sent.  Ordered
      delivery simplifies detection of transmit errors, and simplifies
      the sending of arbitrary sized requests and responses via the
      record marking protocol [3].

   Where an NFSv4.1 implementation supports operation over the IP
   network protocol, any transport used between NFS and IP MUST be among
   the IETF-approved congestion control transport protocols.  At the
   time this document was written, the only two transports that had the
   above attributes were TCP and the Stream Control Transmission
   Protocol (SCTP).  To enhance the possibilities for interoperability,
   an NFSv4.1 implementation MUST support operation over the TCP
   transport protocol.

   Even if NFSv4.1 is used over a non-IP network protocol, it is
   RECOMMENDED that the transport support congestion control.

   It is permissible for a connectionless transport to be used under
   NFSv4.1; however, reliable and in-order delivery of data combined
   with congestion control by the connectionless transport is REQUIRED.
   As a consequence, UDP by itself MUST NOT be used as an NFSv4.1
   transport.  NFSv4.1 assumes that a client transport address and
   server transport address used to send data over a transport together
   constitute a connection, even if the underlying transport eschews the
   concept of a connection.

2.9.2.  Client and Server Transport Behavior



   If a connection-oriented transport (e.g., TCP) is used, the client
   and server SHOULD use long-lived connections for at least three
   reasons:

   1.  This will prevent the weakening of the transport's congestion
       control mechanisms via short-lived connections.

   2.  This will improve performance for the WAN environment by
       eliminating the need for connection setup handshakes.

   3.  The NFSv4.1 callback model differs from NFSv4.0, and requires the
       client and server to maintain a client-created backchannel (see
       Section 2.10.3.1) for the server to use.

   In order to reduce congestion, if a connection-oriented transport is
   used, and the request is not the NULL procedure:

   *  A requester MUST NOT retry a request unless the connection the
      request was sent over was lost before the reply was received.

   *  A replier MUST NOT silently drop a request, even if the request is
      a retry.  (The silent drop behavior of RPCSEC_GSS [4] does not
      apply because this behavior happens at the RPCSEC_GSS layer, a
      lower layer in the request processing.)  Instead, the replier
      SHOULD return an appropriate error (see Section 2.10.6.1), or it
      MAY disconnect the connection.

   When sending a reply, the replier MUST send the reply to the same
   full network address (e.g., if using an IP-based transport, the
   source port of the requester is part of the full network address)
   from which the requester sent the request.  If using a connection-
   oriented transport, replies MUST be sent on the same connection from
   which the request was received.

   If a connection is dropped after the replier receives the request but
   before the replier sends the reply, the replier might have a pending
   reply.  If a connection is established with the same source and
   destination full network address as the dropped connection, then the
   replier MUST NOT send the reply until the requester retries the
   request.  The reason for this prohibition is that the requester MAY
   retry a request over a different connection (provided that connection
   is associated with the original request's session).

   When using RDMA transports, there are other reasons for not
   tolerating retries over the same connection:

   *  RDMA transports use "credits" to enforce flow control, where a
      credit is a right to a peer to transmit a message.  If one peer
      were to retransmit a request (or reply), it would consume an
      additional credit.  If the replier retransmitted a reply, it would
      certainly result in an RDMA connection loss, since the requester
      would typically only post a single receive buffer for each
      request.  If the requester retransmitted a request, the additional
      credit consumed on the server might lead to RDMA connection
      failure unless the client accounted for it and decreased its
      available credit, leading to wasted resources.

   *  RDMA credits present a new issue to the reply cache in NFSv4.1.
      The reply cache may be used when a connection within a session is
      lost, such as after the client reconnects.  Credit information is
      a dynamic property of the RDMA connection, and stale values must
      not be replayed from the cache.  This implies that the reply cache
      contents must not be blindly used when replies are sent from it,
      and credit information appropriate to the channel must be
      refreshed by the RPC layer.

   In addition, as described in Section 2.10.6.2, while a session is
   active, the NFSv4.1 requester MUST NOT stop waiting for a reply.

2.9.3.  Ports



   Historically, NFSv3 servers have listened over TCP port 2049.  The
   registered port 2049 [42] for the NFS protocol should be the default
   configuration.  NFSv4.1 clients SHOULD NOT use the RPC binding
   protocols as described in [43].

2.10.  Session



   NFSv4.1 clients and servers MUST support and MUST use the session
   feature as described in this section.

2.10.1.  Motivation and Overview



   Previous versions and minor versions of NFS have suffered from the
   following:

   *  Lack of support for Exactly Once Semantics (EOS).  This includes
      lack of support for EOS through server failure and recovery.

   *  Limited callback support, including no support for sending
      callbacks through firewalls, and races between replies to normal
      requests and callbacks.

   *  Limited trunking over multiple network paths.

   *  Requiring machine credentials for fully secure operation.

   Through the introduction of a session, NFSv4.1 addresses the above
   shortfalls with practical solutions:

   *  EOS is enabled by a reply cache with a bounded size, making it
      feasible to keep the cache in persistent storage and enable EOS
      through server failure and recovery.  One reason that previous
      revisions of NFS did not support EOS was because some EOS
      approaches often limited parallelism.  As will be explained in
      Section 2.10.6, NFSv4.1 supports both EOS and unlimited
      parallelism.

   *  The NFSv4.1 client (defined in Section 1.7) creates transport
      connections and provides them to the server to use for sending
      callback requests, thus solving the firewall issue
      (Section 18.34).  Races between responses from client requests and
      callbacks caused by the requests are detected via the session's
      sequencing properties that are a consequence of EOS
      (Section 2.10.6.3).

   *  The NFSv4.1 client can associate an arbitrary number of
      connections with the session, and thus provide trunking
      (Section 2.10.5).

   *  The NFSv4.1 client and server produce a session key independent of
      client and server machine credentials which can be used to compute
      a digest for protecting critical session management operations
      (Section 2.10.8.3).

   *  The NFSv4.1 client can also create secure RPCSEC_GSS contexts for
      use by the session's backchannel that do not require the server to
      authenticate to a client machine principal (Section 2.10.8.2).

   A session is a dynamically created, long-lived server object created
   by a client and used over time from one or more transport
   connections.  Its function is to maintain the server's state relative
   to the connection(s) belonging to a client instance.  This state is
   entirely independent of the connection itself, and indeed the state
   exists whether or not the connection exists.  A client may have one
   or more sessions associated with it so that client-associated state
   may be accessed using any of the sessions associated with that
   client's client ID, when connections are associated with those
   sessions.  When no connections are associated with any of a client
   ID's sessions for an extended time, such objects as locks, opens,
   delegations, layouts, etc. are subject to expiration.  The session
   serves as an object representing a means of access by a client to the
   associated client state on the server, independent of the physical
   means of access to that state.

   A single client may create multiple sessions.  A single session MUST
   NOT
serve multiple clients.

2.10.2.  NFSv4 Integration



   Sessions are part of NFSv4.1 and not NFSv4.0.  Normally, a major
   infrastructure change such as sessions would require a new major
   version number to an Open Network Computing (ONC) RPC program like
   NFS.  However, because NFSv4 encapsulates its functionality in a
   single procedure, COMPOUND, and because COMPOUND can support an
   arbitrary number of operations, sessions have been added to NFSv4.1
   with little difficulty.  COMPOUND includes a minor version number
   field, and for NFSv4.1 this minor version is set to 1.  When the
   NFSv4 server processes a COMPOUND with the minor version set to 1, it
   expects a different set of operations than it does for NFSv4.0.
   NFSv4.1 defines the SEQUENCE operation, which is required for every
   COMPOUND that operates over an established session, with the
   exception of some session administration operations, such as
   DESTROY_SESSION (Section 18.37).

2.10.2.1.  SEQUENCE and CB_SEQUENCE



   In NFSv4.1, when the SEQUENCE operation is present, it MUST be the
   first operation in the COMPOUND procedure.  The primary purpose of
   SEQUENCE is to carry the session identifier.  The session identifier
   associates all other operations in the COMPOUND procedure with a
   particular session.  SEQUENCE also contains required information for
   maintaining EOS (see Section 2.10.6).  Session-enabled NFSv4.1
   COMPOUND requests thus have the form:

       +-----+--------------+-----------+------------+-----------+----
       | tag | minorversion | numops    |SEQUENCE op | op + args | ...
       |     |   (== 1)     | (limited) |  + args    |           |
       +-----+--------------+-----------+------------+-----------+----

   and the replies have the form:

       +------------+-----+--------+-------------------------------+--//
       |last status | tag | numres |status + SEQUENCE op + results |  //
       +------------+-----+--------+-------------------------------+--//
               //-----------------------+----
               // status + op + results | ...
               //-----------------------+----

   A CB_COMPOUND procedure request and reply has a similar form to
   COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE
   operation.  CB_COMPOUND also has an additional field called
   "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored
   by the client.  CB_SEQUENCE has the same information as SEQUENCE, and
   also includes other information needed to resolve callback races
   (Section 2.10.6.3).

2.10.2.2.  Client ID and Session Association



   Each client ID (Section 2.4) can have zero or more active sessions.
   A client ID and associated session are required to perform file
   access in NFSv4.1.  Each time a session is used (whether by a client
   sending a request to the server or the client replying to a callback
   request from the server), the state leased to its associated client
   ID is automatically renewed.

   State (which can consist of share reservations, locks, delegations,
   and layouts (Section 1.8.4)) is tied to the client ID.  Client state
   is not tied to any individual session.  Successive state changing
   operations from a given state owner MAY go over different sessions,
   provided the session is associated with the same client ID.  A
   callback MAY arrive over a different session than that of the request
   that originally acquired the state pertaining to the callback.  For
   example, if session A is used to acquire a delegation, a request to
   recall the delegation MAY arrive over session B if both sessions are
   associated with the same client ID.  Sections 2.10.8.1 and 2.10.8.2
   discuss the security considerations around callbacks.

2.10.3.  Channels



   A channel is not a connection.  A channel represents the direction
   ONC RPC requests are sent.

   Each session has one or two channels: the fore channel and the
   backchannel.  Because there are at most two channels per session, and
   because each channel has a distinct purpose, channels are not
   assigned identifiers.

   The fore channel is used for ordinary requests from the client to the
   server, and carries COMPOUND requests and responses.  A session
   always has a fore channel.

   The backchannel is used for callback requests from server to client,
   and carries CB_COMPOUND requests and responses.  Whether or not there
   is a backchannel is decided by the client; however, many features of
   NFSv4.1 require a backchannel.  NFSv4.1 servers MUST support
   backchannels.

   Each session has resources for each channel, including separate reply
   caches (see Section 2.10.6.1).  Note that even the backchannel
   requires a reply cache (or, at least, a slot table in order to detect
   retries) because some callback operations are non-idempotent.

2.10.3.1.  Association of Connections, Channels, and Sessions



   Each channel is associated with zero or more transport connections
   (whether of the same transport protocol or different transport
   protocols).  A connection can be associated with one channel or both
   channels of a session; the client and server negotiate whether a
   connection will carry traffic for one channel or both channels via
   the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION
   (Section 18.34) operations.  When a session is created via
   CREATE_SESSION, the connection that transported the CREATE_SESSION
   request is automatically associated with the fore channel, and
   optionally the backchannel.  If the client specifies no state
   protection (Section 18.35) when the session is created, then when
   SEQUENCE is transmitted on a different connection, the connection is
   automatically associated with the fore channel of the session
   specified in the SEQUENCE operation.

   A connection's association with a session is not exclusive.  A
   connection associated with the channel(s) of one session may be
   simultaneously associated with the channel(s) of other sessions
   including sessions associated with other client IDs.

   It is permissible for connections of multiple transport types to be
   associated with the same channel.  For example, both TCP and RDMA
   connections can be associated with the fore channel.  In the event an
   RDMA and non-RDMA connection are associated with the same channel,
   the maximum number of slots SHOULD be at least one more than the
   total number of RDMA credits (Section 2.10.6.1).  This way, if all
   RDMA credits are used, the non-RDMA connection can have at least one
   outstanding request.  If a server supports multiple transport types,
   it MUST allow a client to associate connections from each transport
   to a channel.

   It is permissible for a connection of one type of transport to be
   associated with the fore channel, and a connection of a different
   type to be associated with the backchannel.

2.10.4.  Server Scope



   Servers each specify a server scope value in the form of an opaque
   string eir_server_scope returned as part of the results of an
   EXCHANGE_ID operation.  The purpose of the server scope is to allow a
   group of servers to indicate to clients that a set of servers sharing
   the same server scope value has arranged to use distinct values of
   opaque identifiers so that the two servers never assign the same
   value to two distinct objects.  Thus, the identifiers generated by
   two servers within that set can be assumed compatible so that, in
   certain important cases, identifiers generated by one server in that
   set may be presented to another server of the same scope.

   The use of such compatible values does not imply that a value
   generated by one server will always be accepted by another.  In most
   cases, it will not.  However, a server will not inadvertently accept
   a value generated by another server.  When it does accept it, it will
   be because it is recognized as valid and carrying the same meaning as
   on another server of the same scope.

   When servers are of the same server scope, this compatibility of
   values applies to the following identifiers:

   *  Filehandle values.  A filehandle value accepted by two servers of
      the same server scope denotes the same object.  A WRITE operation
      sent to one server is reflected immediately in a READ sent to the
      other.

   *  Server owner values.  When the server scope values are the same,
      server owner value may be validly compared.  In cases where the
      server scope values are different, server owner values are treated
      as different even if they contain identical strings of bytes.

   The coordination among servers required to provide such compatibility
   can be quite minimal, and limited to a simple partition of the ID
   space.  The recognition of common values requires additional
   implementation, but this can be tailored to the specific situations
   in which that recognition is desired.

   Clients will have occasion to compare the server scope values of
   multiple servers under a number of circumstances, each of which will
   be discussed under the appropriate functional section:

   *  When server owner values received in response to EXCHANGE_ID
      operations sent to multiple network addresses are compared for the
      purpose of determining the validity of various forms of trunking,
      as described in Section 11.5.2.

   *  When network or server reconfiguration causes the same network
      address to possibly be directed to different servers, with the
      necessity for the client to determine when lock reclaim should be
      attempted, as described in Section 8.4.2.1.

   When two replies from EXCHANGE_ID, each from two different server
   network addresses, have the same server scope, there are a number of
   ways a client can validate that the common server scope is due to two
   servers cooperating in a group.

   *  If both EXCHANGE_ID requests were sent with RPCSEC_GSS ([4], [9],
      [27]) authentication and the server principal is the same for both
      targets, the equality of server scope is validated.  It is
      RECOMMENDED that two servers intending to share the same server
      scope and server_owner major_id also share the same principal
      name.  In some cases, this simplifies the client's task of
      validating server scope.

   *  The client may accept the appearance of the second server in the
      fs_locations or fs_locations_info attribute for a relevant file
      system.  For example, if there is a migration event for a
      particular file system or there are locks to be reclaimed on a
      particular file system, the attributes for that particular file
      system may be used.  The client sends the GETATTR request to the
      first server for the fs_locations or fs_locations_info attribute
      with RPCSEC_GSS authentication.  It may need to do this in advance
      of the need to verify the common server scope.  If the client
      successfully authenticates the reply to GETATTR, and the GETATTR
      request and reply containing the fs_locations or fs_locations_info
      attribute refers to the second server, then the equality of server
      scope is supported.  A client may choose to limit the use of this
      form of support to information relevant to the specific file
      system involved (e.g. a file system being migrated).

2.10.5.  Trunking



   Trunking is the use of multiple connections between a client and
   server in order to increase the speed of data transfer.  NFSv4.1
   supports two types of trunking: session trunking and client ID
   trunking.

   In the context of a single server network address, it can be assumed
   that all connections are accessing the same server, and NFSv4.1
   servers MUST support both forms of trunking.  When multiple
   connections use a set of network addresses to access the same server,
   the server MUST support both forms of trunking.  NFSv4.1 servers in a
   clustered configuration MAY allow network addresses for different
   servers to use client ID trunking.

   Clients may use either form of trunking as long as they do not, when
   trunking between different server network addresses, violate the
   servers' mandates as to the kinds of trunking to be allowed (see
   below).  With regard to callback channels, the client MUST allow the
   server to choose among all callback channels valid for a given client
   ID and MUST support trunking when the connections supporting the
   backchannel allow session or client ID trunking to be used for
   callbacks.

   Session trunking is essentially the association of multiple
   connections, each with potentially different target and/or source
   network addresses, to the same session.  When the target network
   addresses (server addresses) of the two connections are the same, the
   server MUST support such session trunking.  When the target network
   addresses are different, the server MAY indicate such support using
   the data returned by the EXCHANGE_ID operation (see below).

   Client ID trunking is the association of multiple sessions to the
   same client ID.  Servers MUST support client ID trunking for two
   target network addresses whenever they allow session trunking for
   those same two network addresses.  In addition, a server MAY, by
   presenting the same major server owner ID (Section 2.5) and server
   scope (Section 2.10.4), allow an additional case of client ID
   trunking.  When two servers return the same major server owner and
   server scope, it means that the two servers are cooperating on
   locking state management, which is a prerequisite for client ID
   trunking.

   Distinguishing when the client is allowed to use session and client
   ID trunking requires understanding how the results of the EXCHANGE_ID
   (Section 18.35) operation identify a server.  Suppose a client sends
   EXCHANGE_IDs over two different connections, each with a possibly
   different target network address, but each EXCHANGE_ID operation has
   the same value in the eia_clientowner field.  If the same NFSv4.1
   server is listening over each connection, then each EXCHANGE_ID
   result MUST return the same values of eir_clientid,
   eir_server_owner.so_major_id, and eir_server_scope.  The client can
   then treat each connection as referring to the same server (subject
   to verification; see Section 2.10.5.1 below), and it can use each
   connection to trunk requests and replies.  The client's choice is
   whether session trunking or client ID trunking applies.

   Session Trunking.  If the eia_clientowner argument is the same in two
      different EXCHANGE_ID requests, and the eir_clientid,
      eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and
      eir_server_scope results match in both EXCHANGE_ID results, then
      the client is permitted to perform session trunking.  If the
      client has no session mapping to the tuple of eir_clientid,
      eir_server_owner.so_major_id, eir_server_scope, and
      eir_server_owner.so_minor_id, then it creates the session via a
      CREATE_SESSION operation over one of the connections, which
      associates the connection to the session.  If there is a session
      for the tuple, the client can send BIND_CONN_TO_SESSION to
      associate the connection to the session.

      Of course, if the client does not desire to use session trunking,
      it is not required to do so.  It can invoke CREATE_SESSION on the
      connection.  This will result in client ID trunking as described
      below.  It can also decide to drop the connection if it does not
      choose to use trunking.

   Client ID Trunking.  If the eia_clientowner argument is the same in
      two different EXCHANGE_ID requests, and the eir_clientid,
      eir_server_owner.so_major_id, and eir_server_scope results match
      in both EXCHANGE_ID results, then the client is permitted to
      perform client ID trunking (regardless of whether the
      eir_server_owner.so_minor_id results match).  The client can
      associate each connection with different sessions, where each
      session is associated with the same server.

      The client completes the act of client ID trunking by invoking
      CREATE_SESSION on each connection, using the same client ID that
      was returned in eir_clientid.  These invocations create two
      sessions and also associate each connection with its respective
      session.  The client is free to decline to use client ID trunking
      by simply dropping the connection at this point.

      When doing client ID trunking, locking state is shared across
      sessions associated with that same client ID.  This requires the
      server to coordinate state across sessions and the client to be
      able to associate the same locking state with multiple sessions.

   It is always possible that, as a result of various sorts of
   reconfiguration events, eir_server_scope and eir_server_owner values
   may be different on subsequent EXCHANGE_ID requests made to the same
   network address.

   In most cases, such reconfiguration events will be disruptive and
   indicate that an IP address formerly connected to one server is now
   connected to an entirely different one.

   Some guidelines on client handling of such situations follow:

   *  When eir_server_scope changes, the client has no assurance that
      any IDs that it obtained previously (e.g., filehandles) can be
      validly used on the new server, and, even if the new server
      accepts them, there is no assurance that this is not due to
      accident.  Thus, it is best to treat all such state as lost or
      stale, although a client may assume that the probability of
      inadvertent acceptance is low and treat this situation as within
      the next case.

   *  When eir_server_scope remains the same and
      eir_server_owner.so_major_id changes, the client can use the
      filehandles it has, consider its locking state lost, and attempt
      to reclaim or otherwise re-obtain its locks.  It might find that
      its filehandle is now stale.  However, if NFS4ERR_STALE is not
      returned, it can proceed to reclaim or otherwise re-obtain its
      open locking state.

   *  When eir_server_scope and eir_server_owner.so_major_id remain the
      same, the client has to use the now-current values of
      eir_server_owner.so_minor_id in deciding on appropriate forms of
      trunking.  This may result in connections being dropped or new
      sessions being created.

2.10.5.1.  Verifying Claims of Matching Server Identity



   When the server responds using two different connections that claim
   matching or partially matching eir_server_owner, eir_server_scope,
   and eir_clientid values, the client does not have to trust the
   servers' claims.  The client may verify these claims before trunking
   traffic in the following ways:

   *  For session trunking, clients SHOULD reliably verify if
      connections between different network paths are in fact associated
      with the same NFSv4.1 server and usable on the same session, and
      servers MUST allow clients to perform reliable verification.  When
      a client ID is created, the client SHOULD specify that
      BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or
      SP4_MACH_CRED (Section 18.35) state protection options.  For
      SP4_SSV, reliable verification depends on a shared secret (the
      SSV) that is established via the SET_SSV (see Section 18.47)
      operation.

      When a new connection is associated with the session (via the
      BIND_CONN_TO_SESSION operation, see Section 18.34), if the client
      specified SP4_SSV state protection for the BIND_CONN_TO_SESSION
      operation, the client MUST send the BIND_CONN_TO_SESSION with
      RPCSEC_GSS protection, using integrity or privacy, and an
      RPCSEC_GSS handle created with the GSS SSV mechanism (see
      Section 2.10.9).

      If the client mistakenly tries to associate a connection to a
      session of a wrong server, the server will either reject the
      attempt because it is not aware of the session identifier of the
      BIND_CONN_TO_SESSION arguments, or it will reject the attempt
      because the RPCSEC_GSS authentication fails.  Even if the server
      mistakenly or maliciously accepts the connection association
      attempt, the RPCSEC_GSS verifier it computes in the response will
      not be verified by the client, so the client will know it cannot
      use the connection for trunking the specified session.

      If the client specified SP4_MACH_CRED state protection, the
      BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or
      privacy, using the same credential that was used when the client
      ID was created.  Mutual authentication via RPCSEC_GSS assures the
      client that the connection is associated with the correct session
      of the correct server.

   *  For client ID trunking, the client has at least two options for
      verifying that the same client ID obtained from two different
      EXCHANGE_ID operations came from the same server.  The first
      option is to use RPCSEC_GSS authentication when sending each
      EXCHANGE_ID operation.  Each time an EXCHANGE_ID is sent with
      RPCSEC_GSS authentication, the client notes the principal name of
      the GSS target.  If the EXCHANGE_ID results indicate that client
      ID trunking is possible, and the GSS targets' principal names are
      the same, the servers are the same and client ID trunking is
      allowed.

      The second option for verification is to use SP4_SSV protection.
      When the client sends EXCHANGE_ID, it specifies SP4_SSV
      protection.  The first EXCHANGE_ID the client sends always has to
      be confirmed by a CREATE_SESSION call.  The client then sends
      SET_SSV.  Later, the client sends EXCHANGE_ID to a second
      destination network address different from the one the first
      EXCHANGE_ID was sent to.  The client checks that each EXCHANGE_ID
      reply has the same eir_clientid, eir_server_owner.so_major_id, and
      eir_server_scope.  If so, the client verifies the claim by sending
      a CREATE_SESSION operation to the second destination address,
      protected with RPCSEC_GSS integrity using an RPCSEC_GSS handle
      returned by the second EXCHANGE_ID.  If the server accepts the
      CREATE_SESSION request, and if the client verifies the RPCSEC_GSS
      verifier and integrity codes, then the client has proof the second
      server knows the SSV, and thus the two servers are cooperating for
      the purposes of specifying server scope and client ID trunking.

2.10.6.  Exactly Once Semantics



   Via the session, NFSv4.1 offers exactly once semantics (EOS) for
   requests sent over a channel.  EOS is supported on both the fore
   channel and backchannel.

   Each COMPOUND or CB_COMPOUND request that is sent with a leading
   SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver
   exactly once.  This requirement holds regardless of whether the
   request is sent with reply caching specified (see
   Section 2.10.6.1.3).  The requirement holds even if the requester is
   sending the request over a session created between a pNFS data client
   and pNFS data server.  To understand the rationale for this
   requirement, divide the requests into three classifications:

   *  Non-idempotent requests.

   *  Idempotent modifying requests.

   *  Idempotent non-modifying requests.

   An example of a non-idempotent request is RENAME.  Obviously, if a
   replier executes the same RENAME request twice, and the first
   execution succeeds, the re-execution will fail.  If the replier
   returns the result from the re-execution, this result is incorrect.
   Therefore, EOS is required for non-idempotent requests.

   An example of an idempotent modifying request is a COMPOUND request
   containing a WRITE operation.  Repeated execution of the same WRITE
   has the same effect as execution of that WRITE a single time.
   Nevertheless, enforcing EOS for WRITEs and other idempotent modifying
   requests is necessary to avoid data corruption.

   Suppose a client sends WRITE A to a noncompliant server that does not
   enforce EOS, and receives no response, perhaps due to a network
   partition.  The client reconnects to the server and re-sends WRITE A.
   Now, the server has outstanding two instances of A.  The server can
   be in a situation in which it executes and replies to the retry of A,
   while the first A is still waiting in the server's internal I/O
   system for some resource.  Upon receiving the reply to the second
   attempt of WRITE A, the client believes its WRITE is done so it is
   free to send WRITE B, which overlaps the byte-range of A.  When the
   original A is dispatched from the server's I/O system and executed
   (thus the second time A will have been written), then what has been
   written by B can be overwritten and thus corrupted.

   An example of an idempotent non-modifying request is a COMPOUND
   containing SEQUENCE, PUTFH, READLINK, and nothing else.  The re-
   execution of such a request will not cause data corruption or produce
   an incorrect result.  Nonetheless, to keep the implementation simple,
   the replier MUST enforce EOS for all requests, whether or not
   idempotent and non-modifying.

   Note that true and complete EOS is not possible unless the server
   persists the reply cache in stable storage, and unless the server is
   somehow implemented to never require a restart (indeed, if such a
   server exists, the distinction between a reply cache kept in stable
   storage versus one that is not is one without meaning).  See
   Section 2.10.6.5 for a discussion of persistence in the reply cache.
   Regardless, even if the server does not persist the reply cache, EOS
   improves robustness and correctness over previous versions of NFS
   because the legacy duplicate request/reply caches were based on the
   ONC RPC transaction identifier (XID).  Section 2.10.6.1 explains the
   shortcomings of the XID as a basis for a reply cache and describes
   how NFSv4.1 sessions improve upon the XID.

2.10.6.1.  Slot Identifiers and Reply Cache



   The RPC layer provides a transaction ID (XID), which, while required
   to be unique, is not convenient for tracking requests for two
   reasons.  First, the XID is only meaningful to the requester; it
   cannot be interpreted by the replier except to test for equality with
   previously sent requests.  When consulting an RPC-based duplicate
   request cache, the opaqueness of the XID requires a computationally
   expensive look up (often via a hash that includes XID and source
   address).  NFSv4.1 requests use a non-opaque slot ID, which is an
   index into a slot table, which is far more efficient.  Second,
   because RPC requests can be executed by the replier in any order,
   there is no bound on the number of requests that may be outstanding
   at any time.  To achieve perfect EOS, using ONC RPC would require
   storing all replies in the reply cache.  XIDs are 32 bits; storing
   over four billion (2^(32)) replies in the reply cache is not
   practical.  In practice, previous versions of NFS have chosen to
   store a fixed number of replies in the cache, and to use a least
   recently used (LRU) approach to replacing cache entries with new
   entries when the cache is full.  In NFSv4.1, the number of
   outstanding requests is bounded by the size of the slot table, and a
   sequence ID per slot is used to tell the replier when it is safe to
   delete a cached reply.

   In the NFSv4.1 reply cache, when the requester sends a new request,
   it selects a slot ID in the range 0..N, where N is the replier's
   current maximum slot ID granted to the requester on the session over
   which the request is to be sent.  The value of N starts out as equal
   to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the
   response to SEQUENCE or CB_SEQUENCE as described later in this
   section.  The slot ID must be unused by any of the requests that the
   requester has already active on the session.  "Unused" here means the
   requester has no outstanding request for that slot ID.

   A slot contains a sequence ID and the cached reply corresponding to
   the request sent with that sequence ID.  The sequence ID is a 32-bit
   unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^(32) -
   1).  The first time a slot is used, the requester MUST specify a
   sequence ID of one (Section 18.36).  Each time a slot is reused, the
   request MUST specify a sequence ID that is one greater than that of
   the previous request on the slot.  If the previous sequence ID was
   0xFFFFFFFF, then the next request for the slot MUST have the sequence
   ID set to zero (i.e., (2^(32) - 1) + 1 mod 2^(32)).

   The sequence ID accompanies the slot ID in each request.  It is for
   the critical check at the replier: it used to efficiently determine
   whether a request using a certain slot ID is a retransmit or a new,
   never-before-seen request.  It is not feasible for the requester to
   assert that it is retransmitting to implement this, because for any
   given request the requester cannot know whether the replier has seen
   it unless the replier actually replies.  Of course, if the requester
   has seen the reply, the requester would not retransmit.

   The replier compares each received request's sequence ID with the
   last one previously received for that slot ID, to see if the new
   request is:

   *  A new request, in which the sequence ID is one greater than that
      previously seen in the slot (accounting for sequence wraparound).
      The replier proceeds to execute the new request, and the replier
      MUST increase the slot's sequence ID by one.

   *  A retransmitted request, in which the sequence ID is equal to that
      currently recorded in the slot.  If the original request has
      executed to completion, the replier returns the cached reply.  See
      Section 2.10.6.2 for direction on how the replier deals with
      retries of requests that are still in progress.

   *  A misordered retry, in which the sequence ID is less than
      (accounting for sequence wraparound) that previously seen in the
      slot.  The replier MUST return NFS4ERR_SEQ_MISORDERED (as the
      result from SEQUENCE or CB_SEQUENCE).

   *  A misordered new request, in which the sequence ID is two or more
      than (accounting for sequence wraparound) that previously seen in
      the slot.  Note that because the sequence ID MUST wrap around to
      zero once it reaches 0xFFFFFFFF, a misordered new request and a
      misordered retry cannot be distinguished.  Thus, the replier MUST
      return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or
      CB_SEQUENCE).

   Unlike the XID, the slot ID is always within a specific range; this
   has two implications.  The first implication is that for a given
   session, the replier need only cache the results of a limited number
   of COMPOUND requests.  The second implication derives from the first,
   which is that unlike XID-indexed reply caches (also known as
   duplicate request caches - DRCs), the slot ID-based reply cache
   cannot be overflowed.  Through use of the sequence ID to identify
   retransmitted requests, the replier does not need to actually cache
   the request itself, reducing the storage requirements of the reply
   cache further.  These facilities make it practical to maintain all
   the required entries for an effective reply cache.

   The slot ID, sequence ID, and session ID therefore take over the
   traditional role of the XID and source network address in the
   replier's reply cache implementation.  This approach is considerably
   more portable and completely robust -- it is not subject to the
   reassignment of ports as clients reconnect over IP networks.  In
   addition, the RPC XID is not used in the reply cache, enhancing
   robustness of the cache in the face of any rapid reuse of XIDs by the
   requester.  While the replier does not care about the XID for the
   purposes of reply cache management (but the replier MUST return the
   same XID that was in the request), nonetheless there are
   considerations for the XID in NFSv4.1 that are the same as all other
   previous versions of NFS.  The RPC XID remains in each message and
   needs to be formulated in NFSv4.1 requests as in any other ONC RPC
   request.  The reasons include:

   *  The RPC layer retains its existing semantics and implementation.

   *  The requester and replier must be able to interoperate at the RPC
      layer, prior to the NFSv4.1 decoding of the SEQUENCE or
      CB_SEQUENCE operation.

   *  If an operation is being used that does not start with SEQUENCE or
      CB_SEQUENCE (e.g., BIND_CONN_TO_SESSION), then the RPC XID is
      needed for correct operation to match the reply to the request.

   *  The SEQUENCE or CB_SEQUENCE operation may generate an error.  If
      so, the embedded slot ID, sequence ID, and session ID (if present)
      in the request will not be in the reply, and the requester has
      only the XID to match the reply to the request.

   Given that well-formulated XIDs continue to be required, this raises
   the question: why do SEQUENCE and CB_SEQUENCE replies have a session
   ID, slot ID, and sequence ID?  Having the session ID in the reply
   means that the requester does not have to use the XID to look up the
   session ID, which would be necessary if the connection were
   associated with multiple sessions.  Having the slot ID and sequence
   ID in the reply means that the requester does not have to use the XID
   to look up the slot ID and sequence ID.  Furthermore, since the XID
   is only 32 bits, it is too small to guarantee the re-association of a
   reply with its request [44]; having session ID, slot ID, and sequence
   ID in the reply allows the client to validate that the reply in fact
   belongs to the matched request.

   The SEQUENCE (and CB_SEQUENCE) operation also carries a
   "highest_slotid" value, which carries additional requester slot usage
   information.  The requester MUST always indicate the slot ID
   representing the outstanding request with the highest-numbered slot
   value.  The requester should in all cases provide the most
   conservative value possible, although it can be increased somewhat
   above the actual instantaneous usage to maintain some minimum or
   optimal level.  This provides a way for the requester to yield unused
   request slots back to the replier, which in turn can use the
   information to reallocate resources.

   The replier responds with both a new target highest_slotid and an
   enforced highest_slotid, described as follows:

   *  The target highest_slotid is an indication to the requester of the
      highest_slotid the replier wishes the requester to be using.  This
      permits the replier to withdraw (or add) resources from a
      requester that has been found to not be using them, in order to
      more fairly share resources among a varying level of demand from
      other requesters.  The requester must always comply with the
      replier's value updates, since they indicate newly established
      hard limits on the requester's access to session resources.
      However, because of request pipelining, the requester may have
      active requests in flight reflecting prior values; therefore, the
      replier must not immediately require the requester to comply.

   *  The enforced highest_slotid indicates the highest slot ID the
      requester is permitted to use on a subsequent SEQUENCE or
      CB_SEQUENCE operation.  The replier's enforced highest_slotid
      SHOULD be no less than the highest_slotid the requester indicated
      in the SEQUENCE or CB_SEQUENCE arguments.

      A requester can be intransigent with respect to lowering its
      highest_slotid argument to a Sequence operation, i.e. the
      requester continues to ignore the target highest_slotid in the
      response to a Sequence operation, and continues to set its
      highest_slotid argument to be higher than the target
      highest_slotid.  This can be considered particularly egregious
      behavior when the replier knows there are no outstanding requests
      with slot IDs higher than its target highest_slotid.  When faced
      with such intransigence, the replier is free to take more forceful
      action, and MAY reply with a new enforced highest_slotid that is
      less than its previous enforced highest_slotid.  Thereafter, if
      the requester continues to send requests with a highest_slotid
      that is greater than the replier's new enforced highest_slotid,
      the server MAY return NFS4ERR_BAD_HIGH_SLOT, unless the slot ID in
      the request is greater than the new enforced highest_slotid and
      the request is a retry.

      The replier SHOULD retain the slots it wants to retire until the
      requester sends a request with a highest_slotid less than or equal
      to the replier's new enforced highest_slotid.

      The requester can also be intransigent with respect to sending
      non-retry requests that have a slot ID that exceeds the replier's
      highest_slotid.  Once the replier has forcibly lowered the
      enforced highest_slotid, the requester is only allowed to send
      retries on slots that exceed the replier's highest_slotid.  If a
      request is received with a slot ID that is higher than the new
      enforced highest_slotid, and the sequence ID is one higher than
      what is in the slot's reply cache, then the server can both retire
      the slot and return NFS4ERR_BADSLOT (however, the server MUST NOT
      do one and not the other).  The reason it is safe to retire the
      slot is because by using the next sequence ID, the requester is
      indicating it has received the previous reply for the slot.

   *  The requester SHOULD use the lowest available slot when sending a
      new request.  This way, the replier may be able to retire slot
      entries faster.  However, where the replier is actively adjusting
      its granted highest_slotid, it will not be able to use only the
      receipt of the slot ID and highest_slotid in the request.  Neither
      the slot ID nor the highest_slotid used in a request may reflect
      the replier's current idea of the requester's session limit,
      because the request may have been sent from the requester before
      the update was received.  Therefore, in the downward adjustment
      case, the replier may have to retain a number of reply cache
      entries at least as large as the old value of maximum requests
      outstanding, until it can infer that the requester has seen a
      reply containing the new granted highest_slotid.  The replier can
      infer that the requester has seen such a reply when it receives a
      new request with the same slot ID as the request replied to and
      the next higher sequence ID.

2.10.6.1.1.  Caching of SEQUENCE and CB_SEQUENCE Replies


   When a SEQUENCE or CB_SEQUENCE operation is successfully executed,
   its reply MUST always be cached.  Specifically, session ID, sequence
   ID, and slot ID MUST be cached in the reply cache.  The reply from
   SEQUENCE also includes the highest slot ID, target highest slot ID,
   and status flags.  Instead of caching these values, the server MAY
   re-compute the values from the current state of the fore channel,
   session, and/or client ID as appropriate.  Similarly, the reply from
   CB_SEQUENCE includes a highest slot ID and target highest slot ID.
   The client MAY re-compute the values from the current state of the
   session as appropriate.

   Regardless of whether or not a replier is re-computing highest slot
   ID, target slot ID, and status on replies to retries, the requester
   MUST NOT assume that the values are being re-computed whenever it
   receives a reply after a retry is sent, since it has no way of
   knowing whether the reply it has received was sent by the replier in
   response to the retry or is a delayed response to the original
   request.  Therefore, it may be the case that highest slot ID, target
   slot ID, or status bits may reflect the state of affairs when the
   request was first executed.  Although acting based on such delayed
   information is valid, it may cause the receiver of the reply to do
   unneeded work.  Requesters MAY choose to send additional requests to
   get the current state of affairs or use the state of affairs reported
   by subsequent requests, in preference to acting immediately on data
   that might be out of date.

2.10.6.1.2.  Errors from SEQUENCE and CB_SEQUENCE


   Any time SEQUENCE or CB_SEQUENCE returns an error, the sequence ID of
   the slot MUST NOT change.  The replier MUST NOT modify the reply
   cache entry for the slot whenever an error is returned from SEQUENCE
   or CB_SEQUENCE.

2.10.6.1.3.  Optional Reply Caching


   On a per-request basis, the requester can choose to direct the
   replier to cache the reply to all operations after the first
   operation (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or
   csa_cachethis fields of the arguments to SEQUENCE or CB_SEQUENCE.
   The reason it would not direct the replier to cache the entire reply
   is that the request is composed of all idempotent operations [41].
   Caching the reply may offer little benefit.  If the reply is too
   large (see Section 2.10.6.4), it may not be cacheable anyway.  Even
   if the reply to idempotent request is small enough to cache,
   unnecessarily caching the reply slows down the server and increases
   RPC latency.

   Whether or not the requester requests the reply to be cached has no
   effect on the slot processing.  If the result of SEQUENCE or
   CB_SEQUENCE is NFS4_OK, then the slot's sequence ID MUST be
   incremented by one.  If a requester does not direct the replier to
   cache the reply, the replier MUST do one of following:

   *  The replier can cache the entire original reply.  Even though
      sa_cachethis or csa_cachethis is FALSE, the replier is always free
      to cache.  It may choose this approach in order to simplify
      implementation.

   *  The replier enters into its reply cache a reply consisting of the
      original results to the SEQUENCE or CB_SEQUENCE operation, and
      with the next operation in COMPOUND or CB_COMPOUND having the
      error NFS4ERR_RETRY_UNCACHED_REP.  Thus, if the requester later
      retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP.  If a
      replier receives a retried Sequence operation where the reply to
      the COMPOUND or CB_COMPOUND was not cached, then the replier,

      -  MAY return NFS4ERR_RETRY_UNCACHED_REP in reply to a Sequence
         operation if the Sequence operation is not the first operation
         (granted, a requester that does so is in violation of the
         NFSv4.1 protocol).

      -  MUST NOT return NFS4ERR_RETRY_UNCACHED_REP in reply to a
         Sequence operation if the Sequence operation is the first
         operation.

   *  If the second operation is an illegal operation, or an operation
      that was legal in a previous minor version of NFSv4 and MUST NOT
      be supported in the current minor version (e.g., SETCLIENTID), the
      replier MUST NOT ever return NFS4ERR_RETRY_UNCACHED_REP.  Instead
      the replier MUST return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or
      NFS4ERR_NOTSUPP as appropriate.

   *  If the second operation can result in another error status, the
      replier MAY return a status other than NFS4ERR_RETRY_UNCACHED_REP,
      provided the operation is not executed in such a way that the
      state of the replier is changed.  Examples of such an error status
      include: NFS4ERR_NOTSUPP returned for an operation that is legal
      but not REQUIRED in the current minor versions, and thus not
      supported by the replier; NFS4ERR_SEQUENCE_POS; and
      NFS4ERR_REQ_TOO_BIG.

   The discussion above assumes that the retried request matches the
   original one.  Section 2.10.6.1.3.1 discusses what the replier might
   do, and MUST do when original and retried requests do not match.
   Since the replier may only cache a small amount of the information
   that would be required to determine whether this is a case of a false
   retry, the replier may send to the client any of the following
   responses:

   *  The cached reply to the original request (if the replier has
      cached it in its entirety and the users of the original request
      and retry match).

   *  A reply that consists only of the Sequence operation with the
      error NFS4ERR_SEQ_FALSE_RETRY.

   *  A reply consisting of the response to Sequence with the status
      NFS4_OK, together with the second operation as it appeared in the
      retried request with an error of NFS4ERR_RETRY_UNCACHED_REP or
      other error as described above.

   *  A reply that consists of the response to Sequence with the status
      NFS4_OK, together with the second operation as it appeared in the
      original request with an error of NFS4ERR_RETRY_UNCACHED_REP or
      other error as described above.

2.10.6.1.3.1.  False Retry


   If a requester sent a Sequence operation with a slot ID and sequence
   ID that are in the reply cache but the replier detected that the
   retried request is not the same as the original request, including a
   retry that has different operations or different arguments in the
   operations from the original and a retry that uses a different
   principal in the RPC request's credential field that translates to a
   different user, then this is a false retry.  When the replier detects
   a false retry, it is permitted (but not always obligated) to return
   NFS4ERR_SEQ_FALSE_RETRY in response to the Sequence operation when it
   detects a false retry.

   Translations of particularly privileged user values to other users
   due to the lack of appropriately secure credentials, as configured on
   the replier, should be applied before determining whether the users
   are the same or different.  If the replier determines the users are
   different between the original request and a retry, then the replier
   MUST return NFS4ERR_SEQ_FALSE_RETRY.

   If an operation of the retry is an illegal operation, or an operation
   that was legal in a previous minor version of NFSv4 and MUST NOT be
   supported in the current minor version (e.g., SETCLIENTID), the
   replier MAY return NFS4ERR_SEQ_FALSE_RETRY (and MUST do so if the
   users of the original request and retry differ).  Otherwise, the
   replier MAY return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or
   NFS4ERR_NOTSUPP as appropriate.  Note that the handling is in
   contrast for how the replier deals with retries requests with no
   cached reply.  The difference is due to NFS4ERR_SEQ_FALSE_RETRY being
   a valid error for only Sequence operations, whereas
   NFS4ERR_RETRY_UNCACHED_REP is a valid error for all operations except
   illegal operations and operations that MUST NOT be supported in the
   current minor version of NFSv4.

2.10.6.2.  Retry and Replay of Reply



   A requester MUST NOT retry a request, unless the connection it used
   to send the request disconnects.  The requester can then reconnect
   and re-send the request, or it can re-send the request over a
   different connection that is associated with the same session.

   If the requester is a server wanting to re-send a callback operation
   over the backchannel of a session, the requester of course cannot
   reconnect because only the client can associate connections with the
   backchannel.  The server can re-send the request over another
   connection that is bound to the same session's backchannel.  If there
   is no such connection, the server MUST indicate that the session has
   no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag
   bit in the response to the next SEQUENCE operation from the client.
   The client MUST then associate a connection with the session (or
   destroy the session).

   Note that it is not fatal for a requester to retry without a
   disconnect between the request and retry.  However, the retry does
   consume resources, especially with RDMA, where each request, retry or
   not, consumes a credit.  Retries for no reason, especially retries
   sent shortly after the previous attempt, are a poor use of network
   bandwidth and defeat the purpose of a transport's inherent congestion
   control system.

   A requester MUST wait for a reply to a request before using the slot
   for another request.  If it does not wait for a reply, then the
   requester does not know what sequence ID to use for the slot on its
   next request.  For example, suppose a requester sends a request with
   sequence ID 1, and does not wait for the response.  The next time it
   uses the slot, it sends the new request with sequence ID 2.  If the
   replier has not seen the request with sequence ID 1, then the replier
   is not expecting sequence ID 2, and rejects the requester's new
   request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or
   CB_SEQUENCE).

   RDMA fabrics do not guarantee that the memory handles (Steering Tags)
   within each RPC/RDMA "chunk" [32] are valid on a scope outside that
   of a single connection.  Therefore, handles used by the direct
   operations become invalid after connection loss.  The server must
   ensure that any RDMA operations that must be replayed from the reply
   cache use the newly provided handle(s) from the most recent request.

   A retry might be sent while the original request is still in progress
   on the replier.  The replier SHOULD deal with the issue by returning
   NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE operation, but
   implementations MAY return NFS4ERR_MISORDERED.  Since errors from
   SEQUENCE and CB_SEQUENCE are never recorded in the reply cache, this
   approach allows the results of the execution of the original request
   to be properly recorded in the reply cache (assuming that the
   requester specified the reply to be cached).

2.10.6.3.  Resolving Server Callback Races



   It is possible for server callbacks to arrive at the client before
   the reply from related fore channel operations.  For example, a
   client may have been granted a delegation to a file it has opened,
   but the reply to the OPEN (informing the client of the granting of
   the delegation) may be delayed in the network.  If a conflicting
   operation arrives at the server, it will recall the delegation using
   the backchannel, which may be on a different transport connection,
   perhaps even a different network, or even a different session
   associated with the same client ID.

   The presence of a session between the client and server alleviates
   this issue.  When a session is in place, each client request is
   uniquely identified by its { session ID, slot ID, sequence ID }
   triple.  By the rules under which slot entries (reply cache entries)
   are retired, the server has knowledge whether the client has "seen"
   each of the server's replies.  The server can therefore provide
   sufficient information to the client to allow it to disambiguate
   between an erroneous or conflicting callback race condition.

   For each client operation that might result in some sort of server
   callback, the server SHOULD "remember" the { session ID, slot ID,
   sequence ID } triple of the client request until the slot ID
   retirement rules allow the server to determine that the client has,
   in fact, seen the server's reply.  Until the time the { session ID,
   slot ID, sequence ID } request triple can be retired, any recalls of
   the associated object MUST carry an array of these referring
   identifiers (in the CB_SEQUENCE operation's arguments), for the
   benefit of the client.  After this time, it is not necessary for the
   server to provide this information in related callbacks, since it is
   certain that a race condition can no longer occur.

   The CB_SEQUENCE operation that begins each server callback carries a
   list of "referring" { session ID, slot ID, sequence ID } triples.  If
   the client finds the request corresponding to the referring session
   ID, slot ID, and sequence ID to be currently outstanding (i.e., the
   server's reply has not been seen by the client), it can determine
   that the callback has raced the reply, and act accordingly.  If the
   client does not find the request corresponding to the referring
   triple to be outstanding (including the case of a session ID
   referring to a destroyed session), then there is no race with respect
   to this triple.  The server SHOULD limit the referring triples to
   requests that refer to just those that apply to the objects referred
   to in the CB_COMPOUND procedure.

   The client must not simply wait forever for the expected server reply
   to arrive before responding to the CB_COMPOUND that won the race,
   because it is possible that it will be delayed indefinitely.  The
   client should assume the likely case that the reply will arrive
   within the average round-trip time for COMPOUND requests to the
   server, and wait that period of time.  If that period of time
   expires, it can respond to the CB_COMPOUND with NFS4ERR_DELAY.  There
   are other scenarios under which callbacks may race replies.  Among
   them are pNFS layout recalls as described in Section 12.5.5.2.

2.10.6.4.  COMPOUND and CB_COMPOUND Construction Issues



   Very large requests and replies may pose both buffer management
   issues (especially with RDMA) and reply cache issues.  When the
   session is created (Section 18.36), for each channel (fore and back),
   the client and server negotiate the maximum-sized request they will
   send or process (ca_maxrequestsize), the maximum-sized reply they
   will return or process (ca_maxresponsesize), and the maximum-sized
   reply they will store in the reply cache (ca_maxresponsesize_cached).

   If a request exceeds ca_maxrequestsize, the reply will have the
   status NFS4ERR_REQ_TOO_BIG.  A replier MAY return NFS4ERR_REQ_TOO_BIG
   as the status for the first operation (SEQUENCE or CB_SEQUENCE) in
   the request (which means that no operations in the request executed
   and that the state of the slot in the reply cache is unchanged), or
   it MAY opt to return it on a subsequent operation in the same
   COMPOUND or CB_COMPOUND request (which means that at least one
   operation did execute and that the state of the slot in the reply
   cache does change).  The replier SHOULD set NFS4ERR_REQ_TOO_BIG on
   the operation that exceeds ca_maxrequestsize.

   If a reply exceeds ca_maxresponsesize, the reply will have the status
   NFS4ERR_REP_TOO_BIG.  A replier MAY return NFS4ERR_REP_TOO_BIG as the
   status for the first operation (SEQUENCE or CB_SEQUENCE) in the
   request, or it MAY opt to return it on a subsequent operation (in the
   same COMPOUND or CB_COMPOUND reply).  A replier MAY return
   NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if
   the response would still exceed ca_maxresponsesize.

   If sa_cachethis or csa_cachethis is TRUE, then the replier MUST cache
   a reply except if an error is returned by the SEQUENCE or CB_SEQUENCE
   operation (see Section 2.10.6.1.2).  If the reply exceeds
   ca_maxresponsesize_cached (and sa_cachethis or csa_cachethis is
   TRUE), then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE.
   Even if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that
   matter) is returned on an operation other than the first operation
   (SEQUENCE or CB_SEQUENCE), then the reply MUST be cached if
   sa_cachethis or csa_cachethis is TRUE.  For example, if a COMPOUND
   has eleven operations, including SEQUENCE, the fifth operation is a
   RENAME, and the tenth operation is a READ for one million bytes, the
   server may return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth
   operation.  Since the server executed several operations, especially
   the non-idempotent RENAME, the client's request to cache the reply
   needs to be honored in order for the correct operation of exactly
   once semantics.  If the client retries the request, the server will
   have cached a reply that contains results for ten of the eleven
   requested operations, with the tenth operation having a status of
   NFS4ERR_REP_TOO_BIG_TO_CACHE.

   A client needs to take care that, when sending operations that change
   the current filehandle (except for PUTFH, PUTPUBFH, PUTROOTFH, and
   RESTOREFH), it does not exceed the maximum reply buffer before the
   GETFH operation.  Otherwise, the client will have to retry the
   operation that changed the current filehandle, in order to obtain the
   desired filehandle.  For the OPEN operation (see Section 18.16),
   retry is not always available as an option.  The following guidelines
   for the handling of filehandle-changing operations are advised:

   *  Within the same COMPOUND procedure, a client SHOULD send GETFH
      immediately after a current filehandle-changing operation.  A
      client MUST send GETFH after a current filehandle-changing
      operation that is also non-idempotent (e.g., the OPEN operation),
      unless the operation is RESTOREFH.  RESTOREFH is an exception,
      because even though it is non-idempotent, the filehandle RESTOREFH
      produced originated from an operation that is either idempotent
      (e.g., PUTFH, LOOKUP), or non-idempotent (e.g., OPEN, CREATE).  If
      the origin is non-idempotent, then because the client MUST send
      GETFH after the origin operation, the client can recover if
      RESTOREFH returns an error.

   *  A server MAY return NFS4ERR_REP_TOO_BIG or
      NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
      filehandle-changing operation if the reply would be too large on
      the next operation.

   *  A server SHOULD return NFS4ERR_REP_TOO_BIG or
      NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
      filehandle-changing, non-idempotent operation if the reply would
      be too large on the next operation, especially if the operation is
      OPEN.

   *  A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent
      current filehandle-changing operation, if it looks at the next
      operation (in the same COMPOUND procedure) and finds it is not
      GETFH.  The server SHOULD do this if it is unable to determine in
      advance whether the total response size would exceed
      ca_maxresponsesize_cached or ca_maxresponsesize.

2.10.6.5.  Persistence



   Since the reply cache is bounded, it is practical for the reply cache
   to persist across server restarts.  The replier MUST persist the
   following information if it agreed to persist the session (when the
   session was created; see Section 18.36):

   *  The session ID.

   *  The slot table including the sequence ID and cached reply for each
      slot.

   The above are sufficient for a replier to provide EOS semantics for
   any requests that were sent and executed before the server restarted.
   If the replier is a client, then there is no need for it to persist
   any more information, unless the client will be persisting all other
   state across client restart, in which case, the server will never see
   any NFSv4.1-level protocol manifestation of a client restart.  If the
   replier is a server, with just the slot table and session ID
   persisting, any requests the client retries after the server restart
   will return the results that are cached in the reply cache, and any
   new requests (i.e., the sequence ID is one greater than the slot's
   sequence ID) MUST be rejected with NFS4ERR_DEADSESSION (returned by
   SEQUENCE).  Such a session is considered dead.  A server MAY re-
   animate a session after a server restart so that the session will
   accept new requests as well as retries.  To re-animate a session, the
   server needs to persist additional information through server
   restart:

   *  The client ID.  This is a prerequisite to let the client create
      more sessions associated with the same client ID as the re-
      animated session.

   *  The client ID's sequence ID that is used for creating sessions
      (see Sections 18.35 and 18.36).  This is a prerequisite to let the
      client create more sessions.

   *  The principal that created the client ID.  This allows the server
      to authenticate the client when it sends EXCHANGE_ID.

   *  The SSV, if SP4_SSV state protection was specified when the client
      ID was created (see Section 18.35).  This lets the client create
      new sessions, and associate connections with the new and existing
      sessions.

   *  The properties of the client ID as defined in Section 18.35.

   A persistent reply cache places certain demands on the server.  The
   execution of the sequence of operations (starting with SEQUENCE) and
   placement of its results in the persistent cache MUST be atomic.  If
   a client retries a sequence of operations that was previously
   executed on the server, the only acceptable outcomes are either the
   original cached reply or an indication that the client ID or session
   has been lost (indicating a catastrophic loss of the reply cache or a
   session that has been deleted because the client failed to use the
   session for an extended period of time).

   A server could fail and restart in the middle of a COMPOUND procedure
   that contains one or more non-idempotent or idempotent-but-modifying
   operations.  This creates an even higher challenge for atomic
   execution and placement of results in the reply cache.  One way to
   view the problem is as a single transaction consisting of each
   operation in the COMPOUND followed by storing the result in
   persistent storage, then finally a transaction commit.  If there is a
   failure before the transaction is committed, then the server rolls
   back the transaction.  If the server itself fails, then when it
   restarts, its recovery logic could roll back the transaction before
   starting the NFSv4.1 server.

   While the description of the implementation for atomic execution of
   the request and caching of the reply is beyond the scope of this
   document, an example implementation for NFSv2 [45] is described in
   [46].

2.10.7.  RDMA Considerations



   A complete discussion of the operation of RPC-based protocols over
   RDMA transports is in [32].  A discussion of the operation of NFSv4,
   including NFSv4.1, over RDMA is in [33].  Where RDMA is considered,
   this specification assumes the use of such a layering; it addresses
   only the upper-layer issues relevant to making best use of RPC/RDMA.

2.10.7.1.  RDMA Connection Resources



   RDMA requires its consumers to register memory and post buffers of a
   specific size and number for receive operations.

   Registration of memory can be a relatively high-overhead operation,
   since it requires pinning of buffers, assignment of attributes (e.g.,
   readable/writable), and initialization of hardware translation.
   Preregistration is desirable to reduce overhead.  These registrations
   are specific to hardware interfaces and even to RDMA connection
   endpoints; therefore, negotiation of their limits is desirable to
   manage resources effectively.

   Following basic registration, these buffers must be posted by the RPC
   layer to handle receives.  These buffers remain in use by the RPC/
   NFSv4.1 implementation; the size and number of them must be known to
   the remote peer in order to avoid RDMA errors that would cause a
   fatal error on the RDMA connection.

   NFSv4.1 manages slots as resources on a per-session basis (see
   Section 2.10), while RDMA connections manage credits on a per-
   connection basis.  This means that in order for a peer to send data
   over RDMA to a remote buffer, it has to have both an NFSv4.1 slot and
   an RDMA credit.  If multiple RDMA connections are associated with a
   session, then if the total number of credits across all RDMA
   connections associated with the session is X, and the number of slots
   in the session is Y, then the maximum number of outstanding requests
   is the lesser of X and Y.

2.10.7.2.  Flow Control



   Previous versions of NFS do not provide flow control; instead, they
   rely on the windowing provided by transports like TCP to throttle
   requests.  This does not work with RDMA, which provides no operation
   flow control and will terminate a connection in error when limits are
   exceeded.  Limits such as maximum number of requests outstanding are
   therefore negotiated when a session is created (see the
   ca_maxrequests field in Section 18.36).  These limits then provide
   the maxima within which each connection associated with the session's
   channel(s) must remain.  RDMA connections are managed within these
   limits as described in Section 3.3 of [32]; if there are multiple
   RDMA connections, then the maximum number of requests for a channel
   will be divided among the RDMA connections.  Put a different way, the
   onus is on the replier to ensure that the total number of RDMA
   credits across all connections associated with the replier's channel
   does exceed the channel's maximum number of outstanding requests.

   The limits may also be modified dynamically at the replier's choosing
   by manipulating certain parameters present in each NFSv4.1 reply.  In
   addition, the CB_RECALL_SLOT callback operation (see Section 20.8)
   can be sent by a server to a client to return RDMA credits to the
   server, thereby lowering the maximum number of requests a client can
   have outstanding to the server.

2.10.7.3.  Padding



   Header padding is requested by each peer at session initiation (see
   the ca_headerpadsize argument to CREATE_SESSION in Section 18.36),
   and subsequently used by the RPC RDMA layer, as described in [32].
   Zero padding is permitted.

   Padding leverages the useful property that RDMA preserve alignment of
   data, even when they are placed into anonymous (untagged) buffers.
   If requested, client inline writes will insert appropriate pad bytes
   within the request header to align the data payload on the specified
   boundary.  The client is encouraged to add sufficient padding (up to
   the negotiated size) so that the "data" field of the WRITE operation
   is aligned.  Most servers can make good use of such padding, which
   allows them to chain receive buffers in such a way that any data
   carried by client requests will be placed into appropriate buffers at
   the server, ready for file system processing.  The receiver's RPC
   layer encounters no overhead from skipping over pad bytes, and the
   RDMA layer's high performance makes the insertion and transmission of
   padding on the sender a significant optimization.  In this way, the
   need for servers to perform RDMA Read to satisfy all but the largest
   client writes is obviated.  An added benefit is the reduction of
   message round trips on the network -- a potentially good trade, where
   latency is present.

   The value to choose for padding is subject to a number of criteria.
   A primary source of variable-length data in the RPC header is the
   authentication information, the form of which is client-determined,
   possibly in response to server specification.  The contents of
   COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all
   go into the determination of a maximal NFSv4.1 request size and
   therefore minimal buffer size.  The client must select its offered
   value carefully, so as to avoid overburdening the server, and vice
   versa.  The benefit of an appropriate padding value is higher
   performance.

                    Sender gather:
        |RPC Request|Pad  bytes|Length| -> |User data...|
        \------+----------------------/      \
                \                             \
                 \    Receiver scatter:        \-----------+- ...
            /-----+----------------\            \           \
            |RPC Request|Pad|Length|   ->  |FS buffer|->|FS buffer|->...

   In the above case, the server may recycle unused buffers to the next
   posted receive if unused by the actual received request, or may pass
   the now-complete buffers by reference for normal write processing.
   For a server that can make use of it, this removes any need for data
   copies of incoming data, without resorting to complicated end-to-end
   buffer advertisement and management.  This includes most kernel-based
   and integrated server designs, among many others.  The client may
   perform similar optimizations, if desired.

2.10.7.4.  Dual RDMA and Non-RDMA Transports



   Some RDMA transports (e.g., RFC 5040 [8]) permit a "streaming" (non-
   RDMA) phase, where ordinary traffic might flow before "stepping up"
   to RDMA mode, commencing RDMA traffic.  Some RDMA transports start
   connections always in RDMA mode.  NFSv4.1 allows, but does not
   assume, a streaming phase before RDMA mode.  When a connection is
   associated with a session, the client and server negotiate whether
   the connection is used in RDMA or non-RDMA mode (see Sections 18.36
   and 18.34).

2.10.8.  Session Security



2.10.8.1.  Session Callback Security



   Via session/connection association, NFSv4.1 improves security over
   that provided by NFSv4.0 for the backchannel.  The connection is
   client-initiated (see Section 18.34) and subject to the same firewall
   and routing checks as the fore channel.  At the client's option (see
   Section 18.35), connection association is fully authenticated before
   being activated (see Section 18.34).  Traffic from the server over
   the backchannel is authenticated exactly as the client specifies (see
   Section 2.10.8.2).

2.10.8.2.  Backchannel RPC Security



   When the NFSv4.1 client establishes the backchannel, it informs the
   server of the security flavors and principals to use when sending
   requests.  If the security flavor is RPCSEC_GSS, the client expresses
   the principal in the form of an established RPCSEC_GSS context.  The
   server is free to use any of the flavor/principal combinations the
   client offers, but it MUST NOT use unoffered combinations.  This way,
   the client need not provide a target GSS principal for the
   backchannel as it did with NFSv4.0, nor does the server have to
   implement an RPCSEC_GSS initiator as it did with NFSv4.0 [37].

   The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL
   (Section 18.33) operations allow the client to specify flavor/
   principal combinations.

   Also note that the SP4_SSV state protection mode (see Sections 18.35
   and 2.10.8.3) has the side benefit of providing SSV-derived
   RPCSEC_GSS contexts (Section 2.10.9).

2.10.8.3.  Protection from Unauthorized State Changes



   As described to this point in the specification, the state model of
   NFSv4.1 is vulnerable to an attacker that sends a SEQUENCE operation
   with a forged session ID and with a slot ID that it expects the
   legitimate client to use next.  When the legitimate client uses the
   slot ID with the same sequence number, the server returns the
   attacker's result from the reply cache, which disrupts the legitimate
   client and thus denies service to it.  Similarly, an attacker could
   send a CREATE_SESSION with a forged client ID to create a new session
   associated with the client ID.  The attacker could send requests
   using the new session that change locking state, such as LOCKU
   operations to release locks the legitimate client has acquired.
   Setting a security policy on the file that requires RPCSEC_GSS
   credentials when manipulating the file's state is one potential work
   around, but has the disadvantage of preventing a legitimate client
   from releasing state when RPCSEC_GSS is required to do so, but a GSS
   context cannot be obtained (possibly because the user has logged off
   the client).

   NFSv4.1 provides three options to a client for state protection,
   which are specified when a client creates a client ID via EXCHANGE_ID
   (Section 18.35).

   The first (SP4_NONE) is to simply waive state protection.

   The other two options (SP4_MACH_CRED and SP4_SSV) share several
   traits:

   *  An RPCSEC_GSS-based credential is used to authenticate client ID
      and session maintenance operations, including creating and
      destroying a session, associating a connection with the session,
      and destroying the client ID.

   *  Because RPCSEC_GSS is used to authenticate client ID and session
      maintenance, the attacker cannot associate a rogue connection with
      a legitimate session, or associate a rogue session with a
      legitimate client ID in order to maliciously alter the client ID's
      lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc.

   *  In cases where the server's security policies on a portion of its
      namespace require RPCSEC_GSS authentication, a client may have to
      use an RPCSEC_GSS credential to remove per-file state (e.g.,
      LOCKU, CLOSE, etc.).  The server may require that the principal
      that removes the state match certain criteria (e.g., the principal
      might have to be the same as the one that acquired the state).
      However, the client might not have an RPCSEC_GSS context for such
      a principal, and might not be able to create such a context
      (perhaps because the user has logged off).  When the client
      establishes SP4_MACH_CRED or SP4_SSV protection, it can specify a
      list of operations that the server MUST allow using the machine
      credential (if SP4_MACH_CRED is used) or the SSV credential (if
      SP4_SSV is used).

   The SP4_MACH_CRED state protection option uses a machine credential
   where the principal that creates the client ID MUST also be the
   principal that performs client ID and session maintenance operations.
   The security of the machine credential state protection approach
   depends entirely on safeguarding the per-machine credential.
   Assuming a proper safeguard using the per-machine credential for
   operations like CREATE_SESSION, BIND_CONN_TO_SESSION,
   DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from
   associating a rogue connection with a session, or associating a rogue
   session with a client ID.

   There are at least three scenarios for the SP4_MACH_CRED option:

   1.  The system administrator configures a unique, permanent per-
       machine credential for one of the mandated GSS mechanisms (e.g.,
       if Kerberos V5 is used, a "keytab" containing a principal derived
       from a client host name could be used).

   2.  The client is used by a single user, and so the client ID and its
       sessions are used by just that user.  If the user's credential
       expires, then session and client ID maintenance cannot occur, but
       since the client has a single user, only that user is
       inconvenienced.

   3.  The physical client has multiple users, but the client
       implementation has a unique client ID for each user.  This is
       effectively the same as the second scenario, but a disadvantage
       is that each user needs to be allocated at least one session
       each, so the approach suffers from lack of economy.

   The SP4_SSV protection option uses the SSV (Section 1.7), via
   RPCSEC_GSS and the SSV GSS mechanism (Section 2.10.9), to protect
   state from attack.  The SP4_SSV protection option is intended for the
   situation comprised of a client that has multiple active users and a
   system administrator who wants to avoid the burden of installing a
   permanent machine credential on each client.  The SSV is established
   and updated on the server via SET_SSV (see Section 18.47).  To
   prevent eavesdropping, a client SHOULD send SET_SSV via RPCSEC_GSS
   with the privacy service.  Several aspects of the SSV make it
   intractable for an attacker to guess the SSV, and thus associate
   rogue connections with a session, and rogue sessions with a client
   ID:

   *  The arguments to and results of SET_SSV include digests of the old
      and new SSV, respectively.

   *  Because the initial value of the SSV is zero, therefore known, the
      client that opts for SP4_SSV protection and opts to apply SP4_SSV
      protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST send at
      least one SET_SSV operation before the first BIND_CONN_TO_SESSION
      operation or before the second CREATE_SESSION operation on a
      client ID.  If it does not, the SSV mechanism will not generate
      tokens (Section 2.10.9).  A client SHOULD send SET_SSV as soon as
      a session is created.

   *  A SET_SSV request does not replace the SSV with the argument to
      SET_SSV.  Instead, the current SSV on the server is logically
      exclusive ORed (XORed) with the argument to SET_SSV.  Each time a
      new principal uses a client ID for the first time, the client
      SHOULD send a SET_SSV with that principal's RPCSEC_GSS
      credentials, with RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY.

   Here are the types of attacks that can be attempted by an attacker
   named Eve on a victim named Bob, and how SP4_SSV protection foils
   each attack:

   *  Suppose Eve is the first user to log into a legitimate client.
      Eve's use of an NFSv4.1 file system will cause the legitimate
      client to create a client ID with SP4_SSV protection, specifying
      that the BIND_CONN_TO_SESSION operation MUST use the SSV
      credential.  Eve's use of the file system also causes an SSV to be
      created.  The SET_SSV operation that creates the SSV will be
      protected by the RPCSEC_GSS context created by the legitimate
      client, which uses Eve's GSS principal and credentials.  Eve can
      eavesdrop on the network while her RPCSEC_GSS context is created
      and the SET_SSV using her context is sent.  Even if the legitimate
      client sends the SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve
      knows her own credentials, she can decrypt the SSV.  Eve can
      compute an RPCSEC_GSS credential that BIND_CONN_TO_SESSION will
      accept, and so associate a new connection with the legitimate
      session.  Eve can change the slot ID and sequence state of a
      legitimate session, and/or the SSV state, in such a way that when
      Bob accesses the server via the same legitimate client, the
      legitimate client will be unable to use the session.

      The client's only recourse is to create a new client ID for Bob to
      use, and establish a new SSV for the client ID.  The client will
      be unable to delete the old client ID, and will let the lease on
      the old client ID expire.

      Once the legitimate client establishes an SSV over the new session
      using Bob's RPCSEC_GSS context, Eve can use the new session via
      the legitimate client, but she cannot disrupt Bob.  Moreover,
      because the client SHOULD have modified the SSV due to Eve using
      the new session, Bob cannot get revenge on Eve by associating a
      rogue connection with the session.

      The question is how did the legitimate client detect that Eve has
      hijacked the old session?  When the client detects that a new
      principal, Bob, wants to use the session, it SHOULD have sent a
      SET_SSV, which leads to the following sub-scenarios:

      -  Let us suppose that from the rogue connection, Eve sent a
         SET_SSV with the same slot ID and sequence ID that the
         legitimate client later uses.  The server will assume the
         SET_SSV sent with Bob's credentials is a retry, and return to
         the legitimate client the reply it sent Eve.  However, unless
         Eve can correctly guess the SSV the legitimate client will use,
         the digest verification checks in the SET_SSV response will
         fail.  That is an indication to the client that the session has
         apparently been hijacked.

      -  Alternatively, Eve sent a SET_SSV with a different slot ID than
         the legitimate client uses for its SET_SSV.  Then the digest
         verification of the SET_SSV sent with Bob's credentials fails
         on the server, and the error returned to the client makes it
         apparent that the session has been hijacked.

      -  Alternatively, Eve sent an operation other than SET_SSV, but
         with the same slot ID and sequence that the legitimate client
         uses for its SET_SSV.  The server returns to the legitimate
         client the response it sent Eve.  The client sees that the
         response is not at all what it expects.  The client assumes
         either session hijacking or a server bug, and either way
         destroys the old session.

   *  Eve associates a rogue connection with the session as above, and
      then destroys the session.  Again, Bob goes to use the server from
      the legitimate client, which sends a SET_SSV using Bob's
      credentials.  The client receives an error that indicates that the
      session does not exist.  When the client tries to create a new
      session, this will fail because the SSV it has does not match that
      which the server has, and now the client knows the session was
      hijacked.  The legitimate client establishes a new client ID.

   *  If Eve creates a connection before the legitimate client
      establishes an SSV, because the initial value of the SSV is zero
      and therefore known, Eve can send a SET_SSV that will pass the
      digest verification check.  However, because the new connection
      has not been associated with the session, the SET_SSV is rejected
      for that reason.

   In summary, an attacker's disruption of state when SP4_SSV protection
   is in use is limited to the formative period of a client ID, its
   first session, and the establishment of the SSV.  Once a non-
   malicious user uses the client ID, the client quickly detects any
   hijack and rectifies the situation.  Once a non-malicious user
   successfully modifies the SSV, the attacker cannot use NFSv4.1
   operations to disrupt the non-malicious user.

   Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches
   prevent hijacking of a transport connection that has previously been
   associated with a session.  If the goal of a counter-threat strategy
   is to prevent connection hijacking, the use of IPsec is RECOMMENDED.

   If a connection hijack occurs, the hijacker could in theory change
   locking state and negatively impact the service to legitimate
   clients.  However, if the server is configured to require the use of
   RPCSEC_GSS with integrity or privacy on the affected file objects,
   and if EXCHGID4_FLAG_BIND_PRINC_STATEID capability (Section 18.35) is
   in force, this will thwart unauthorized attempts to change locking
   state.

2.10.9.  The Secret State Verifier (SSV) GSS Mechanism



   The SSV provides the secret key for a GSS mechanism internal to
   NFSv4.1 that NFSv4.1 uses for state protection.  Contexts for this
   mechanism are not established via the RPCSEC_GSS protocol.  Instead,
   the contexts are automatically created when EXCHANGE_ID specifies
   SP4_SSV protection.  The only tokens defined are the PerMsgToken
   (emitted by GSS_GetMIC) and the SealedMessage token (emitted by
   GSS_Wrap).

   The mechanism OID for the SSV mechanism is
   iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech
   (1.3.6.1.4.1.28882.1.1).  While the SSV mechanism does not define any
   initial context tokens, the OID can be used to let servers indicate
   that the SSV mechanism is acceptable whenever the client sends a
   SECINFO or SECINFO_NO_NAME operation (see Section 2.6).

   The SSV mechanism defines four subkeys derived from the SSV value.
   Each time SET_SSV is invoked, the subkeys are recalculated by the
   client and server.  The calculation of each of the four subkeys
   depends on each of the four respective ssv_subkey4 enumerated values.
   The calculation uses the HMAC [52] algorithm, using the current SSV
   as the key, the one-way hash algorithm as negotiated by EXCHANGE_ID,
   and the input text as represented by the XDR encoded enumeration
   value for that subkey of data type ssv_subkey4.  If the length of the
   output of the HMAC algorithm exceeds the length of key of the
   encryption algorithm (which is also negotiated by EXCHANGE_ID), then
   the subkey MUST be truncated from the HMAC output, i.e., if the
   subkey is of N bytes long, then the first N bytes of the HMAC output
   MUST be used for the subkey.  The specification of EXCHANGE_ID states
   that the length of the output of the HMAC algorithm MUST NOT be less
   than the length of subkey needed for the encryption algorithm (see
   Section 18.35).

   /* Input for computing subkeys */
   enum ssv_subkey4 {
           SSV4_SUBKEY_MIC_I2T     = 1,
           SSV4_SUBKEY_MIC_T2I     = 2,
           SSV4_SUBKEY_SEAL_I2T    = 3,
           SSV4_SUBKEY_SEAL_T2I    = 4
   };

   The subkey derived from SSV4_SUBKEY_MIC_I2T is used for calculating
   message integrity codes (MICs) that originate from the NFSv4.1
   client, whether as part of a request over the fore channel or a
   response over the backchannel.  The subkey derived from
   SSV4_SUBKEY_MIC_T2I is used for MICs originating from the NFSv4.1
   server.  The subkey derived from SSV4_SUBKEY_SEAL_I2T is used for
   encryption text originating from the NFSv4.1 client, and the subkey
   derived from SSV4_SUBKEY_SEAL_T2I is used for encryption text
   originating from the NFSv4.1 server.

   The PerMsgToken description is based on an XDR definition:

   /* Input for computing smt_hmac */
   struct ssv_mic_plain_tkn4 {
     uint32_t        smpt_ssv_seq;
     opaque          smpt_orig_plain<>;
   };

   /* SSV GSS PerMsgToken token */
   struct ssv_mic_tkn4 {
     uint32_t        smt_ssv_seq;
     opaque          smt_hmac<>;
   };

   The field smt_hmac is an HMAC calculated by using the subkey derived
   from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I as the key, the one-
   way hash algorithm as negotiated by EXCHANGE_ID, and the input text
   as represented by data of type ssv_mic_plain_tkn4.  The field
   smpt_ssv_seq is the same as smt_ssv_seq.  The field smpt_orig_plain
   is the "message" input passed to GSS_GetMIC() (see Section 2.3.1 of
   [7]).  The caller of GSS_GetMIC() provides a pointer to a buffer
   containing the plain text.  The SSV mechanism's entry point for
   GSS_GetMIC() encodes this into an opaque array, and the encoding will
   include an initial four-byte length, plus any necessary padding.
   Prepended to this will be the XDR encoded value of smpt_ssv_seq, thus
   making up an XDR encoding of a value of data type ssv_mic_plain_tkn4,
   which in turn is the input into the HMAC.

   The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type
   ssv_mic_tkn4.  The field smt_ssv_seq comes from the SSV sequence
   number, which is equal to one after SET_SSV (Section 18.47) is called
   the first time on a client ID.  Thereafter, the SSV sequence number
   is incremented on each SET_SSV.  Thus, smt_ssv_seq represents the
   version of the SSV at the time GSS_GetMIC() was called.  As noted in
   Section 18.35, the client and server can maintain multiple concurrent
   versions of the SSV.  This allows the SSV to be changed without
   serializing all RPC calls that use the SSV mechanism with SET_SSV
   operations.  Once the HMAC is calculated, it is XDR encoded into
   smt_hmac, which will include an initial four-byte length, and any
   necessary padding.  Prepended to this will be the XDR encoded value
   of smt_ssv_seq.

   The SealedMessage description is based on an XDR definition:

   /* Input for computing ssct_encr_data and ssct_hmac */
   struct ssv_seal_plain_tkn4 {
     opaque          sspt_confounder<>;
     uint32_t        sspt_ssv_seq;
     opaque          sspt_orig_plain<>;
     opaque          sspt_pad<>;
   };

   /* SSV GSS SealedMessage token */
   struct ssv_seal_cipher_tkn4 {
     uint32_t      ssct_ssv_seq;
     opaque        ssct_iv<>;
     opaque        ssct_encr_data<>;
     opaque        ssct_hmac<>;
   };

   The token emitted by GSS_Wrap() is XDR encoded and of XDR data type
   ssv_seal_cipher_tkn4.

   The ssct_ssv_seq field has the same meaning as smt_ssv_seq.

   The ssct_encr_data field is the result of encrypting a value of the
   XDR encoded data type ssv_seal_plain_tkn4.  The encryption key is the
   subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and
   the encryption algorithm is that negotiated by EXCHANGE_ID.

   The ssct_iv field is the initialization vector (IV) for the
   encryption algorithm (if applicable) and is sent in clear text.  The
   content and size of the IV MUST comply with the specification of the
   encryption algorithm.  For example, the id-aes256-CBC algorithm MUST
   use a 16-byte initialization vector (IV), which MUST be unpredictable
   for each instance of a value of data type ssv_seal_plain_tkn4 that is
   encrypted with a particular SSV key.

   The ssct_hmac field is the result of computing an HMAC using the
   value of the XDR encoded data type ssv_seal_plain_tkn4 as the input
   text.  The key is the subkey derived from SSV4_SUBKEY_MIC_I2T or
   SSV4_SUBKEY_MIC_T2I, and the one-way hash algorithm is that
   negotiated by EXCHANGE_ID.

   The sspt_confounder field is a random value.

   The sspt_ssv_seq field is the same as ssvt_ssv_seq.

   The field sspt_orig_plain field is the original plaintext and is the
   "input_message" input passed to GSS_Wrap() (see Section 2.3.3 of
   [7]).  As with the handling of the plaintext by the SSV mechanism's
   GSS_GetMIC() entry point, the entry point for GSS_Wrap() expects a
   pointer to the plaintext, and will XDR encode an opaque array into
   sspt_orig_plain representing the plain text, along with the other
   fields of an instance of data type ssv_seal_plain_tkn4.

   The sspt_pad field is present to support encryption algorithms that
   require inputs to be in fixed-sized blocks.  The content of sspt_pad
   is zero filled except for the length.  Beware that the XDR encoding
   of ssv_seal_plain_tkn4 contains three variable-length arrays, and so
   each array consumes four bytes for an array length, and each array
   that follows the length is always padded to a multiple of four bytes
   per the XDR standard.

   For example, suppose the encryption algorithm uses 16-byte blocks,
   and the sspt_confounder is three bytes long, and the sspt_orig_plain
   field is 15 bytes long.  The XDR encoding of sspt_confounder uses
   eight bytes (4 + 3 + 1-byte pad), the XDR encoding of sspt_ssv_seq
   uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4
   + 15 + 1-byte pad), and the smallest XDR encoding of the sspt_pad
   field is four bytes.  This totals 36 bytes.  The next multiple of 16
   is 48; thus, the length field of sspt_pad needs to be set to 12
   bytes, or a total encoding of 16 bytes.  The total number of XDR
   encoded bytes is thus 8 + 4 + 20 + 16 = 48.

   GSS_Wrap() emits a token that is an XDR encoding of a value of data
   type ssv_seal_cipher_tkn4.  Note that regardless of whether or not
   the caller of GSS_Wrap() requests confidentiality, the token always
   has confidentiality.  This is because the SSV mechanism is for
   RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without
   confidentiality.

   There is one SSV per client ID.  There is a single GSS context for a
   client ID / SSV pair.  All SSV mechanism RPCSEC_GSS handles of a
   client ID / SSV pair share the same GSS context.  SSV GSS contexts do
   not expire except when the SSV is destroyed (causes would include the
   client ID being destroyed or a server restart).  Since one purpose of
   context expiration is to replace keys that have been in use for "too
   long", hence vulnerable to compromise by brute force or accident, the
   client can replace the SSV key by sending periodic SET_SSV
   operations, which is done by cycling through different users'
   RPCSEC_GSS credentials.  This way, the SSV is replaced without
   destroying the SSV's GSS contexts.

   SSV RPCSEC_GSS handles can be expired or deleted by the server at any
   time, and the EXCHANGE_ID operation can be used to create more SSV
   RPCSEC_GSS handles.  Expiration of SSV RPCSEC_GSS handles does not
   imply that the SSV or its GSS context has expired.

   The client MUST establish an SSV via SET_SSV before the SSV GSS
   context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC().
   If SET_SSV has not been successfully called, attempts to emit tokens
   MUST fail.

   The SSV mechanism does not support replay detection and sequencing in
   its tokens because RPCSEC_GSS does not use those features (see
   "Context Creation Requests", Section 5.2.2 of [4]).  However,
   Section 2.10.10 discusses special considerations for the SSV
   mechanism when used with RPCSEC_GSS.

2.10.10.  Security Considerations for RPCSEC_GSS When Using the SSV
          Mechanism



   When a client ID is created with SP4_SSV state protection (see
   Section 18.35), the client is permitted to associate multiple
   RPCSEC_GSS handles with the single SSV GSS context (see
   Section 2.10.9).  Because of the way RPCSEC_GSS (both version 1 and
   version 2, see [4] and [9]) calculate the verifier of the reply,
   special care must be taken by the implementation of the NFSv4.1
   client to prevent attacks by a man-in-the-middle.  The verifier of an
   RPCSEC_GSS reply is the output of GSS_GetMIC() applied to the input
   value of the seq_num field of the RPCSEC_GSS credential (data type
   rpc_gss_cred_ver_1_t) (see Section 5.3.3.2 of [4]).  If multiple
   RPCSEC_GSS handles share the same GSS context, then if one handle is
   used to send a request with the same seq_num value as another handle,
   an attacker could block the reply, and replace it with the verifier
   used for the other handle.

   There are multiple ways to prevent the attack on the SSV RPCSEC_GSS
   verifier in the reply.  The simplest is believed to be as follows.

   *  Each time one or more new SSV RPCSEC_GSS handles are created via
      EXCHANGE_ID, the client SHOULD send a SET_SSV operation to modify
      the SSV.  By changing the SSV, the new handles will not result in
      the re-use of an SSV RPCSEC_GSS verifier in a reply.

   *  When a requester decides to use N SSV RPCSEC_GSS handles, it
      SHOULD assign a unique and non-overlapping range of seq_nums to
      each SSV RPCSEC_GSS handle.  The size of each range SHOULD be
      equal to MAXSEQ / N (see Section 5 of [4] for the definition of
      MAXSEQ).  When an SSV RPCSEC_GSS handle reaches its maximum, it
      SHOULD force the replier to destroy the handle by sending a NULL
      RPC request with seq_num set to MAXSEQ + 1 (see Section 5.3.3.3 of
      [4]).

   *  When the requester wants to increase or decrease N, it SHOULD
      force the replier to destroy all N handles by sending a NULL RPC
      request on each handle with seq_num set to MAXSEQ + 1.  If the
      requester is the client, it SHOULD send a SET_SSV operation before
      using new handles.  If the requester is the server, then the
      client SHOULD send a SET_SSV operation when it detects that the
      server has forced it to destroy a backchannel's SSV RPCSEC_GSS
      handle.  By sending a SET_SSV operation, the SSV will change, and
      so the attacker will be unavailable to successfully replay a
      previous verifier in a reply to the requester.

   Note that if the replier carefully creates the SSV RPCSEC_GSS
   handles, the related risk of a man-in-the-middle splicing a forged
   SSV RPCSEC_GSS credential with a verifier for another handle does not
   exist.  This is because the verifier in an RPCSEC_GSS request is
   computed from input that includes both the RPCSEC_GSS handle and
   seq_num (see Section 5.3.1 of [4]).  Provided the replier takes care
   to avoid re-using the value of an RPCSEC_GSS handle that it creates,
   such as by including a generation number in the handle, the man-in-
   the-middle will not be able to successfully replay a previous
   verifier in the request to a replier.

2.10.11.  Session Mechanics - Steady State



2.10.11.1.  Obligations of the Server



   The server has the primary obligation to monitor the state of
   backchannel resources that the client has created for the server
   (RPCSEC_GSS contexts and backchannel connections).  If these
   resources vanish, the server takes action as specified in
   Section 2.10.13.2.

2.10.11.2.  Obligations of the Client



   The client SHOULD honor the following obligations in order to utilize
   the session:

   *  Keep a necessary session from going idle on the server.  A client
      that requires a session but nonetheless is not sending operations
      risks having the session be destroyed by the server.  This is
      because sessions consume resources, and resource limitations may
      force the server to cull an inactive session.  A server MAY
      consider a session to be inactive if the client has not used the
      session before the session inactivity timer (Section 2.10.12) has
      expired.

   *  Destroy the session when not needed.  If a client has multiple
      sessions, one of which has no requests waiting for replies, and
      has been idle for some period of time, it SHOULD destroy the
      session.

   *  Maintain GSS contexts and RPCSEC_GSS handles for the backchannel.
      If the client requires the server to use the RPCSEC_GSS security
      flavor for callbacks, then it needs to be sure the RPCSEC_GSS
      handles and/or their GSS contexts that are handed to the server
      via BACKCHANNEL_CTL or CREATE_SESSION are unexpired.

   *  Preserve a connection for a backchannel.  The server requires a
      backchannel in order to gracefully recall recallable state or
      notify the client of certain events.  Note that if the connection
      is not being used for the fore channel, there is no way for the
      client to tell if the connection is still alive (e.g., the server
      restarted without sending a disconnect).  The onus is on the
      server, not the client, to determine if the backchannel's
      connection is alive, and to indicate in the response to a SEQUENCE
      operation when the last connection associated with a session's
      backchannel has disconnected.

2.10.11.3.  Steps the Client Takes to Establish a Session



   If the client does not have a client ID, the client sends EXCHANGE_ID
   to establish a client ID.  If it opts for SP4_MACH_CRED or SP4_SSV
   protection, in the spo_must_enforce list of operations, it SHOULD at
   minimum specify CREATE_SESSION, DESTROY_SESSION,
   BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID.  If it
   opts for SP4_SSV protection, the client needs to ask for SSV-based
   RPCSEC_GSS handles.

   The client uses the client ID to send a CREATE_SESSION on a
   connection to the server.  The results of CREATE_SESSION indicate
   whether or not the server will persist the session reply cache
   through a server that has restarted, and the client notes this for
   future reference.

   If the client specified SP4_SSV state protection when the client ID
   was created, then it SHOULD send SET_SSV in the first COMPOUND after
   the session is created.  Each time a new principal goes to use the
   client ID, it SHOULD send a SET_SSV again.

   If the client wants to use delegations, layouts, directory
   notifications, or any other state that requires a backchannel, then
   it needs to add a connection to the backchannel if CREATE_SESSION did
   not already do so.  The client creates a connection, and calls
   BIND_CONN_TO_SESSION to associate the connection with the session and
   the session's backchannel.  If CREATE_SESSION did not already do so,
   the client MUST tell the server what security is required in order
   for the client to accept callbacks.  The client does this via
   BACKCHANNEL_CTL.  If the client selected SP4_MACH_CRED or SP4_SSV
   protection when it called EXCHANGE_ID, then the client SHOULD specify
   that the backchannel use RPCSEC_GSS contexts for security.

   If the client wants to use additional connections for the
   backchannel, then it needs to call BIND_CONN_TO_SESSION on each
   connection it wants to use with the session.  If the client wants to
   use additional connections for the fore channel, then it needs to
   call BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED
   state protection when the client ID was created.

   At this point, the session has reached steady state.

2.10.12.  Session Inactivity Timer



   The server MAY maintain a session inactivity timer for each session.
   If the session inactivity timer expires, then the server MAY destroy
   the session.  To avoid losing a session due to inactivity, the client
   MUST renew the session inactivity timer.  The length of session
   inactivity timer MUST NOT be less than the lease_time attribute
   (Section 5.8.1.11).  As with lease renewal (Section 8.3), when the
   server receives a SEQUENCE operation, it resets the session
   inactivity timer, and MUST NOT allow the timer to expire while the
   rest of the operations in the COMPOUND procedure's request are still
   executing.  Once the last operation has finished, the server MUST set
   the session inactivity timer to expire no sooner than the sum of the
   current time and the value of the lease_time attribute.

2.10.13.  Session Mechanics - Recovery



2.10.13.1.  Events Requiring Client Action



   The following events require client action to recover.

2.10.13.1.1.  RPCSEC_GSS Context Loss by Callback Path


   If all RPCSEC_GSS handles granted by the client to the server for
   callback use have expired, the client MUST establish a new handle via
   BACKCHANNEL_CTL.  The sr_status_flags field of the SEQUENCE results
   indicates when callback handles are nearly expired, or fully expired
   (see Section 18.46.3).

2.10.13.1.2.  Connection Loss


   If the client loses the last connection of the session and wants to
   retain the session, then it needs to create a new connection, and if,
   when the client ID was created, BIND_CONN_TO_SESSION was specified in
   the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION
   to associate the connection with the session.

   If there was a request outstanding at the time of connection loss,
   then if the client wants to continue to use the session, it MUST
   retry the request, as described in Section 2.10.6.2.  Note that it is
   not necessary to retry requests over a connection with the same
   source network address or the same destination network address as the
   lost connection.  As long as the session ID, slot ID, and sequence ID
   in the retry match that of the original request, the server will
   recognize the request as a retry if it executed the request prior to
   disconnect.

   If the connection that was lost was the last one associated with the
   backchannel, and the client wants to retain the backchannel and/or
   prevent revocation of recallable state, the client needs to
   reconnect, and if it does, it MUST associate the connection to the
   session and backchannel via BIND_CONN_TO_SESSION.  The server SHOULD
   indicate when it has no callback connection via the sr_status_flags
   result from SEQUENCE.

2.10.13.1.3.  Backchannel GSS Context Loss


   Via the sr_status_flags result of the SEQUENCE operation or other
   means, the client will learn if some or all of the RPCSEC_GSS
   contexts it assigned to the backchannel have been lost.  If the
   client wants to retain the backchannel and/or not put recallable
   state subject to revocation, the client needs to use BACKCHANNEL_CTL
   to assign new contexts.

2.10.13.1.4.  Loss of Session


   The replier might lose a record of the session.  Causes include:

   *  Replier failure and restart.

   *  A catastrophe that causes the reply cache to be corrupted or lost
      on the media on which it was stored.  This applies even if the
      replier indicated in the CREATE_SESSION results that it would
      persist the cache.

   *  The server purges the session of a client that has been inactive
      for a very extended period of time.

   *  As a result of configuration changes among a set of clustered
      servers, a network address previously connected to one server
      becomes connected to a different server that has no knowledge of
      the session in question.  Such a configuration change will
      generally only happen when the original server ceases to function
      for a time.

   Loss of reply cache is equivalent to loss of session.  The replier
   indicates loss of session to the requester by returning
   NFS4ERR_BADSESSION on the next operation that uses the session ID
   that refers to the lost session.

   After an event like a server restart, the client may have lost its
   connections.  The client assumes for the moment that the session has
   not been lost.  It reconnects, and if it specified connection
   association enforcement when the session was created, it invokes
   BIND_CONN_TO_SESSION using the session ID.  Otherwise, it invokes
   SEQUENCE.  If BIND_CONN_TO_SESSION or SEQUENCE returns
   NFS4ERR_BADSESSION, the client knows the session is not available to
   it when communicating with that network address.  If the connection
   survives session loss, then the next SEQUENCE operation the client
   sends over the connection will get back NFS4ERR_BADSESSION.  The
   client again knows the session was lost.

   Here is one suggested algorithm for the client when it gets
   NFS4ERR_BADSESSION.  It is not obligatory in that, if a client does
   not want to take advantage of such features as trunking, it may omit
   parts of it.  However, it is a useful example that draws attention to
   various possible recovery issues:

   1.  If the client has other connections to other server network
       addresses associated with the same session, attempt a COMPOUND
       with a single operation, SEQUENCE, on each of the other
       connections.

   2.  If the attempts succeed, the session is still alive, and this is
       a strong indicator that the server's network address has moved.
       The client might send an EXCHANGE_ID on the connection that
       returned NFS4ERR_BADSESSION to see if there are opportunities for
       client ID trunking (i.e., the same client ID and so_major_id
       value are returned).  The client might use DNS to see if the
       moved network address was replaced with another, so that the
       performance and availability benefits of session trunking can
       continue.

   3.  If the SEQUENCE requests fail with NFS4ERR_BADSESSION, then the
       session no longer exists on any of the server network addresses
       for which the client has connections associated with that session
       ID.  It is possible the session is still alive and available on
       other network addresses.  The client sends an EXCHANGE_ID on all
       the connections to see if the server owner is still listening on
       those network addresses.  If the same server owner is returned
       but a new client ID is returned, this is a strong indicator of a
       server restart.  If both the same server owner and same client ID
       are returned, then this is a strong indication that the server
       did delete the session, and the client will need to send a
       CREATE_SESSION if it has no other sessions for that client ID.
       If a different server owner is returned, the client can use DNS
       to find other network addresses.  If it does not, or if DNS does
       not find any other addresses for the server, then the client will
       be unable to provide NFSv4.1 service, and fatal errors should be
       returned to processes that were using the server.  If the client
       is using a "mount" paradigm, unmounting the server is advised.

   4.  If the client knows of no other connections associated with the
       session ID and server network addresses that are, or have been,
       associated with the session ID, then the client can use DNS to
       find other network addresses.  If it does not, or if DNS does not
       find any other addresses for the server, then the client will be
       unable to provide NFSv4.1 service, and fatal errors should be
       returned to processes that were using the server.  If the client
       is using a "mount" paradigm, unmounting the server is advised.

   If there is a reconfiguration event that results in the same network
   address being assigned to servers where the eir_server_scope value is
   different, it cannot be guaranteed that a session ID generated by the
   first will be recognized as invalid by the first.  Therefore, in
   managing server reconfigurations among servers with different server
   scope values, it is necessary to make sure that all clients have
   disconnected from the first server before effecting the
   reconfiguration.  Nonetheless, clients should not assume that servers
   will always adhere to this requirement; clients MUST be prepared to
   deal with unexpected effects of server reconfigurations.  Even where
   a session ID is inappropriately recognized as valid, it is likely
   either that the connection will not be recognized as valid or that a
   sequence value for a slot will not be correct.  Therefore, when a
   client receives results indicating such unexpected errors, the use of
   EXCHANGE_ID to determine the current server configuration is
   RECOMMENDED.

   A variation on the above is that after a server's network address
   moves, there is no NFSv4.1 server listening, e.g., no listener on
   port 2049.  In this example, one of the following occur: the NFSv4
   server returns NFS4ERR_MINOR_VERS_MISMATCH, the NFS server returns a
   PROG_MISMATCH error, the RPC listener on 2049 returns PROG_UNVAIL, or
   attempts to reconnect to the network address timeout.  These SHOULD
   be treated as equivalent to SEQUENCE returning NFS4ERR_BADSESSION for
   these purposes.

   When the client detects session loss, it needs to call CREATE_SESSION
   to recover.  Any non-idempotent operations that were in progress
   might have been performed on the server at the time of session loss.
   The client has no general way to recover from this.

   Note that loss of session does not imply loss of byte-range lock,
   open, delegation, or layout state because locks, opens, delegations,
   and layouts are tied to the client ID and depend on the client ID,
   not the session.  Nor does loss of byte-range lock, open, delegation,
   or layout state imply loss of session state, because the session
   depends on the client ID; loss of client ID however does imply loss
   of session, byte-range lock, open, delegation, and layout state.  See
   Section 8.4.2.  A session can survive a server restart, but lock
   recovery may still be needed.

   It is possible that CREATE_SESSION will fail with
   NFS4ERR_STALE_CLIENTID (e.g., the server restarts and does not
   preserve client ID state).  If so, the client needs to call
   EXCHANGE_ID, followed by CREATE_SESSION.

2.10.13.2.  Events Requiring Server Action



   The following events require server action to recover.

2.10.13.2.1.  Client Crash and Restart


   As described in Section 18.35, a restarted client sends EXCHANGE_ID
   in such a way that it causes the server to delete any sessions it
   had.

2.10.13.2.2.  Client Crash with No Restart


   If a client crashes and never comes back, it will never send
   EXCHANGE_ID with its old client owner.  Thus, the server has session
   state that will never be used again.  After an extended period of
   time, and if the server has resource constraints, it MAY destroy the
   old session as well as locking state.

2.10.13.2.3.  Extended Network Partition


   To the server, the extended network partition may be no different
   from a client crash with no restart (see Section 2.10.13.2.2).
   Unless the server can discern that there is a network partition, it
   is free to treat the situation as if the client has crashed
   permanently.

2.10.13.2.4.  Backchannel Connection Loss


   If there were callback requests outstanding at the time of a
   connection loss, then the server MUST retry the requests, as
   described in Section 2.10.6.2.  Note that it is not necessary to
   retry requests over a connection with the same source network address
   or the same destination network address as the lost connection.  As
   long as the session ID, slot ID, and sequence ID in the retry match
   that of the original request, the callback target will recognize the
   request as a retry even if it did see the request prior to
   disconnect.

   If the connection lost is the last one associated with the
   backchannel, then the server MUST indicate that in the
   sr_status_flags field of every SEQUENCE reply until the backchannel
   is re-established.  There are two situations, each of which uses
   different status flags: no connectivity for the session's backchannel
   and no connectivity for any session backchannel of the client.  See
   Section 18.46 for a description of the appropriate flags in
   sr_status_flags.

2.10.13.2.5.  GSS Context Loss


   The server SHOULD monitor when the number of RPCSEC_GSS handles
   assigned to the backchannel reaches one, and when that one handle is
   near expiry (i.e., between one and two periods of lease time), and
   indicate so in the sr_status_flags field of all SEQUENCE replies.
   The server MUST indicate when all of the backchannel's assigned
   RPCSEC_GSS handles have expired via the sr_status_flags field of all
   SEQUENCE replies.

2.10.14.  Parallel NFS and Sessions



   A client and server can potentially be a non-pNFS implementation, a
   metadata server implementation, a data server implementation, or two
   or three types of implementations.  The EXCHGID4_FLAG_USE_NON_PNFS,
   EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not
   mutually exclusive) are passed in the EXCHANGE_ID arguments and
   results to allow the client to indicate how it wants to use sessions
   created under the client ID, and to allow the server to indicate how
   it will allow the sessions to be used.  See Section 13.1 for pNFS
   sessions considerations.

3.  Protocol Constants and Data Types

   The syntax and semantics to describe the data types of the NFSv4.1
   protocol are defined in the XDR (RFC 4506 [2]) and RPC (RFC 5531 [3])
   documents.  The next sections build upon the XDR data types to define
   constants, types, and structures specific to this protocol.  The full
   list of XDR data types is in [10].

3.1.  Basic Constants



   const NFS4_FHSIZE               = 128;
   const NFS4_VERIFIER_SIZE        = 8;
   const NFS4_OPAQUE_LIMIT         = 1024;
   const NFS4_SESSIONID_SIZE       = 16;

   const NFS4_INT64_MAX            = 0x7fffffffffffffff;
   const NFS4_UINT64_MAX           = 0xffffffffffffffff;
   const NFS4_INT32_MAX            = 0x7fffffff;
   const NFS4_UINT32_MAX           = 0xffffffff;

   const NFS4_MAXFILELEN           = 0xffffffffffffffff;
   const NFS4_MAXFILEOFF           = 0xfffffffffffffffe;

   Except where noted, all these constants are defined in bytes.

   *  NFS4_FHSIZE is the maximum size of a filehandle.

   *  NFS4_VERIFIER_SIZE is the fixed size of a verifier.

   *  NFS4_OPAQUE_LIMIT is the maximum size of certain opaque
      information.

   *  NFS4_SESSIONID_SIZE is the fixed size of a session identifier.

   *  NFS4_INT64_MAX is the maximum value of a signed 64-bit integer.

   *  NFS4_UINT64_MAX is the maximum value of an unsigned 64-bit
      integer.

   *  NFS4_INT32_MAX is the maximum value of a signed 32-bit integer.

   *  NFS4_UINT32_MAX is the maximum value of an unsigned 32-bit
      integer.

   *  NFS4_MAXFILELEN is the maximum length of a regular file.

   *  NFS4_MAXFILEOFF is the maximum offset into a regular file.

3.2.  Basic Data Types



   These are the base NFSv4.1 data types.

     +===============+==============================================+
     | Data Type     | Definition                                   |
     +===============+==============================================+
     | int32_t       | typedef int int32_t;                         |
     +---------------+----------------------------------------------+
     | uint32_t      | typedef unsigned int uint32_t;               |
     +---------------+----------------------------------------------+
     | int64_t       | typedef hyper int64_t;                       |
     +---------------+----------------------------------------------+
     | uint64_t      | typedef unsigned hyper uint64_t;             |
     +---------------+----------------------------------------------+
     | attrlist4     | typedef opaque attrlist4<>;                  |
     |               |                                              |
     |               | Used for file/directory attributes.          |
     +---------------+----------------------------------------------+
     | bitmap4       | typedef uint32_t bitmap4<>;                  |
     |               |                                              |
     |               | Used in attribute array encoding.            |
     +---------------+----------------------------------------------+
     | changeid4     | typedef uint64_t changeid4;                  |
     |               |                                              |
     |               | Used in the definition of change_info4.      |
     +---------------+----------------------------------------------+
     | clientid4     | typedef uint64_t clientid4;                  |
     |               |                                              |
     |               | Shorthand reference to client                |
     |               | identification.                              |
     +---------------+----------------------------------------------+
     | count4        | typedef uint32_t count4;                     |
     |               |                                              |
     |               | Various count parameters (READ, WRITE,       |
     |               | COMMIT).                                     |
     +---------------+----------------------------------------------+
     | length4       | typedef uint64_t length4;                    |
     |               |                                              |
     |               | The length of a byte-range within a file.    |
     +---------------+----------------------------------------------+
     | mode4         | typedef uint32_t mode4;                      |
     |               |                                              |
     |               | Mode attribute data type.                    |
     +---------------+----------------------------------------------+
     | nfs_cookie4   | typedef uint64_t nfs_cookie4;                |
     |               |                                              |
     |               | Opaque cookie value for READDIR.             |
     +---------------+----------------------------------------------+
     | nfs_fh4       | typedef opaque nfs_fh4<NFS4_FHSIZE>;         |
     |               |                                              |
     |               | Filehandle definition.                       |
     +---------------+----------------------------------------------+
     | nfs_ftype4    | enum nfs_ftype4;                             |
     |               |                                              |
     |               | Various defined file types.                  |
     +---------------+----------------------------------------------+
     | nfsstat4      | enum nfsstat4;                               |
     |               |                                              |
     |               | Return value for operations.                 |
     +---------------+----------------------------------------------+
     | offset4       | typedef uint64_t offset4;                    |
     |               |                                              |
     |               | Various offset designations (READ, WRITE,    |
     |               | LOCK, COMMIT).                               |
     +---------------+----------------------------------------------+
     | qop4          | typedef uint32_t qop4;                       |
     |               |                                              |
     |               | Quality of protection designation in         |
     |               | SECINFO.                                     |
     +---------------+----------------------------------------------+
     | sec_oid4      | typedef opaque sec_oid4<>;                   |
     |               |                                              |
     |               | Security Object Identifier.  The sec_oid4    |
     |               | data type is not really opaque.  Instead, it |
     |               | contains an ASN.1 OBJECT IDENTIFIER as used  |
     |               | by GSS-API in the mech_type argument to      |
     |               | GSS_Init_sec_context.  See [7] for details.  |
     +---------------+----------------------------------------------+
     | sequenceid4   | typedef uint32_t sequenceid4;                |
     |               |                                              |
     |               | Sequence number used for various session     |
     |               | operations (EXCHANGE_ID, CREATE_SESSION,     |
     |               | SEQUENCE, CB_SEQUENCE).                      |
     +---------------+----------------------------------------------+
     | seqid4        | typedef uint32_t seqid4;                     |
     |               |                                              |
     |               | Sequence identifier used for locking.        |
     +---------------+----------------------------------------------+
     | sessionid4    | typedef opaque                               |
     |               | sessionid4[NFS4_SESSIONID_SIZE];             |
     |               |                                              |
     |               | Session identifier.                          |
     +---------------+----------------------------------------------+
     | slotid4       | typedef uint32_t slotid4;                    |
     |               |                                              |
     |               | Sequencing artifact for various session      |
     |               | operations (SEQUENCE, CB_SEQUENCE).          |
     +---------------+----------------------------------------------+
     | utf8string    | typedef opaque utf8string<>;                 |
     |               |                                              |
     |               | UTF-8 encoding for strings.                  |
     +---------------+----------------------------------------------+
     | utf8str_cis   | typedef utf8string utf8str_cis;              |
     |               |                                              |
     |               | Case-insensitive UTF-8 string.               |
     +---------------+----------------------------------------------+
     | utf8str_cs    | typedef utf8string utf8str_cs;               |
     |               |                                              |
     |               | Case-sensitive UTF-8 string.                 |
     +---------------+----------------------------------------------+
     | utf8str_mixed | typedef utf8string utf8str_mixed;            |
     |               |                                              |
     |               | UTF-8 strings with a case-sensitive prefix   |
     |               | and a case-insensitive suffix.               |
     +---------------+----------------------------------------------+
     | component4    | typedef utf8str_cs component4;               |
     |               |                                              |
     |               | Represents pathname components.              |
     +---------------+----------------------------------------------+
     | linktext4     | typedef utf8str_cs linktext4;                |
     |               |                                              |
     |               | Symbolic link contents ("symbolic link" is   |
     |               | defined in an Open Group [11] standard).     |
     +---------------+----------------------------------------------+
     | pathname4     | typedef component4 pathname4<>;              |
     |               |                                              |
     |               | Represents pathname for fs_locations.        |
     +---------------+----------------------------------------------+
     | verifier4     | typedef opaque                               |
     |               | verifier4[NFS4_VERIFIER_SIZE];               |
     |               |                                              |
     |               | Verifier used for various operations         |
     |               | (COMMIT, CREATE, EXCHANGE_ID, OPEN, READDIR, |
     |               | WRITE) NFS4_VERIFIER_SIZE is defined as 8.   |
     +---------------+----------------------------------------------+

                                 Table 1

   End of Base Data Types

3.3.  Structured Data Types



3.3.1.  nfstime4



   struct nfstime4 {
           int64_t         seconds;
           uint32_t        nseconds;
   };

   The nfstime4 data type gives the number of seconds and nanoseconds
   since midnight or zero hour January 1, 1970 Coordinated Universal
   Time (UTC).  Values greater than zero for the seconds field denote
   dates after the zero hour January 1, 1970.  Values less than zero for
   the seconds field denote dates before the zero hour January 1, 1970.
   In both cases, the nseconds field is to be added to the seconds field
   for the final time representation.  For example, if the time to be
   represented is one-half second before zero hour January 1, 1970, the
   seconds field would have a value of negative one (-1) and the
   nseconds field would have a value of one-half second (500000000).
   Values greater than 999,999,999 for nseconds are invalid.

   This data type is used to pass time and date information.  A server
   converts to and from its local representation of time when processing
   time values, preserving as much accuracy as possible.  If the
   precision of timestamps stored for a file system object is less than
   defined, loss of precision can occur.  An adjunct time maintenance
   protocol is RECOMMENDED to reduce client and server time skew.

3.3.2.  time_how4



   enum time_how4 {
           SET_TO_SERVER_TIME4 = 0,
           SET_TO_CLIENT_TIME4 = 1
   };

3.3.3.  settime4



   union settime4 switch (time_how4 set_it) {
    case SET_TO_CLIENT_TIME4:
            nfstime4       time;
    default:
            void;
   };

   The time_how4 and settime4 data types are used for setting timestamps
   in file object attributes.  If set_it is SET_TO_SERVER_TIME4, then
   the server uses its local representation of time for the time value.

3.3.4.  specdata4



   struct specdata4 {
    uint32_t specdata1; /* major device number */
    uint32_t specdata2; /* minor device number */
   };

   This data type represents the device numbers for the device file
   types NF4CHR and NF4BLK.

3.3.5.  fsid4



   struct fsid4 {
           uint64_t        major;
           uint64_t        minor;
   };

3.3.6.  change_policy4



   struct change_policy4 {
           uint64_t        cp_major;
           uint64_t        cp_minor;
   };

   The change_policy4 data type is used for the change_policy
   RECOMMENDED attribute.  It provides change sequencing indication
   analogous to the change attribute.  To enable the server to present a
   value valid across server re-initialization without requiring
   persistent storage, two 64-bit quantities are used, allowing one to
   be a server instance ID and the second to be incremented non-
   persistently, within a given server instance.

3.3.7.  fattr4



   struct fattr4 {
           bitmap4         attrmask;
           attrlist4       attr_vals;
   };

   The fattr4 data type is used to represent file and directory
   attributes.

   The bitmap is a counted array of 32-bit integers used to contain bit
   values.  The position of the integer in the array that contains bit n
   can be computed from the expression (n / 32), and its bit within that
   integer is (n mod 32).

                     0            1
   +-----------+-----------+-----------+--
   |  count    | 31  ..  0 | 63  .. 32 |
   +-----------+-----------+-----------+--



3.3.8.  change_info4



   struct change_info4 {
           bool            atomic;
           changeid4       before;
           changeid4       after;
   };

   This data type is used with the CREATE, LINK, OPEN, REMOVE, and
   RENAME operations to let the client know the value of the change
   attribute for the directory in which the target file system object
   resides.

3.3.9.  netaddr4



   struct netaddr4 {
           /* see struct rpcb in RFC 1833 */
           string na_r_netid<>; /* network id */
           string na_r_addr<>;  /* universal address */
   };

   The netaddr4 data type is used to identify network transport
   endpoints.  The na_r_netid and na_r_addr fields respectively contain
   a netid and uaddr.  The netid and uaddr concepts are defined in [12].
   The netid and uaddr formats for TCP over IPv4 and TCP over IPv6 are
   defined in [12], specifically Tables 2 and 3 and in Sections 5.2.3.3
   and 5.2.3.4.

3.3.10.  state_owner4



   struct state_owner4 {
           clientid4       clientid;
           opaque          owner<NFS4_OPAQUE_LIMIT>;
   };

   typedef state_owner4 open_owner4;
   typedef state_owner4 lock_owner4;

   The state_owner4 data type is the base type for the open_owner4
   (Section 3.3.10.1) and lock_owner4 (Section 3.3.10.2).

3.3.10.1.  open_owner4



   This data type is used to identify the owner of OPEN state.

3.3.10.2.  lock_owner4



   This structure is used to identify the owner of byte-range locking
   state.

3.3.11.  open_to_lock_owner4



   struct open_to_lock_owner4 {
           seqid4          open_seqid;
           stateid4        open_stateid;
           seqid4          lock_seqid;
           lock_owner4     lock_owner;
   };

   This data type is used for the first LOCK operation done for an
   open_owner4.  It provides both the open_stateid and lock_owner, such
   that the transition is made from a valid open_stateid sequence to
   that of the new lock_stateid sequence.  Using this mechanism avoids
   the confirmation of the lock_owner/lock_seqid pair since it is tied
   to established state in the form of the open_stateid/open_seqid.

3.3.12.  stateid4



   struct stateid4 {
           uint32_t        seqid;
           opaque          other[12];
   };

   This data type is used for the various state sharing mechanisms
   between the client and server.  The client never modifies a value of
   data type stateid.  The starting value of the "seqid" field is
   undefined.  The server is required to increment the "seqid" field by
   one at each transition of the stateid.  This is important since the
   client will inspect the seqid in OPEN stateids to determine the order
   of OPEN processing done by the server.

3.3.13.  layouttype4



   enum layouttype4 {
           LAYOUT4_NFSV4_1_FILES   = 0x1,
           LAYOUT4_OSD2_OBJECTS    = 0x2,
           LAYOUT4_BLOCK_VOLUME    = 0x3
   };

   This data type indicates what type of layout is being used.  The file
   server advertises the layout types it supports through the
   fs_layout_type file system attribute (Section 5.12.1).  A client asks
   for layouts of a particular type in LAYOUTGET, and processes those
   layouts in its layout-type-specific logic.

   The layouttype4 data type is 32 bits in length.  The range
   represented by the layout type is split into three parts.  Type 0x0
   is reserved.  Types within the range 0x00000001-0x7FFFFFFF are
   globally unique and are assigned according to the description in
   Section 22.5; they are maintained by IANA.  Types within the range
   0x80000000-0xFFFFFFFF are site specific and for private use only.

   The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file
   layout type, as defined in Section 13, is to be used.  The
   LAYOUT4_OSD2_OBJECTS enumeration specifies that the object layout, as
   defined in [47], is to be used.  Similarly, the LAYOUT4_BLOCK_VOLUME
   enumeration specifies that the block/volume layout, as defined in
   [48], is to be used.

3.3.14.  deviceid4



   const NFS4_DEVICEID4_SIZE = 16;

   typedef opaque  deviceid4[NFS4_DEVICEID4_SIZE];

   Layout information includes device IDs that specify a storage device
   through a compact handle.  Addressing and type information is
   obtained with the GETDEVICEINFO operation.  Device IDs are not
   guaranteed to be valid across metadata server restarts.  A device ID
   is unique per client ID and layout type.  See Section 12.2.10 for
   more details.

3.3.15.  device_addr4



   struct device_addr4 {
           layouttype4             da_layout_type;
           opaque                  da_addr_body<>;
   };

   The device address is used to set up a communication channel with the
   storage device.  Different layout types will require different data
   types to define how they communicate with storage devices.  The
   opaque da_addr_body field is interpreted based on the specified
   da_layout_type field.

   This document defines the device address for the NFSv4.1 file layout
   (see Section 13.3), which identifies a storage device by network IP
   address and port number.  This is sufficient for the clients to
   communicate with the NFSv4.1 storage devices, and may be sufficient
   for other layout types as well.  Device types for object-based
   storage devices and block storage devices (e.g., Small Computer
   System Interface (SCSI) volume labels) are defined by their
   respective layout specifications.

3.3.16.  layout_content4



   struct layout_content4 {
           layouttype4 loc_type;
           opaque      loc_body<>;
   };

   The loc_body field is interpreted based on the layout type
   (loc_type).  This document defines the loc_body for the NFSv4.1 file
   layout type; see Section 13.3 for its definition.

3.3.17.  layout4



   struct layout4 {
           offset4                 lo_offset;
           length4                 lo_length;
           layoutiomode4           lo_iomode;
           layout_content4         lo_content;
   };

   The layout4 data type defines a layout for a file.  The layout type
   specific data is opaque within lo_content.  Since layouts are sub-
   dividable, the offset and length together with the file's filehandle,
   the client ID, iomode, and layout type identify the layout.

3.3.18.  layoutupdate4



   struct layoutupdate4 {
           layouttype4             lou_type;
           opaque                  lou_body<>;
   };

   The layoutupdate4 data type is used by the client to return updated
   layout information