Tree Hash EXchange format (THEX)

Abstract

The following memo presents the Tree Hash Exchange (THEX) format, for exchanging Merkle Hash Trees built up from the subrange digests of discrete digital files. Such tree hash data structures assist in file integrity verification, allowing arbitrary subranges of bytes to be verified before the entire file has been received.

Description and references

To get the latest complete version of the THEX specification, visit:
http://open-content.net/specs/draft-jchapweske-thex-01.html

THEX can be used to exchange Merkle Hash Trees computed with various message digest algorithms and various digest sizes (including "CRC32", "MD5", "SHA1" or "Tiger" with all their variants).

Despite THEX trees built with CRC32 are very fast to compute and can detect most errors in transmissions, they don't offer security against undesired tampering of file contents. In addition, CRC32 tends to be too small for large file contents where THEX is typically needed. So, stronger digest algorithms with longer lengths are highly preferable.

Most THEX applications will then use the 160-bit "SHA1" message digest algorithm, or the faster and stronger 192-bit "Tiger" message digest algorithm, as they are currently irreversible.

It's possible to use reduced versions of these two message digests to minimize the storage space used by serialized THEX tree data, but message digests should generate at least 128 bits, each bit with approximately equal encryption strength.

The standard THEX tree data exchange format uses XML in its encapsulation layer according to the Direct Internet Message Encapsulation (alias DIME) specification used for XML Web Services, initially developed by Henrik Frystik Nielsen, and developped as an IETF draft by Microsoft/IBM for SOAP. Visit:
http://msdn.microsoft.com/webservices/understanding/gxa/default.asp?pull=/library/en-us/dnglobspec/html/dimeindex.asp)

The THEX serialized tree data transported in a DIME encapsulation should be accessible in a location independant way, for example the secure "urn:sha1:" URN or very secure "urn:bitprint:" URN (both requires precomputing the digests of the fully serialized tree), or a more simple "uuid:" URI defined in SOAP and referenced in DIME (this UUID can be generated independantly of the serialized tree data content, and may reduce the time to generate the DIME encapsulation as it does not require an additional hash, but this makes THEX serialized trees less secure).

For some distributed applications (in peer-to-peer file exchange protocols or distributed file systems), the THEX encapsulation in XML with DIME may be unnecessary, if user-agents all agree on the message digest algorithm to use, and on its tree data serialization format. In that case, only the tree data URN may be necessary, and transported for example during connection handshake headers (if using HTTP-like protocols that allow transporting such extensions before the actual file content data). Note however that DIME allows further extension to stronger or faster alternate algorithms if they become necessary.

Applications of THEX for "TigerTree" file digests

A typical application of Merkle Hash Trees is "TigerTree" which is another file Digest that can complement "SHA1" file digests.

A "TigerTree" digest differs from a full "Tiger" because it is NOT computed by digesting the full file, but by combining "Tiger" digests computed on individual 1KB blocks, and combining them in a Merkle Hash Tree. The "TigerTree" digest of the file is the root hash of the Merkle Hash tree computed with the standard "Tiger" digest.

Bitzi's "bitprint:" URNs and TigerTree file digests

The Bitzi's "bitprint:" URN scheme uses the "TigerTree" file digest, NOT the "Tiger" file digest. They will most often be different for any file that is larger than 1024-9=1013 bytes exactly, and will always be identical ONLY for small files up to 1013 bytes.

"bitprint:" URNs can be computed without generating and serializing the full Merkle Hash Tree. But for applications in Gnutella with swarmed downloads, it's best to keep a storage for intermediate hash values, that complies to the THEX binary serialization format.

Note: Bitzi's "bitprint:" URN are using the following format:

"urn:bitprint:SHA1.TigerTree"

where:

Note: The shorter (but less secure) "sha1:" URN for any file content can be simply infered from an existing "bitprint:" URN for the same file content by replacing the URN encoding scheme, and stripping the "." and the TigerTree part. So transporting both the "sha1:" URN and the "bitprint:" URN is not needed, as the latter will suffice in most cases.

To get the latest complete specification of the "bitprint:" URN scheme, visit:
http://bitzi.com/developer/bitprint.

To get reference documentation about the standard "Tiger" message digest, and a sample C implementation, visit:
http://www.cs.technion.ac.il/~biham/Reports/Tiger/.

To get reference documentation about the standard "Base32" encoding, visit:
http://www.ietf.org/internet-drafts/draft-josefsson-base-encoding-03.txt.

To get a sample Public Domain implementation in Java of the "Tiger" Digest and of simple Base16, Base32, Base64 encoders/decoders, visit: http://groups.yahoo.com/group/the_gdf/files/Proposals/HUGE/com.bitzi.util/.

Applications of THEX for peer-to-peer file exchanges

THEX works best with peer-to-peer file exchanges and distributed filesystems.

The "swarmed downloads" feature on Gnutella will best benefit from THEX as it allows verifying the integrity of files downloaded by fragments from multiple sources, as discovered with the "HUGE" protocol extension proposal for Gnutella that is largely approved by most Gnutella servent vendors

For a complete specification of the HUGE protocol extension (Hash/URN Generic Extension) by Gnutella servents, visit:
http://groups.yahoo.com/group/the_gdf/files/Proposals/HUGE/.

For a complete specification of the PFSP protocol extension (Partial File Sharing Protocol) by Gnutella servents, visit:
http://groups.yahoo.com/group/the_gdf/files/Proposals/PFSP/.

For a complete specification of the standard Gnutella protocol, visit:
http://groups.yahoo.com/group/the_gdf/files/Development/.

For developers only, technical discussions about the evolutions of the Gnutella protocol, visit:
http://groups.yahoo.com/group/the_gdf/ (may require user registration on the Yahoo! service).

THEX specification Table of Contents