PKWare Data Compression
This is the "unofficial" specification of the compression algorithm as used in the PKWare Data Compression Library. For an explanation on how it is used in the game files, see either the general info page or the "saved game" and/or "scenario" pages per game.
Ben Rudiak-Gould posted the following specification of it in the comp.compression newsgroup. This specification is in the public domain.
The following description assumes that the reader is familiar with Lempel-Ziv sliding dictionary compression. A PKWARE Data Compression Library compressed stream consists of a two-byte header followed by a bitstream of arbitrary length. The first byte of the header is either 0 or 1, and determines the way in which literal bytes are represented in the bitstream (see below). The second byte of the header is equal to 4, 5, or 6, and determines the size of the sliding-window dictionary: 4, 5, and 6 indicate a 1K, 2K and 4K dictionary respectively. The bitstream is stored in little-endian order. For example, the bytes 0F 15 represent the bitstream 1111000010101000. The bitstream consists of a sequence of standard Lempel-Ziv tokens, represented by a prefix-free code. That is, each code word represents either a literal byte or a (length,offset) pair. One code word is reserved to indicate end-of-stream. Tokens are represented as follows: If the first bit of the token's code word is 0, the token is a literal byte. If it is 1, the token is a (length,offset) pair or end-of-stream. Literal bytes are represented in one of two different ways. If the first byte of the header is equal to 0, the bytes are represented as eight-bit little-endian binary numbers in the obvious way. (The total code word length in this case is 9 bits, including the leading 0 bit.) If the first byte of the header is equal to 1, literal bytes are represented by the following bit sequences: byte representation 20 1111 45 11101 61 11100 65 11011 69 11010 6c 11001 6e 11000 6f 10111 72 10110 73 10101 74 10100 75 10011 2d 100101 31 100100 41 100011 43 100010 44 100001 49 100000 4c 011111 4e 011110 4f 011101 52 011100 53 011011 54 011010 62 011001 63 011000 64 010111 66 010110 67 010101 68 010100 6d 010011 70 010010 0a 0100011 0d 0100010 28 0100001 29 0100000 2c 0011111 2e 0011110 30 0011101 32 0011100 33 0011011 34 0011010 35 0011001 37 0011000 38 0010111 3d 0010110 42 0010101 46 0010100 4d 0010011 50 0010010 55 0010001 6b 0010000 77 0001111 09 00011101 22 00011100 27 00011011 2a 00011010 2f 00011001 36 00011000 39 00010111 3a 00010110 47 00010101 48 00010100 57 00010011 5b 00010010 5f 00010001 76 00010000 78 00001111 79 00001110 2b 000011011 3e 000011010 4b 000011001 56 000011000 58 000010111 59 000010110 5d 000010101 21 0000101001 24 0000101000 26 0000100111 71 0000100110 7a 0000100101 00 00001001001 3c 00001001000 3f 00001000111 4a 00001000110 51 00001000101 5a 00001000100 5c 00001000011 6a 00001000010 7b 00001000001 7c 00001000000 01 000001111111 02 000001111110 03 000001111101 04 000001111100 05 000001111011 06 000001111010 07 000001111001 08 000001111000 0b 000001110111 0c 000001110110 0e 000001110101 0f 000001110100 10 000001110011 11 000001110010 12 000001110001 13 000001110000 14 000001101111 15 000001101110 16 000001101101 17 000001101100 18 000001101011 19 000001101010 1b 000001101001 1c 000001101000 1d 000001100111 1e 000001100110 1f 000001100101 23 000001100100 25 000001100011 3b 000001100010 40 000001100001 5e 000001100000 60 000001011111 7d 000001011110 7e 000001011101 7f 000001011100 b0 000001011011 b1 000001011010 b2 000001011001 b3 000001011000 b4 000001010111 b5 000001010110 b6 000001010101 b7 000001010100 b8 000001010011 b9 000001010010 ba 000001010001 bb 000001010000 bc 000001001111 bd 000001001110 be 000001001101 bf 000001001100 c0 000001001011 c1 000001001010 c2 000001001001 c3 000001001000 c4 000001000111 c5 000001000110 c6 000001000101 c7 000001000100 c8 000001000011 c9 000001000010 ca 000001000001 cb 000001000000 cc 000000111111 cd 000000111110 ce 000000111101 cf 000000111100 d0 000000111011 d1 000000111010 d2 000000111001 d3 000000111000 d4 000000110111 d5 000000110110 d6 000000110101 d7 000000110100 d8 000000110011 d9 000000110010 da 000000110001 db 000000110000 dc 000000101111 dd 000000101110 de 000000101101 df 000000101100 e1 000000101011 e5 000000101010 e9 000000101001 ee 000000101000 f2 000000100111 f3 000000100110 f4 000000100101 1a 0000001001001 80 0000001001000 81 0000001000111 82 0000001000110 83 0000001000101 84 0000001000100 85 0000001000011 86 0000001000010 87 0000001000001 88 0000001000000 89 0000000111111 8a 0000000111110 8b 0000000111101 8c 0000000111100 8d 0000000111011 8e 0000000111010 8f 0000000111001 90 0000000111000 91 0000000110111 92 0000000110110 93 0000000110101 94 0000000110100 95 0000000110011 96 0000000110010 97 0000000110001 98 0000000110000 99 0000000101111 9a 0000000101110 9b 0000000101101 9c 0000000101100 9d 0000000101011 9e 0000000101010 9f 0000000101001 a0 0000000101000 a1 0000000100111 a2 0000000100110 a3 0000000100101 a4 0000000100100 a5 0000000100011 a6 0000000100010 a7 0000000100001 a8 0000000100000 a9 0000000011111 aa 0000000011110 ab 0000000011101 ac 0000000011100 ad 0000000011011 ae 0000000011010 af 0000000011001 e0 0000000011000 e2 0000000010111 e3 0000000010110 e4 0000000010101 e6 0000000010100 e7 0000000010011 e8 0000000010010 ea 0000000010001 eb 0000000010000 ec 0000000001111 ed 0000000001110 ef 0000000001101 f0 0000000001100 f1 0000000001011 f5 0000000001010 f6 0000000001001 f7 0000000001000 f8 0000000000111 f9 0000000000110 fa 0000000000101 fb 0000000000100 fc 0000000000011 fd 0000000000010 fe 0000000000001 ff 0000000000000 (length,offset) pairs are represented by a bit sequence for the length, followed by a bit sequence for the most significant six bits of the offset, followed by the remaining bits of the offset in little-endian order. (Note that the number of remaining bits is equal to the value in the second byte of the header.) There is one exception to this: if the copy length is equal to 2, then only two low-order offset bits are encoded instead of 4, 5, or 6, limiting the offset to the most recent 256 bytes regardless of dictionary size. The copy length is represented as follows: length representation 2 101 3 11 4 100 5 011 6 0101 7 0100 8 0011 9 00101 10-11 00100x 12-15 00011xx 16-23 00010xxx 24-39 000011xxxx 40-71 000010xxxxx 72-135 000001xxxxxx 136-263 0000001xxxxxxx 264-518 0000000xxxxxxxx In this table the bits represented by 'x' contain the offset within the corresponding range, represented as a little-endian binary number. For example, the line reading: 16-23 00010xxx is shorthand for this: 16 00010000 17 00010100 18 00010010 19 00010110 20 00010001 21 00010101 22 00010011 23 00010111 The sequence 000000011111111, which would indicate a copy length of 519 if the table above were extended in the natural way, instead indicates end-of-stream. The upper six bits of the copy offset are represented as follows: offset bits representation 00 11 01 1011 02 1010 03 10011 04 10010 05 10001 06 10000 07 011111 08 011110 09 011101 0a 011100 0b 011011 0c 011010 0d 011001 0e 011000 0f 010111 10 010110 11 010101 12 010100 13 010011 14 010010 15 010001 16 0100001 17 0100000 18 0011111 19 0011110 1a 0011101 1b 0011100 1c 0011011 1d 0011010 1e 0011001 1f 0011000 20 0010111 21 0010110 22 0010101 23 0010100 24 0010011 25 0010010 26 0010001 27 0010000 28 0001111 29 0001110 2a 0001101 2b 0001100 2c 0001011 2d 0001010 2e 0001001 2f 0001000 30 00001111 31 00001110 32 00001101 33 00001100 34 00001011 35 00001010 36 00001001 37 00001000 38 00000111 39 00000110 3a 00000101 3b 00000100 3c 00000011 3d 00000010 3e 00000001 3f 00000000 Once these bits are combined with the low-order bits, the resulting number is an index from the end of the dictionary. That is, 0 represents the most-recently decoded byte, 1 the next-most-recently, and so on. The range to be copied begins at the indexed byte. The range to be copied can extend past the end of the dictionary. This is handled in the usual Lempel-Ziv way: the bytes within the dictionary are repeated cyclically as necessary to fill the rest of the range. The dictionary is not pre-initialized, so the range cannot extend past the beginning of the file. Here's a sample compressed stream and how it would be decoded. The stream is 00 04 82 24 25 c7 80 7f. The first byte of the header is 0, so the fixed-width representation is used for literal bytes. The second byte is 4, so the dictionary is 1K in size. The bitstream portion breaks down as follows: 0 10000010 literal byte 41 (ASCII 'A') 0 10010010 literal byte 49 (ASCII 'I') 1 001001 110001 copy 11 bytes starting at dictionary byte 1 (counting from the end starting with 0) 1 000000011111111 end of stream 0 padding to multiple of 8 bits (ignored) This stream would decompress to "AIAIAIAIAIAIA".
Note: the example is incorrect. The byte stream should be: 00 04 82 24 25 8f 80 7f
Last modified on 2011-12-08 19:34:58