Cache format specification
For updating purpose, HTTrack stores original (untouched) HTML data,
references to downloaded files, and other meta-data (especially parts of the HTTP headers) in a cache,
located in the hts-cache directory. Because local html pages are always modified to "fit" the local
filesystem structure, and because meta-data such as the last-Modified date and Etag can not be stored
with the associated files, the cache is absolutely mandatory for reprocessing (update/continue) phases.
The (new) cache.zip format
The 3.31 release of HTTrack introduces a new cache format, more extensible and efficient than the previous one (ndx/dat format).
The main advantages of this cache are:
- One single file for a complete website cache archive
- Standard ZIP format, that can be easily reused on most platforms and languages
- Compressed data with the efficient and opened zlib format
The cache is made of ZIP files entries ; with one ZIP file entry per fetched URL (successfully or not - errors are also stored).
For each entry:
- The ZIP file name is the original URL [see notes below]
- The ZIP file contents, if available, is the original (compressed, using the deflate algorythm) data
- The ZIP file extra field (in the local file header) contains a list of meta-fields, very similar to the HTTP headers fields. See also RFC.
- The ZIP file timestamp follows the "Last-Modified-Since" field given for this URL, if any
Example of cache file:
$ unzip -l hts-cache/new.zip
Archive: hts-cache/new.zip
HTTrack Website Copier/3.31-ALPHA-4 mirror complete in 3 seconds : 5 links scanned,
3 files written (16109 bytes overall) [17690 bytes received at 5896 bytes/sec]
(1 errors, 0 warnings, 0 messages)
Length Date Time Name
-------- ---- ---- ----
94 07-18-03 08:59 http://www.httrack.com/robots.txt
9866 01-17-04 01:09 http://www.httrack.com/html/cache.html
0 05-11-03 13:31 http://www.httrack.com/html/images/bg_rings.gif
207 01-19-04 05:49 http://www.httrack.com/html/fade.gif
0 05-11-03 13:31 http://www.httrack.com/html/images/header_title_4.gif
-------- -------
10167 5 files
Example of cache file meta-data:
HTTP/1.1 200 OK
X-In-Cache: 1
X-StatusCode: 200
X-StatusMessage: OK
X-Size: 94
Content-Type: text/plain
Last-Modified: Fri, 18 Jul 2003 08:59:11 GMT
Etag: "40ebb5-5e-3f17b6df"
X-Addr: www.httrack.com
X-Fil: /robots.txt
There are also specific issues regarding this format:
- The data in the central directory (such as CD extra field, and CD comments) are not used
- The ZIP archive is allowed to contains more than 2^16 files (65535) ; in such case the total number of entries in the 32-bit central directory is 65536 (0xffff), but the presence of the 64-bit central directory is not mandatory
- The ZIP archive is allowed to contains more than 2^32 bytes (4GiB) ; in such case the 64-bit central directory must be present (not currently supported)
Meta-data stored in the "extra field" of the local file headers
The extra field is composed of text data, and this text data is composed of distinct lines of headers.
The end of text, or a double CR/LF, mark the end of this zone.
This method allows to optionally store original HTTP headers just after the "meta-data" headers for informational use.
The status line (the first headers line)
Status-Line = HTTP-Version SP Status-Code SP X-Reason-Phrase CRLF
Other lines:
Specific fields:
- X-In-Cache
Indicates if the data are present (value=1) in the cache (that is, as ZIP data), or in an external file (value=0).
This field MUST be the first field.
- X-StatusCode
The modified (by httrack) status code after processing. 304 error codes ("Not modified"), for example, are transformed into "200" codes after processing.
- X-StatusMessage
The modified (by httrack) status message.
- X-Size
The stored (either in cache, or in an external file) data size.
- X-Charset
The original charset.
- X-Addr
The original URL address part.
- X-Fil
The original URL path part.
- X-Save
The local filename, depending on user's "build structure" preferences.
Standard (RFC 2616) "useful" fields:
- Content-Type
- Last-Modified
- Etag
- Location
- Content-Disposition
Specific fields in "BNF-like" grammar:
X-In-Cache = "X-In-Cache" ":" 1*DIGIT
X-StatusCode = "X-StatusCode" ":" 1*DIGIT
X-StatusMessage = "X-StatusMessage" ":" *<TEXT, excluding CR, LF>
X-Size = "X-Size" ":" 1*DIGIT
X-Charset = "X-Charset" ":" value
X-Addr = "X-Addr" ":" scheme ":" "//" authority
X-Fil = "X-Fil" ":" rel_path
X-Save = "X-Save" ":" rel_path
RFC standard fields:
Content-Type = "Content-Type" ":" media-type
Last-Modified = "Last-Modified" ":" HTTP-date
Etag = "ETag" ":" entity-tag
Location = "Location" ":" absoluteURI
Content-Disposition = "Content-Disposition" ":" disposition-type *( ";" disposition-parm )
And, for your information,
X-Reason-Phrase = *<TEXT, with a maximum of 32 characters, and excluding CR, LF>
Note: Because the URLs may have an unexpected format, especially with double "/" inside, and other reserved characters ("?", "&" ..),
various ZIP uncompressors can potentially have troubles accessing or decompressing the data.
Libraries should generally handle this peculiar format, however.
|