| |
|
Author's note This page describes a early prototype SourceSafe search engine which provided valuable feedback on scalability, storage, and tuning issues. The SourceSafe search engine ZDS Search available for sale on this site is based on new technology and incorporates the lessons learned.
This search engine is designed to return results in the minimum time. Provisions are made to tune the storage required by the index, but optimization of this is secondary Index Lookup The index server is based on the principle of least action, and does no work, and opens no files until it has to. Given a keyword it first tries see if the term is too long, short, contains no letter characters, or fails a 'digraph' check. If the term passes this check then:
The eight byte hash entry is broken into two components:
The hash COLLISION flag will normally not be set, as the system is configured by default to keep collisions below 1 percent. Opening the Terms.dat file to resolve a collision is therefore uncommon. Forty percent of the terms occur only in one file. If we find the SINGLEFILE flag set we can open the Files.dat file, do a direct lookup by offset to get the file information (about 60 bytes) format the result to STDOUT, and we are done. If not, we do a direct lookup by offset of the bit-map pages in XFiles_Yindex.dat. The offsets of bits we find set in XFiles_Yindex.dat each represent a direct index into xdex_files.dat. We do a direct lookup of the 4 byte entry in this index. The entries contain direct offsets of the file's info in Files.dat. The high order bits are overstored with DELETED | OBSOLETE | EXTENSION flags. If none of these flags are set, then we do a direct lookup by offset into Files.dat to get the file information (about 60 bytes), format the result to STDOUT, and we are done See the graphic's side-bar for a detailed IO budget. The links shown on the search page returned are to files in the SourceSafe database, not to html pages as is usual. Instead of 'http://' protocol links they are generated as 'sstp://' links. By defining this private protocol in the registry of the client machine, we arrange that when the users click on a link, the SourceSafe client application is launched and opens at the project containing the file. Index Generation The common word problem is resolved without requiring expensive manual discard word list maintenance by allowing the user to configure a common word limit. Terms found with frequencies more than this value in the total set are flagged as COMMON, and the search does not return them. To reduce the number of raw tokens to something more manageable.
|