vspacer
 
vspacer
 

A SourceSafe Search Engine

 

 
Author's note

This page describes a early prototype SourceSafe search engine which provided valuable feedback on scalability, storage, and tuning issues.

The SourceSafe search engine ZDS Search available for sale on this site is based on new technology and incorporates the lessons learned.


SourceSafe Search Engine

This search engine is designed to return results in the minimum time. Provisions are made to tune the storage required by the index, but optimization of this is secondary


Index Lookup

The index server is based on the principle of least action, and does no work, and opens no files until it has to.

Given a keyword it first tries see if the term is too long, short, contains no letter characters, or fails a 'digraph' check.

If the term passes this check then:


*The termhhash_ydex_terms.map file is opened. This is simply an array of hash entries, eight bytes per hash entry. The hash function has been designed so that any term will hash to a value less than the number of entriet in the map file
*The hash is computed for the term.
*A direct lookup is done at the offset of the hash for the eight byte hash entry.

The eight byte hash entry is broken into two components:


*An index to the 'nth' entry in the XTerms_YFiles.dat file if the term occurs multiple times OR a direct offset into the Files.dat file if the term occurs only in one file.
*An offset into the terms.dat file where term information, and hash collision chain information is stored. The high order bits of this value are overstored with COLLISION | COMMON | SINGLEFILE flags

The hash COLLISION flag will normally not be set, as the system is configured by default to keep collisions below 1 percent. Opening the Terms.dat file to resolve a collision is therefore uncommon.

Forty percent of the terms occur only in one file. If we find the SINGLEFILE flag set we can open the Files.dat file, do a direct lookup by offset to get the file information (about 60 bytes) format the result to STDOUT, and we are done.

If not, we do a direct lookup by offset of the bit-map pages in XFiles_Yindex.dat.

The offsets of bits we find set in XFiles_Yindex.dat each represent a direct index into xdex_files.dat. We do a direct lookup of the 4 byte entry in this index. The entries contain direct offsets of the file's info in Files.dat. The high order bits are overstored with DELETED | OBSOLETE | EXTENSION flags.

If none of these flags are set, then we do a direct lookup by offset into Files.dat to get the file information (about 60 bytes), format the result to STDOUT, and we are done

See the graphic's side-bar for a detailed IO budget.

The links shown on the search page returned are to files in the SourceSafe database, not to html pages as is usual. Instead of 'http://' protocol links they are generated as 'sstp://' links. By defining this private protocol in the registry of the client machine, we arrange that when the users click on a link, the SourceSafe client application is launched and opens at the project containing the file.


Index Generation

The common word problem is resolved without requiring expensive manual discard word list maintenance by allowing the user to configure a common word limit. Terms found with frequencies more than this value in the total set are flagged as COMMON, and the search does not return them.

To reduce the number of raw tokens to something more manageable.


*Tokens less than a configurable number of characters long are discarded.
*Tokens longer than a configurable number of characters long are discarded.
*Tokens failing to pass a 'digraph' check are discarded. This check accumulates a score for the token based on the frequency of letter digraphs in the english language. Tokens earning less than a configurable score value are discarded.


   

Back to top | ZDS Home | This article updated January 7 2004.