Where's That Folder?

OpenBSD has a nifty little locate(1) command, and most other Unixes at least have it as an option. It does a great job of locating file or directory names, but had some issue for things I wanted to do with it. I started thinking about how to adapt locate(1) for my needs, but decided it was a complete re-write, and really a different program.

WTF is not presented as a general purpose file and directory finder, but as a special purpose one for MY uses...but it might give you some ideas for something along this line for your personal use. I would not recommend blindly copy/pasting this script into your general purpose computer.

The database

The "database" is simply a flat file. I puzzled for a long time trying to figure out how to efficiently store and search data, and basically decided that a linear search through a text file was the best bet. But of course, that could be a HUGE file. The obvious solution is to compress the file -- and since it's just doing a linear search through the file using grep, it can be created and searched in compressed form, never having to exist anywhere on the disk in uncompressed form, unlike locate(1). I have seen very good compression here, well over 90% reduction in file size.

Strategy

The basic strategy is, build the database by using a find command, piping the output through gzip, and to a file.

Finding the desired output consists of running the compressed data through zcat and grep for the strings desired, with a -i by default so that case is ignored.

Originally, my plan was to have one data file, and then grep for files vs. directories by using different regular expressions. However, testing quickly showed that searching for directories (my common use case) through a file with all the file names was very slow, so I ended up creating TWO files -- a list of file names and a list of just directory names. This gave me very acceptable performance when searching for directories, but still leaving the file search option available.

An effect I have noted a few times in my career is that sometimes performing two searches on the same basic data set ends up efficiently using the system disk cache for the second query. I suspected it might be very possible to get both the complete list of all files AND the list of just all directories with two parallel searches with less total time spent than two sequential searches, and it turns out this is correct in my environment. So the directory list find and the file list find are run in parallel by backgrounding them.

find(1) tends to find files in an order that may not seem obviously logical. Sometimes the order of results is annoying to me, I thought about sorting the output of the find command before running it through gzip, and figured that would increase the compression. Somewhat to my surprise, though, it did just the opposite -- the compressed files increased in size, with a huge penalty in time, and a lot of tmp space being used to sort the massive files. So, I dropped that -- if I really want sorted output, I'll run it through sort after the run.

A friend and former coworker of mine always likes to go for absolute maximum compression, using bzip instead of gzip for and the -9 option to get a tiny amount of extra compression in trade for a large amount of additional time ("If they didn't want me to use it, why did they put it there?"). I thought this application might be the time when gzip -9 or even bzip -9 would make sense, where the additional time required to compress the file might be trivial compared to the time to find all the data and the additional compression might be appreciated. However, my testing showed the search times went up a lot when I tried it and the file size went down very little. Wasn't worth it in my opinion.

========================================================
#!/bin/ksh

/*
 * Copyright (c) 2022 Nicholas Holland 
 *
 * Permission to use, copy, modify, and distribute this software for any
 * purpose with or without fee is hereby granted, provided that the above
 * copyright notice and this permission notice appear in all copies.
 *
 * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
 * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
 * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
 * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
 * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
 * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 */

DATADIR=/var/db/wtf        # where WTF datafiles live.
FILELIST=filelist.text.gz  # All files
DIRLIST=dirlist.text.gz    # All directory names
DATAFILE=$DATADIR/$DIRLIST # Default to directories.
BUILDUSER="root"           # verify user being used to build the lists
ZCAT="/usr/bin/zcat"       # command to 'cat' a compressed file to stdout

GREPCASE="-i"
GREPOPTS=""

if [[ -z $1 ]]; then
    cat <<-__ENDHELP
        usage:
           $0 [ -c] [-dDfF] searchstring

        options:
           -d "dirname"  : search for strings in directory names
           -D "dirname"  : search for exactly specified directory
           -f "filename" : search for string in file names
           -F "filename" : search for exact file name
           -c : case sensitive searches
           -build : rebuild the WTF database (run $BUILDUSER)

           Note: parser is lame, -c must come first if used.
        __ENDHELP
    exit
fi

if [[ "$1" = "-build" ]]; then
    if [[ $(whoami) != "$BUILDUSER" ]]; then
        print "Must run as $BUILDUSER"
        exit
    fi
    mkdir -pm 755 $DATADIR
    (find / -type f |gzip -v >$DATADIR/$FILELIST.new && mv $DATADIR/$FILELIST.new $DATADIR/$FILELIST) &
    (find / -type d |gzip -v >$DATADIR/$DIRLIST.new && mv $DATADIR/$DIRLIST.new $DATADIR/$DIRLIST) &
    sleep 2
    print "Database build running in the background."
    chmod 644 $DATADIR/*
    exit
fi

while [[ -n $1 ]]; do
    case $1 in
        -c ) GREPCASE=""
             shift
             ;;
        -f ) GREPOPTS="$GREPCASE -e \"$2[^/]*$\""
             DATAFILE=$DATADIR/$FILELIST
             shift ; shift
             ;;
        -F ) GREPOPTS="$GREPCASE -e \"/$2\$\""
             DATAFILE=$DATADIR/$FILELIST
             shift ; shift
             ;;
        -d ) GREPOPTS="$GREPCASE -e \"$2[^/]*$\""
             DATAFILE=$DATADIR/$DIRLIST
             shift ; shift
             ;;
        -D ) GREPOPTS="$GREPCASE -e \"/$2\$\""
             DATAFILE=$DATADIR/$DIRLIST
             shift ; shift
             ;;
        -* ) print "Invalid option $1"
             exit
             ;;
        * ) GREPOPTS="$GREPCASE -e $1"
            shift
             ;;
    esac
done

COMMAND="$ZCAT $DATAFILE |grep $GREPOPTS"

print "=$COMMAND=" # Diagnostic.  Probably comment out in production.
eval $COMMAND

========================================================

Usage

Periodically, you want to update the database -- how often periodically is, is up to you. OpenBSD's locate(1) command is updated weekly. For my uses, I wanted daily, but I'm also aware this creates huge loads on some of my system's disks while the wtf -update is running, so I want it to run when the system is otherwise under low load (i.e., when I'm asleep).

Teaming this up with disknice might be wise.

After that, just run "wtf" and whatever you want to search for. As I've presented it, it searches for directories, not file names, but if you are more often more interested in file names, you can change the logic to default to $FILELIST rather than $DIRLIST.

crap that could be improved.

This is not particularly good code. It reached the "good enough for me" stage, and decided YOU might have too many things to change, so I'm presenting it only as "starting point" for your needs, rather than as a be-all solution for everyone. Better to push it out as "less than optimal with warnings" than to sit on it forever trying to remove the last bit of embarrassment out of it.
 
 

Holland Consulting home page
Back to scripting page
 

since August 3, 2022

Copyright 2022, Nick Holland, Holland Consulting.