Sunday, November 27, 2005

Indexing .chm Files with ht://Dig

Thanks to Google cache, I was able to find this treasure I found long ago. The author appears to be moving/revamping his site. I figured I better save this some place for future reference, in case it doesn't make it to his new site.

Google Cache


Indexing .chm Files with ht://Dig

1. First, you need to install CHM lib to get the command line utility chmextract. In gentoo, just type emerge chmlib.
2. Apache needs to recognise chm files correctly. Edit /etc/apache2/conf/commonapache2.conf and search for AddType. Add the following line:

AddType application/x-chm .chm

3. Restart apache with /etc/init.d/apache2 restart. You can verify that the mime type is correctly identified by using wget to download a .chm file from your server, this should display the correct mimet type.
4. You can use cabextract to extract the html files from chm files. I wrote this little script, which creates the required output for htdig:

#!/bin/sh
set -e
FILE=$1
TMPFILE=`tempfile`
rm $TMPFILE
mkdir $TMPFILE
chmextract $FILE $TMPFILE >/dev/null
find $TMPFILE -type f -iname "*.htm*" -exec cat \{\} \;
rm -R $TMPFILE

Save this file as /usr/local/share/doc2html/chm2html.sh, and make it executable with chmod a+x chm2html.sh.
5. In /etc/htdig/htdig.conf search for external_parsers: and add the following line:

application/x-chm->text/html /usr/local/share/doc2html/chm2html.sh

Do not forget to add a trailing / in the previous line!

That's it. happy indexing!

0 Comments:

Post a Comment

<< Home