Viktor Petersson.com
Create a lightweight intranet search engine with Xapian on FreeBSD
Recently I had to set up an intranet search engine to crawl trough thousands of PDF files. There are a ton of commercial solutions (read: $$$$) out there on the market, ranging from Google Search Appliance to IBM’s OmniFind. There are also a few good Open Source engines, such as Apache’s Lucene. The problem is that these are primarily intended for enterprises with server farms full of data. That’s really not what I was looking for. I was looking something simple that was easy to set up and maintain. That’s when I came across Xapian. It’s Open Source and lightweight. Combine Xapian with Omega and you got exactly what I was looking for — A lightweight intranet search engine.
This howto will walk you trough how to set up Xapian with Omega on FreeBSD. The version I used was FreeBSD 8.1, but I’m sure any recent version of FreeBSD (7.x>) will do. Please note that I do expect you to know your way around FreeBSD, so I’m not going to spend time on simple tasks like how to edit files etc. I also assume you already got your system up and running.
I’ve called the path we’re going to index (recursively) ‘/path/to/something’. This can be either a local path or something mounted from a remote server. Also, as you’ll see below, a lot of dependencies are installed. This is to increase the number of file-format Xapian will index. It should be able to index PDF-files, Word-files, RTF-files, in addition to plain-text files.
Let’s get started.
Note: If you don’t have the ports-tree installed (/usr/ports), you can download it by simply running:
portsnap fetch extract
Install Apache
/usr/ports/www/apache22
make install
echo -e “\napache22_enable=\”YES\”" >> /etc/rc.conf
Install Xapian with Xapian-Omega
cd /usr/ports/www/xapian-omega
make install
Install Xpdf
Make sure to uncheck X11 and DRAW
cd /usr/ports/graphics/xpdf
make install
Install Catdoc
Uncheck WORDVIEW
cd /usr/ports/textproc/catdoc
make install
Install Unzip
cd /usr/ports/archivers/unzip
make install
Install Gzip
cd /usr/ports/archivers/gzip
make install
Install Antiword
cd /usr/ports/textproc/antiword
make install
Install Unrtf
cd /usr/ports/textproc/unrtf
make install
Install Catdvi
cd /usr/ports/print/catdvi
make install
Next we need to edit Apache’s config-file (/usr/local/etc/apache22/httpd.conf)
Change:
ScriptAlias /cgi-bin/ “/usr/local/www/apache22/cgi-bin/”
Into:
ScriptAlias /cgi-bin/ “/usr/local/www/xapian-omega/cgi-bin/”
We also need to create a new config-file for Xapian. Create the file /usr/local/etc/apache22/Include/xapian.conf
Alias /something /path/to/something <Directory "/path/to/something"> Options Indexes AllowOverride None Order allow,deny Allow from all </Directory> <Directory "/usr/local/www/xapian-omega/cgi-bin/"> AllowOverride None Options None Order allow,deny Allow from all </Directory>
With all Apache configuration being done, let’s fire up Apache:
/usr/local/etc/rc.d/apache22 start
Create the holding directory
mkdir -p /usr/local/lib/omega/data/
Copy over the templates. For some reason FreeBSD doesn’t do this by default.
cp -rfv /usr/ports/www/xapian-omega/work/xapian-omega-*/templates /usr/local/lib/omega/
We also need to tell Xapian-Omega where to look for the files. Create the file /usr/local/www/xapian-omega/cgi-bin/omega.conf
# Directory containing Xapian databases:
database_dir /usr/local/lib/omega/data# Directory containing OmegaScript templates:
template_dir /usr/local/lib/omega/templates# Directory to write Omega logs to:
log_dir /var/log/omega# Directory containing any cdb files for the $lookup OmegaScript command:
cdb_dir /var/lib/omega/cdb
Create a search page. I’ll just use index.html in Apache’s default DocumentRoot (/usr/local/www/apache22/data/index.html).
<head> <title>Intranet Search</title> </head> <body bgcolor="#ffffff"> <FORM NAME=P METHOD=GET ACTION="/cgi-bin/omega" TARGET="_top"> <center> <INPUT NAME=P VALUE="" SIZE=65> <INPUT TYPE=SUBMIT VALUE="Search"> <hr> <INPUT TYPE=radio NAME=DEFAULTOP VALUE=or > Match any word <INPUT TYPE=radio NAME=DEFAULTOP VALUE=and CHECKED> Match all words </center><br> <INPUT TYPE=hidden NAME=DB VALUE="default"> <INPUT TYPE=hidden NAME=FMT VALUE="query"> <INPUT TYPE=hidden NAME=xDB VALUE="default"> <INPUT TYPE=hidden NAME=xFILTERS VALUE="--O"> </FORM> <hr> </body> </html>
Lastly, let’s try it by hand. Run:
/usr/local/bin/omindex –db /usr/local/lib/omega/data/default –url /something /path/to/something –depth-limit=0
Now fire up your browser and validate the result by surfing over to the IP address of the server. If that worked out well too, the last step is to add it to Crontab, so that it refreshes the index automatically. In my case, once a day is enough. As you can see, the index is being refreshed at 1:15 AM every night.
Edit crontab (/etc/crontab)
15 1 * * * root /usr/local/bin/omindex –db /usr/local/lib/omega/data/default –url /something /path/to/something –depth-limit=0 > /var/log/index.log
That’s it. Good luck!
-
http://profiles.google.com/timo.selvaraj Timo Selvaraj
-
Anonymous
-
Alexey


