Fork me on GitHub

article

An Official Gippy Endorsement

February 15, 2005 | Web Design & Development

I’ve been testing out many a search script to find something that will run fast, comprehensively index pages, and be highly customizeable because I sure as hell wasn’t going to write one myself. My search took about 36 hours of time dedicated to installing and running different search applications. I finally found what I was looking for in a free program called htdig.

Even though I haven’t yet put it into production Htdig has proven to be everything I was looking for in a search script. I will spider a site, it respects robots.txt files, you can specify areas of pages that are not to be indexed (handy for excluding common navigational elements – be careful though – it won’t follow the links in these areas either so be sure that anything you comment out is able to be found through a sitemap or something) and will display results by relevance and apply a 5 star rank to listings.

The script is by no means a simple setup as it requires compiling on the server. This can be a major drawback if you host doesn’t already have it installed and won’t let you compile applications on the server. Fortunately it is allowed by our ISP that we use at work, Pair Networks. But once configured it is a very powerful program.

The coolest part is that I didn’t have to break my templates to run it. With a little creative usage of PHP you can run the htsearch script and pull the results into an array with the exec command.

I’m not up to writing a full tutorial right now (but if anyone is interested I can write one up later) but the jist of it is that the search script, when run by the exec command, can be configured to return the results (page title, url, description, rank, etc…) line by line into an array that you can then output in any way that fits the site structure.

The only time it gets ugly in my setup is when I have to go to multiple pages of results. I like to keep a clean url structure and don’t like to pass any $_GET vars. Well, there was no way around it this time so I had to set up the multiple page links with $_GET vars so that I wasn’t creating 10 forms on a page just to keep a clean url. I can live with this in a search script especially since it creates a bookmarkable search result.

I’ll hopefully have this in production within a week. We need to get some more comprehensive meta tags on our website now that I have a direct boss that understands the importance and can dedicate the appropriate time to it.

Next project: registration system for events and webinars…

10 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  • Not that I’ve used either, but I was wondering how it might compare to phpdig? That is, assuming that you tried that in your search for a search.

    This comment could get very corny very quickly, so I’ll just leave it at that.

    Pennypacker, February 16, 2005 1:40 pm | permalink

  • phpDig is one that I tried and found that it could not rank results by relevance. It surely was fast and easy to implement but the results just left a lot to be desired.

    After looking around a bit I also found that most of the scripts that I found recommendations for around the net were Perl scripts. This one being the one that I liked the best and that was actually halfway recommended by our host.

    All the php only options left a lot to be desired. A benefit of this is that it also can do Fuzzy searches. I haven’t played around with that yet but plan to look into it soon.

    shawn, February 16, 2005 9:02 pm | permalink

  • It was also a boon that this was in the portage tree – I could test it very easily but just typing:

    emerge htdig

    and then letting Gentoo do the rest. The production install won’t be that easy but at least testing it was a snap.

    shawn, February 16, 2005 9:03 pm | permalink

  • OK, I take back one thing I said about phpDig – it can rank by relevance. I could not go with it because it required a server setting that our host does not enable: allow_url_fopen.

    I found my notes on the packages I tested and they are as follows:

    Google API: Thorough results, crawler, page rank. Can’t schedule updates or specify areas of page to leave out.

    risearch_php: No relevancy results. Very fast. Pro version apparently sorts by relevancy but failed to run on my system.

    tsep: No crawler. test not completed

    phpDig: Slow, inefficient crawler (omitted random pages), thorough results filtering and page rank. Can’t use with current server config.

    iSearch: Fast, decent results. Customization sucks.

    FastFind: Fast, good admin interface, easy install. Results lacking in relevancy. Customization sucks.

    Sphider: Fast, decent results, decent customization. Spider ran into infinite loop on some pages killing processing time and results.

    The next one I tried was htDig and it was everything I had hoped for and was semi supported by my host (they allowed it to be installed despite it needing to be compiled for the system).

    shawn, February 17, 2005 9:27 am | permalink

  • In regards to google, isn’t it possible to specify a time in which google revisits? I don’t remember, and I’m too lazy to look it up.

    But regardless, I don’t think it’s ever a good idea to rely on an outside resource to make a site work, be it google, mapquest, or a hotlinked background image. That’s not to say I haven’t done it and won’t do it again though.

    I’ll look into these search engines myself as well because I’m curious, I must admit, and while I take your recommendations as first rate, there’s nothing like first hand experience to really get a feel for something.

    I doubt I’ll be as thorough as you were, but I am interested in looking at your top picks. Thanks for the good info!

    John Pennypacker, February 19, 2005 7:33 am | permalink

  • Yeah, if Google could be scheduled it would have been pretty near perfect. They already take navbars and such into account when they crawl a site so the 2nd part of my beef with them, not being able to designate "no crawl" zones is moot.

    shawn, February 19, 2005 7:50 am | permalink

  • I did a little digging to satisfy my curiosity, and it was meta tags that I was thinking of.

    Linkie Poo

    There’s some info. Happily, google obeys a good many of them, and there’s info about revisit intervals too.

    But in terms of setting ‘no-crawl’ zones, that can be set in robots.txt like so:

     User-Agent: *
     Disallow: /cgi-bin/ 
    /secretdirectory/ 
    /anothersecretdirectory/ 
    

    But then, that’s just google. Not all spider bots are as compliant. And again, providing your own spider that runs on your own server is always a better option.

    John Pennypacker, February 19, 2005 8:05 am | permalink

  • Linkie-poo. Nice!

    John Pennypacker, February 19, 2005 8:06 am | permalink

  • Pretty much all of the above software is robots.txt aware – but some of them offer a way of excluding common page elements in the search results.

    Example – some of them would see a nav bar and return that as part of the search result if your search term included that text. htDig offers a way to put and around navbars and breadcrumb bars so that they don’t affect search results. I don’t think Google does that, but then, it seems that Google has something like that built in to its logic.

    shawn, February 19, 2005 8:29 am | permalink

  • he he, it stripped the examples… I think I need to edit my code a bit…

    it goes a little something like this:

     <--htdig_noindex-->
        Text to be ignored
    <--/htdig_noindex--> 

    shawn, February 19, 2005 9:38 am | permalink

Comments are closed