Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats.I have been waiting for this release for a long time as I made some contributions to this project. These contributions were also my first contributions to an open source project. In 2007, when I was playing with the search engine in Infosys (my previous employer), I found a few things that could be fixed and improved. I submitted these fixes and enhancements to the Nutch community and they were committed to the subversion repository. Let me list my contributions from the CHANGES.txt file.
Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file:
http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt
Apache Nutch is available for download from the following download page:
http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz
NUTCH-559 was my major contribution. I did this when I found that Nutch was unable to authenticate itself to the intranet sites which were protected with NTLM authentication scheme. I modified the module that deals with the HTTP protocol so that it could authenticate itself with configured credentials when challenged with authentication. While developing this, I also developed support for Basic and Digest authentication schemes. More details on this can be found in NUTCH-559 (JIRA) and the Nutch wiki entry on HTTP authentication schemes.62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
server. (Susam Pal via dogacan)
77. NUTCH-44 - Too many search results, limits max results returned from a
single search. (Emilijan Mirceski and Susam Pal via kubes)
80. NUTCH-612 - URL filtering was disabled in Generator when invoked
from Crawl (Susam Pal via ab)
81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
NUTCH-44 and NUTCH-612 were bug fixes. NUTCH-601 involved the removal of a minor irritant. In the days of Nutch 0.9, the crawler complained if a directory with the name 'crawl' already existed in the current directory. As a result, before beginning a re-crawl using the
bin/nutch crawl command, we had to move the existing crawl directory to another location. After a discussion in the community, we agreed that it was better to avoid shuffling the crawl directories by allowing re-crawls on the same directory. The change was made and committed.Nutch users' mailing list has often received mails from users who wanted to know how they can enable support for authentication schemes in Nutch 0.9 by applying the patch in NUTCH-559. Patching Nutch 0.9 was a little cumbersome as the patch was generated against the trunk. With this release, the users can simply download Nutch 1.0 and configure the authentication schemes.

3 comments:
Congrats and thanx Sir!
I'm sure it would have been a long wait for a lot of users for configuring their authentication schemes.
Hey Dude...
Finally caught up with your blog. Nice posts, quite informative. Have started following your blog.
Cheers!!
Paritosh
Congrats! Good to see your work at Infosys being useful to the open source community.
Post a Comment