web crawler - Regarding crawling of short URLs using nutch -
i using nutch crawler application needs crawl set of urls give urls
directory , fetch contents of url only. not interested in contents of internal or external links. have used nutch crawler , have run crawl command giving depth 1.
bin/nutch crawl urls -dir crawl -depth 1
nutch crawls urls , gives me contents of given urls.
i reading content using readseg utility.
bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata
with fetching content of webpage.
the problem facing if give direct urls
http://isoc.org/wp/worldipv6day/ http://openhackindia.eventbrite.com http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/ http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php http://bangalore.yahoo.com/labs/summerschool.html http://riadevcamp.eventbrite.com http://www.sleepingtime.org/
then able contents of webpage. when give set of urls short urls
http://is.gd/jooaa9 http://is.gd/ubhraf http://is.gd/gifqj9 http://is.gd/h5ruhg http://is.gd/wvkinl http://is.gd/k6jtnl http://is.gd/mpa6fr http://is.gd/fmobvj http://is.gd/s7uzf***
i not able fetch contents.
when read segments, not showing content. please find below content of dump file read segments.
*recno:: 0 url:: http://is.gd/0ykjo6 crawldatum:: version: 7 status: 1 (db_unfetched) fetch time: tue jan 25 20:56:07 ist 2011 modified time: thu jan 01 05:30:00 ist 1970 retries since fetch: 0 retry interval: 2592000 seconds (30 days) score: 1.0 signature: null metadata: _ngt_: 1295969171407 content:: version: -1 url: http://is.gd/0ykjo6 base: http://is.gd/0ykjo6 contenttype: text/html metadata: date=tue, 25 jan 2011 15:26:28 gmt nutch.crawl.score=1.0 location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 content-type=text/html; charset=utf-8 connection=close server=nginx x-powered-by=php/5.2.14 content: recno:: 1 url:: http://is.gd/1tpkan content:: version: -1 url: http://is.gd/1tpkan base: http://is.gd/1tpkan contenttype: text/html metadata: date=tue, 25 jan 2011 15:26:28 gmt nutch.crawl.score=1.0 location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125205614 content-type=text/html; charset=utf-8 connection=close server=nginx x-powered-by=php/5.2.14 content: crawldatum:: version: 7 status: 1 (db_unfetched) fetch time: tue jan 25 20:56:07 ist 2011 modified time: thu jan 01 05:30:00 ist 1970 retries since fetch: 0 retry interval: 2592000 seconds (30 days) score: 1.0*
i have tried setting max.redirects property in nutch-default.xml 4 dint find progress. kindly provide me solution problem.
thanks , regards, arjun kumar reddy
using nutch 1.2 try editing file conf/nutch-default.xml
find http.redirect.max , change value @ least 1 instead of default 0.
<property> <name>http.redirect.max</name> <value>2</value><!-- instead of 0 --> <description>the maximum number of redirects fetcher follow when trying fetch page. if set negative or 0, fetcher won't follow redirected urls, instead record them later fetching. </description> </property>
good luck
Comments
Post a Comment