web crawler - Regarding crawling of short URLs using nutch -

- March 15, 2014

i using nutch crawler application needs crawl set of urls give urls directory , fetch contents of url only. not interested in contents of internal or external links. have used nutch crawler , have run crawl command giving depth 1.

bin/nutch crawl urls -dir crawl -depth 1

nutch crawls urls , gives me contents of given urls.

i reading content using readseg utility.

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata

with fetching content of webpage.

the problem facing if give direct urls

http://isoc.org/wp/worldipv6day/ http://openhackindia.eventbrite.com http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/ http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php http://bangalore.yahoo.com/labs/summerschool.html http://riadevcamp.eventbrite.com http://www.sleepingtime.org/

then able contents of webpage. when give set of urls short urls

http://is.gd/jooaa9 http://is.gd/ubhraf http://is.gd/gifqj9 http://is.gd/h5ruhg http://is.gd/wvkinl http://is.gd/k6jtnl http://is.gd/mpa6fr http://is.gd/fmobvj http://is.gd/s7uzf***

i not able fetch contents.

when read segments, not showing content. please find below content of dump file read segments.

 *recno:: 0 url:: http://is.gd/0ykjo6 crawldatum:: version: 7 status: 1 (db_unfetched) fetch time: tue jan 25 20:56:07 ist 2011 modified time: thu jan 01 05:30:00 ist 1970 retries since fetch: 0 retry interval: 2592000 seconds (30 days) score: 1.0 signature: null metadata: _ngt_: 1295969171407 content:: version: -1 url: http://is.gd/0ykjo6 base: http://is.gd/0ykjo6 contenttype: text/html metadata: date=tue, 25 jan 2011 15:26:28 gmt nutch.crawl.score=1.0 location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 content-type=text/html; charset=utf-8 connection=close server=nginx x-powered-by=php/5.2.14 content: recno:: 1 url:: http://is.gd/1tpkan content:: version: -1 url: http://is.gd/1tpkan base: http://is.gd/1tpkan contenttype: text/html metadata: date=tue, 25 jan 2011 15:26:28 gmt nutch.crawl.score=1.0 location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125205614 content-type=text/html; charset=utf-8 connection=close server=nginx x-powered-by=php/5.2.14 content: crawldatum:: version: 7 status: 1 (db_unfetched) fetch time: tue jan 25 20:56:07 ist 2011 modified time: thu jan 01 05:30:00 ist 1970 retries since fetch: 0 retry interval: 2592000 seconds (30 days) score: 1.0*

i have tried setting max.redirects property in nutch-default.xml 4 dint find progress. kindly provide me solution problem.

thanks , regards, arjun kumar reddy

using nutch 1.2 try editing file conf/nutch-default.xml
find http.redirect.max , change value @ least 1 instead of default 0.

<property>   <name>http.redirect.max</name>   <value>2</value><!-- instead of 0 -->   <description>the maximum number of redirects fetcher follow when   trying fetch page. if set negative or 0, fetcher won't   follow redirected urls, instead record them later fetching.   </description> </property>

good luck

Search This Blog

Sohocode

web crawler - Regarding crawling of short URLs using nutch -

Comments

Post a Comment

Popular posts from this blog

java - SNMP4J General Variable Binding Error -

sql server - python to mssql encoding problem -

windows - Python Service Installation - "Could not find PythonClass entry" -