Question : urllib.py urllib2.py errors

Hi I am using code like in the snippet. I am trying to crawl google groups. I get the error below. Anyway to solve this?

>>> from searchengineWithoutTry import *
>>> pagelist=['http://groups.google.com/groups?as_q=&num=100&scoring=r&hl=en&as_epq=&as_oq=japan+tokyo&as_eq=&as_ugroup=&as_usubject=&as_uauthors=&lr=lang_en&as_drrb=q&as_qdr=&as_mind=1&as_minm=1&as_miny=1981&as_max' ]
>>> webcrawler = crawler("test.db")
>>> webcrawler.createindextables()
>>> webcrawler.crawl(pagelist)

Traceback (most recent call last):
File "", line 1, in
webcrawler.crawl(pagelist)
File "C:\Python25\searchengineWithoutTry.py", line 487, in crawl
c=urllib2.urlopen(page)
File "C:\Python25\lib\urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "C:\Python25\lib\urllib2.py", line 380, in open
response = meth(req, response)
File "C:\Python25\lib\urllib2.py", line 491, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python25\lib\urllib2.py", line 418, in error
return self._call_chain(*args)
File "C:\Python25\lib\urllib2.py", line 353, in _call_chain
result = func(*args)
File "C:\Python25\lib\urllib2.py", line 499, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden

Code Snippet:

         
           
             1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:

           
           
             #this line is line 481 in searchengineWithoutTry.py
def crawl(self,pages,depth=2):
    for i in range(depth):
      newpages={}
      for page in pages:
        #try:
          c=urllib2.urlopen(page)
        #except:
          print "Could not open %s" % page
          continue
        #try:
          soup=BeautifulSoup(c.read())
          self.addtoindex(page,soup)
  
          links=soup('a')
          for link in links:
            if ('href' in dict(link.attrs)):
              url=urljoin(page,link['href'])
              if url.find("'")!=-1: continue
              url=url.split('#')[0]  # remove location portion
              if url[0:4]=='http' and not self.isindexed(url):
                newpages[url]=1
              linkText=self.gettextonly(link)
              self.addlinkref(page,url,linkText)
  
          self.dbcommit()
        #except Exception, e:
          print "Could not parse page %s" % page
          print 'The exception is: ', e
 
      pages=newpages

           
         

Open in New Window Select All

Answer : urllib.py urllib2.py errors

There is a way to fool Google into thinking your program is a web browser rather than a spider, but I don't know the legal status of it - Google probably has the right to kick you out of all their services if they catch you doing it. That said, it's common practice. You set urllib2's "User-Agent" header to look like a browser rather than a spider. You can see here how to do it:

http://www.voidspace.org.uk/python/articles/urllib2.shtml#headers

If Google asks, I didn't tell you how to do that. 8-)