Pages

Sunday, October 19, 2008

Fetch urls anonymously with Python urllib2 and Tor

For various purposes (web scrapping, spiders, data extraction, etc) you need to use anonimity... 1. Install Tor 2. Check if Tor is working 3. Write your script in Python
import urllib2
proxy_support = urllib2.ProxyHandler({"http":"http://127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support)
url='http://whatismyip.com/'
page = opener.open(url)
contents=page.read()
print contents

And that is all

For web scrapping you can use Beautiful Soup. Adding some Beautiful Soup:
import re
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(page)
h1Tags = soup.findAll('h1')
#ip address text is in a 2nd h1 tag:
ip = re.sub(r'<[^>]*?>', '', str(h1Tag[1]))
print ip
If you want to see some readable text, see the page source, there is a comment about which url you can access to see only the IP address...

No comments: