Scrapy is really amazing. It is a very handy framework for scraping. In this guide, I will illustrate how to create a spider to extract multi-pages content. I assume that you already know Scrapy and you’ve covered the official tutorial.
I also assume that you’re familiar with XPath, if not please get your self familiar with it on w3schools site.
There are multiple ways to do this, however I found using a CrawlSpider is the fastest way to do this. CrawlSpider keeps following links based on a certain rules. In this guide I will cover the spider part only. I believe defining the item is very straight forward.
Let’s assume we’ve a site called example.com. Example.com lists 10 items – say articles – per page, with a next link at the bottom – except for the last page off course. We want to scrape article titles across all pages. Below is the HTML of the pages we want to scrape
Ready? let’s start writing our spider. First we should import the following
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom
from scraper.items import MyItem ##This is the item I defined in items.py
Then create you spider class that inherit from CrawlSpider and define its main attributes
domain_name = ‘example.com’
allowed_domains = ['example.com']
start_urls = ['http://www.example.com',]
So far our spider is just like the spider covered in the official tutorial with one exception. Our spider inherits CrawlSpider and not BaseSpider.
Next, define the rules. The rules is a tuple that defines which links to follow while scraping. The 3 main attributes of the each rule are:
- SgmlLinkExtractor: this is where you define the link you want the spider to follow
- allow: indicates the link href
- restrict_xpaths: restrict following links to a certain xpath. In our example we want the spider to follow the next link, which resides in the div. In our example we might not need to define it as we’ve only one link. however, if the page have multiple links and this value is not defined, the spider will follow all links!
- callback: the call back function to be called after each page is scraped
- follow: to instruct the spider to keep following the link through pages. The spider keeps following the link defined in the SgmlLinkExtractor as long as it exists!
The rest of the spider code should look something like below:
rules = (
Rule (SgmlLinkExtractor(allow=(‘next\.html’, ),restrict_xpaths=(‘//div[@id="pg"]‘,))
, callback=’parse_item’, follow= True),
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
for item in items:
scraped_item = MyItem() ### this is the item object you defined in the items file
scraped_item["title"] = item.select(‘li/text()’).extract() ### assuming your item object has “title” field
spider = MultiPagesSpider()
That’s it! we’re done. Now you can move to your pipeline and write the code to handle scraped items! happy scraping!