Scrapy is really amazing. It is a very handy framework for scraping. In this guide, I will illustrate how to create a spider to extract multi-pages content. I assume that you already know Scrapy and you’ve covered the official tutorial.
I also assume that you’re familiar with XPath, if not please get your self familiar with it on w3schools site.
There are multiple ways to do this, however I found using a CrawlSpider is the fastest way to do this. CrawlSpider keeps following links based on a certain rules. In this guide I will cover the spider part only. I believe defining the item is very straight forward.
Let’s assume we’ve a site called example.com. Example.com lists 10 items – say articles – per page, with a next link at the bottom – except for the last page off course. We want to scrape article titles across all pages. Below is the HTML of the pages we want to scrape
Ready? let’s start writing our spider. First we should import the following
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom
from scraper.items import MyItem ##This is the item I defined in items.py
Then create you spider class that inherit from CrawlSpider and define its main attributes
domain_name = ‘example.com’
allowed_domains = [‘example.com’]
start_urls = [‘http://www.example.com’,]
So far our spider is just like the spider covered in the official tutorial with one exception. Our spider inherits CrawlSpider and not BaseSpider.
Next, define the rules. The rules is a tuple that defines which links to follow while scraping. The 3 main attributes of the each rule are:
- SgmlLinkExtractor: this is where you define the link you want the spider to follow
- allow: indicates the link href
- restrict_xpaths: restrict following links to a certain xpath. In our example we want the spider to follow the next link, which resides in the div. In our example we might not need to define it as we’ve only one link. however, if the page have multiple links and this value is not defined, the spider will follow all links!
- callback: the call back function to be called after each page is scraped
- follow: to instruct the spider to keep following the link through pages. The spider keeps following the link defined in the SgmlLinkExtractor as long as it exists!
The rest of the spider code should look something like below:
rules = (
Rule (SgmlLinkExtractor(allow=(‘next\.html’, ),restrict_xpaths=(‘//div[@id=”pg”]’,))
, callback=’parse_item’, follow= True),
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
for item in items:
scraped_item = MyItem() ### this is the item object you defined in the items file
scraped_item[“title”] = item.select(‘li/text()’).extract() ### assuming your item object has “title” field
spider = MultiPagesSpider()
That’s it! we’re done. Now you can move to your pipeline and write the code to handle scraped items! happy scraping!
23 thoughts on “GUIDE: Scrape multi-pages content with Scrapy”
I truly knew about most of this, but that being said, I still believed it had been useful. Nice post!
Other might not know about this J. Thanks for your comment
thanks for your post. I have always missed the new instance of the spider on the end… 🙂
You’re welcome Daniel
You can improve your code:
Wow – exactly what I was looking for – Thank you!
Very nice tuto.
Can we add an other layer :
instead of … we have … to give as the page where are our items.
Is it possible with CrawlSpider or we have to write our own spider ?
Thx (can you please tell me how to do that, I am trying to scrap something like google search results with multi-pages)
I don’t seem to get your point. Can you please elaborate me.
Thank you for replying, her is my point :
The first page does not have the data to scrape it just have links to details pages
and you have to do the next :
1- Find links in the search page
2- Follow them to details page and do the scraping job
3- Once done, go to the next search page: the “next >>” link
4. Redo all over again
I hope my idea is more clear now 🙂
Sorry, I mean \ with \
What if there are more than one links in a page and want to follow all the links and the links the subsequent pages as well without repeating the links once visited.
How can I do that. This approach is not helping me do that.
You can define multiple rules. For each rule you can define a different call back. Check this link for more details
Hello, I check your blogs daily. Your writing style is awesome, keep
doing what you’re doing!
Thanks a lot
it is too clear .. it saved my time alot and gave me clear view on crawl spider .. thanks alot
i m scraper and scraping web using core php and .Net…now i m gonna try scrapy for scraping web…nice post thanks for sharing…
Thanks for the information. Correction – latest version of Scrapy insists of getting rid of SGMLLinkExtractor(getting deprecated) and use LinkExtractor class instead.
Thanks buddy. I haven’t updated my self with Scrapy for a while J.
How can I scrapy a page for exemple: domain.com?search=WORDS _TO_SEARCH
And return the data for each search?
haven’t palyed with Scrapy for a while. Sorry cannot help J