Link Extractors

    Scrapy提供了 scrapy.linkextractors import LinkExtractor , 但你通过实现一个简单的接口创建自己定制的Link Extractor来满足需求。

    Link Extractors在 CrawlSpider 类(在Scrapy可用)中使用,通过一套规则,但你也可以用它在你的Spider中,即使你不是从继承的子类, 因为它的目的很简单: 提取链接。

    There used to be other link extractor classes in previous Scrapy versions,but they are deprecated now.

    class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)

    LxmlLinkExtractor is the recommended link extractor with handy filteringoptions. It is implemented using lxml’s robust HTMLParser.