Link Extractors
The method of LxmlLinkExtractor takes settings that determine which links may be extracted. returns a list of matching Link objects from a object.
Link extractors are used in CrawlSpider spiders through a set of objects.
You can also use link extractors in regular spiders. For example, you can instantiate LinkExtractor into a class variable in your spider, and use it from your spider callbacks:
The link extractor class is . For convenience it can also be imported as scrapy.linkextractors.LinkExtractor
:
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=(‘a’, ‘area’), attrs=(‘href’,), canonicalize=False, unique=True, process_value=None, strip=True)[source]
LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml’s robust HTMLParser.
Parameters
allow ( or list) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
deny ( or list) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (i.e. not extracted). It has precedence over the
allow
parameter. If not given (or empty) it won’t exclude any links.allow_domains ( or list) – a single value or a list of string containing domains which will be considered for extracting the links
deny_extensions () –
Changed in version 2.0:
IGNORED_EXTENSIONS
now includes7z
,7zip
,apk
,bz2
,cdr
,dmg
,ico
,iso
, ,tar.gz
,webm
, andxz
.restrict_xpaths (str or ) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
restrict_css (str or ) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as
restrict_xpaths
.restrict_text (str or ) – a single regular expression (or list of regular expressions) that the link’s text must match in order to be extracted. If not given (or empty), it will match all links. If a list of regular expressions is given, the link will be extracted if it matches at least one.
tags (str or ) – a tag or a list of tags to consider when extracting links. Defaults to
('a', 'area')
.attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the
tags
parameter). Defaults to('href',)
canonicalize () – canonicalize each extracted url (using w3lib.url.canonicalize_url). Defaults to
False
. Note that canonicalize_url is meant for duplicate checking; it can change the URL visible at server side, so the response can be different for requests with canonicalized and raw URLs. If you’re using LinkExtractor to follow links it is more robust to keep the defaultcanonicalize=False
.process_value (collections.abc.Callable) –
a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return
None
to ignore the link altogether. If not given,process_value
defaults tolambda x: x
.For example, to extract links from this code:
You can use the following function in :
extract_links(response)
Returns a list of Link objects from the specified .
Only links that match the settings passed to the
__init__
method of the link extractor are returned.Duplicate links are omitted.
Link
class scrapy.link.Link(url, text=’’, fragment=’’, nofollow=False)
Link objects represent an extracted link by the LinkExtractor.
Using the anchor tag sample below to illustrate the parameters:
Parameters
url – the absolute url being linked to in the anchor tag. From the sample, this is
https://example.com/nofollow.html
.text – the text in the anchor tag. From the sample, this is
Dont follow this one
.fragment – the part of the url after the hash symbol. From the sample, this is
foo
.