Tool interfaces¶
This module defines interfaces for each tool or decision maker used in the scraping process. This makes it easy to test new ways or to tune one aspect of the scraper while keeping most of the code unchanged.
See the swisstext.cmd.scraping.tools
module for implementations.
-
class
swisstext.cmd.scraping.interfaces.
ICrawler
[source]¶ Bases:
abc.ABC
[ABSTRACT] This tool is in charge of crawling a page. More specifically, it should be able to: 1. extract the text of the page (stripped of any HTML or other structural clue), 2. extract links pointing to other pages
-
exception
CrawlError
(name='CrawlError', message='')[source]¶ Bases:
Exception
This wrapper should be used for any exception that arise during scraping.
-
class
CrawlResults
(text: str, links: List[str])[source]¶ Bases:
object
Holds the results of a page crawl.
-
__init__
(text: str, links: List[str])[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
links
= None¶ A list of interesting links found in the page. By interesting, we mean: * no duplicates * different from the current page URL (no anchors !) * if possible, no link pointing to unparseable resources (zip files, images, etc.) The method
swisstext.cmd.link_utils.filter_links()
is available to do the filtering.
-
text
= None¶ the clean text found in the page, free of any structural marker such as HTML tags, etc.
-
-
abstract
crawl
(url: str) → swisstext.cmd.scraping.interfaces.ICrawler.CrawlResults[source]¶ [ABSTRACT] Should crawl the page and extract the text and the links into a
ICrawler.CrawlResults
instance.
-
exception
-
class
swisstext.cmd.scraping.interfaces.
IDecider
[source]¶ Bases:
object
A decider should implement the logic behind whether or not a URL is considered interesting/should be crawled.
-
should_children_be_crawled
(page: swisstext.cmd.scraping.data.Page) → bool[source]¶ Decide if the links found on a page should be scraped on this run. Returns true by default.
-
should_page_be_crawled
(page: swisstext.cmd.scraping.data.Page) → bool[source]¶ Decide if a page should be scraped on this run. By default, it returns always true, but we could devise other rules based on the last crawl, etc.
-
should_url_be_blacklisted
(page: swisstext.cmd.scraping.data.Page) → bool[source]¶ Decide if a URL/page is blacklisted. The default implementation returns true if the URL has never been visited before and contains no Swiss German sentence. The new criteria is important to avoid blacklisting a page that changed over time, but contained once interesting sentences (and thus might be referenced in the persistence layer).
-
-
class
swisstext.cmd.scraping.interfaces.
INormalizer
[source]¶ Bases:
object
A normalizer should take a [long] text (extracted from a web page) and clean it in a consistant way. For example: uncurling quotes, normalizing punctuation, fix unicode, etc. The default implementation just return the text as-is.
-
class
swisstext.cmd.scraping.interfaces.
ISaver
(**kwargs)[source]¶ Bases:
abc.ABC
[ABSTRACT] The saver is responsible for persisting everything somewhere, such as a database, a file or the console.
-
abstract
get_page
(url: str, **kwargs) → swisstext.cmd.scraping.data.Page[source]¶ Load a page. The simplest implementation is just
return Page(url)
. If the subclass uses a data store, it should also populate the other page attributes (e.g. the score information) so that the :py:class::IDecider can make clever decisions.
-
abstract
is_url_blacklisted
(url: str) → bool[source]¶ [ABSTRACT] Tells if the given url is part of the blacklist. This is called at the beginning, to avoid scraping pages unnecessarily.
-
abstract
save_page
(page: swisstext.cmd.scraping.data.Page)[source]¶ [ABSTRACT] Persist a page. This is called after the scraping, so all page’s attributes are set, including the list of new :py:class::~swisstext.cmd.scraping.data.Sentence found.
-
abstract
save_seed
(seed: str)[source]¶ [ABSTRACT] Persist a seed (usually generated by the :py:class::ISeedGenerator).
-
save_seeds
(seeds: List[str])[source]¶ Persist multiple seeds (see
save_seed()
).
-
abstract
-
class
swisstext.cmd.scraping.interfaces.
ISeedCreator
[source]¶ Bases:
abc.ABC
[ABSTRACT] A seed creator should generate seeds, i.e. search queries, out of Swiss German sentences.
-
class
swisstext.cmd.scraping.interfaces.
ISentenceFilter
[source]¶ Bases:
object
A sentence filter should be able to tell if a given sentence is well-formed (i.e. valid) or not.
-
filter
(sentences: List[str]) → List[str][source]¶ Filter a list of sentences by calling
ISentenceFilter.is_valid()
on each element.
-
-
class
swisstext.cmd.scraping.interfaces.
ISgDetector
[source]¶ Bases:
object
[ABSTRACT] An SG Detector is a Language Identifier supporting Swiss German.
-
class
swisstext.cmd.scraping.interfaces.
ISplitter
[source]¶ Bases:
object
A splitter should take a [long] text (extracted from a web page) and split it into well-formed sentences. The default implementation just splits on the newline character.
-
class
swisstext.cmd.scraping.interfaces.
IUrlFilter
[source]¶ Bases:
object
A URL filter can transform or remove links found on a page before they are saved/crawled.
Note
IUrlFilter will only be called on URLs discovered during the scraping process (child URLs), not on the URLs used to bootstrap the process (using st_scrape from_file/from_mongo).
-
filter
(urls: List[str]) → Set[str][source]¶ Fix and filter a list of URLs by calling
IUrlFilter.fix()
on each element and removing None.
-