Tool interfaces

This module defines interfaces for each tool or decision maker used in the scraping process. This makes it easy to test new ways or to tune one aspect of the scraper while keeping most of the code unchanged.

See the swisstext.cmd.scraping.tools module for implementations.

class swisstext.cmd.scraping.interfaces.ICrawler[source]

Bases: abc.ABC

[ABSTRACT] This tool is in charge of crawling a page. More specifically, it should be able to: 1. extract the text of the page (stripped of any HTML or other structural clue), 2. extract links pointing to other pages

exception CrawlError(name='CrawlError', message='')[source]

Bases: Exception

This wrapper should be used for any exception that arise during scraping.

__init__(name='CrawlError', message='')[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod from_ex(e: Exception)[source]

Create an exception using the original exception name and repr

class CrawlResults(text: str, links: List[str])[source]

Bases: object

Holds the results of a page crawl.

__init__(text: str, links: List[str])[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod empty()[source]

A list of interesting links found in the page. By interesting, we mean: * no duplicates * different from the current page URL (no anchors !) * if possible, no link pointing to unparseable resources (zip files, images, etc.) The method swisstext.cmd.link_utils.filter_links() is available to do the filtering.

text = None

the clean text found in the page, free of any structural marker such as HTML tags, etc.

abstract crawl(url: str) → swisstext.cmd.scraping.interfaces.ICrawler.CrawlResults[source]

[ABSTRACT] Should crawl the page and extract the text and the links into a ICrawler.CrawlResults instance.

class swisstext.cmd.scraping.interfaces.IDecider[source]

Bases: object

A decider should implement the logic behind whether or not a URL is considered interesting/should be crawled.

should_children_be_crawled(page: swisstext.cmd.scraping.data.Page) → bool[source]

Decide if the links found on a page should be scraped on this run. Returns true by default.

should_page_be_crawled(page: swisstext.cmd.scraping.data.Page) → bool[source]

Decide if a page should be scraped on this run. By default, it returns always true, but we could devise other rules based on the last crawl, etc.

should_url_be_blacklisted(page: swisstext.cmd.scraping.data.Page) → bool[source]

Decide if a URL/page is blacklisted. The default implementation returns true if the URL has never been visited before and contains no Swiss German sentence. The new criteria is important to avoid blacklisting a page that changed over time, but contained once interesting sentences (and thus might be referenced in the persistence layer).

class swisstext.cmd.scraping.interfaces.INormalizer[source]

Bases: object

A normalizer should take a [long] text (extracted from a web page) and clean it in a consistant way. For example: uncurling quotes, normalizing punctuation, fix unicode, etc. The default implementation just return the text as-is.

normalize(text: str) → str[source]

This should be overriden. The default implementation just returns the text as-is.

normalize_all(texts: List[str]) → List[str][source]

Calls normalize on each element of text.

class swisstext.cmd.scraping.interfaces.ISaver(**kwargs)[source]

Bases: abc.ABC

[ABSTRACT] The saver is responsible for persisting everything somewhere, such as a database, a file or the console.

__init__(**kwargs)[source]
abstract blacklist_url(url: str, **kwargs)[source]

[ABSTRACT] Add the url to a blacklist.

abstract get_page(url: str, **kwargs) → swisstext.cmd.scraping.data.Page[source]

Load a page. The simplest implementation is just return Page(url). If the subclass uses a data store, it should also populate the other page attributes (e.g. the score information) so that the :py:class::IDecider can make clever decisions.

abstract is_url_blacklisted(url: str) → bool[source]

[ABSTRACT] Tells if the given url is part of the blacklist. This is called at the beginning, to avoid scraping pages unnecessarily.

abstract save_page(page: swisstext.cmd.scraping.data.Page)[source]

[ABSTRACT] Persist a page. This is called after the scraping, so all page’s attributes are set, including the list of new :py:class::~swisstext.cmd.scraping.data.Sentence found.

abstract save_seed(seed: str)[source]

[ABSTRACT] Persist a seed (usually generated by the :py:class::ISeedGenerator).

save_seeds(seeds: List[str])[source]

Persist multiple seeds (see save_seed()).

abstract save_url(url: str, parent: str = None)[source]

[ABSTRACT] Save a potentially interesting URL which won’t be visited during this round (e.g. max_depth reached).

Parameters
  • url – the url

  • parent – the parent url

abstract sentence_exists(sentence: str) → bool[source]

[ABSTRACT] Tells if the given Swiss German sentence already exists. Only new sentences will be added to the page’s new_sg attribute.

class swisstext.cmd.scraping.interfaces.ISeedCreator[source]

Bases: abc.ABC

[ABSTRACT] A seed creator should generate seeds, i.e. search queries, out of Swiss German sentences.

abstract generate_seeds(sentences: List[str], max=10, **kwargs) → List[str][source]

[ABSTRACT] Should generate interesting seeds.

Parameters
  • sentences – the sentences

  • max – maximum number of seeds to return

Returns

a list of seeds

class swisstext.cmd.scraping.interfaces.ISentenceFilter[source]

Bases: object

A sentence filter should be able to tell if a given sentence is well-formed (i.e. valid) or not.

filter(sentences: List[str]) → List[str][source]

Filter a list of sentences by calling ISentenceFilter.is_valid() on each element.

is_valid(sentence: str) → bool[source]

This should be overriden. The default implementation just returns true.

class swisstext.cmd.scraping.interfaces.ISgDetector[source]

Bases: object

[ABSTRACT] An SG Detector is a Language Identifier supporting Swiss German.

abstract predict(sentences: List[str]) → List[float][source]

Predict the Swiss German probability (between 0 and 1) of a list of sentences. Always return 1 by default.

predict_one(sentence: str) → float[source]

Call predict() with one sentence.

class swisstext.cmd.scraping.interfaces.ISplitter[source]

Bases: object

A splitter should take a [long] text (extracted from a web page) and split it into well-formed sentences. The default implementation just splits on the newline character.

split(text: str) → List[str][source]

This should be overriden. The default implementation just splits on newlines.

split_all(texts: List[str]) → List[str][source]

Takes a list of texts and returns a list of sentences (see split()).

class swisstext.cmd.scraping.interfaces.IUrlFilter[source]

Bases: object

A URL filter can transform or remove links found on a page before they are saved/crawled.

Note

IUrlFilter will only be called on URLs discovered during the scraping process (child URLs), not on the URLs used to bootstrap the process (using st_scrape from_file/from_mongo).

filter(urls: List[str]) → Set[str][source]

Fix and filter a list of URLs by calling IUrlFilter.fix() on each element and removing None.

fix(url: str) → Optional[str][source]

This should be overriden. The default implementation just returns the URL.

Parameters

url – the current URL

Returns

the potentially transformed URL, or None if it should be ignored