Tool implementations¶

This package contains various implementations of the different pipeline tools.

Deciders¶

This module contains multiple IDecider implementations.

class swisstext.cmd.scraping.tools.basic_decider.BasicDecider(min_ratio=0, min_recrawl_delta: str = None, **kwargs)[source]¶

Bases: swisstext.cmd.scraping.interfaces.IDecider

A basic decider that:

crawls URLs that fulfill one of the following criteria:
- the URL is new,
- the last visit yielded at least one new sentence on the last visit,
- the last visit was older than min_recrawl_delta
blacklists URLs with no Swiss German sentences,
adds child URLs from a page only if the ratio between sentences and Swiss German sentences is greater or equal min_ratio

__init__(min_ratio=0, min_recrawl_delta: str = None, **kwargs)[source]¶

Parameters

min_ratio – the min ratio
min_recrawl_delta – a string that can be parsed to a time difference using pytimeparse.timeparse

min_ratio = None¶: Child URLs are added only if sentence_count / sg_count > min_ratio

min_recrawl_delta = None¶: a timedelta. URLs are revisited if now() - last visit > min_recrawl_delta (UTC)

should_children_be_crawled(page: swisstext.cmd.scraping.data.Page) → bool[source]¶: Returns true if the page’s sg_count is above 0 and sentence_count / sg_count >= min_ratio

should_page_be_crawled(page: swisstext.cmd.scraping.data.Page) → bool[source]¶: Returns true if the URL is new, or the page’s delta_count above 0 and the page’s :py:attr:~swisstext.cmd.scraping.data.Page.delta_date is older than min_recrawl_delta note that a page will NEVER be recrawled if the last crawl is less than ABSOLUTE_MIN_RECRAWL_DELTA old.

should_url_be_blacklisted(page: swisstext.cmd.scraping.data.Page) → bool[source]¶: Returns true only if the URL is new (first visit) and no Swiss German sentence is present in the page.

class swisstext.cmd.scraping.tools.basic_decider.OneNewSgDecider(min_ratio=0, min_recrawl_delta: str = None, **kwargs)[source]¶

Bases: swisstext.cmd.scraping.tools.basic_decider.BasicDecider

Same as BasicDecider, but children will be crawled only if at least one NEW Swiss-German sentence was found.

should_children_be_crawled(page: swisstext.cmd.scraping.data.Page) → bool[source]¶: Returns true if the page’s sg_count is above 0 and sentence_count / sg_count >= min_ratio

class swisstext.cmd.scraping.tools.basic_decider.OnlyNewDecider(min_ratio=0, min_recrawl_delta: str = None, **kwargs)[source]¶

Bases: swisstext.cmd.scraping.tools.basic_decider.BasicDecider

Same as BasicDecider, but only crawls new URLs.

should_page_be_crawled(page: swisstext.cmd.scraping.data.Page) → bool[source]¶: Returns true only if the page is new.

Seed creators¶

This module contains multiple ISeedCreator implementations. They all use sklearn vectorizers (word N-grams) under the hood.

Warning

This utility was never really used in experiments. We usually generated the seeds using the scripts in the seeding directory on GitHub.

class swisstext.cmd.scraping.tools.basic_seed_creator.BasicSeedCreator(ngram_range=(3, 3))[source]¶

Bases: swisstext.cmd.scraping.interfaces.ISeedCreator

A basic seed creator that uses an sklearn CountVectorizer to compute frequent word N-grams. The generated seeds are simply the x most frequent N-grams found.

__init__(ngram_range=(3, 3))[source]¶: Initialize self. See help(type(self)) for accurate signature.

generate_seeds(sentences: List[str], max=10, stopwords: List[str] = None, **kwargs) → List[str][source]¶: Return a list of seeds composed of the most frequent n-grams.

ngram_range = None¶: ngram range parameter passed to the CountVectorizer

class swisstext.cmd.scraping.tools.basic_seed_creator.IdfSeedCreator(sanitize=True, **kwargs)[source]¶

Bases: swisstext.cmd.scraping.interfaces.ISeedCreator

A basic seed creator that uses an sklearn TfidfVectorizer to compute frequent word N-grams. The generated seeds are the x n-grams with the highest score.

__init__(sanitize=True, **kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

generate_seeds(sentences: List[str], max=10, stopwords: List[str] = None, **kwargs) → List[str][source]¶: Return a list of seeds composed of the most frequent n-grams (ponderated by IDF).

kwargs = None¶: The arguments to pass to the vectorizer. They can be overriden in the constructor.

sanitize = None¶: If this flag is set, digits that are not part of a word will be removed before feeding the vectorizer. This is useful because the TfidfVectorizer’s default token pattern considers digits as actual words (but ignores punctuation, see the CountVectorizer token_pattern attribute).

Crawlers¶

This module contains the implementation of a ICrawler that uses BeautifulSoup to extract text and links.

class swisstext.cmd.scraping.tools.bs_crawler.BsCrawler(joiner=' ')[source]¶

Bases: swisstext.cmd.scraping.interfaces.ICrawler

A basic crawler implemented using BeautifulSoup.

Text is extracted by concatenating all pieces of text (except CSS and script) into one string using a space separator (no newlines).

Warning

This crawler implementation will return the page’s textual content in one bulk, with no newlines characters. Consequently, results won’t be exploitable without a clever ISplitter (recall that the default implementation split text based on newlines…) such as the PunktSplitter.

Todo

Try using just the response.text from requests to get a proper encoding ?

__init__(joiner=' ')[source]¶: Initialize self. See help(type(self)) for accurate signature.

crawl(url: str) → swisstext.cmd.scraping.interfaces.ICrawler.CrawlResults[source]¶: Extract links and text from a URL.

classmethod extract_links(url, soup)[source]¶: Get all links from a soup (a href only). Note that links will be resolved (relative to absolute) and filtered (non-HTML removed).

classmethod extract_text_blocks(soup) → Generator[str, None, None][source]¶: Get text blocks from a BeautifulSoup object.

Warning

This method is destructive, as it will first remove script, style and forms from the HTML/soup object !

Todo

Find a way to avoid altering the soup object.. ?

classmethod get_content(url) → Tuple[bytes, str][source]¶: Get the raw content from a URL (as a string), with the response encoding as reported by the requests module. Exceptions may be raised if: * an error occurs during the GET request (timeout, decoding issue, too many redirects, etc.) * the content-type is not of a supported type (namely html or text) * the response body is empty

classmethod get_soup(url) → Tuple[bs4.BeautifulSoup, bytes][source]¶: Get a BeautifulSoup object from a URL (HTML).

class swisstext.cmd.scraping.tools.bs_crawler.CleverBsCrawler(joiner=' ')[source]¶

Bases: swisstext.cmd.scraping.tools.bs_crawler.BsCrawler

Another implementation of the BsCrawler that tries to be more clever during the text extraction step.

Processing steps:

remove all scripts and CSS content (same as the BsCrawler)
try to detect the page’s main content using common naming schemes (id=main, role=main, etc). If found, stop the processing and return only the text under it.
try to detect and remove the header, footer and navigation before returning the text

Following those heuristics requires more processing power and might miss some sentences (full sentences in the side bar, main content in a poorly-coded website, etc).

Todo

Make more thorough tests to determine if those heuristics are worth it. If so, make this implementation the default.

extract_text_blocks(soup) → Generator[str, None, None][source]¶: Get text blocks from a BeautifulSoup object.

Warning

This method is destructive, as it will first remove script, style and forms from the HTML/soup object !

Todo

Find a way to avoid altering the soup object.. ?

swisstext.cmd.scraping.tools.bs_crawler.DEFAULT_HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36'}¶: Headers passed with each request

swisstext.cmd.scraping.tools.bs_crawler.GET_TIMEOUT = 60¶: Timeout used in requests.get

This module contains a subclass of BsCrawler relying on JusText for text extraction.

It seems to work better than BsCrawler or CleverBsCrawler. To get a feel, try the online demo of the original JustText.

Note

usually, jusText uses '' (empty string) to join text nodes inside a paragraph, thus making things like “One sentence.Second sentence” likely. Here, we always use a space to join, then normalize the spaces.
justext will throw an error on an empty document content, which is wrapped inside a CrawlError.

class swisstext.cmd.scraping.tools.justext_crawler.JustextCrawler(joiner='n', keep_bad=True, stoplist=None, stopwords_low=0.3, stopwords_high=0.32, **kwargs)[source]¶

Bases: swisstext.cmd.scraping.tools.bs_crawler.BsCrawler

A BsCrawler that relies on JusText to cleverly extract meaningful text from webpages.

__init__(joiner='\n', keep_bad=True, stoplist=None, stopwords_low=0.3, stopwords_high=0.32, **kwargs)[source]¶: Create a crawler instance. :param joiner: character used to join paragraphs; :param keep_bad: if set, keep everything. If unset, keep only paragraphs with a context-free class of “neargood” or “good”. :param stoplist: see the justText doc :param stopwords_low: idem :param stopwords_high: idem :param kwargs: unused

crawl(url: str)[source]¶: Extract links and text from a URL.

Normalizers¶

class swisstext.cmd.scraping.tools.norm_punc.Normalizer(**kwargs)[source]¶

Bases: object

A wrapper around normalize_text()

__init__(**kwargs)[source]¶: Initialize a normalizer. The kwargs will be passed to normalize_text() as-is.

kwargs = None¶: extra options to pass to normalize_text()

normalize(text)[source]¶: Call normalize_text() on text with the extra kwargs options.

swisstext.cmd.scraping.tools.norm_punc.normalize_text(text, fix_encoding=False, strip_emojis=False)[source]¶

Normalize text:

normalize accents (using NFC convention),
strip control/invisible chars and leftover combining diacritics,
undo ligatures,
normalize quotes, apostrophes and unicode characters (dashes, etc.),
normalize spaces (all spaces, including nbsp and tabs, will be encoded as 0x20),
strip and collapse multiple spaces into one,
etc.

Optionally:

try to detect and fix encoding issues (see ftfy.fix_encoding)
on a per-sentence basis (delimited by newlines);
strip emojis (using a simple regex, not all cases are covered !)

Parameters

text – the text to normalize, newlines will be preserved;
fix_encoding – if set, use ftfy to fix encoding issues on a per-sentence basis;
strip_emojis – if set, try to find and strip unicode emojis;

Returns

the normalized text

Splitters¶

class swisstext.cmd.scraping.tools.punkt_splitter.PunktSplitter(modelfile=None)[source]¶

Bases: swisstext.cmd.scraping.interfaces.ISplitter

A splitter using the PunktSentenceTokenizer, the NLTK implementation of the “Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk (2005)” algorithm.

Note

The default implementation uses a model trained on English sentences. This kaggle resource offers pretrained Punkt Models for other languages as well, including German. In my tests though, German models performed poorly compared to the default…

Todo

Train a Punkt model for Swiss-German. (https://stackoverflow.com/questions/21160310/training-data-format-for-nltk-punkt)

__init__(modelfile=None)[source]¶: Initialize self. See help(type(self)) for accurate signature.

split(text: str) → List[str][source]¶: Split text using Punkt.

Reimplementation of moses’ split-sentences.perl in pure Python 3. The behavior is equivalent to https://bitbucket.org/luismsgomes/mosestokenizer/src/default/src/mosestokenizer/split-sentences.perl.

Changed compared to the latest split-sentences.perl:

no support for Chinese, Hindi and Gujarati (those if code blocks were simply ignored)
addition of a “more” option that splits on “:;” characters
split returns a list of sentences instead of a string
‘<P>’ are not added when encountering a “closing” empty line in the input (-> all empty lines are just ignored)

Note

A better implementation (at least for swisstext) is MocySplitter

class swisstext.cmd.scraping.tools.moses_splitter.MosesSplitter(lang='de', prefix_file=None, more=True)[source]¶

Bases: swisstext.cmd.scraping.interfaces.ISplitter

Python implementation of Moses’ split_sentences.perl

__init__(lang='de', prefix_file=None, more=True)[source]¶

Parameters

lang – nonbreaking_prefix file to load (available: en, de)
prefix_file – path to a custom nonbreaking_prefix file
more – if set, systematically split on :;

classmethod cleanup_spaces(text)[source]¶: Normalize spaces in a text.

classmethod load_nb_prefixes(lang='en', prefix_file=None)[source]¶

Read the nonbreaking_prefixes from a file

Parameters

lang – the language to load
prefix_file – a custom file to load (has priority over lang)

Returns

the nonbreaking_prefix dictionary, with key=prefix and value=1|2. 1 means apply anywhere, 2 means only applies when followed by digits.

split(input_text)[source]¶: This should be overriden. The default implementation just splits on newlines.

classmethod split_paragraph(text, nb_prefixes, more=False)[source]¶

Handle on paragraph of text.

Parameters

text – the paragraph to split
nb_prefixes – the dictionary of nonbreaking_prefix (see perl implementation/doc)
more – if set, systematically split on :;

Returns

a list of sentences

This splitter implementation was heavily inspired from moses’ split-sentences.perl. It requires only basic python 3 modules and the regex module (which supports unicode regexes).

Main changes compared to the latest split-sentences.perl:

implemented with Swiss-German (Web) in mind: no support for Chinese, Hindi and Gujarati, numbers and quotes conventions based on this language only;
addition of a “more” option that splits on “:;” characters (but tries to keep emojis and URLs intact);
lowercase letters can also be signals of start-of-sentence (often seen on the web);
more aggressive splits on ?! characters;
split returns a list of sentences instead of a string;
it is possible to load multiple nonbreaking prefix lists (merged into one lookup table);
no support for <p> tags (I didn’t get the purpose of this option anyway)

class swisstext.cmd.scraping.tools.mocy_splitter.MocySplitter(langs=None, prefix_file=None, more=True, keep_newlines=True)[source]¶

Bases: swisstext.cmd.scraping.interfaces.ISplitter

A splitter largely inspired from Moses’ split_sentences.perl. The name is a contraction of Moses and Lucy. Most important changes:

a start of sentence doesn’t need to be an uppercase, lowercase letters will do too;
the more parameter let you choose to break on ‘;:’ as well;
end of sentences ‘?!’ are better handled
multiple nonbreaking_prefix files can be used at once (rules will be merged)

__init__(langs=None, prefix_file=None, more=True, keep_newlines=True)[source]¶

Parameters

lang – a List[str] of language(s) for nonbreaking_prefix file to load (default: en, de)
prefix_file – path to a custom nonbreaking_prefix file
more – if set, systematically split on :;
keep_newlines – if set, treat newlines as paragraph delimiters that will be preserved. If unset, newlines are ignored and empty lines are treated as paragraph delimiters (Moses original behavior, see split()).

classmethod cleanup_spaces(text)[source]¶: Normalize spaces in a text.

langs = None¶: nonbreaking prefixes files to load

classmethod load_nb_prefixes(langs, prefix_file=None)[source]¶

Read the nonbreaking_prefixes from a file or from an array of languages.

Parameters

langs – the language(s) to load
prefix_file – a custom file to load (has priority over lang)

Returns

the nonbreaking_prefix dictionary, with key=prefix and value=1|2. 1 means apply anywhere, 2 means only applies when followed by digits.

more = None¶: whether or not to split on :;

nb_prefixes = None¶: nonbreaking prefix lookup table

split(input_text)[source]¶

Split a text into sentences. Depending on the value of keep_newlines, either split_sentences() or split_text() will be called.

Parameters: input_text – the input text
Returns: a list of sentences (no blank lines)

classmethod split_paragraph(text, nb_prefixes, more=False)[source]¶

Handle one paragraph of text.

Parameters

text – the paragraph to split
nb_prefixes – the dictionary of nonbreaking_prefix (see perl implementation/doc)
more – if set, systematically split on :;

Returns

a list of sentences

split_sentences(input_text)[source]¶

Split a text into sentences. Newlines already present in text will be preserved and act as paragraph delimiters.

Parameters: input_text – the input text
Returns: a list of sentences (no blank lines)

split_text(input_text)[source]¶

Split a text into sentences. Newlines already present in text won’t be preserved and empty lines act as paragraph delimiter.

Parameters: input_text – the input text
Returns: a list of sentences (no blank lines)

Sentence Filters¶

This module contains an implementation of ISentenceFilter that uses simple rules to filter “well-formed” sentences.

How it works¶

Each sentence is checked against a list of rules and rejected / considered invalid if any of those rules fail. Rules are thus AND-based.

Rules are defined using a simple YAML syntax and can be of two types: length-based (character count): or pattern-based (regular expressions). They are checked in the same order they are defined.

Note

Regular expressions can be quite expensive, so try to limit their complexity to the minimum required.
Rules are checked in the same order as they are defined, so it is advised to put the most generic / efficient ones first.

Note

This module uses the regex library (version V0) instead of the default re. You can thus freely use unicode regular expressions in your rules.

Rule syntax¶

Length-based rules (length) must specify at least one of min or max length, i.e. the bounds on the number of characters. The rule succeeds if min <= len(s) <= max. Here is an example:

- max_length:
  descr: too long
  length:
    max: 1000

Pattern-based rules (find) a bit similar, but instead of counting the number of characters, they count the number of occurrences of a pattern (i.e. the number of matches when calling regex.findall(pattern, s)). The rule succeeds if min <= nb_matches <= max (inclusive !)

Examples:

- dashes:
  descr: too many dashes
  find:
    pattern: '[¯‐―−﹘－-]'
    count:
      max: 2

- right_number_of_punctuation:
  descr: punctuation is between 5 and 10
  find:
    pattern: '\p{P}'
    count:
      min: 5
      max: 10

Comparison rules (compare) they define both a numerator and a denominator pattern. The number of matches is found for each pattern, then a ratio is computed as: ratio = count(num matches) / (count(denom matches)+1). The matches is once again based on regex.findall. The rule succeeds if min <= ratio <= max (inclusive !).

Example:

- too_many_commas:
  descr: compare the number of commas against the number of words in the sentence
  compare:
    num: ','
    denom: '\p{L}+'
    ratio:
      max: 0.25

Finally, an if condition can be used. If conditions are checked first, and if the check fails, the rule is simply ignored:

- ellipsis:
  descr: ellispsis on short sentences.
  if:
    length:
      max: 30
  find:
    pattern: '(\.\s?){3}$'
    count:
      max: 0

Rules can additionnally specify examples and counterexamples that could be used to check quickly if they work (see the Rule.self_check method). For example:

- spelled_words:
  descr: W O R D
  find:
    pattern: ' ([\p{L}] ){3,}'
    count:
      max: 0
  examples:
    - 'you must B E L I E V E me.'
    - 'span spells S P A N.'
    - 's p a n means span!!'
  counterexamples:
    - 'this is O K :)'

class swisstext.cmd.scraping.tools.pattern_sentence_filter.PatternSentenceFilter(rulespath=None)[source]¶

Bases: swisstext.cmd.scraping.interfaces.ISentenceFilter

By default, rules are loaded from the default file pattern_sentence_filter.yaml in the current directory. You can override this by passing a path to the constructor (rulespath argument).

__init__(rulespath=None)[source]¶: Load rules from the default YAML file or the path provided.

is_valid(sentence)[source]¶: Returns true only if all the rules were respected.

Link Filters¶

It is optionally possible to add a custom URL filtering logic that will be called for each new URL found on a page.

This let’s you:

ignore child URLs (by returning None) or
modify child URLs, for example by normalizing subdomains, stripping url parameters, etc.

Just create an instance of the IUrlFilter interface and implement its fix method.

Language Detectors¶

class swisstext.cmd.scraping.tools.swigspot_langid.SwigspotLangid[source]¶

Bases: swisstext.cmd.scraping.interfaces.ISgDetector

This LID model was developed during the SwigSpot project.

In short, it uses 6000 character n-grams features (between 3 and 5 characters long) with TF-IDF scaling and a logistic regression. Sentences are preprocessed by removing everything except letters, spaces, commas and dots.

All the details are available in the SwigSpot repository. The notebook for recreating the model is available here.

__init__()[source]¶: Initialize self. See help(type(self)) for accurate signature.

predict(sentences: List[str]) → List[float][source]¶: Predict the Swiss German probability (between 0 and 1) of a list of sentences. Always return 1 by default.

predict_lang(sentences: List[str]) → List[float][source]¶

sanitize(s: str) → str[source]¶

Savers¶

class swisstext.cmd.scraping.tools.console_saver.ConsoleSaver(sentences_file: str = None, **kwargs)[source]¶

Bases: swisstext.cmd.scraping.interfaces.ISaver

Implementation of an ISaver useful for testing and debugging. It does not persist any results, but prints everything to the console instead.

Blacklisted URLs and sentences are kept in sets in memory.

__init__(sentences_file: str = None, **kwargs)[source]¶

Parameters: sentences_file – optional path to a file were new sentences are written. Note that the file is overriden on each run.

blacklist_url(url: str, **kwargs)[source]¶: [ABSTRACT] Add the url to a blacklist.

close()[source]¶

get_page(url: str, **kwargs) → swisstext.cmd.scraping.data.Page[source]¶: Load a page. The simplest implementation is just return Page(url). If the subclass uses a data store, it should also populate the other page attributes (e.g. the score information) so that the :py:class::IDecider can make clever decisions.

is_url_blacklisted(url: str) → bool[source]¶: [ABSTRACT] Tells if the given url is part of the blacklist. This is called at the beginning, to avoid scraping pages unnecessarily.

save_page(page)[source]¶: [ABSTRACT] Persist a page. This is called after the scraping, so all page’s attributes are set, including the list of new :py:class::~swisstext.cmd.scraping.data.Sentence found.

save_seed(seed: str)[source]¶: [ABSTRACT] Persist a seed (usually generated by the :py:class::ISeedGenerator).

save_url(url: str, parent: str = None)[source]¶

[ABSTRACT] Save a potentially interesting URL which won’t be visited during this round (e.g. max_depth reached).

Parameters

url – the url
parent – the parent url

sentence_exists(sentence: str) → bool[source]¶: [ABSTRACT] Tells if the given Swiss German sentence already exists. Only new sentences will be added to the page’s new_sg attribute.

class swisstext.cmd.scraping.tools.mongo_saver.MongoSaver(db='st1', **kwargs)[source]¶

Bases: swisstext.cmd.scraping.interfaces.ISaver

This ISaver implementation persists everything to a MongoDB database.