Tool implementations¶
This package contains various implementations of the different pipeline tools.
See also
interfaces
The tools interfaces definitions
config
The default configuration instantiates tools from this package
Deciders¶
This module contains multiple IDecider
implementations.
-
class
swisstext.cmd.scraping.tools.basic_decider.
BasicDecider
(min_ratio=0, min_recrawl_delta: str = None, **kwargs)[source]¶ Bases:
swisstext.cmd.scraping.interfaces.IDecider
A basic decider that:
crawls URLs that fulfill one of the following criteria:
the URL is new,
the last visit yielded at least one new sentence on the last visit,
the last visit was older than
min_recrawl_delta
blacklists URLs with no Swiss German sentences,
adds child URLs from a page only if the ratio between sentences and Swiss German sentences is greater or equal
min_ratio
-
__init__
(min_ratio=0, min_recrawl_delta: str = None, **kwargs)[source]¶ - Parameters
min_ratio – the min ratio
min_recrawl_delta – a string that can be parsed to a time difference using
pytimeparse.timeparse
-
min_ratio
= None¶ Child URLs are added only if sentence_count / sg_count > min_ratio
-
min_recrawl_delta
= None¶ a timedelta. URLs are revisited if now() - last visit > min_recrawl_delta (UTC)
-
should_children_be_crawled
(page: swisstext.cmd.scraping.data.Page) → bool[source]¶ Returns true if the page’s
sg_count
is above 0 andsentence_count
/sg_count
>=min_ratio
-
should_page_be_crawled
(page: swisstext.cmd.scraping.data.Page) → bool[source]¶ Returns true if the URL is new, or the page’s
delta_count
above 0 and the page’s :py:attr:~swisstext.cmd.scraping.data.Page.delta_date is older thanmin_recrawl_delta
note that a page will NEVER be recrawled if the last crawl is less than ABSOLUTE_MIN_RECRAWL_DELTA old.
-
class
swisstext.cmd.scraping.tools.basic_decider.
OneNewSgDecider
(min_ratio=0, min_recrawl_delta: str = None, **kwargs)[source]¶ Bases:
swisstext.cmd.scraping.tools.basic_decider.BasicDecider
Same as
BasicDecider
, but children will be crawled only if at least one NEW Swiss-German sentence was found.-
should_children_be_crawled
(page: swisstext.cmd.scraping.data.Page) → bool[source]¶ Returns true if the page’s
sg_count
is above 0 andsentence_count
/sg_count
>=min_ratio
-
-
class
swisstext.cmd.scraping.tools.basic_decider.
OnlyNewDecider
(min_ratio=0, min_recrawl_delta: str = None, **kwargs)[source]¶ Bases:
swisstext.cmd.scraping.tools.basic_decider.BasicDecider
Same as
BasicDecider
, but only crawls new URLs.
Seed creators¶
This module contains multiple ISeedCreator
implementations.
They all use sklearn vectorizers (word N-grams) under the hood.
Warning
This utility was never really used in experiments. We usually generated the seeds using the scripts in the seeding directory on GitHub.
-
class
swisstext.cmd.scraping.tools.basic_seed_creator.
BasicSeedCreator
(ngram_range=(3, 3))[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ISeedCreator
A basic seed creator that uses an sklearn
CountVectorizer
to compute frequent word N-grams. The generated seeds are simply the x most frequent N-grams found.-
generate_seeds
(sentences: List[str], max=10, stopwords: List[str] = None, **kwargs) → List[str][source]¶ Return a list of seeds composed of the most frequent n-grams.
-
ngram_range
= None¶ ngram range parameter passed to the CountVectorizer
-
-
class
swisstext.cmd.scraping.tools.basic_seed_creator.
IdfSeedCreator
(sanitize=True, **kwargs)[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ISeedCreator
A basic seed creator that uses an sklearn
TfidfVectorizer
to compute frequent word N-grams. The generated seeds are the x n-grams with the highest score.-
__init__
(sanitize=True, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
generate_seeds
(sentences: List[str], max=10, stopwords: List[str] = None, **kwargs) → List[str][source]¶ Return a list of seeds composed of the most frequent n-grams (ponderated by IDF).
-
kwargs
= None¶ The arguments to pass to the vectorizer. They can be overriden in the constructor.
-
sanitize
= None¶ If this flag is set, digits that are not part of a word will be removed before feeding the vectorizer. This is useful because the
TfidfVectorizer
’s default token pattern considers digits as actual words (but ignores punctuation, see the CountVectorizer token_pattern attribute).
-
Crawlers¶
This module contains the implementation of a ICrawler
that uses BeautifulSoup to extract text and links.
-
class
swisstext.cmd.scraping.tools.bs_crawler.
BsCrawler
(joiner=' ')[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ICrawler
A basic crawler implemented using BeautifulSoup.
Text is extracted by concatenating all pieces of text (except CSS and script) into one string using a space separator (no newlines).
Warning
This crawler implementation will return the page’s textual content in one bulk, with no newlines characters. Consequently, results won’t be exploitable without a clever
ISplitter
(recall that the default implementation split text based on newlines…) such as thePunktSplitter
.Todo
Try using just the response.text from requests to get a proper encoding ?
-
crawl
(url: str) → swisstext.cmd.scraping.interfaces.ICrawler.CrawlResults[source]¶ Extract links and text from a URL.
-
classmethod
extract_links
(url, soup)[source]¶ Get all links from a soup (a href only). Note that links will be resolved (relative to absolute) and filtered (non-HTML removed).
-
classmethod
extract_text_blocks
(soup) → Generator[str, None, None][source]¶ Get text blocks from a
BeautifulSoup
object.Warning
This method is destructive, as it will first remove script, style and forms from the HTML/soup object !
Todo
Find a way to avoid altering the soup object.. ?
-
classmethod
get_content
(url) → Tuple[bytes, str][source]¶ Get the raw content from a URL (as a string), with the response encoding as reported by the requests module. Exceptions may be raised if: * an error occurs during the GET request (timeout, decoding issue, too many redirects, etc.) * the content-type is not of a supported type (namely html or text) * the response body is empty
-
-
class
swisstext.cmd.scraping.tools.bs_crawler.
CleverBsCrawler
(joiner=' ')[source]¶ Bases:
swisstext.cmd.scraping.tools.bs_crawler.BsCrawler
Another implementation of the
BsCrawler
that tries to be more clever during the text extraction step.Processing steps:
remove all scripts and CSS content (same as the BsCrawler)
try to detect the page’s main content using common naming schemes (
id=main
,role=main
, etc). If found, stop the processing and return only the text under it.try to detect and remove the header, footer and navigation before returning the text
Following those heuristics requires more processing power and might miss some sentences (full sentences in the side bar, main content in a poorly-coded website, etc).
Todo
Make more thorough tests to determine if those heuristics are worth it. If so, make this implementation the default.
-
swisstext.cmd.scraping.tools.bs_crawler.
DEFAULT_HEADERS
= {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36'}¶ Headers passed with each request
-
swisstext.cmd.scraping.tools.bs_crawler.
GET_TIMEOUT
= 60¶ Timeout used in requests.get
This module contains a subclass of BsCrawler
relying on JusText for text extraction.
It seems to work better than BsCrawler
or CleverBsCrawler
.
To get a feel, try the online demo of the original JustText.
Note
usually, jusText uses
''
(empty string) to join text nodes inside a paragraph, thus making things like “One sentence.Second sentence” likely. Here, we always use a space to join, then normalize the spaces.justext will throw an error on an empty document content, which is wrapped inside a
CrawlError
.
-
class
swisstext.cmd.scraping.tools.justext_crawler.
JustextCrawler
(joiner='n', keep_bad=True, stoplist=None, stopwords_low=0.3, stopwords_high=0.32, **kwargs)[source]¶ Bases:
swisstext.cmd.scraping.tools.bs_crawler.BsCrawler
A
BsCrawler
that relies on JusText to cleverly extract meaningful text from webpages.-
__init__
(joiner='\n', keep_bad=True, stoplist=None, stopwords_low=0.3, stopwords_high=0.32, **kwargs)[source]¶ Create a crawler instance. :param joiner: character used to join paragraphs; :param keep_bad: if set, keep everything. If unset, keep only paragraphs with a context-free class of “neargood” or “good”. :param stoplist: see the justText doc :param stopwords_low: idem :param stopwords_high: idem :param kwargs: unused
-
Normalizers¶
-
class
swisstext.cmd.scraping.tools.norm_punc.
Normalizer
(**kwargs)[source]¶ Bases:
object
A wrapper around
normalize_text()
-
__init__
(**kwargs)[source]¶ Initialize a normalizer. The
kwargs
will be passed tonormalize_text()
as-is.
-
kwargs
= None¶ extra options to pass to
normalize_text()
-
normalize
(text)[source]¶ Call
normalize_text()
ontext
with the extrakwargs
options.
-
-
swisstext.cmd.scraping.tools.norm_punc.
normalize_text
(text, fix_encoding=False, strip_emojis=False)[source]¶ Normalize text:
normalize accents (using NFC convention),
strip control/invisible chars and leftover combining diacritics,
undo ligatures,
normalize quotes, apostrophes and unicode characters (dashes, etc.),
normalize spaces (all spaces, including nbsp and tabs, will be encoded as 0x20),
strip and collapse multiple spaces into one,
etc.
Optionally:
- try to detect and fix encoding issues (see ftfy.fix_encoding)
on a per-sentence basis (delimited by newlines);
strip emojis (using a simple regex, not all cases are covered !)
- Parameters
text – the text to normalize, newlines will be preserved;
fix_encoding – if set, use ftfy to fix encoding issues on a per-sentence basis;
strip_emojis – if set, try to find and strip unicode emojis;
- Returns
the normalized text
Splitters¶
-
class
swisstext.cmd.scraping.tools.punkt_splitter.
PunktSplitter
(modelfile=None)[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ISplitter
A splitter using the PunktSentenceTokenizer, the NLTK implementation of the “Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk (2005)” algorithm.
Note
The default implementation uses a model trained on English sentences. This kaggle resource offers pretrained Punkt Models for other languages as well, including German. In my tests though, German models performed poorly compared to the default…
Todo
Train a Punkt model for Swiss-German. (https://stackoverflow.com/questions/21160310/training-data-format-for-nltk-punkt)
Reimplementation of moses’ split-sentences.perl in pure Python 3. The behavior is equivalent to https://bitbucket.org/luismsgomes/mosestokenizer/src/default/src/mosestokenizer/split-sentences.perl.
Changed compared to the latest split-sentences.perl:
no support for Chinese, Hindi and Gujarati (those if code blocks were simply ignored)
addition of a “more” option that splits on “:;” characters
split returns a list of sentences instead of a string
‘<P>’ are not added when encountering a “closing” empty line in the input (-> all empty lines are just ignored)
Note
A better implementation (at least for swisstext) is
MocySplitter
-
class
swisstext.cmd.scraping.tools.moses_splitter.
MosesSplitter
(lang='de', prefix_file=None, more=True)[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ISplitter
Python implementation of Moses’ split_sentences.perl
-
__init__
(lang='de', prefix_file=None, more=True)[source]¶ - Parameters
lang – nonbreaking_prefix file to load (available: en, de)
prefix_file – path to a custom nonbreaking_prefix file
more – if set, systematically split on :;
-
classmethod
load_nb_prefixes
(lang='en', prefix_file=None)[source]¶ Read the nonbreaking_prefixes from a file
- Parameters
lang – the language to load
prefix_file – a custom file to load (has priority over lang)
- Returns
the nonbreaking_prefix dictionary, with key=prefix and value=1|2. 1 means apply anywhere, 2 means only applies when followed by digits.
-
This splitter implementation was heavily inspired from moses’ split-sentences.perl. It requires only basic python 3 modules and the regex module (which supports unicode regexes).
Main changes compared to the latest split-sentences.perl:
implemented with Swiss-German (Web) in mind: no support for Chinese, Hindi and Gujarati, numbers and quotes conventions based on this language only;
addition of a “more” option that splits on “:;” characters (but tries to keep emojis and URLs intact);
lowercase letters can also be signals of start-of-sentence (often seen on the web);
more aggressive splits on
?!
characters;split returns a list of sentences instead of a string;
it is possible to load multiple nonbreaking prefix lists (merged into one lookup table);
no support for <p> tags (I didn’t get the purpose of this option anyway)
-
class
swisstext.cmd.scraping.tools.mocy_splitter.
MocySplitter
(langs=None, prefix_file=None, more=True, keep_newlines=True)[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ISplitter
A splitter largely inspired from Moses’ split_sentences.perl. The name is a contraction of Moses and Lucy. Most important changes:
a start of sentence doesn’t need to be an uppercase, lowercase letters will do too;
the more parameter let you choose to break on ‘;:’ as well;
end of sentences ‘?!’ are better handled
multiple nonbreaking_prefix files can be used at once (rules will be merged)
-
__init__
(langs=None, prefix_file=None, more=True, keep_newlines=True)[source]¶ - Parameters
lang – a List[str] of language(s) for nonbreaking_prefix file to load (default: en, de)
prefix_file – path to a custom nonbreaking_prefix file
more – if set, systematically split on
:;
keep_newlines – if set, treat newlines as paragraph delimiters that will be preserved. If unset, newlines are ignored and empty lines are treated as paragraph delimiters (Moses original behavior, see
split()
).
-
langs
= None¶ nonbreaking prefixes files to load
-
classmethod
load_nb_prefixes
(langs, prefix_file=None)[source]¶ Read the nonbreaking_prefixes from a file or from an array of languages.
- Parameters
langs – the language(s) to load
prefix_file – a custom file to load (has priority over lang)
- Returns
the nonbreaking_prefix dictionary, with key=prefix and value=1|2. 1 means apply anywhere, 2 means only applies when followed by digits.
-
more
= None¶ whether or not to split on
:;
-
nb_prefixes
= None¶ nonbreaking prefix lookup table
-
split
(input_text)[source]¶ Split a text into sentences. Depending on the value of
keep_newlines
, eithersplit_sentences()
orsplit_text()
will be called.- Parameters
input_text – the input text
- Returns
a list of sentences (no blank lines)
-
classmethod
split_paragraph
(text, nb_prefixes, more=False)[source]¶ Handle one paragraph of text.
- Parameters
text – the paragraph to split
nb_prefixes – the dictionary of nonbreaking_prefix (see perl implementation/doc)
more – if set, systematically split on :;
- Returns
a list of sentences
Sentence Filters¶
This module contains an implementation of ISentenceFilter
that uses
simple rules to filter “well-formed” sentences.
How it works¶
Each sentence is checked against a list of rules and rejected / considered invalid if any of those rules fail. Rules are thus AND-based.
- Rules are defined using a simple YAML syntax and can be of two types: length-based (character count)
or pattern-based (regular expressions). They are checked in the same order they are defined.
Note
Regular expressions can be quite expensive, so try to limit their complexity to the minimum required.
Rules are checked in the same order as they are defined, so it is advised to put the most generic / efficient ones first.
Note
This module uses the regex library (version V0) instead of the default re. You can thus freely use unicode regular expressions in your rules.
Rule syntax¶
Length-based rules (length) must specify at least one of min or max length,
i.e. the bounds on the number of characters.
The rule succeeds if min <= len(s) <= max
. Here is an example:
- max_length:
descr: too long
length:
max: 1000
Pattern-based rules (find) a bit similar, but instead of counting the number of characters, they count the number
of occurrences of a pattern (i.e. the number of matches when calling regex.findall(pattern, s)
).
The rule succeeds if min <= nb_matches <= max
(inclusive !)
Examples:
- dashes:
descr: too many dashes
find:
pattern: '[¯‐―−﹘--]'
count:
max: 2
- right_number_of_punctuation:
descr: punctuation is between 5 and 10
find:
pattern: '\p{P}'
count:
min: 5
max: 10
Comparison rules (compare) they define both a numerator and a denominator pattern. The number of matches is
found for each pattern, then a ratio is computed as:
ratio = count(num matches) / (count(denom matches)+1)
. The matches is once again based on regex.findall
.
The rule succeeds if min <= ratio <= max
(inclusive !).
Example:
- too_many_commas:
descr: compare the number of commas against the number of words in the sentence
compare:
num: ','
denom: '\p{L}+'
ratio:
max: 0.25
Finally, an if condition can be used. If conditions are checked first, and if the check fails, the rule is simply ignored:
- ellipsis:
descr: ellispsis on short sentences.
if:
length:
max: 30
find:
pattern: '(\.\s?){3}$'
count:
max: 0
Rules can additionnally specify examples
and counterexamples
that could be used to check quickly if they work
(see the Rule.self_check
method). For example:
- spelled_words:
descr: W O R D
find:
pattern: ' ([\p{L}] ){3,}'
count:
max: 0
examples:
- 'you must B E L I E V E me.'
- 'span spells S P A N.'
- 's p a n means span!!'
counterexamples:
- 'this is O K :)'
-
class
swisstext.cmd.scraping.tools.pattern_sentence_filter.
PatternSentenceFilter
(rulespath=None)[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ISentenceFilter
By default, rules are loaded from the default file
pattern_sentence_filter.yaml
in the current directory. You can override this by passing a path to the constructor (rulespath
argument).
Link Filters¶
It is optionally possible to add a custom URL filtering logic that will be called for each new URL found on a page.
This let’s you:
ignore child URLs (by returning
None
) ormodify child URLs, for example by normalizing subdomains, stripping url parameters, etc.
Just create an instance of the IUrlFilter
interface and implement its fix
method.
Language Detectors¶
-
class
swisstext.cmd.scraping.tools.swigspot_langid.
SwigspotLangid
[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ISgDetector
This LID model was developed during the SwigSpot project.
In short, it uses 6000 character n-grams features (between 3 and 5 characters long) with TF-IDF scaling and a logistic regression. Sentences are preprocessed by removing everything except letters, spaces, commas and dots.
All the details are available in the SwigSpot repository. The notebook for recreating the model is available here.
Savers¶
-
class
swisstext.cmd.scraping.tools.console_saver.
ConsoleSaver
(sentences_file: str = None, **kwargs)[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ISaver
Implementation of an
ISaver
useful for testing and debugging. It does not persist any results, but prints everything to the console instead.Blacklisted URLs and sentences are kept in sets in memory.
-
__init__
(sentences_file: str = None, **kwargs)[source]¶ - Parameters
sentences_file – optional path to a file were new sentences are written. Note that the file is overriden on each run.
-
get_page
(url: str, **kwargs) → swisstext.cmd.scraping.data.Page[source]¶ Load a page. The simplest implementation is just
return Page(url)
. If the subclass uses a data store, it should also populate the other page attributes (e.g. the score information) so that the :py:class::IDecider can make clever decisions.
-
is_url_blacklisted
(url: str) → bool[source]¶ [ABSTRACT] Tells if the given url is part of the blacklist. This is called at the beginning, to avoid scraping pages unnecessarily.
-
save_page
(page)[source]¶ [ABSTRACT] Persist a page. This is called after the scraping, so all page’s attributes are set, including the list of new :py:class::~swisstext.cmd.scraping.data.Sentence found.
-
save_seed
(seed: str)[source]¶ [ABSTRACT] Persist a seed (usually generated by the :py:class::ISeedGenerator).
-
-
class
swisstext.cmd.scraping.tools.mongo_saver.
MongoSaver
(db='st1', **kwargs)[source]¶ Bases:
swisstext.cmd.scraping.interfaces.ISaver
This
ISaver
implementation persists everything to a MongoDB database.See also
swisstext.mongo
Package defining the Mongo collections.
-
__init__
(db='st1', **kwargs)[source]¶ - Parameters
db – the database to use
kwargs – may include
host
andport
-
blacklist_url
(url: str, error_message=None, **kwargs)[source]¶ [ABSTRACT] Add the url to a blacklist.
-
get_page
(url: str, **kwargs) → swisstext.cmd.scraping.data.Page[source]¶ Load a page. The simplest implementation is just
return Page(url)
. If the subclass uses a data store, it should also populate the other page attributes (e.g. the score information) so that the :py:class::IDecider can make clever decisions.
-
is_url_blacklisted
(url: str)[source]¶ [ABSTRACT] Tells if the given url is part of the blacklist. This is called at the beginning, to avoid scraping pages unnecessarily.
-
save_page
(page: swisstext.cmd.scraping.data.Page)[source]¶ [ABSTRACT] Persist a page. This is called after the scraping, so all page’s attributes are set, including the list of new :py:class::~swisstext.cmd.scraping.data.Sentence found.
-
save_seed
(seed: str)[source]¶ [ABSTRACT] Persist a seed (usually generated by the :py:class::ISeedGenerator).
-
save_url
(url: str, parent: str = None)[source]¶ [ABSTRACT] Save a potentially interesting URL which won’t be visited during this round (e.g. max_depth reached).
- Parameters
url – the url
parent – the parent url