swisstext.cmd¶

This package contains everything for running the SwissText Scraping logic.

It is implemented using two command-line programs, each in its own package:

st_scrape defined in swisstext.cmd.scraping, is used to scrape URLs and detect/save Swiss German sentences,
st_search defined in swisstext.cmd.searching, is used to do submit search queries using seeds and gather new URLs to scrape

After installing the package, here is an example use:

## To use this script, create a directory and create the following files:
##  * configuration files: searching.yaml and scraping.yaml (see the doc)
##  * bootstrap files: base_seeds.txt, one seed per line to start with

set -e
trap exit SIGINT

base_dir=.  # relative link to the base directory

st_search="st_search -c $base_dir/searching.yaml"
st_scrape="st_scrape -c $base_dir/scraping.yaml"

# search for the bootstrap seeds
$st_search from_file $base_dir/base_seeds.txt

# do five loops. Each loop: (a) scrapes the new URLs, (b) generates random seeds, (c) use the new seeds
for i in $(seq 1 5); do
    echo "======= START LOOP"
    # scrape at most 30 new urls
    $st_scrape --no-seed from_mongo --new -n 30
    echo "======= generating SEEDS"
    # generate three random seeds, each time from 150 randomly picked sentences
    $st_scrape gen_seeds -s 150 -n 1
    $st_scrape gen_seeds -s 150 -n 1
    $st_scrape gen_seeds -s 150 -n 1
    echo "======= searching SEEDS"
    # search URLs using the newly generated seeds
    $st_search from_mongo --new -n 3
done

Todo

add pipeline schema

Configuration files¶

This module encapsulates all the configuration needed to run a multi-steps toolchain. It is used by both commandline tools.

Config objects can easily be populated from a YAML configuration file. This file should have the following structure:

# This entry defines general options that will be stored in :py:attr:`BaseConfig.options`.
options:
  key1: value1

# This entry defines tools to instantiate
# the actual name is specified in the subclass by overriding :py:attr:`BaseConfig.tool_entry_name`
# valid interface names are specified in the sublcass by overriding :py:attr:`valid_tool_entries`
tool_entry_name:
    # optional default package for relative entries
    _base_package: some.default.module
    # list of tools
    interface_name_1: absolute.module.ToolClassName1
    interface_name_2: .ToolClassName2 # will be expanded to some.default.module.ToolClassName2

# any entry in the form [interface_name]_options defines options that will be passed to the
# tool class upon construction. Here, for example, the actual instantiation will be:
#    absolute.module.ToolClassName1(param_1=some_value, param_X=some_other_value)
interface_name_1_options:
    param_1: some_value
    param_X: some_other_value

class swisstext.cmd.base_config.BaseConfig(default_config_path: str, option_class: classmethod = <class 'dict'>, config: Union[str, dict, io.IOBase] = None)[source]¶

Bases: abc.ABC

This class encapsulates the configuration options of a toolchain. It is able to load options from YAML files and also to instantiate tools from a dictionary of module+class names.

INTERFACE_WILDCARD = '_I_'¶: if this is used, try to instantiate the interface of the tool

__init__(default_config_path: str, option_class: classmethod = <class 'dict'>, config: Union[str, dict, io.IOBase] = None)[source]¶

Create a configuration.

Subclasses should provide the path to a default configuration (YAML file) and optionally an option class to hold general options. The config is provided by the user and can be a file, a path to a file or even a dictionary (as long as it has the correct structure).

Parameters

default_config_path – the absolute path to a default configuration file.
option_class – the class to use for holding global options (a dictionary by default).
config – user configuration that overrides the default options. config can be either a path or a file object of a YAML configuration, or a dictionary

Raises

ValueError – if config is not of a supported type

dumps()[source]¶

get(prop_name, default=None)[source]¶

Get a value from the YAML configuration file. This method supports paths with “.”, for example:

options: returns the top-level “options”
tool_entry.tool_name: return the value of tool_name under tool_entry

Parameters

prop_name – the path of the property to retrieve
default – the default value if the property is not found

Returns

the property value or default

instantiate_tools() → List[object][source]¶

For each valid_tool_entries under tool_entry_name, try to create an instance. In case a tool is not defined and interfaces_package is not None or the value is INTERFACE_WILDCARD it will try to instantiate the tool name interface instead.

Returns: a list of tool instances, in the same order as interfaces_package
Raises: RuntimeError – if a tool could not be instantiated

abstract property interfaces_package¶: Should return the complete package where the interfaces are defined, if any. In case this is defined and a tool is missing from the list, we will try to instantiate the interface instead.

classmethod merge_dicts(default, overrides)[source]¶: Merge override into default recursively. As the names suggest, if an entry is defined in both, the value in overrides takes precedence.

set(prop_name, value)[source]¶: Set a value using the property dot syntax. See get() to see how it works.

abstract property tool_entry_name¶: Should return the name of the tool_entries option, i.e. the YAML path to the dictionary of [interface name, canonical class to instantiate].

abstract property valid_tool_entries¶: Should return the list of valid tool entries under the tool_entry_name. Note that the order of tools in the list defines the order of tools instances returned by instantiate_tools().

Link utilities¶

This module provides utilities to deal with links.

Dealing with links in a page¶

By simply extracting all the href attributes in a HTML page, we might end up with a rather large list of children, not all of them interesting to crawl. Indded, this list would probably contain:

duplicates (or different links pointing to the same page but another anchor),
anchors,
javascript (javascript:),
pointers to images, PDFs or other non-text resources,
etc.

As every implementation of swisstext.cmd.scraping.interfaces.ICrawler has to do the job of filtering this list, we might as well provide a general way to do so.

Dealing with search results¶

When querying Google or another search engine for results, returned URLs might point to PDFs, videos or other uninteresting websites. The fix_url() method is here to help normalize URLs and filtering out those “bad” results.

Note

For the stability of the whole system, it is primordial for all URLs to pass through this module before being added to the persistance layer (e.g. Mongo Database).

swisstext.cmd.link_utils.EXCLUDED_TLDS = {'ac': True, 'ad': True, 'ae': True, 'af': True, 'ag': True, 'ai': True, 'al': True, 'am': True, 'an': True, 'ao': True, 'aq': True, 'ar': True, 'as': True, 'at': True, 'au': True, 'aw': True, 'ax': True, 'az': True, 'ba': True, 'bb': True, 'bd': True, 'be': True, 'bf': True, 'bg': True, 'bh': True, 'bi': True, 'bj': True, 'bl': True, 'bm': True, 'bn': True, 'bo': True, 'bq': True, 'br': True, 'bs': True, 'bt': True, 'bv': True, 'bw': True, 'by': True, 'bz': True, 'ca': True, 'cat': True, 'cc': True, 'cd': True, 'cf': True, 'cg': True, 'ci': True, 'ck': True, 'cl': True, 'cm': True, 'cn': True, 'co': True, 'cr': True, 'cu': True, 'cv': True, 'cw': True, 'cx': True, 'cy': True, 'cz': True, 'dj': True, 'dk': True, 'dm': True, 'do': True, 'dz': True, 'ec': True, 'ee': True, 'eg': True, 'eh': True, 'er': True, 'es': True, 'et': True, 'eus': True, 'fi': True, 'fj': True, 'fk': True, 'fm': True, 'fo': True, 'ga': True, 'gal': True, 'gd': True, 'ge': True, 'gf': True, 'gg': True, 'gh': True, 'gi': True, 'gl': True, 'gm': True, 'gn': True, 'gp': True, 'gq': True, 'gr': True, 'gs': True, 'gt': True, 'gu': True, 'gw': True, 'gy': True, 'hk': True, 'hm': True, 'hn': True, 'hr': True, 'ht': True, 'hu': True, 'id': True, 'ie': True, 'il': True, 'im': True, 'in': True, 'io': True, 'iq': True, 'ir': True, 'is': True, 'je': True, 'jm': True, 'jo': True, 'jp': True, 'ke': True, 'kg': True, 'kh': True, 'ki': True, 'km': True, 'kn': True, 'kp': True, 'kr': True, 'kw': True, 'ky': True, 'kz': True, 'la': True, 'lb': True, 'lc': True, 'li': True, 'lk': True, 'lr': True, 'ls': True, 'lt': True, 'lu': True, 'lv': True, 'ly': True, 'ma': True, 'mc': True, 'md': True, 'me': True, 'mf': True, 'mg': True, 'mh': True, 'mk': True, 'ml': True, 'mm': True, 'mn': True, 'mo': True, 'mp': True, 'mq': True, 'mr': True, 'ms': True, 'mt': True, 'mu': True, 'mv': True, 'mw': True, 'mx': True, 'my': True, 'mz': True, 'na': True, 'nc': True, 'ne': True, 'nf': True, 'ng': True, 'ni': True, 'nl': True, 'no': True, 'np': True, 'nr': True, 'nu': True, 'nz': True, 'om': True, 'pa': True, 'pe': True, 'pf': True, 'pg': True, 'ph': True, 'pk': True, 'pl': True, 'pm': True, 'pn': True, 'pr': True, 'ps': True, 'pt': True, 'pw': True, 'py': True, 'qa': True, 're': True, 'ro': True, 'rs': True, 'ru': True, 'rw': True, 'sa': True, 'sb': True, 'sc': True, 'sd': True, 'se': True, 'sg': True, 'sh': True, 'si': True, 'sj': True, 'sk': True, 'sl': True, 'sm': True, 'sn': True, 'so': True, 'sr': True, 'ss': True, 'st': True, 'sv': True, 'sx': True, 'sy': True, 'sz': True, 'tc': True, 'td': True, 'tf': True, 'tg': True, 'th': True, 'tj': True, 'tk': True, 'tl': True, 'tm': True, 'tn': True, 'to': True, 'tp': True, 'tr': True, 'tt': True, 'tv': True, 'tw': True, 'tz': True, 'ua': True, 'ug': True, 'uy': True, 'uz': True, 'va': True, 'vc': True, 've': True, 'vg': True, 'vi': True, 'vn': True, 'vu': True, 'wf': True, 'ws': True, 'ye': True, 'yt': True, 'za': True, 'zm': True, 'zw': True}¶: Quick lookup dictionary to exclude URLs from any country code TLDs except a rare list of pertinent ones (.ch, .eu, .de …). See this gist for the source of the list.

swisstext.cmd.link_utils.INCLUDED_WIKI_DOMAINS = {}¶: Quick lookup dictionary to exclude any URL from wikipedia, except the ones from the given subdomains. See https://en.wikipedia.org/wiki/List_of_Wikipedias for a full list of wikipedia subdomains.

swisstext.cmd.link_utils.filter_links(base_url: str, links: Iterable[str]) → Generator[str, None, None][source]¶

Resolve, clean and filter links found in a page. By links we mean here any value of href attribute found in a page. This is especially useful for the swisstext.cmd.scraping.interface.ICrawler to constitute and populate the crawl results’s links attribute.

In addition to what fix_url() does, this method will also exclude:

the base URL
Duplicates

For two links to be considered duplicates, they need to match exactly. Exceptions are anchors (stripped automatically) and trailing slashes (in this case, the first encountered link will be returned).

from swisstext.cmd.link_utils import filter_links
base_url = 'http://example.ch/page/1'
hrefs = [
    '#',
    "whatsapp://send?text=Dumoulin verlangt naar",
    '../other/',
    '../other#anchor',
    '?page=2&q=isch',
    '?page=2',
    'https://imgur.org/some-image.png',
    'https://ru.wikipedia.org/wiki',
    'https://als.wikipedia.org',
    'http://other.resource.test',
    'javascript:return false',
    '?p=66383&sid=aaece0dfd1e47e08505dcb5fae2d3f03',

    'http://www.twitter.com/some-hashtag?lang=en-gb',
    'https://twitter.com/share?text=blabla',

    'http://zh-cn.facebook.com/XXXX',
    'https://facebook.com/',
    'https://www.facebook.com',
    'https://graph.facebook.com'
]
links = list(filter_links(base_url, hrefs))

# links contain:
#  http://example.ch/other/
#  http://example.ch/page/1?page=2&q=isch
#  http://example.ch/page/1?page=2
#  http://other.resource.test
#  http://example.ch/page/1?p=66383
#  https://twitter.com/some-hashtag
#  https://www.facebook.com/XXXX
#  https://www.facebook.com/

Parameters

base_url – the URL of the current page (not returned and used to resolve relative links). Use ‘’ to ignore.
links – a list of links found, relative or absolute

Returns

a generator of unique absolute URLs, all beginning with http

swisstext.cmd.link_utils.fix_url(url: str, base_url: str = None) -> (<class 'str'>, <class 'bool'>)[source]¶

Fix/normalize and URL and decide if it is interesting to crawl.

The URL is potentially transformed by:

resolving relative to absolute URLs (only if base_url is set)

removing potential sid query parameters (Magento)

removing anchors

The “is interesting” decision will exclude:

relative URLs (so ensure to provide a base_url if the url is relative)

non HTTP links (mailto:, javascript:, anchors, …)

URLs pointing to non text resources (see EXCLUDED_EXTENSIONS)

URLs with a Country Code TLD unlikely to contain Swiss German (see EXCLUDED_EXTENSIONS)

Parameters

url – the url
base_url – the base url, required if the url is a relative one

Returns

a tuple (fixed_url, is_interesting)