swisstext.cmd¶
This package contains everything for running the SwissText Scraping logic.
It is implemented using two command-line programs, each in its own package:
st_scrape
defined inswisstext.cmd.scraping
, is used to scrape URLs and detect/save Swiss German sentences,st_search
defined inswisstext.cmd.searching
, is used to do submit search queries using seeds and gather new URLs to scrape
After installing the package, here is an example use:
## To use this script, create a directory and create the following files:
## * configuration files: searching.yaml and scraping.yaml (see the doc)
## * bootstrap files: base_seeds.txt, one seed per line to start with
set -e
trap exit SIGINT
base_dir=. # relative link to the base directory
st_search="st_search -c $base_dir/searching.yaml"
st_scrape="st_scrape -c $base_dir/scraping.yaml"
# search for the bootstrap seeds
$st_search from_file $base_dir/base_seeds.txt
# do five loops. Each loop: (a) scrapes the new URLs, (b) generates random seeds, (c) use the new seeds
for i in $(seq 1 5); do
echo "======= START LOOP"
# scrape at most 30 new urls
$st_scrape --no-seed from_mongo --new -n 30
echo "======= generating SEEDS"
# generate three random seeds, each time from 150 randomly picked sentences
$st_scrape gen_seeds -s 150 -n 1
$st_scrape gen_seeds -s 150 -n 1
$st_scrape gen_seeds -s 150 -n 1
echo "======= searching SEEDS"
# search URLs using the newly generated seeds
$st_search from_mongo --new -n 3
done
Todo
add pipeline schema
Configuration files¶
This module encapsulates all the configuration needed to run a multi-steps toolchain. It is used by both commandline tools.
Config objects can easily be populated from a YAML configuration file. This file should have the following structure:
# This entry defines general options that will be stored in :py:attr:`BaseConfig.options`.
options:
key1: value1
# This entry defines tools to instantiate
# the actual name is specified in the subclass by overriding :py:attr:`BaseConfig.tool_entry_name`
# valid interface names are specified in the sublcass by overriding :py:attr:`valid_tool_entries`
tool_entry_name:
# optional default package for relative entries
_base_package: some.default.module
# list of tools
interface_name_1: absolute.module.ToolClassName1
interface_name_2: .ToolClassName2 # will be expanded to some.default.module.ToolClassName2
# any entry in the form [interface_name]_options defines options that will be passed to the
# tool class upon construction. Here, for example, the actual instantiation will be:
# absolute.module.ToolClassName1(param_1=some_value, param_X=some_other_value)
interface_name_1_options:
param_1: some_value
param_X: some_other_value
-
class
swisstext.cmd.base_config.
BaseConfig
(default_config_path: str, option_class: classmethod = <class 'dict'>, config: Union[str, dict, io.IOBase] = None)[source]¶ Bases:
abc.ABC
This class encapsulates the configuration options of a toolchain. It is able to load options from YAML files and also to instantiate tools from a dictionary of module+class names.
-
INTERFACE_WILDCARD
= '_I_'¶ if this is used, try to instantiate the interface of the tool
-
__init__
(default_config_path: str, option_class: classmethod = <class 'dict'>, config: Union[str, dict, io.IOBase] = None)[source]¶ Create a configuration.
Subclasses should provide the path to a default configuration (YAML file) and optionally an option class to hold general options. The config is provided by the user and can be a file, a path to a file or even a dictionary (as long as it has the correct structure).
- Parameters
default_config_path – the absolute path to a default configuration file.
option_class – the class to use for holding global options (a dictionary by default).
config – user configuration that overrides the default options. config can be either a path or a file object of a YAML configuration, or a dictionary
- Raises
ValueError – if config is not of a supported type
-
get
(prop_name, default=None)[source]¶ Get a value from the YAML configuration file. This method supports paths with “.”, for example:
options: returns the top-level “options”
tool_entry.tool_name: return the value of tool_name under tool_entry
- Parameters
prop_name – the path of the property to retrieve
default – the default value if the property is not found
- Returns
the property value or default
-
instantiate_tools
() → List[object][source]¶ For each
valid_tool_entries
undertool_entry_name
, try to create an instance. In case a tool is not defined andinterfaces_package
is not None or the value isINTERFACE_WILDCARD
it will try to instantiate the tool name interface instead.- Returns
a list of tool instances, in the same order as
interfaces_package
- Raises
RuntimeError – if a tool could not be instantiated
-
abstract property
interfaces_package
¶ Should return the complete package where the interfaces are defined, if any. In case this is defined and a tool is missing from the list, we will try to instantiate the interface instead.
-
classmethod
merge_dicts
(default, overrides)[source]¶ Merge
override
intodefault
recursively. As the names suggest, if an entry is defined in both, the value inoverrides
takes precedence.
-
set
(prop_name, value)[source]¶ Set a value using the property dot syntax. See
get()
to see how it works.
-
abstract property
tool_entry_name
¶ Should return the name of the tool_entries option, i.e. the YAML path to the dictionary of [interface name, canonical class to instantiate].
-
abstract property
valid_tool_entries
¶ Should return the list of valid tool entries under the
tool_entry_name
. Note that the order of tools in the list defines the order of tools instances returned byinstantiate_tools()
.
-
Link utilities¶
This module provides utilities to deal with links.
Dealing with links in a page¶
By simply extracting all the href
attributes in a HTML page, we might end up with a rather large list of
children, not all of them interesting to crawl. Indded, this list would probably contain:
duplicates (or different links pointing to the same page but another anchor),
anchors,
javascript (
javascript:
),pointers to images, PDFs or other non-text resources,
etc.
As every implementation of swisstext.cmd.scraping.interfaces.ICrawler
has to do the job of filtering
this list, we might as well provide a general way to do so.
Dealing with search results¶
When querying Google or another search engine for results, returned URLs might point to PDFs, videos or other
uninteresting websites. The fix_url()
method is here to help normalize URLs and filtering out those “bad” results.
Note
For the stability of the whole system, it is primordial for all URLs to pass through this module before being added to the persistance layer (e.g. Mongo Database).
-
swisstext.cmd.link_utils.
EXCLUDED_TLDS
= {'ac': True, 'ad': True, 'ae': True, 'af': True, 'ag': True, 'ai': True, 'al': True, 'am': True, 'an': True, 'ao': True, 'aq': True, 'ar': True, 'as': True, 'at': True, 'au': True, 'aw': True, 'ax': True, 'az': True, 'ba': True, 'bb': True, 'bd': True, 'be': True, 'bf': True, 'bg': True, 'bh': True, 'bi': True, 'bj': True, 'bl': True, 'bm': True, 'bn': True, 'bo': True, 'bq': True, 'br': True, 'bs': True, 'bt': True, 'bv': True, 'bw': True, 'by': True, 'bz': True, 'ca': True, 'cat': True, 'cc': True, 'cd': True, 'cf': True, 'cg': True, 'ci': True, 'ck': True, 'cl': True, 'cm': True, 'cn': True, 'co': True, 'cr': True, 'cu': True, 'cv': True, 'cw': True, 'cx': True, 'cy': True, 'cz': True, 'dj': True, 'dk': True, 'dm': True, 'do': True, 'dz': True, 'ec': True, 'ee': True, 'eg': True, 'eh': True, 'er': True, 'es': True, 'et': True, 'eus': True, 'fi': True, 'fj': True, 'fk': True, 'fm': True, 'fo': True, 'ga': True, 'gal': True, 'gd': True, 'ge': True, 'gf': True, 'gg': True, 'gh': True, 'gi': True, 'gl': True, 'gm': True, 'gn': True, 'gp': True, 'gq': True, 'gr': True, 'gs': True, 'gt': True, 'gu': True, 'gw': True, 'gy': True, 'hk': True, 'hm': True, 'hn': True, 'hr': True, 'ht': True, 'hu': True, 'id': True, 'ie': True, 'il': True, 'im': True, 'in': True, 'io': True, 'iq': True, 'ir': True, 'is': True, 'je': True, 'jm': True, 'jo': True, 'jp': True, 'ke': True, 'kg': True, 'kh': True, 'ki': True, 'km': True, 'kn': True, 'kp': True, 'kr': True, 'kw': True, 'ky': True, 'kz': True, 'la': True, 'lb': True, 'lc': True, 'li': True, 'lk': True, 'lr': True, 'ls': True, 'lt': True, 'lu': True, 'lv': True, 'ly': True, 'ma': True, 'mc': True, 'md': True, 'me': True, 'mf': True, 'mg': True, 'mh': True, 'mk': True, 'ml': True, 'mm': True, 'mn': True, 'mo': True, 'mp': True, 'mq': True, 'mr': True, 'ms': True, 'mt': True, 'mu': True, 'mv': True, 'mw': True, 'mx': True, 'my': True, 'mz': True, 'na': True, 'nc': True, 'ne': True, 'nf': True, 'ng': True, 'ni': True, 'nl': True, 'no': True, 'np': True, 'nr': True, 'nu': True, 'nz': True, 'om': True, 'pa': True, 'pe': True, 'pf': True, 'pg': True, 'ph': True, 'pk': True, 'pl': True, 'pm': True, 'pn': True, 'pr': True, 'ps': True, 'pt': True, 'pw': True, 'py': True, 'qa': True, 're': True, 'ro': True, 'rs': True, 'ru': True, 'rw': True, 'sa': True, 'sb': True, 'sc': True, 'sd': True, 'se': True, 'sg': True, 'sh': True, 'si': True, 'sj': True, 'sk': True, 'sl': True, 'sm': True, 'sn': True, 'so': True, 'sr': True, 'ss': True, 'st': True, 'sv': True, 'sx': True, 'sy': True, 'sz': True, 'tc': True, 'td': True, 'tf': True, 'tg': True, 'th': True, 'tj': True, 'tk': True, 'tl': True, 'tm': True, 'tn': True, 'to': True, 'tp': True, 'tr': True, 'tt': True, 'tv': True, 'tw': True, 'tz': True, 'ua': True, 'ug': True, 'uy': True, 'uz': True, 'va': True, 'vc': True, 've': True, 'vg': True, 'vi': True, 'vn': True, 'vu': True, 'wf': True, 'ws': True, 'ye': True, 'yt': True, 'za': True, 'zm': True, 'zw': True}¶ Quick lookup dictionary to exclude URLs from any country code TLDs except a rare list of pertinent ones (.ch, .eu, .de …). See this gist for the source of the list.
-
swisstext.cmd.link_utils.
INCLUDED_WIKI_DOMAINS
= {}¶ Quick lookup dictionary to exclude any URL from wikipedia, except the ones from the given subdomains. See https://en.wikipedia.org/wiki/List_of_Wikipedias for a full list of wikipedia subdomains.
-
swisstext.cmd.link_utils.
filter_links
(base_url: str, links: Iterable[str]) → Generator[str, None, None][source]¶ Resolve, clean and filter links found in a page. By links we mean here any value of href attribute found in a page. This is especially useful for the
swisstext.cmd.scraping.interface.ICrawler
to constitute and populate the crawl results’slinks
attribute.In addition to what
fix_url()
does, this method will also exclude:the base URL
Duplicates
For two links to be considered duplicates, they need to match exactly. Exceptions are anchors (stripped automatically) and trailing slashes (in this case, the first encountered link will be returned).
from swisstext.cmd.link_utils import filter_links base_url = 'http://example.ch/page/1' hrefs = [ '#', "whatsapp://send?text=Dumoulin verlangt naar", '../other/', '../other#anchor', '?page=2&q=isch', '?page=2', 'https://imgur.org/some-image.png', 'https://ru.wikipedia.org/wiki', 'https://als.wikipedia.org', 'http://other.resource.test', 'javascript:return false', '?p=66383&sid=aaece0dfd1e47e08505dcb5fae2d3f03', 'http://www.twitter.com/some-hashtag?lang=en-gb', 'https://twitter.com/share?text=blabla', 'http://zh-cn.facebook.com/XXXX', 'https://facebook.com/', 'https://www.facebook.com', 'https://graph.facebook.com' ] links = list(filter_links(base_url, hrefs)) # links contain: # http://example.ch/other/ # http://example.ch/page/1?page=2&q=isch # http://example.ch/page/1?page=2 # http://other.resource.test # http://example.ch/page/1?p=66383 # https://twitter.com/some-hashtag # https://www.facebook.com/XXXX # https://www.facebook.com/
- Parameters
base_url – the URL of the current page (not returned and used to resolve relative links). Use ‘’ to ignore.
links – a list of links found, relative or absolute
- Returns
a generator of unique absolute URLs, all beginning with http
-
swisstext.cmd.link_utils.
fix_url
(url: str, base_url: str = None) -> (<class 'str'>, <class 'bool'>)[source]¶ Fix/normalize and URL and decide if it is interesting to crawl.
The URL is potentially transformed by:
resolving relative to absolute URLs (only if base_url is set)
removing potential sid query parameters (Magento)
removing anchors
The “is interesting” decision will exclude:
relative URLs (so ensure to provide a base_url if the url is relative)
non HTTP links (mailto:, javascript:, anchors, …)
URLs pointing to non text resources (see
EXCLUDED_EXTENSIONS
)URLs with a Country Code TLD unlikely to contain Swiss German (see
EXCLUDED_EXTENSIONS
)
- Parameters
url – the url
base_url – the base url, required if the url is a relative one
- Returns
a tuple (fixed_url, is_interesting)