swisstext.cmd

This package contains everything for running the SwissText Scraping logic.

It is implemented using two command-line programs, each in its own package:

  • st_scrape defined in swisstext.cmd.scraping, is used to scrape URLs and detect/save Swiss German sentences,

  • st_search defined in swisstext.cmd.searching, is used to do submit search queries using seeds and gather new URLs to scrape

After installing the package, here is an example use:

## To use this script, create a directory and create the following files:
##  * configuration files: searching.yaml and scraping.yaml (see the doc)
##  * bootstrap files: base_seeds.txt, one seed per line to start with

set -e
trap exit SIGINT

base_dir=.  # relative link to the base directory

st_search="st_search -c $base_dir/searching.yaml"
st_scrape="st_scrape -c $base_dir/scraping.yaml"

# search for the bootstrap seeds
$st_search from_file $base_dir/base_seeds.txt

# do five loops. Each loop: (a) scrapes the new URLs, (b) generates random seeds, (c) use the new seeds
for i in $(seq 1 5); do
    echo "======= START LOOP"
    # scrape at most 30 new urls
    $st_scrape --no-seed from_mongo --new -n 30
    echo "======= generating SEEDS"
    # generate three random seeds, each time from 150 randomly picked sentences
    $st_scrape gen_seeds -s 150 -n 1
    $st_scrape gen_seeds -s 150 -n 1
    $st_scrape gen_seeds -s 150 -n 1
    echo "======= searching SEEDS"
    # search URLs using the newly generated seeds
    $st_search from_mongo --new -n 3
done

Todo

add pipeline schema

Configuration files

This module encapsulates all the configuration needed to run a multi-steps toolchain. It is used by both commandline tools.

Config objects can easily be populated from a YAML configuration file. This file should have the following structure:

# This entry defines general options that will be stored in :py:attr:`BaseConfig.options`.
options:
  key1: value1

# This entry defines tools to instantiate
# the actual name is specified in the subclass by overriding :py:attr:`BaseConfig.tool_entry_name`
# valid interface names are specified in the sublcass by overriding :py:attr:`valid_tool_entries`
tool_entry_name:
    # optional default package for relative entries
    _base_package: some.default.module
    # list of tools
    interface_name_1: absolute.module.ToolClassName1
    interface_name_2: .ToolClassName2 # will be expanded to some.default.module.ToolClassName2

# any entry in the form [interface_name]_options defines options that will be passed to the
# tool class upon construction. Here, for example, the actual instantiation will be:
#    absolute.module.ToolClassName1(param_1=some_value, param_X=some_other_value)
interface_name_1_options:
    param_1: some_value
    param_X: some_other_value
class swisstext.cmd.base_config.BaseConfig(default_config_path: str, option_class: classmethod = <class 'dict'>, config: Union[str, dict, io.IOBase] = None)[source]

Bases: abc.ABC

This class encapsulates the configuration options of a toolchain. It is able to load options from YAML files and also to instantiate tools from a dictionary of module+class names.

INTERFACE_WILDCARD = '_I_'

if this is used, try to instantiate the interface of the tool

__init__(default_config_path: str, option_class: classmethod = <class 'dict'>, config: Union[str, dict, io.IOBase] = None)[source]

Create a configuration.

Subclasses should provide the path to a default configuration (YAML file) and optionally an option class to hold general options. The config is provided by the user and can be a file, a path to a file or even a dictionary (as long as it has the correct structure).

Parameters
  • default_config_path – the absolute path to a default configuration file.

  • option_class – the class to use for holding global options (a dictionary by default).

  • config – user configuration that overrides the default options. config can be either a path or a file object of a YAML configuration, or a dictionary

Raises

ValueError – if config is not of a supported type

dumps()[source]
get(prop_name, default=None)[source]

Get a value from the YAML configuration file. This method supports paths with “.”, for example:

  • options: returns the top-level “options”

  • tool_entry.tool_name: return the value of tool_name under tool_entry

Parameters
  • prop_name – the path of the property to retrieve

  • default – the default value if the property is not found

Returns

the property value or default

instantiate_tools() → List[object][source]

For each valid_tool_entries under tool_entry_name, try to create an instance. In case a tool is not defined and interfaces_package is not None or the value is INTERFACE_WILDCARD it will try to instantiate the tool name interface instead.

Returns

a list of tool instances, in the same order as interfaces_package

Raises

RuntimeError – if a tool could not be instantiated

abstract property interfaces_package

Should return the complete package where the interfaces are defined, if any. In case this is defined and a tool is missing from the list, we will try to instantiate the interface instead.

classmethod merge_dicts(default, overrides)[source]

Merge override into default recursively. As the names suggest, if an entry is defined in both, the value in overrides takes precedence.

set(prop_name, value)[source]

Set a value using the property dot syntax. See get() to see how it works.

abstract property tool_entry_name

Should return the name of the tool_entries option, i.e. the YAML path to the dictionary of [interface name, canonical class to instantiate].

abstract property valid_tool_entries

Should return the list of valid tool entries under the tool_entry_name. Note that the order of tools in the list defines the order of tools instances returned by instantiate_tools().