Pipeline implementation¶

Configuration¶

class swisstext.cmd.searching.config.Config(config: Union[str, dict, io.IOBase] = None)[source]¶

Bases: swisstext.cmd.base_config.BaseConfig

The default configuration file for the searching pipeline (search engine) is defined in config.yaml. This is the best way to understand what options are available.

class Options(max_fetches=-1, max_results=10, **kwargs)[source]¶

Bases: object

__init__(max_fetches=-1, max_results=10, **kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

max_fetches = None¶: Max number of URLs retrieved in one search (-1 for no limit). Note that this is only the number of URLs returned by the search engine, not the number of URLs added to the db. Indeed, duplicates or uninteresting URLs won’t be saved (see max_results).

max_results = None¶: Max number of URLs saved from one search. Note that it should always be <= max_fetches.

__init__(config: Union[str, dict, io.IOBase] = None)[source]¶

Create a configuration.

Subclasses should provide the path to a default configuration (YAML file) and optionally an option class to hold general options. The config is provided by the user and can be a file, a path to a file or even a dictionary (as long as it has the correct structure).

Parameters

default_config_path – the absolute path to a default configuration file.
option_class – the class to use for holding global options (a dictionary by default).
config – user configuration that overrides the default options. config can be either a path or a file object of a YAML configuration, or a dictionary

Raises

ValueError – if config is not of a supported type

create_search_engine() → swisstext.cmd.searching.pipeline.SearchEngine[source]¶: Instantiate a search engine from the YAML configuration.

property interfaces_package¶: Should return the complete package where the interfaces are defined, if any. In case this is defined and a tool is missing from the list, we will try to instantiate the interface instead.

property tool_entry_name¶: Should return the name of the tool_entries option, i.e. the YAML path to the dictionary of [interface name, canonical class to instantiate].

property valid_tool_entries¶: Should return the list of valid tool entries under the tool_entry_name. Note that the order of tools in the list defines the order of tools instances returned by instantiate_tools().

Data structures¶

This module defines the generic data structures used across the module / between the different tools. They have been thought to be decoupled from MongoDB for better flexibility/adaptability.

class swisstext.cmd.searching.data.Seed(query: str)[source]¶

Bases: object

Holds informations on a seed, i.e. a query.

__init__(query: str)[source]¶: Initialize self. See help(type(self)) for accurate signature.

new_links = None¶: the new links found after the search

query = None¶: the query terms

Search engine¶

This module contains the core of the searching system.

class swisstext.cmd.searching.pipeline.SearchEngine(query_builder: swisstext.cmd.searching.interfaces.IQueryBuilder, searcher: swisstext.cmd.searching.interfaces.ISearcher, saver: swisstext.cmd.searching.interfaces.ISaver)[source]¶

Bases: object

The search engine implements all the logic of searching seeds and persisting results.

Note that this class is usually instantiated by a Config object.

__init__(query_builder: swisstext.cmd.searching.interfaces.IQueryBuilder, searcher: swisstext.cmd.searching.interfaces.ISearcher, saver: swisstext.cmd.searching.interfaces.ISaver)[source]¶: Initialize self. See help(type(self)) for accurate signature.

check_link(raw_link: str) → Tuple[str, bool][source]¶

Potentially “fix” the url (see link_utils) and check that is is unique (not found on a previous search with this instance and not already present in the backend). Turn on debug to follow what is going on.

Parameters: raw_link – the raw URL (must be absolute)
Returns: a tuple with the fixed URL and a “ok” flag

new_urls = None¶: the list of URLs discovered during the lifetime of the object

process(seeds: List[swisstext.cmd.searching.data.Seed], **kwargs) → int[source]¶

Do the magic.

Note that this method was not been implemented with multithreading in mind, as most of the time search engine APIs are limited and pretty fast…

process_one(seed: swisstext.cmd.searching.data.Seed, max_results=10, max_fetches=-1) → int[source]¶

Process one seed. Note that if called multiple times, the previous results are still saved in SearchEngine.new_urls so duplicate URLs will be skipped.

Parameters

seed – the seed to search for
max_results – the target number of URLs to find
max_fetches – the maximum number of URLs fetched from the search engine

Returns

a list of URLs, with 0 <= length <= max_results