Pipeline implementation

Configuration

class swisstext.cmd.searching.config.Config(config: Union[str, dict, io.IOBase] = None)[source]

Bases: swisstext.cmd.base_config.BaseConfig

The default configuration file for the searching pipeline (search engine) is defined in config.yaml. This is the best way to understand what options are available.

class Options(max_fetches=-1, max_results=10, **kwargs)[source]

Bases: object

__init__(max_fetches=-1, max_results=10, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

max_fetches = None

Max number of URLs retrieved in one search (-1 for no limit). Note that this is only the number of URLs returned by the search engine, not the number of URLs added to the db. Indeed, duplicates or uninteresting URLs won’t be saved (see max_results).

max_results = None

Max number of URLs saved from one search. Note that it should always be <= max_fetches.

__init__(config: Union[str, dict, io.IOBase] = None)[source]

Create a configuration.

Subclasses should provide the path to a default configuration (YAML file) and optionally an option class to hold general options. The config is provided by the user and can be a file, a path to a file or even a dictionary (as long as it has the correct structure).

Parameters
  • default_config_path – the absolute path to a default configuration file.

  • option_class – the class to use for holding global options (a dictionary by default).

  • config – user configuration that overrides the default options. config can be either a path or a file object of a YAML configuration, or a dictionary

Raises

ValueError – if config is not of a supported type

create_search_engine() → swisstext.cmd.searching.pipeline.SearchEngine[source]

Instantiate a search engine from the YAML configuration.

property interfaces_package

Should return the complete package where the interfaces are defined, if any. In case this is defined and a tool is missing from the list, we will try to instantiate the interface instead.

property tool_entry_name

Should return the name of the tool_entries option, i.e. the YAML path to the dictionary of [interface name, canonical class to instantiate].

property valid_tool_entries

Should return the list of valid tool entries under the tool_entry_name. Note that the order of tools in the list defines the order of tools instances returned by instantiate_tools().

Data structures

This module defines the generic data structures used across the module / between the different tools. They have been thought to be decoupled from MongoDB for better flexibility/adaptability.

class swisstext.cmd.searching.data.Seed(query: str)[source]

Bases: object

Holds informations on a seed, i.e. a query.

__init__(query: str)[source]

Initialize self. See help(type(self)) for accurate signature.

the new links found after the search

query = None

the query terms

Search engine

This module contains the core of the searching system.

class swisstext.cmd.searching.pipeline.SearchEngine(query_builder: swisstext.cmd.searching.interfaces.IQueryBuilder, searcher: swisstext.cmd.searching.interfaces.ISearcher, saver: swisstext.cmd.searching.interfaces.ISaver)[source]

Bases: object

The search engine implements all the logic of searching seeds and persisting results.

Note that this class is usually instantiated by a Config object.

__init__(query_builder: swisstext.cmd.searching.interfaces.IQueryBuilder, searcher: swisstext.cmd.searching.interfaces.ISearcher, saver: swisstext.cmd.searching.interfaces.ISaver)[source]

Initialize self. See help(type(self)) for accurate signature.

Potentially “fix” the url (see link_utils) and check that is is unique (not found on a previous search with this instance and not already present in the backend). Turn on debug to follow what is going on.

Parameters

raw_link – the raw URL (must be absolute)

Returns

a tuple with the fixed URL and a “ok” flag

new_urls = None

the list of URLs discovered during the lifetime of the object

process(seeds: List[swisstext.cmd.searching.data.Seed], **kwargs) → int[source]

Do the magic.

Note that this method was not been implemented with multithreading in mind, as most of the time search engine APIs are limited and pretty fast…

process_one(seed: swisstext.cmd.searching.data.Seed, max_results=10, max_fetches=-1) → int[source]

Process one seed. Note that if called multiple times, the previous results are still saved in SearchEngine.new_urls so duplicate URLs will be skipped.

Parameters
  • seed – the seed to search for

  • max_results – the target number of URLs to find

  • max_fetches – the maximum number of URLs fetched from the search engine

Returns

a list of URLs, with 0 <= length <= max_results