Pipeline implementation¶
Configuration¶
-
class
swisstext.cmd.searching.config.
Config
(config: Union[str, dict, io.IOBase] = None)[source]¶ Bases:
swisstext.cmd.base_config.BaseConfig
The default configuration file for the searching pipeline (search engine) is defined in
config.yaml
. This is the best way to understand what options are available.-
class
Options
(max_fetches=-1, max_results=10, **kwargs)[source]¶ Bases:
object
-
__init__
(max_fetches=-1, max_results=10, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
max_fetches
= None¶ Max number of URLs retrieved in one search (-1 for no limit). Note that this is only the number of URLs returned by the search engine, not the number of URLs added to the db. Indeed, duplicates or uninteresting URLs won’t be saved (see max_results).
-
max_results
= None¶ Max number of URLs saved from one search. Note that it should always be <= max_fetches.
-
-
__init__
(config: Union[str, dict, io.IOBase] = None)[source]¶ Create a configuration.
Subclasses should provide the path to a default configuration (YAML file) and optionally an option class to hold general options. The config is provided by the user and can be a file, a path to a file or even a dictionary (as long as it has the correct structure).
- Parameters
default_config_path – the absolute path to a default configuration file.
option_class – the class to use for holding global options (a dictionary by default).
config – user configuration that overrides the default options. config can be either a path or a file object of a YAML configuration, or a dictionary
- Raises
ValueError – if config is not of a supported type
-
create_search_engine
() → swisstext.cmd.searching.pipeline.SearchEngine[source]¶ Instantiate a search engine from the YAML configuration.
-
property
interfaces_package
¶ Should return the complete package where the interfaces are defined, if any. In case this is defined and a tool is missing from the list, we will try to instantiate the interface instead.
-
property
tool_entry_name
¶ Should return the name of the tool_entries option, i.e. the YAML path to the dictionary of [interface name, canonical class to instantiate].
-
property
valid_tool_entries
¶ Should return the list of valid tool entries under the
tool_entry_name
. Note that the order of tools in the list defines the order of tools instances returned byinstantiate_tools()
.
-
class
Data structures¶
This module defines the generic data structures used across the module / between the different tools. They have been thought to be decoupled from MongoDB for better flexibility/adaptability.
Search engine¶
This module contains the core of the searching system.
-
class
swisstext.cmd.searching.pipeline.
SearchEngine
(query_builder: swisstext.cmd.searching.interfaces.IQueryBuilder, searcher: swisstext.cmd.searching.interfaces.ISearcher, saver: swisstext.cmd.searching.interfaces.ISaver)[source]¶ Bases:
object
The search engine implements all the logic of searching seeds and persisting results.
Note that this class is usually instantiated by a
Config
object.-
__init__
(query_builder: swisstext.cmd.searching.interfaces.IQueryBuilder, searcher: swisstext.cmd.searching.interfaces.ISearcher, saver: swisstext.cmd.searching.interfaces.ISaver)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
check_link
(raw_link: str) → Tuple[str, bool][source]¶ Potentially “fix” the url (see
link_utils
) and check that is is unique (not found on a previous search with this instance and not already present in the backend). Turn on debug to follow what is going on.- Parameters
raw_link – the raw URL (must be absolute)
- Returns
a tuple with the fixed URL and a “ok” flag
-
new_urls
= None¶ the list of URLs discovered during the lifetime of the object
-
process
(seeds: List[swisstext.cmd.searching.data.Seed], **kwargs) → int[source]¶ Do the magic.
Note that this method was not been implemented with multithreading in mind, as most of the time search engine APIs are limited and pretty fast…
-
process_one
(seed: swisstext.cmd.searching.data.Seed, max_results=10, max_fetches=-1) → int[source]¶ Process one seed. Note that if called multiple times, the previous results are still saved in
SearchEngine.new_urls
so duplicate URLs will be skipped.- Parameters
seed – the seed to search for
max_results – the target number of URLs to find
max_fetches – the maximum number of URLs fetched from the search engine
- Returns
a list of URLs, with 0 <= length <= max_results
-