Tool implementations¶
This package contains various implementations of the different search engine tools.
See also
interfaces
The tools interfaces definitions
config
The default configuration instantiates tools from this package
Savers¶
-
class
swisstext.cmd.searching.tools.console_saver.
ConsoleSaver
(**kwargs)[source]¶ Bases:
swisstext.cmd.searching.interfaces.ISaver
Implementation of an
ISaver
useful for testing and debugging. It does not persist any results, but prints everything to the console instead.
-
class
swisstext.cmd.searching.tools.mongo_saver.
MongoSaver
(host='localhost', port=27017, db='st1', **kwargs)[source]¶ Bases:
swisstext.cmd.searching.interfaces.ISaver
This
ISaver
implementation persists everything to a MongoDB database.See also
swisstext.mongo
Package defining the Mongo collections.
-
__init__
(host='localhost', port=27017, db='st1', **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
link_exists
(url: str) → swisstext.cmd.searching.interfaces.ISaver.LinkStatus[source]¶ Test if the url already exists in the persistence layer. Returns false by default.
Searchers¶
This module let’s you use the Google Custom Search API to retrieve URLs.
Example usages¶
Create a factory:
factory = GoogleGeneratorFactory(apikey="your apikey")
To retrieve a list of results, you can use the interface’s top_results
method:
# retrieve at most 13 results from a search using the interface
results = factory.top_results(query="es isch sone seich", max_results=13)
If you want to use the iterator, it is advised to use itertools
, since it deals with
StopIteration
exceptions silently:
# retrieve at most 13 results from a search using the builtin python iterator
results_iterator = factory.search(query="es isch sone seich")
import itertools
results = itertools.islice(results_iterator, 13) # won't throw StopIteration
Iterators are useful when you need to process URLs in a loop, or have a more complex stop criteria than just the number of results. For that, you can use the builtin iterator interface:
# process results one by one using the builtin python iterator
from sys import stderr
results_iterator = factory.search(query="es isch sone seich")
for i in range(13): # usually, here we have a while some_dynamic_condition
try:
url = next(results_iterator)
print(f"Processing url: {url}")
# ... do something more with the result ...
except StopIteration:
# even though Google is pretty good at retrieving billions of results,
# you might hit the limit...
print("Oops, no result left!", file=sys.stderr)
break
You can also use the GoogleGenerator.next()
and GoogleGenerator.has_next()
methods
in place of the try-except, like so:
# process results one by one using the Google Iterator methods
from sys import stderr
results_iterator = factory.search(query="es isch sone seich")
for i in range(13):
if not results_iterator.has_next():
print("Oops, no result left!", file=sys.stderr)
break
url = results_iterator.next()
print(f"Processing url: {url}")
# ... do something more with the result ...
-
swisstext.cmd.searching.tools.google_search.
BASE_URL
= 'https://www.googleapis.com/customsearch/v1'¶ google api URL see https://developers.google.com/custom-search/json-api/v1/reference/cse/list#request for the json API reference
-
class
swisstext.cmd.searching.tools.google_search.
GoogleGenerator
(query, apikey: str, context='015058622601103575455:cpfpm27mio8', qps=20, qpm=200)[source]¶ Bases:
collections.abc.Iterable
,typing.Generic
A new Google Generator should be created for each query. It allows for lazy loading of results, thus sparing API quotas.
Warning
This generator might raise an
Exception
, for example if you reached your daily quota limit. In this case, the exception message should contain the error code and error message as delivered by the Google API.-
__init__
(query, apikey: str, context='015058622601103575455:cpfpm27mio8', qps=20, qpm=200)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
ctx
= None¶ The context to use (see the official API reference for more info). The default context is usually fine: it is parameterized to search all the web.
-
key
= None¶ The Google Custom Search API key
-
next
() → str[source]¶ Note
If you use this method directly (instead of the classic python iterator interface), you need to check that results are available by yourself using the
has_next()
method.
-
-
class
swisstext.cmd.searching.tools.google_search.
GoogleGeneratorFactory
(**kwargs)[source]¶ Bases:
swisstext.cmd.searching.interfaces.ISearcher
This factory creates a new
GoogleGenerator
for each query.
This module implements an ISearcher
using
http://startpage.com/.
Warning
This module highjacks startpage.com !! There are no API available, so this is really a dirty hack. But we couldn’t find free APIs to run tests… So use this module sparsely and ONLY IN DEVELOPMENT. We decline all responsibility in case startpage detects the robot…
-
class
swisstext.cmd.searching.tools.start_page.
StartPageGenerator
(query)[source]¶ Bases:
collections.abc.Iterable
,typing.Generic
A new generator is created for each query. Its usage is similar to the generator described in the
google_search
module.
-
class
swisstext.cmd.searching.tools.start_page.
StartPageGeneratorFactory
(**kwargs)[source]¶ Bases:
swisstext.cmd.searching.interfaces.ISearcher
Implementation of a searcher using startpage. Its usage is similar to the factory described in the
google_search
module.