swisstext.cmd.searching¶
The searching module uses seeds to query search engines and discover new URLs to scrape.
The magic happens by calling st_search
.
st_search¶
st_search [OPTIONS] COMMAND [ARGS]...
Options
-
-l
,
--log-level
<log_level>
¶ [default: info]
- Options
debug|info|warning|fatal
-
-d
,
--db
<db>
¶ If set, this will override the database set in the config
-
-c
,
--config-path
<config_path>
¶
dump_config¶
Prints the active configuration. If <test> is set, the search engine is also instantiated, ensuring all tool names exist and use correct options.
st_search dump_config [OPTIONS]
Options
-
-t
,
--test
¶
Also instantiate the tools [default: False]
from_file¶
Search using seeds from a file.
This script runs the search engine using the query terms present in a file. The file should have one seed per line; any line starting with a space will ignored. Seeds and results will be persisted using the configured saver.
st_search from_file [OPTIONS] SEEDSFILE
Options
-
--no-search
¶
Just save the seeds to Mongo, but don’t search for URLs. [default: False]
Arguments
-
SEEDSFILE
¶
Required argument
from_mongo¶
This script runs the search engine system using -n seeds pulled from Mongo. Those seeds are selected depending on the number of usages and the date of the last use (less used first, oldest usage first). If –new is specified, only seeds that have never been used will be queried. If –any is specified, exactly -n seeds are selected.
Note that to connect to MongoDB, it relies on the host, port and db options present in the saver_options property of the configuration. So whatever saver you use, ensure that those properties are correct (default: localhost:27017, db=swisstext).
st_search from_mongo [OPTIONS]
Options
-
-n
,
--num-seeds
<num_seeds>
¶ Max seeds used. [default: 20]
-
--new
,
--any
¶
Only search new seeds [default: False]