swisstext.cmd.scraping¶
The scraping module is able to scrape the web in order to find new Swiss German sentences.
The magic happens by calling st_scrape
.
st_scrape¶
st_scrape [OPTIONS] COMMAND [ARGS]...
Options
-
-l
,
--log-level
<log_level>
¶ [default: info]
- Options
debug|info|warning|fatal
-
-c
,
--config-path
<config_path>
¶
-
-d
,
--db
<db>
¶ If set, this will override the database set in the config
dump_config¶
Prints the active configuration. If <test> is set, the pipeline is also instantiated, ensuring all tool names exist and use correct options.
st_scrape dump_config [OPTIONS]
Options
-
-t
,
--test
¶
Also instantiate the tools [default: False]
from_file¶
Scrape using URLs in a file as base.
This script runs the scraping pipeline using the URLs present in a file as bootstrap URLs. The file should have one URL per line. Any line starting with something other that “http” will be ignored.
st_scrape from_file [OPTIONS] URLFILE
Arguments
-
URLFILE
¶
Required argument
from_mongo¶
Scrape using mongo URLs as base.
This script runs the scraping pipeline using -n bootstrap URLs pulled from Mongo. Those URLs are selected depending on the number of visits and the date of the last visit (less visited first, oldest visit first). The –what argument can be used to control which URLs will be pulled from mongo to start the process. Use ‘new’ for URLs never visited before, ‘seed’ for non-visited URLs found using the search engine or ‘any’ (default). Except with what=any, the actual number of URLs pulled is not guaranteed to be -n.
Note that to connect to MongoDB, it relies on the host, port and db options present in the saver_options property of the configuration. So whatever saver you use, ensure that those properties are correct (default: localhost:27017, db=swisstext).
st_scrape from_mongo [OPTIONS]
Options
-
-n
,
--num-urls
<num_urls>
¶ Max URLs crawled in one pass. [default: 20]
-
--what
<what>
¶ Pull any URLs / new URLs / new URLs from external source (seed, file) [default: any]
- Options
any|new|ext
-
--how
<how>
¶ Pull URLs oldest first / at random [default: oldest]
- Options
oldest|random
gen_seeds¶
Generate seeds from a sample of mongo sentences.
This script generates -n seeds using a given number of sentences (-s) pulled from MongoDB. If –new is specified, the latest sentences are used (date_added). If –any is specified, sentences are selected randomly.
Note that: (1) seeds will be saved to the persistence layer using the saver class specified in the configuration. (2) to connect to MongoDB, it relies on the host, port and db options present in the saver_options property of the configuration. So whatever saver you use, ensure that those properties are correct (default: localhost:27017, db=swisstext).
st_scrape gen_seeds [OPTIONS]
Options
-
-s
,
--num-sentences
<num_sentences>
¶ Number of sentences to use. [default: 100]
-
-n
,
--num
<num>
¶ Number of seeds to generate. [default: 5]
-
--new
,
--any
¶
Use the newest sentences [default: False]
-
-c
,
--confirm
¶
Ask for confirmation before saving. [default: False]