swisstext.cmd.scraping

The scraping module is able to scrape the web in order to find new Swiss German sentences.

The magic happens by calling st_scrape.

st_scrape

st_scrape [OPTIONS] COMMAND [ARGS]...

Options

-l, --log-level <log_level>

[default: info]

Options

debug|info|warning|fatal

-c, --config-path <config_path>
-d, --db <db>

If set, this will override the database set in the config

dump_config

Prints the active configuration. If <test> is set, the pipeline is also instantiated, ensuring all tool names exist and use correct options.

st_scrape dump_config [OPTIONS]

Options

-t, --test

Also instantiate the tools [default: False]

from_file

Scrape using URLs in a file as base.

This script runs the scraping pipeline using the URLs present in a file as bootstrap URLs. The file should have one URL per line. Any line starting with something other that “http” will be ignored.

st_scrape from_file [OPTIONS] URLFILE

Arguments

URLFILE

Required argument

from_mongo

Scrape using mongo URLs as base.

This script runs the scraping pipeline using -n bootstrap URLs pulled from Mongo. Those URLs are selected depending on the number of visits and the date of the last visit (less visited first, oldest visit first). The –what argument can be used to control which URLs will be pulled from mongo to start the process. Use ‘new’ for URLs never visited before, ‘seed’ for non-visited URLs found using the search engine or ‘any’ (default). Except with what=any, the actual number of URLs pulled is not guaranteed to be -n.

Note that to connect to MongoDB, it relies on the host, port and db options present in the saver_options property of the configuration. So whatever saver you use, ensure that those properties are correct (default: localhost:27017, db=swisstext).

st_scrape from_mongo [OPTIONS]

Options

-n, --num-urls <num_urls>

Max URLs crawled in one pass. [default: 20]

--what <what>

Pull any URLs / new URLs / new URLs from external source (seed, file) [default: any]

Options

any|new|ext

--how <how>

Pull URLs oldest first / at random [default: oldest]

Options

oldest|random

gen_seeds

Generate seeds from a sample of mongo sentences.

This script generates -n seeds using a given number of sentences (-s) pulled from MongoDB. If –new is specified, the latest sentences are used (date_added). If –any is specified, sentences are selected randomly.

Note that: (1) seeds will be saved to the persistence layer using the saver class specified in the configuration. (2) to connect to MongoDB, it relies on the host, port and db options present in the saver_options property of the configuration. So whatever saver you use, ensure that those properties are correct (default: localhost:27017, db=swisstext).

st_scrape gen_seeds [OPTIONS]

Options

-s, --num-sentences <num_sentences>

Number of sentences to use. [default: 100]

-n, --num <num>

Number of seeds to generate. [default: 5]

--new, --any

Use the newest sentences [default: False]

-c, --confirm

Ask for confirmation before saving. [default: False]