swisstext.alswiki

The magic happens by calling st_alswiki.

st_alswiki

A suite of tools for downloading, parsing and processing wikipedia dumps.

st_alswiki [OPTIONS] COMMAND [ARGS]...

Options

-l, --log-level <log_level>

[default: info]

Options

debug|info|warning|fatal

download

Download the latest dump of als.wikipedia.org.

It will save alswiki-latest-pages-articles.xml.bz2 in the directory given by <dir>.

st_alswiki download [OPTIONS]

Options

-d, --dir <dir>

directory where to download files [default: .]

parse

Extract articles from a wiki dump.

<dumpfile> must be a wiki-*-pages-articles.xml.bz2 file. The output file will be saved alongside the dump with the json.bz2 extension. See gensim’s segment_wiki tool for more details (python -m gensim.scripts.segment_wiki -h).

st_alswiki parse [OPTIONS] DUMPFILE

Options

-m, --min-chars <min_chars>

Ignore articles with fewer characters than this (article stubs) [default: 200]

Arguments

DUMPFILE

Required argument

process

Process articles using the scraping pipeline.

<config-path> and <db> are the same as in st_scrape. The <gensimfile> can be either in json or json.bz2. To generate it:

  1. download an alswiki-*-pages-articles.xml.bz2 from https://dumps.wikimedia.org/alswiki (or use the download command),

  2. use gensim’s segment_wiki tool (python -m gensim.scripts.segment_wiki -h) to extract articles (in json format) from the raw dump (or use the parse command)

st_alswiki process [OPTIONS] GENSIMFILE

Options

-c, --config-path <config_path>
-d, --db <db>

If set, this will override the database set in the config

Arguments

GENSIMFILE

Required argument

txt

Process a text file.

This command reads a file and pass it through a simplified pipeline (split -> filter -> sg_detect). The actual splitter/filter/detector used depends on the configuration given by <config-path>. The config is the same as in the backend, but only options for those 3 tools are used.

Output is printed to the terminal, in one of the two formats given by the <format> argument:

* txt: prints sentences
* csv: prints proba, sentence using csv format (quotes escaped)
st_alswiki txt [OPTIONS] TEXTFILE

Options

-c, --config-path <config_path>
-f, --format <format>

[default: txt]

Options

txt|csv

-p, --min-proba <min_proba>

Override min_proba in config

Arguments

TEXTFILE

Required argument

Processing Alswiki dumps

If you want to add the sentences from an alswiki dump to the mongo database, you can use:

st_alswiki download
st_alswiki parse alswiki-latest-pages-articles.xml.bz2
st_alswiki process alswiki-latest-pages-articles.json.bz2

Remember that by default, the scraping pipeline is configured to use MongoDB on localhost. If you would rather get the output into a file, just pass a custom config.yaml file to the process subcommand using the -c option.

# custom config: output the results to stdout
pipeline:
  saver: .console_saver.ConsoleSaver

saver_options:
  # where to save the output (default to stdout)
  sentences_file: alswiki_sentences.txt

Note

For old dumps with older XML syntax, you can use WikiExtractor.py to extract text from the dumps, get the text from the JSON file and then use the txt subcommand.

Processing text files

The alswiki txt subcommand will use only a subpart of the pipeline. It will use:

  1. splitter: split text into sentences

  2. filter: filter out ill/invalid sentences

  3. language id: filter out non Swiss German sentences

And write the results to stdout.

The configuration file is the same as the scraping tool (see st_scrape config), so it is possible to select which tool implementation to use. To skip any of the above steps, create a custom configuration and:

  • for 1. and 2., specify the interface in the configuration pipeline or use the special _I_ magic;

  • for 3., set the options.min_proba to 0 in the configuration (or set the sg_detector to the interface);

For example, this configuration will do nothing:

# turn off EVERY TOOl (hence doing nothing)

pipeline:
    splitter: swisstext.cmd.scraping.interfaces.ISplitter # or _I_
    sentence_filter: swisstext.cmd.scraping.interfaces.ISentenceFilter # or _I_

options:
  min_proba: 0.0