swisstext.alswiki¶
The magic happens by calling st_alswiki
.
st_alswiki¶
A suite of tools for downloading, parsing and processing wikipedia dumps.
st_alswiki [OPTIONS] COMMAND [ARGS]...
Options
-
-l
,
--log-level
<log_level>
¶ [default: info]
- Options
debug|info|warning|fatal
download¶
Download the latest dump of als.wikipedia.org.
It will save alswiki-latest-pages-articles.xml.bz2 in the directory given by <dir>.
st_alswiki download [OPTIONS]
Options
-
-d
,
--dir
<dir>
¶ directory where to download files [default: .]
parse¶
Extract articles from a wiki dump.
<dumpfile> must be a wiki-*-pages-articles.xml.bz2
file. The output file
will be saved alongside the dump with the json.bz2 extension.
See gensim’s segment_wiki tool for more details (python -m gensim.scripts.segment_wiki -h).
st_alswiki parse [OPTIONS] DUMPFILE
Options
-
-m
,
--min-chars
<min_chars>
¶ Ignore articles with fewer characters than this (article stubs) [default: 200]
Arguments
-
DUMPFILE
¶
Required argument
process¶
Process articles using the scraping pipeline.
<config-path> and <db> are the same as in st_scrape
.
The <gensimfile> can be either in json or json.bz2. To generate it:
download an
alswiki-*-pages-articles.xml.bz2
from https://dumps.wikimedia.org/alswiki (or use the download command),use gensim’s segment_wiki tool (python -m gensim.scripts.segment_wiki -h) to extract articles (in json format) from the raw dump (or use the parse command)
st_alswiki process [OPTIONS] GENSIMFILE
Options
-
-c
,
--config-path
<config_path>
¶
-
-d
,
--db
<db>
¶ If set, this will override the database set in the config
Arguments
-
GENSIMFILE
¶
Required argument
txt¶
Process a text file.
This command reads a file and pass it through a simplified pipeline (split -> filter -> sg_detect). The actual splitter/filter/detector used depends on the configuration given by <config-path>. The config is the same as in the backend, but only options for those 3 tools are used.
Output is printed to the terminal, in one of the two formats given by the <format> argument:
st_alswiki txt [OPTIONS] TEXTFILE
Options
-
-c
,
--config-path
<config_path>
¶
-
-f
,
--format
<format>
¶ [default: txt]
- Options
txt|csv
-
-p
,
--min-proba
<min_proba>
¶ Override min_proba in config
Arguments
-
TEXTFILE
¶
Required argument
Processing Alswiki dumps¶
If you want to add the sentences from an alswiki dump to the mongo database, you can use:
st_alswiki download
st_alswiki parse alswiki-latest-pages-articles.xml.bz2
st_alswiki process alswiki-latest-pages-articles.json.bz2
Remember that by default, the scraping pipeline is configured to use MongoDB on localhost.
If you would rather get the output into a file, just pass a custom config.yaml
file to
the process
subcommand using the -c
option.
# custom config: output the results to stdout
pipeline:
saver: .console_saver.ConsoleSaver
saver_options:
# where to save the output (default to stdout)
sentences_file: alswiki_sentences.txt
Note
For old dumps with older XML syntax, you can use WikiExtractor.py to extract text from the dumps, get the text from the JSON file and then use the txt subcommand.
Processing text files¶
The alswiki txt
subcommand will use only a subpart of the pipeline. It will use:
splitter: split text into sentences
filter: filter out ill/invalid sentences
language id: filter out non Swiss German sentences
And write the results to stdout.
The configuration file is the same as the scraping tool (see st_scrape config
),
so it is possible to select which tool implementation to use.
To skip any of the above steps, create a custom configuration and:
for 1. and 2., specify the interface in the configuration pipeline or use the special
_I_
magic;for 3., set the
options.min_proba
to 0 in the configuration (or set thesg_detector
to the interface);
For example, this configuration will do nothing:
# turn off EVERY TOOl (hence doing nothing)
pipeline:
splitter: swisstext.cmd.scraping.interfaces.ISplitter # or _I_
sentence_filter: swisstext.cmd.scraping.interfaces.ISentenceFilter # or _I_
options:
min_proba: 0.0