Configuration options¶
The behavior of the scraper highly depends on the tools chosen, which is defined in a configuration YAML file (see swisstext.cmd.base_config
).
Some tool options are more suited together, so in this section I try to detail some of the tips and tricks to get the best system.
Current ?best? pipeline configuration¶
Using default tools¶
The current best configuration for Swiss-German spotting is the following:
# scraping
pipeline:
crawler: .JustextCrawler
normalizer: .Normalizer
splitter: .MocySplitter
sentence_filter: .PatternSentenceFilter
sg_detector: .SwigspotLangid
decider: .OneNewSgDecider
saver: .MongoSaver
crawler_options:
keep_bad: false
normalizer_options:
fix_encoding: true
strip_emojis: false
splitter_options:
more: true
keep_newlines: true
# searching
search_engine:
query_builder: .QuoteWordsQueryBuilder
saver: .MongoSaver
searcher: .StartPageGeneratorFactory
The JustextCrawler
is quite efficient at extracting and
classifying paragraphs of text. Using the keep_bad
option, we discard paragraphs such as short titles, breadcrumbs,
post meta, etc. Setting keep_bad
to true should impact the pipeline much, assuming the filter does its job properly.
Then, the Normalizer
will take care of unifying unicode. The fix_encoding
option is important, as many GSW sentences
are found on old forums with databases improperly configured
(e.g. celica-t23.ch uses twos encodings …).
However, the strip_emojis
option is set to false, because it adds a huge overhead for little outcome (not so many pages have unicode emojis). The only downside is that it messes with the splitter, which is not able to properly segment sentences when emojis are present.
The best splitter so far is the MocySplitter
, an improvement
upon Moses split-sentences.perl
. Note that using a normalizer is vital for it to work well,
since the splitter is not configured to handle all possible space/dash/etc. unicode codepoints.
The more
option will split cleverly on :;
(here, no real arguments, just a matter of taste).
The keep_newlines
option is really interesting, but only in conjunction with the JustextCrawler:
justext already returns paragraphs delimited by newlines, so it is better to keep those newlines and thus avoid to
combine a title or a button text (both lacking punctuation) with a content.
To see the difference, try crawling twitter
with and without keep_newlines
! Without, we will basically crawl all possible versions of a page (different lang=xxx
) and always find new sentences because button texts are appended and change depending on the language…
The rest of the tools are less critical and their behavior is not impacted by the behavior of the previous tools.
Using extra tools¶
# scraping
pipeline:
crawler: .JustextCrawler
normalizer: .Normalizer
splitter: .MocySplitter
sentence_filter: .PatternSentenceFilter
decider: decider.Decider
url_filter: extra.sg_url_filter.SgUrlFilter
sg_detector: extra.bert_torch_predictor.BertPredictor
crawler_options:
keep_bad: false
normalizer_options:
fix_encoding: true
strip_emojis: true
splitter_options:
keep_newlines: true
more: true
sg_detector_options:
chunk_size: 1000
# searching
search_engine:
query_builder: .QuoteWordsQueryBuilder
saver: .MongoSaver
searcher: .StartPageGeneratorFactory
# general options
options:
crawl_depth: 3
max_fetches: -1
max_results: 20
min_proba: 0.92
num_workers: 3
The big difference here is the use of a custom URL filter, decider and language detector. More precisely:
URL filter: the
extra
directory contains a specific implementation of URL filter that deals with some of GSW blogs that have strange URL patterns. It ensures we do not scrape for nothing by blacklisting some and normalizing others;SG detector: we use a better language identifier, a BERT model that was finetuned on the task of language identification; The model is available in another repo, the code in
extra
is just a wrapper around it to make it compatible with swisstext;decider: the decider implementation varies depending on the needs. On our latest runs to find Swiss German quickly, we used the following:
class Decider(BasicDecider):
def should_page_be_crawled(self, page: Page) -> bool:
"""Returns true only if the page is new."""
return page.is_new()
def should_children_be_crawled(self, page: Page) -> bool:
if super().should_children_be_crawled(page):
# just don't try going further if no new sg is found
return len(page.new_sg) > 2
More information can be found in our publication: Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German.
Tool dependencies¶
splitters don’t work well without normalization of the text as they are made to work with the basic character set (vs all the unicode variations);
MocySplitter.keep_newlines
option should be used only in conjunction withJustextCrawler
(and the default joiner);