Ideas

Sentence Filtering

We would need more rules to enhance the sentences’ quality. Here are some ideas:

  • Exclude sentences with the world Wikipedia (or Alemannische Wikipedia) DONE;

  • Exclude short sentences with the character : or with parentheses;

  • Exclude long sentences with no punctuation ? This would remove poems…;

URL Filtering

Right now, the swisstext.cmd.link_util package is rather simple. Having a better way of filtering links beforehand (i.e. without the need to scrape them) could help us improve the performances by (1) avoiding unnecessary scaping, (2) prevent sentences to be wrongly scraped (example: sentences from wikipedia URLs containing ISBN are often miscategorized as Swiss German and of poor quality).

Idea: use exactly the same principle as in swisstext.cmd.tools.pettern_sentence_filter. This would make URL filtering very flexible and powerful (and the code is already here)! On the other hand, switching from dictionary lookup to regexes will impact performances…