Abstract Definitions

All defined in the swisstext.mongo.abstract package:

This module contains abstract mongoengine.Document definitions for all the objects in the SwissText system database.

When using mongoengine directly, you can use the concrete classes defined in swisstext.mongo.models.

When using Flask-MongoEngine, just subclass all abstract classes (i.e. the ones prefixed with Abstract) and make them inherit from db.Document as well. For example:

from flask_mongoengine import MongoEngine
db = MongoEngine()

from swisstext.mongo.abstract import AbstractMongoURL

class MongoURL(db.Document, AbstractMongoURL):
    pass

Common structures and embedded documents

Define generic/reused embedded documents as well as constants.

class swisstext.mongo.abstract.generic.CrawlMeta(*args, **kwargs)[source]

Bases: mongoengine.document.EmbeddedDocument

Holds a date and a count of items

count

Items count, for example the number of new URLs found.

date

The creation/added date, in UTC.

hash

Optional hash of the text/results/…

class swisstext.mongo.abstract.generic.Deleted(*args, **kwargs)[source]

Bases: mongoengine.document.EmbeddedDocument

A deleted flag.

by

The ID of the user triggering deletion, required !

comment

Optional, not recorded to the DB if not specified

date

The datetime, generated automatically on creation in UTC

swisstext.mongo.abstract.generic.Dialects = {'a_l_z': 'Aargau, Luzern, Zug Nord', 'ba_sn': 'Basel, Solothurn Nord', 'bo_fr': 'Berner Oberland, Freiburg', 'g_sgs': 'Glarus und SG Süd', 'graub': 'Graubünden', 'nords': 'Nordostschweiz (SH, TG, SG Nord, AR, AI)', 'ss_bn': 'Solothurn Süd, Bern Nord', 'walli': 'Wallis', 'zentr': 'Zentralschweiz (OW, NW, UR, SZ, ZG Süd)', 'zuric': 'Zürich'}

An ordered dictionary of available dialect tags. The dialect tags come from the Master Thesis of Sandra Kellerhals, “Dialektometrische Analyse und Visualisierung von schweizerdeutschen Dialekten auf verschiedenen linguistischen Ebenen”, p. 62 (see category Alle SDS- und SADS- Daten 10 Dialektregionen)

class swisstext.mongo.abstract.generic.Source(*args, **kwargs)[source]

Bases: mongoengine.document.EmbeddedDocument

A source field. It contains two informations: type_ (mapped to type in mongo), one of the SourceType above and extra for an optional extra information.

If the source is attached to a URL, possible values are:
  • (SourceType.UNKNOWN, None)

  • (SourceType.UNKNOWN, "extra info")

  • (SourceType.USER, "user ID")

  • (SourceType.SEED, "seed ID")

  • (SourceType.AUTO, "parent url")

If the source is attached to a Seed, possible values are:
  • (SourceType.UNKNOWN, None)

  • (SourceType.USER, "user ID")

  • (SourceType.AUTO, None)

If the source is attached to a _blacklisted_ URL, possible values are:
  • (SourceType.USER, "user ID")

  • (SourceType.AUTO, None)

  • (SourceType.ERROR, "... details ...")

extra

A unicode string field.

type_

A unicode string field.

class swisstext.mongo.abstract.generic.SourceType[source]

Bases: object

Possible sources for a URL or a Seed.

AUTO = 'auto'

Auto means it has been generated automatically by the system during regular execution

ERROR = 'error'

The URL raised an error while scraping (for blacklist) source.extra should have more info.

SEED = 'seed'

The URL was found by searching the seed whose ID figures in source.extra

UNKNOWN = 'file'

This is used when the seeds/urls are read from a file

USER = 'user'

When user is set, the source.extra should contain the user id

Seed collection

Class representing a seed entry in the MongoDatabase.

A seed is a string used as a search engine query in order to find new Swiss German URLs. In the SwissText system, seeds can be added by users or generated automatically. Each seed will potentially be used multiple times.

Seeds are generated automatically in swisstext.cmd.crawling and used in swisstext.cmd.searching. Users can also add seeds manually to the collection using the Frontend (see swisstext.frontend).

class swisstext.mongo.abstract.seeds.AbstractMongoSeed(*args, **values)[source]

Bases: mongoengine.document.Document

An abstract mongoengine.Document for a seed, stored in the seeds collection.

add_search_history(new_links_count)[source]

Add a search history entry. This should be call after each usage of the seed.

Parameters

new_links_count – the number of new URLs found.

count

The number of new URLs found. This counter is incremented on each seed use for any new URL, whether the URL is actually a “good URL” (i.e. really contains Swiss German) or not. This is because to determine the quality of a URL, one need to actually crawl it.

classmethod create(seed, source=<Source: Source object>)[source]

Create a seed. Warning: this won’t save the document automatically. You need to call the .save method to persist it into the database.

Parameters
  • seed – the seed

  • source – the source of the seed, default to SourceType.UNKNOWN

date_added

When the seed has been added to the collection, in UTC.

deleted

This field is only present in the collection if the seed has been deleted. Use the following query to get deleted seeds:

db.seeds.find({deleted: {$exists: true})
delta_date

Date of the last use of this seed in a search.

classmethod exists(text) → bool[source]

Test if a seed already exists.

Parameters

text – the seed

classmethod find_similar(seed)[source]

Find similar seeds. Currently, this method is quite dumb: it will search for seeds containing one or more words present in seed using regex.

Note

the actual regex for a seed like hello world will use the regex (hello)|(world), so a seed like worldnews may also be included in the results.

Parameters

seed – the seed to search for

Returns

a mongoengine.BaseQuery object for other seeds containing similar words.

classmethod get(seed)[source]

Get a seed by ID.

id

The seed itself is used as an primary key to avoid duplicates.

Note

only true duplicates are detected, so ensure the seed is lowercase and trimed before insert.

classmethod mark_deleted(obj, uuid, comment=None)[source]

Mark one or multiple seeds as deleted.

Parameters
  • obj – either a seed ID (i.e. a string), a MongoSeed instance or a QuerySet of seeds.

  • uuid – the ID of the user deleting the seed.

  • comment – an optional comment

search_history

A list of usage. For each use, we record the date and the number of new URLs found. The sum of each search history count should be equal to the count variable.

source

Source of the seed. Possible values are:

  • SourceType.AUTO or SourceType.UNKNOWN: no extra required,

  • SourceType.USER: the extra is the id of the user.

classmethod unmark_deleted(obj)[source]

Undelete a seed. This will unset the deleted flag completely. In Mongo Shell, use:

db.seeds.find({_id: "the seed"}, {$unset: {deleted_by:1}})
Parameters

obj – either a seed ID (i.e. a string), a MongoSeed instance or a QuerySet of seeds.

Sentences collection

Classes for interacting with a Swiss German sentence in the MongoDatabase.

Sentences are unique and found automatically using the swisstext crawler (see swisstext.cmd.scraping). Once a sentence is added to the collection, it is never deleted, except if the URL it comes from is blacklisted.

Using the SwissText Frontend (see swisstext.frontend), users can:

  • validate the sentence: i.e. mark it as actually Swiss German,

  • label the sentence: assign a dialect (see DialectInfo) to it,

  • delete a sentence: this will just add a deleted flag, not actually remove the sentence from the collection.

class swisstext.mongo.abstract.sentences.AbstractMongoSentence(*args, **values)[source]

Bases: mongoengine.document.Document

An abstract mongoengine.Document for a sentence, stored in the sentence collection.

To avoid duplicates, the primary key is a hash derived from the text. We currently use CityHash64, a hashing function especially made for hashtables instead of cryptography. Note that the hash is case-sensitive and that the text should be trimed before hashing (by calling string.trim() for example). There is also no guarantee that there won’t be clashes.

classmethod add_label(obj, uuid, label, **kwargs)[source]

Add a dialect label. This method first ensures that the dialect attribute is not null, then forwards the call to DialectInfo.add_label().

Parameters
  • obj – either the ID of a sentence (a string) or an actual sentence.

  • uuid – the ID of the user labelling the sentence.

  • label – the label (see Dialects)

crawl_proba

The probability of Swiss German, as calculated by the LID model during crawl.

classmethod create(text, url, proba)[source]

Create a seed. Warning: this won’t save the document automatically. You need to call the .save method to persist it into the database.

date_added

The date when the sentence was added to the collection, in UTC.

deleted

A deleted flag (with date, user ID and optional comment) that exists only if the sentence is deleted.

dialect

All the informations about dialect tagging (see DialectInfo). Note that this field can be absent (never labelled by anyone) or its label empty (all labels removed).

classmethod exists(text) → bool[source]

Test if a sentence already exists (case sensitive).

static get_hash(text) → str[source]

Hash the given text using CityHash64.

id

The sentence ID, computed by hashing the text using get_hash().

classmethod mark_deleted(obj, uuid, comment=None)[source]

Mark a sentence as deleted.

Parameters
  • obj – either the ID of a sentence (a string) or an actual sentence.

  • uuid – the ID of the user deleting the sentence.

  • comment – an optional comment.

classmethod remove_label(obj, uuid)[source]

Remove a label added by a user. This method just forwards the call to DialectInfo.add_label().

Parameters
  • obj – either the ID of a sentence (a string) or an actual sentence.

  • uuid – the ID of the user for which to remove the label.

text

The raw text, without any transformations except trim.

classmethod unmark_deleted(obj)[source]

Restore a deleted sentence.

Parameters

obj – either the ID of a sentence (a string) or an actual sentence.

url

The source URL.

property url_id
validated_by

A list of user ID that have seen the sentence and validated it as Swiss German in the Frontend.

class swisstext.mongo.abstract.sentences.DialectEntry(*args, **kwargs)[source]

Bases: mongoengine.document.EmbeddedDocument

Represents a vote, i.e. dialect label, assigned by a user.

date

The date of the vote, in UTC

label

The label (see swisstext.mongo.abstract.generic.Dialects for a list of available labels)

user

The ID of the user

class swisstext.mongo.abstract.sentences.DialectInfo(*args, **kwargs)[source]

Bases: mongoengine.document.EmbeddedDocument

A mongoengine.EmbeddedDocument that Encapsulates all the dialect-tagging information for a sentence.

Usually, a sentence will be presented to a given user only once. If the user knows the dialect, an entry in labels is added. If the user doesn’t know, his ID is added to skipped_by. Thus, a user ID should be present at most in one of the two lists.

The current dialect is stored in label. In order to judge the pertinence of the former, two statistics can be used: count is the total number of users having voted for the label, while confidence is the ratio between the number of votes for this label and the total number of votes.

Warning

Most of the methods here call mongoengine.EmbeddedDocument.save() under the hood, thus persisting the changes automatically to the DB.

This usually works fine, but this means you CAN NOT manipulate a DialectInfo instance if it is not attached to a mongoengine.Document.

Moreover, you can sometimes run into a ‘ReferenceError: weakly-referenced object no longer exists’ (only when using Flask-Mongoengine ?). If this is the case, ensure that you have a strong reference to the parent document in your code. Here is an example:

# let's say MongoSentence implements AbstractMongoSentence
# This can raise a ReferenceException:
MongoSentence.objects.with_id(sid).dialect.add_label(uuid, label)
# while this is fine:
s = MongoSentence.objects.with_id(sid)
s.dialect.add_label(uuid, label) # OK
add_label(uuid, label)[source]

Add a label and save the document. Note the following two special cases: 1. the user has already voted: the old vote will be deleted / replaced 2. the user has previously skipped the sentence: the user ID will be removed from skipped_by

Parameters
  • uuid – the user ID

  • label – the label

Returns

self

confidence

Defined by count / len(labels). Gives a rough estimate of how “good” a label is.

count

Number of people that voted for the current label.

get_label_by(uuid) → str[source]

Get the label given by a user.

Parameters

uuid – the user ID

Returns

the label as a string, or None if the user hasn’t labelled the sentence.

label

Current label, i.e. the label with the highest number of votes. In case of a draw, one label is selected randomly. This is why the confidence information is important.

labels

All the votes, as a list of DialectEntry.

remove_label(uuid)[source]

Remove the label from a user and save the change to MongoDb. Note that it does nothing if the user didn’t vote.

Parameters

uuid – the user ID

Returns

self

skip(uuid)[source]

Add the user to the skipped_by list. Note that if the user already voted, his vote will be removed from labels to ensure consistency.

Parameters

uuid – the user ID

Returns

self

skipped_by

A list of User ID, corresponding to users having seen the sentence, but were not able to label it.

unskip(uuid)[source]

Remove a user from the skipped_by list.

Parameters

uuid – the user ID

Returns

self

URLs and Blacklist collections

Classes for interacting with URLs in the MongoDatabase.

All visited URLs must be recorded in some way in the database, at least to avoid recawling over and over the same links. Thus, URLs are splitted into two collections:

  • urls: stores interesting URLs that could/should be crawled again,

  • blacklist: stores URLs that should never by visited again.

At the time of writing, swisstext.cmd.scraping defines interesting URLs as URLs with at least one Swiss German sentence.

Also note that URLs that have never been visited can also be stored in the urls collection. They will be updated or moved to the blacklist after the first visit.

class swisstext.mongo.abstract.urls.AbstractMongoBlacklist(*args, **values)[source]

Bases: mongoengine.document.Document

An abstract mongoengine.Document for uninteresting URLs, stored in the blacklist collection.

classmethod add_url(url: str, source: swisstext.mongo.abstract.generic.Source = None)[source]

Blacklist a URL. This will create and save the entry to mongo automatically.

date_added

When the URL was added to the blacklist

classmethod exists(url: str = None, hash: str = None) → bool[source]

Test if a url is blacklisted.

static get_hash(text) → str[source]
id

The url ID, computed by hashing the text using get_hash().

source

What/who triggered the blacklisting, see swisstext.mongo.abstract.generic.Source.

url

The blacklisted URL, indexed by hash.

class swisstext.mongo.abstract.urls.AbstractMongoURL(*args, **values)[source]

Bases: mongoengine.document.Document

An abstract mongoengine.Document for interesting URLs, stored in the urls collection.

add_crawl_history(new_sg_count, hash=None, sents_count=None, sg_sents_count=None)[source]

Add a crawl history entry. Note that this will update the document instance, but won’t persist the change to mongo. You need to call mongoengine.Document.save() yourself.

Parameters
  • new_sg_count – the number of new sentences found on this crawl.

  • kwargs – see UrlCrawlMeta

Returns

self

count

Total number of new sentences found on this page in all visit.

AbstractMongoURL.count == sum((ch.count for ch in AbstractMongoURL.crawl_history))
crawl_history

One entry for each visit of this URL by the scraper, ordered by the visit date ascending.

classmethod create(url, source=<Source: Source object>) → mongoengine.document.Document[source]

Create a URL. Warning: this won’t save the document automatically. You need to call the .save method to persist it into the database.

date_added

When the URL was added to the collection, in UTC.

delta

The number of new sentences found on the last visit (same as crawl_history[-1].count).

delta_date

The date of the last visit (same as crawl_history[-1].date). Inexistant if the URL was never crawled.

classmethod exists(url: str = None, id: str = None) → bool[source]

Test if a url exists.

classmethod get(url=None, id=None) → mongoengine.document.Document[source]

Get a URL Document instance by ID or url (or None if not exist).

static get_hash(text) → str[source]

Hash the given text using CityHash64.

classmethod get_never_crawled(**kwargs) → mongoengine.queryset.queryset.QuerySet[source]

Get a QuerySet of URLs that have never been visited.

id

The url ID, computed by hashing the text using get_hash().

source

The source of the URL (see Source). Possible sources are defined in SourceType.

classmethod try_delete(url: str = None, id: str = None)[source]

Delete a URL if it exists. Otherwise, do nothing silently.

url

The URL, indexed by hash.

class swisstext.mongo.abstract.urls.UrlCrawlMeta(*args, **kwargs)[source]

Bases: swisstext.mongo.abstract.generic.CrawlMeta

Keep more information on the crawl

sents_count

Number of quality sentences found on the page.

sg_sents_count

Number of quality sentences spotted as GSW (new or not).

Users collection

Classes for interacting with Users in the MongoDatabase.

A user has a username and a password, as well as a list of possible roles to fine-grain the access to the frontend.

class swisstext.mongo.abstract.users.AbstractMongoUser(*args, **values)[source]

Bases: mongoengine.document.Document

classmethod get(uuid: str, password: str = None)[source]

Get a user. If password is specified, it is checked against the one in the database (login).

Parameters
  • uuid – the username / ID

  • password – the clear password

Returns

either a user instance or None if the uuid does not exist or the password was specified and incorrect.

static get_hash(password) → str[source]

Generate the MD5 hash of the password.

id

The user ID, which is also the username.

password

The password, hashed using MD5.

In python:

import hashlib
hashlib.md5(password.encode()).hexdigest()

In Mongo Shell:

hex_md5(password)
roles

The list of roles. If the list is empty, the user is considered a USER.

class swisstext.mongo.abstract.users.UserRoles[source]

Bases: object

Currently, users can either be ADMIN or USER.

ADMIN = 'admin'
USER = 'user'

MongoEngine-ready classes

Implementation of the classes defined in swisstext.mongo.abstract for use with mongoengine (and not flask-mongoengine).

Example usage:

from mongoengine import connect
from swisstext.mongo.models import *

connect(db='swisstext') # default host and port: localhost:27017

# Get all urls in the urls collection
all_urls = MongoURL.objects
all_urls.count() # print the size of the cursor

# Get one url by ID
url = 'http://example.com'
MongoURL.objects.with_id(url) # returns either a MongoURL or None

# A more complex query:
# get all URLs containing 'wikipedia' that have been crawled at least once and
# with less than 10 new URLs found and sort the results by last crawl date, descending
MongoURL         .objects(
        id__icontains="wikipedia",
        crawl_history__0__exists=True,
        count__lt=10
    )         .order_by('-delta_date')

See also

Module swisstext.mongo.abstract

Documentation of all the classes. Simply look for Abstract<Classname>.

MongoEngine documentation

The MongoEngine documentation, including the API reference.

class swisstext.mongo.models.MongoBlacklist(*args, **values)[source]

Bases: swisstext.mongo.abstract.urls.AbstractMongoBlacklist

exception DoesNotExist

Bases: mongoengine.errors.DoesNotExist

exception MultipleObjectsReturned

Bases: mongoengine.errors.MultipleObjectsReturned

class swisstext.mongo.models.MongoSeed(*args, **values)[source]

Bases: swisstext.mongo.abstract.seeds.AbstractMongoSeed

exception DoesNotExist

Bases: mongoengine.errors.DoesNotExist

exception MultipleObjectsReturned

Bases: mongoengine.errors.MultipleObjectsReturned

class swisstext.mongo.models.MongoSentence(*args, **values)[source]

Bases: swisstext.mongo.abstract.sentences.AbstractMongoSentence

exception DoesNotExist

Bases: mongoengine.errors.DoesNotExist

exception MultipleObjectsReturned

Bases: mongoengine.errors.MultipleObjectsReturned

class swisstext.mongo.models.MongoText(*args, **values)[source]

Bases: swisstext.mongo.abstract.text.AbstractMongoText

exception DoesNotExist

Bases: mongoengine.errors.DoesNotExist

exception MultipleObjectsReturned

Bases: mongoengine.errors.MultipleObjectsReturned

class swisstext.mongo.models.MongoURL(*args, **values)[source]

Bases: swisstext.mongo.abstract.urls.AbstractMongoURL

exception DoesNotExist

Bases: mongoengine.errors.DoesNotExist

exception MultipleObjectsReturned

Bases: mongoengine.errors.MultipleObjectsReturned

class swisstext.mongo.models.MongoUser(*args, **values)[source]

Bases: swisstext.mongo.abstract.users.AbstractMongoUser

exception DoesNotExist

Bases: mongoengine.errors.DoesNotExist

exception MultipleObjectsReturned

Bases: mongoengine.errors.MultipleObjectsReturned