Abstract Definitions¶
All defined in the swisstext.mongo.abstract package:
This module contains abstract mongoengine.Document definitions for all the objects in the SwissText system
database.
When using mongoengine directly, you can use the concrete classes defined in swisstext.mongo.models.
When using Flask-MongoEngine, just subclass all abstract classes (i.e. the ones prefixed with Abstract)
and make them inherit from db.Document as well. For example:
from flask_mongoengine import MongoEngine
db = MongoEngine()
from swisstext.mongo.abstract import AbstractMongoURL
class MongoURL(db.Document, AbstractMongoURL):
pass
Common structures and embedded documents¶
Define generic/reused embedded documents as well as constants.
-
class
swisstext.mongo.abstract.generic.CrawlMeta(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocumentHolds a date and a count of items
-
count¶ Items count, for example the number of new URLs found.
-
date¶ The creation/added date, in UTC.
-
hash¶ Optional hash of the text/results/…
-
-
class
swisstext.mongo.abstract.generic.Deleted(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocumentA deleted flag.
-
by¶ The ID of the user triggering deletion, required !
-
comment¶ Optional, not recorded to the DB if not specified
-
date¶ The datetime, generated automatically on creation in UTC
-
-
swisstext.mongo.abstract.generic.Dialects= {'a_l_z': 'Aargau, Luzern, Zug Nord', 'ba_sn': 'Basel, Solothurn Nord', 'bo_fr': 'Berner Oberland, Freiburg', 'g_sgs': 'Glarus und SG Süd', 'graub': 'Graubünden', 'nords': 'Nordostschweiz (SH, TG, SG Nord, AR, AI)', 'ss_bn': 'Solothurn Süd, Bern Nord', 'walli': 'Wallis', 'zentr': 'Zentralschweiz (OW, NW, UR, SZ, ZG Süd)', 'zuric': 'Zürich'}¶ An ordered dictionary of available dialect tags. The dialect tags come from the Master Thesis of Sandra Kellerhals, “Dialektometrische Analyse und Visualisierung von schweizerdeutschen Dialekten auf verschiedenen linguistischen Ebenen”, p. 62 (see category Alle SDS- und SADS- Daten 10 Dialektregionen)
-
class
swisstext.mongo.abstract.generic.Source(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocumentA source field. It contains two informations:
type_(mapped to type in mongo), one of theSourceTypeabove andextrafor an optional extra information.- If the source is attached to a URL, possible values are:
(SourceType.UNKNOWN, None)(SourceType.UNKNOWN, "extra info")(SourceType.USER, "user ID")(SourceType.SEED, "seed ID")(SourceType.AUTO, "parent url")
- If the source is attached to a Seed, possible values are:
(SourceType.UNKNOWN, None)(SourceType.USER, "user ID")(SourceType.AUTO, None)
- If the source is attached to a _blacklisted_ URL, possible values are:
(SourceType.USER, "user ID")(SourceType.AUTO, None)(SourceType.ERROR, "... details ...")
-
extra¶ A unicode string field.
-
type_¶ A unicode string field.
-
class
swisstext.mongo.abstract.generic.SourceType[source]¶ Bases:
objectPossible sources for a URL or a Seed.
-
AUTO= 'auto'¶ Auto means it has been generated automatically by the system during regular execution
-
ERROR= 'error'¶ The URL raised an error while scraping (for blacklist) source.extra should have more info.
-
SEED= 'seed'¶ The URL was found by searching the seed whose ID figures in source.extra
-
UNKNOWN= 'file'¶ This is used when the seeds/urls are read from a file
-
USER= 'user'¶ When user is set, the source.extra should contain the user id
-
Seed collection¶
Class representing a seed entry in the MongoDatabase.
A seed is a string used as a search engine query in order to find new Swiss German URLs. In the SwissText system, seeds can be added by users or generated automatically. Each seed will potentially be used multiple times.
Seeds are generated automatically in swisstext.cmd.crawling and used in swisstext.cmd.searching. Users
can also add seeds manually to the collection using the Frontend (see swisstext.frontend).
-
class
swisstext.mongo.abstract.seeds.AbstractMongoSeed(*args, **values)[source]¶ Bases:
mongoengine.document.DocumentAn abstract
mongoengine.Documentfor a seed, stored in theseedscollection.-
add_search_history(new_links_count)[source]¶ Add a search history entry. This should be call after each usage of the seed.
- Parameters
new_links_count – the number of new URLs found.
-
count¶ The number of new URLs found. This counter is incremented on each seed use for any new URL, whether the URL is actually a “good URL” (i.e. really contains Swiss German) or not. This is because to determine the quality of a URL, one need to actually crawl it.
-
classmethod
create(seed, source=<Source: Source object>)[source]¶ Create a seed. Warning: this won’t save the document automatically. You need to call the
.savemethod to persist it into the database.- Parameters
seed – the seed
source – the source of the seed, default to
SourceType.UNKNOWN
-
date_added¶ When the seed has been added to the collection, in UTC.
-
deleted¶ This field is only present in the collection if the seed has been deleted. Use the following query to get deleted seeds:
db.seeds.find({deleted: {$exists: true})
-
delta_date¶ Date of the last use of this seed in a search.
-
classmethod
find_similar(seed)[source]¶ Find similar seeds. Currently, this method is quite dumb: it will search for seeds containing one or more words present in seed using regex.
Note
the actual regex for a seed like hello world will use the regex
(hello)|(world), so a seed like worldnews may also be included in the results.- Parameters
seed – the seed to search for
- Returns
a
mongoengine.BaseQueryobject for other seeds containing similar words.
-
id¶ The seed itself is used as an primary key to avoid duplicates.
Note
only true duplicates are detected, so ensure the seed is lowercase and trimed before insert.
-
classmethod
mark_deleted(obj, uuid, comment=None)[source]¶ Mark one or multiple seeds as deleted.
- Parameters
obj – either a seed ID (i.e. a string), a
MongoSeedinstance or aQuerySetof seeds.uuid – the ID of the user deleting the seed.
comment – an optional comment
-
search_history¶ A list of usage. For each use, we record the date and the number of new URLs found. The sum of each search history
countshould be equal to thecountvariable.
-
source¶ Source of the seed. Possible values are:
SourceType.AUTOorSourceType.UNKNOWN: no extra required,SourceType.USER: the extra is the id of the user.
-
Sentences collection¶
Classes for interacting with a Swiss German sentence in the MongoDatabase.
Sentences are unique and found automatically using the swisstext crawler (see swisstext.cmd.scraping).
Once a sentence is added to the collection, it is never deleted, except if the URL it comes from is blacklisted.
Using the SwissText Frontend (see swisstext.frontend), users can:
validate the sentence: i.e. mark it as actually Swiss German,
label the sentence: assign a dialect (see
DialectInfo) to it,delete a sentence: this will just add a deleted flag, not actually remove the sentence from the collection.
-
class
swisstext.mongo.abstract.sentences.AbstractMongoSentence(*args, **values)[source]¶ Bases:
mongoengine.document.DocumentAn abstract
mongoengine.Documentfor a sentence, stored in thesentencecollection.To avoid duplicates, the primary key is a hash derived from the text. We currently use CityHash64, a hashing function especially made for hashtables instead of cryptography. Note that the hash is case-sensitive and that the text should be trimed before hashing (by calling
string.trim()for example). There is also no guarantee that there won’t be clashes.-
classmethod
add_label(obj, uuid, label, **kwargs)[source]¶ Add a dialect label. This method first ensures that the
dialectattribute is not null, then forwards the call toDialectInfo.add_label().- Parameters
obj – either the ID of a sentence (a string) or an actual sentence.
uuid – the ID of the user labelling the sentence.
label – the label (see
Dialects)
-
crawl_proba¶ The probability of Swiss German, as calculated by the LID model during crawl.
-
classmethod
create(text, url, proba)[source]¶ Create a seed. Warning: this won’t save the document automatically. You need to call the
.savemethod to persist it into the database.
-
date_added¶ The date when the sentence was added to the collection, in UTC.
-
deleted¶ A deleted flag (with date, user ID and optional comment) that exists only if the sentence is deleted.
-
dialect¶ All the informations about dialect tagging (see
DialectInfo). Note that this field can be absent (never labelled by anyone) or its label empty (all labels removed).
-
static
get_hash(text) → str[source]¶ Hash the given text using CityHash64.
-
id¶ The sentence ID, computed by hashing the text using
get_hash().
-
classmethod
mark_deleted(obj, uuid, comment=None)[source]¶ Mark a sentence as deleted.
- Parameters
obj – either the ID of a sentence (a string) or an actual sentence.
uuid – the ID of the user deleting the sentence.
comment – an optional comment.
-
classmethod
remove_label(obj, uuid)[source]¶ Remove a label added by a user. This method just forwards the call to
DialectInfo.add_label().- Parameters
obj – either the ID of a sentence (a string) or an actual sentence.
uuid – the ID of the user for which to remove the label.
-
text¶ The raw text, without any transformations except trim.
-
classmethod
unmark_deleted(obj)[source]¶ Restore a deleted sentence.
- Parameters
obj – either the ID of a sentence (a string) or an actual sentence.
-
url¶ The source URL.
-
property
url_id¶
-
validated_by¶ A list of user ID that have seen the sentence and validated it as Swiss German in the Frontend.
-
classmethod
-
class
swisstext.mongo.abstract.sentences.DialectEntry(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocumentRepresents a vote, i.e. dialect label, assigned by a user.
-
date¶ The date of the vote, in UTC
-
label¶ The label (see
swisstext.mongo.abstract.generic.Dialectsfor a list of available labels)
-
user¶ The ID of the user
-
-
class
swisstext.mongo.abstract.sentences.DialectInfo(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocumentA
mongoengine.EmbeddedDocumentthat Encapsulates all the dialect-tagging information for a sentence.Usually, a sentence will be presented to a given user only once. If the user knows the dialect, an entry in
labelsis added. If the user doesn’t know, his ID is added toskipped_by. Thus, a user ID should be present at most in one of the two lists.The current dialect is stored in
label. In order to judge the pertinence of the former, two statistics can be used:countis the total number of users having voted for the label, whileconfidenceis the ratio between the number of votes for this label and the total number of votes.Warning
Most of the methods here call
mongoengine.EmbeddedDocument.save()under the hood, thus persisting the changes automatically to the DB.This usually works fine, but this means you CAN NOT manipulate a
DialectInfoinstance if it is not attached to amongoengine.Document.Moreover, you can sometimes run into a ‘
ReferenceError: weakly-referenced object no longer exists’ (only when using Flask-Mongoengine ?). If this is the case, ensure that you have a strong reference to the parent document in your code. Here is an example:# let's say MongoSentence implements AbstractMongoSentence # This can raise a ReferenceException: MongoSentence.objects.with_id(sid).dialect.add_label(uuid, label) # while this is fine: s = MongoSentence.objects.with_id(sid) s.dialect.add_label(uuid, label) # OK
-
add_label(uuid, label)[source]¶ Add a label and save the document. Note the following two special cases: 1. the user has already voted: the old vote will be deleted / replaced 2. the user has previously skipped the sentence: the user ID will be removed from
skipped_by- Parameters
uuid – the user ID
label – the label
- Returns
self
-
confidence¶ Defined by
count / len(labels). Gives a rough estimate of how “good” a label is.
-
count¶ Number of people that voted for the current label.
-
get_label_by(uuid) → str[source]¶ Get the label given by a user.
- Parameters
uuid – the user ID
- Returns
the label as a string, or None if the user hasn’t labelled the sentence.
-
label¶ Current label, i.e. the label with the highest number of votes. In case of a draw, one label is selected randomly. This is why the
confidenceinformation is important.
-
labels¶ All the votes, as a list of
DialectEntry.
-
remove_label(uuid)[source]¶ Remove the label from a user and save the change to MongoDb. Note that it does nothing if the user didn’t vote.
- Parameters
uuid – the user ID
- Returns
self
-
skip(uuid)[source]¶ Add the user to the
skipped_bylist. Note that if the user already voted, his vote will be removed fromlabelsto ensure consistency.- Parameters
uuid – the user ID
- Returns
self
-
skipped_by¶ A list of User ID, corresponding to users having seen the sentence, but were not able to label it.
-
unskip(uuid)[source]¶ Remove a user from the
skipped_bylist.- Parameters
uuid – the user ID
- Returns
self
-
URLs and Blacklist collections¶
Classes for interacting with URLs in the MongoDatabase.
All visited URLs must be recorded in some way in the database, at least to avoid recawling over and over the same links. Thus, URLs are splitted into two collections:
urls: stores interesting URLs that could/should be crawled again,
blacklist: stores URLs that should never by visited again.
At the time of writing, swisstext.cmd.scraping defines interesting URLs as URLs with at least one
Swiss German sentence.
Also note that URLs that have never been visited can also be stored in the urls collection. They will be updated or moved to the blacklist after the first visit.
-
class
swisstext.mongo.abstract.urls.AbstractMongoBlacklist(*args, **values)[source]¶ Bases:
mongoengine.document.DocumentAn abstract
mongoengine.Documentfor uninteresting URLs, stored in theblacklistcollection.-
classmethod
add_url(url: str, source: swisstext.mongo.abstract.generic.Source = None)[source]¶ Blacklist a URL. This will create and save the entry to mongo automatically.
-
date_added¶ When the URL was added to the blacklist
-
id¶ The url ID, computed by hashing the text using
get_hash().
-
source¶ What/who triggered the blacklisting, see
swisstext.mongo.abstract.generic.Source.
-
url¶ The blacklisted URL, indexed by hash.
-
classmethod
-
class
swisstext.mongo.abstract.urls.AbstractMongoURL(*args, **values)[source]¶ Bases:
mongoengine.document.DocumentAn abstract
mongoengine.Documentfor interesting URLs, stored in theurlscollection.-
add_crawl_history(new_sg_count, hash=None, sents_count=None, sg_sents_count=None)[source]¶ Add a crawl history entry. Note that this will update the document instance, but won’t persist the change to mongo. You need to call
mongoengine.Document.save()yourself.- Parameters
new_sg_count – the number of new sentences found on this crawl.
kwargs – see
UrlCrawlMeta
- Returns
self
-
count¶ Total number of new sentences found on this page in all visit.
AbstractMongoURL.count == sum((ch.count for ch in AbstractMongoURL.crawl_history))
-
crawl_history¶ One entry for each visit of this URL by the scraper, ordered by the visit date ascending.
-
classmethod
create(url, source=<Source: Source object>) → mongoengine.document.Document[source]¶ Create a URL. Warning: this won’t save the document automatically. You need to call the
.savemethod to persist it into the database.
-
date_added¶ When the URL was added to the collection, in UTC.
-
delta¶ The number of new sentences found on the last visit (same as
crawl_history[-1].count).
-
delta_date¶ The date of the last visit (same as
crawl_history[-1].date). Inexistant if the URL was never crawled.
-
classmethod
get(url=None, id=None) → mongoengine.document.Document[source]¶ Get a URL Document instance by ID or url (or None if not exist).
-
static
get_hash(text) → str[source]¶ Hash the given text using CityHash64.
-
classmethod
get_never_crawled(**kwargs) → mongoengine.queryset.queryset.QuerySet[source]¶ Get a
QuerySetof URLs that have never been visited.
-
id¶ The url ID, computed by hashing the text using
get_hash().
-
source¶ The source of the URL (see
Source). Possible sources are defined inSourceType.
-
classmethod
try_delete(url: str = None, id: str = None)[source]¶ Delete a URL if it exists. Otherwise, do nothing silently.
-
url¶ The URL, indexed by hash.
-
-
class
swisstext.mongo.abstract.urls.UrlCrawlMeta(*args, **kwargs)[source]¶ Bases:
swisstext.mongo.abstract.generic.CrawlMetaKeep more information on the crawl
-
sents_count¶ Number of quality sentences found on the page.
-
sg_sents_count¶ Number of quality sentences spotted as GSW (new or not).
-
Users collection¶
Classes for interacting with Users in the MongoDatabase.
A user has a username and a password, as well as a list of possible roles to fine-grain the access to the frontend.
-
class
swisstext.mongo.abstract.users.AbstractMongoUser(*args, **values)[source]¶ Bases:
mongoengine.document.Document-
classmethod
get(uuid: str, password: str = None)[source]¶ Get a user. If password is specified, it is checked against the one in the database (login).
- Parameters
uuid – the username / ID
password – the clear password
- Returns
either a user instance or None if the uuid does not exist or the password was specified and incorrect.
-
id¶ The user ID, which is also the username.
-
password¶ The password, hashed using MD5.
In python:
import hashlib hashlib.md5(password.encode()).hexdigest()
In Mongo Shell:
hex_md5(password)
-
classmethod
MongoEngine-ready classes¶
Implementation of the classes defined in swisstext.mongo.abstract for use with mongoengine
(and not flask-mongoengine).
Example usage:
from mongoengine import connect
from swisstext.mongo.models import *
connect(db='swisstext') # default host and port: localhost:27017
# Get all urls in the urls collection
all_urls = MongoURL.objects
all_urls.count() # print the size of the cursor
# Get one url by ID
url = 'http://example.com'
MongoURL.objects.with_id(url) # returns either a MongoURL or None
# A more complex query:
# get all URLs containing 'wikipedia' that have been crawled at least once and
# with less than 10 new URLs found and sort the results by last crawl date, descending
MongoURL .objects(
id__icontains="wikipedia",
crawl_history__0__exists=True,
count__lt=10
) .order_by('-delta_date')
See also
- Module
swisstext.mongo.abstract Documentation of all the classes. Simply look for
Abstract<Classname>.- MongoEngine documentation
The MongoEngine documentation, including the API reference.
-
class
swisstext.mongo.models.MongoBlacklist(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.urls.AbstractMongoBlacklist-
exception
DoesNotExist¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.MongoSeed(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.seeds.AbstractMongoSeed-
exception
DoesNotExist¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.MongoSentence(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.sentences.AbstractMongoSentence-
exception
DoesNotExist¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.MongoText(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.text.AbstractMongoText-
exception
DoesNotExist¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.MongoURL(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.urls.AbstractMongoURL-
exception
DoesNotExist¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.MongoUser(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.users.AbstractMongoUser-
exception
DoesNotExist¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception