Abstract Definitions¶
All defined in the swisstext.mongo.abstract
package:
This module contains abstract mongoengine.Document
definitions for all the objects in the SwissText system
database.
When using mongoengine directly, you can use the concrete classes defined in swisstext.mongo.models
.
When using Flask-MongoEngine, just subclass all abstract classes (i.e. the ones prefixed with Abstract)
and make them inherit from db.Document
as well. For example:
from flask_mongoengine import MongoEngine
db = MongoEngine()
from swisstext.mongo.abstract import AbstractMongoURL
class MongoURL(db.Document, AbstractMongoURL):
pass
Common structures and embedded documents¶
Define generic/reused embedded documents as well as constants.
-
class
swisstext.mongo.abstract.generic.
CrawlMeta
(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocument
Holds a date and a count of items
-
count
¶ Items count, for example the number of new URLs found.
-
date
¶ The creation/added date, in UTC.
-
hash
¶ Optional hash of the text/results/…
-
-
class
swisstext.mongo.abstract.generic.
Deleted
(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocument
A deleted flag.
-
by
¶ The ID of the user triggering deletion, required !
-
comment
¶ Optional, not recorded to the DB if not specified
-
date
¶ The datetime, generated automatically on creation in UTC
-
-
swisstext.mongo.abstract.generic.
Dialects
= {'a_l_z': 'Aargau, Luzern, Zug Nord', 'ba_sn': 'Basel, Solothurn Nord', 'bo_fr': 'Berner Oberland, Freiburg', 'g_sgs': 'Glarus und SG Süd', 'graub': 'Graubünden', 'nords': 'Nordostschweiz (SH, TG, SG Nord, AR, AI)', 'ss_bn': 'Solothurn Süd, Bern Nord', 'walli': 'Wallis', 'zentr': 'Zentralschweiz (OW, NW, UR, SZ, ZG Süd)', 'zuric': 'Zürich'}¶ An ordered dictionary of available dialect tags. The dialect tags come from the Master Thesis of Sandra Kellerhals, “Dialektometrische Analyse und Visualisierung von schweizerdeutschen Dialekten auf verschiedenen linguistischen Ebenen”, p. 62 (see category Alle SDS- und SADS- Daten 10 Dialektregionen)
-
class
swisstext.mongo.abstract.generic.
Source
(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocument
A source field. It contains two informations:
type_
(mapped to type in mongo), one of theSourceType
above andextra
for an optional extra information.- If the source is attached to a URL, possible values are:
(SourceType.UNKNOWN, None)
(SourceType.UNKNOWN, "extra info")
(SourceType.USER, "user ID")
(SourceType.SEED, "seed ID")
(SourceType.AUTO, "parent url")
- If the source is attached to a Seed, possible values are:
(SourceType.UNKNOWN, None)
(SourceType.USER, "user ID")
(SourceType.AUTO, None)
- If the source is attached to a _blacklisted_ URL, possible values are:
(SourceType.USER, "user ID")
(SourceType.AUTO, None)
(SourceType.ERROR, "... details ...")
-
extra
¶ A unicode string field.
-
type_
¶ A unicode string field.
-
class
swisstext.mongo.abstract.generic.
SourceType
[source]¶ Bases:
object
Possible sources for a URL or a Seed.
-
AUTO
= 'auto'¶ Auto means it has been generated automatically by the system during regular execution
-
ERROR
= 'error'¶ The URL raised an error while scraping (for blacklist) source.extra should have more info.
-
SEED
= 'seed'¶ The URL was found by searching the seed whose ID figures in source.extra
-
UNKNOWN
= 'file'¶ This is used when the seeds/urls are read from a file
-
USER
= 'user'¶ When user is set, the source.extra should contain the user id
-
Seed collection¶
Class representing a seed
entry in the MongoDatabase.
A seed is a string used as a search engine query in order to find new Swiss German URLs. In the SwissText system, seeds can be added by users or generated automatically. Each seed will potentially be used multiple times.
Seeds are generated automatically in swisstext.cmd.crawling
and used in swisstext.cmd.searching
. Users
can also add seeds manually to the collection using the Frontend (see swisstext.frontend
).
-
class
swisstext.mongo.abstract.seeds.
AbstractMongoSeed
(*args, **values)[source]¶ Bases:
mongoengine.document.Document
An abstract
mongoengine.Document
for a seed, stored in theseeds
collection.-
add_search_history
(new_links_count)[source]¶ Add a search history entry. This should be call after each usage of the seed.
- Parameters
new_links_count – the number of new URLs found.
-
count
¶ The number of new URLs found. This counter is incremented on each seed use for any new URL, whether the URL is actually a “good URL” (i.e. really contains Swiss German) or not. This is because to determine the quality of a URL, one need to actually crawl it.
-
classmethod
create
(seed, source=<Source: Source object>)[source]¶ Create a seed. Warning: this won’t save the document automatically. You need to call the
.save
method to persist it into the database.- Parameters
seed – the seed
source – the source of the seed, default to
SourceType.UNKNOWN
-
date_added
¶ When the seed has been added to the collection, in UTC.
-
deleted
¶ This field is only present in the collection if the seed has been deleted. Use the following query to get deleted seeds:
db.seeds.find({deleted: {$exists: true})
-
delta_date
¶ Date of the last use of this seed in a search.
-
classmethod
find_similar
(seed)[source]¶ Find similar seeds. Currently, this method is quite dumb: it will search for seeds containing one or more words present in seed using regex.
Note
the actual regex for a seed like hello world will use the regex
(hello)|(world)
, so a seed like worldnews may also be included in the results.- Parameters
seed – the seed to search for
- Returns
a
mongoengine.BaseQuery
object for other seeds containing similar words.
-
id
¶ The seed itself is used as an primary key to avoid duplicates.
Note
only true duplicates are detected, so ensure the seed is lowercase and trimed before insert.
-
classmethod
mark_deleted
(obj, uuid, comment=None)[source]¶ Mark one or multiple seeds as deleted.
- Parameters
obj – either a seed ID (i.e. a string), a
MongoSeed
instance or aQuerySet
of seeds.uuid – the ID of the user deleting the seed.
comment – an optional comment
-
search_history
¶ A list of usage. For each use, we record the date and the number of new URLs found. The sum of each search history
count
should be equal to thecount
variable.
-
source
¶ Source of the seed. Possible values are:
SourceType.AUTO
orSourceType.UNKNOWN
: no extra required,SourceType.USER
: the extra is the id of the user.
-
Sentences collection¶
Classes for interacting with a Swiss German sentence
in the MongoDatabase.
Sentences are unique and found automatically using the swisstext crawler (see swisstext.cmd.scraping
).
Once a sentence is added to the collection, it is never deleted, except if the URL it comes from is blacklisted.
Using the SwissText Frontend (see swisstext.frontend
), users can:
validate the sentence: i.e. mark it as actually Swiss German,
label the sentence: assign a dialect (see
DialectInfo
) to it,delete a sentence: this will just add a deleted flag, not actually remove the sentence from the collection.
-
class
swisstext.mongo.abstract.sentences.
AbstractMongoSentence
(*args, **values)[source]¶ Bases:
mongoengine.document.Document
An abstract
mongoengine.Document
for a sentence, stored in thesentence
collection.To avoid duplicates, the primary key is a hash derived from the text. We currently use CityHash64, a hashing function especially made for hashtables instead of cryptography. Note that the hash is case-sensitive and that the text should be trimed before hashing (by calling
string.trim()
for example). There is also no guarantee that there won’t be clashes.-
classmethod
add_label
(obj, uuid, label, **kwargs)[source]¶ Add a dialect label. This method first ensures that the
dialect
attribute is not null, then forwards the call toDialectInfo.add_label()
.- Parameters
obj – either the ID of a sentence (a string) or an actual sentence.
uuid – the ID of the user labelling the sentence.
label – the label (see
Dialects
)
-
crawl_proba
¶ The probability of Swiss German, as calculated by the LID model during crawl.
-
classmethod
create
(text, url, proba)[source]¶ Create a seed. Warning: this won’t save the document automatically. You need to call the
.save
method to persist it into the database.
-
date_added
¶ The date when the sentence was added to the collection, in UTC.
-
deleted
¶ A deleted flag (with date, user ID and optional comment) that exists only if the sentence is deleted.
-
dialect
¶ All the informations about dialect tagging (see
DialectInfo
). Note that this field can be absent (never labelled by anyone) or its label empty (all labels removed).
-
static
get_hash
(text) → str[source]¶ Hash the given text using CityHash64.
-
id
¶ The sentence ID, computed by hashing the text using
get_hash()
.
-
classmethod
mark_deleted
(obj, uuid, comment=None)[source]¶ Mark a sentence as deleted.
- Parameters
obj – either the ID of a sentence (a string) or an actual sentence.
uuid – the ID of the user deleting the sentence.
comment – an optional comment.
-
classmethod
remove_label
(obj, uuid)[source]¶ Remove a label added by a user. This method just forwards the call to
DialectInfo.add_label()
.- Parameters
obj – either the ID of a sentence (a string) or an actual sentence.
uuid – the ID of the user for which to remove the label.
-
text
¶ The raw text, without any transformations except trim.
-
classmethod
unmark_deleted
(obj)[source]¶ Restore a deleted sentence.
- Parameters
obj – either the ID of a sentence (a string) or an actual sentence.
-
url
¶ The source URL.
-
property
url_id
¶
-
validated_by
¶ A list of user ID that have seen the sentence and validated it as Swiss German in the Frontend.
-
classmethod
-
class
swisstext.mongo.abstract.sentences.
DialectEntry
(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocument
Represents a vote, i.e. dialect label, assigned by a user.
-
date
¶ The date of the vote, in UTC
-
label
¶ The label (see
swisstext.mongo.abstract.generic.Dialects
for a list of available labels)
-
user
¶ The ID of the user
-
-
class
swisstext.mongo.abstract.sentences.
DialectInfo
(*args, **kwargs)[source]¶ Bases:
mongoengine.document.EmbeddedDocument
A
mongoengine.EmbeddedDocument
that Encapsulates all the dialect-tagging information for a sentence.Usually, a sentence will be presented to a given user only once. If the user knows the dialect, an entry in
labels
is added. If the user doesn’t know, his ID is added toskipped_by
. Thus, a user ID should be present at most in one of the two lists.The current dialect is stored in
label
. In order to judge the pertinence of the former, two statistics can be used:count
is the total number of users having voted for the label, whileconfidence
is the ratio between the number of votes for this label and the total number of votes.Warning
Most of the methods here call
mongoengine.EmbeddedDocument.save()
under the hood, thus persisting the changes automatically to the DB.This usually works fine, but this means you CAN NOT manipulate a
DialectInfo
instance if it is not attached to amongoengine.Document
.Moreover, you can sometimes run into a ‘
ReferenceError
: weakly-referenced object no longer exists’ (only when using Flask-Mongoengine ?). If this is the case, ensure that you have a strong reference to the parent document in your code. Here is an example:# let's say MongoSentence implements AbstractMongoSentence # This can raise a ReferenceException: MongoSentence.objects.with_id(sid).dialect.add_label(uuid, label) # while this is fine: s = MongoSentence.objects.with_id(sid) s.dialect.add_label(uuid, label) # OK
-
add_label
(uuid, label)[source]¶ Add a label and save the document. Note the following two special cases: 1. the user has already voted: the old vote will be deleted / replaced 2. the user has previously skipped the sentence: the user ID will be removed from
skipped_by
- Parameters
uuid – the user ID
label – the label
- Returns
self
-
confidence
¶ Defined by
count / len(labels)
. Gives a rough estimate of how “good” a label is.
-
count
¶ Number of people that voted for the current label.
-
get_label_by
(uuid) → str[source]¶ Get the label given by a user.
- Parameters
uuid – the user ID
- Returns
the label as a string, or None if the user hasn’t labelled the sentence.
-
label
¶ Current label, i.e. the label with the highest number of votes. In case of a draw, one label is selected randomly. This is why the
confidence
information is important.
-
labels
¶ All the votes, as a list of
DialectEntry
.
-
remove_label
(uuid)[source]¶ Remove the label from a user and save the change to MongoDb. Note that it does nothing if the user didn’t vote.
- Parameters
uuid – the user ID
- Returns
self
-
skip
(uuid)[source]¶ Add the user to the
skipped_by
list. Note that if the user already voted, his vote will be removed fromlabels
to ensure consistency.- Parameters
uuid – the user ID
- Returns
self
-
skipped_by
¶ A list of User ID, corresponding to users having seen the sentence, but were not able to label it.
-
unskip
(uuid)[source]¶ Remove a user from the
skipped_by
list.- Parameters
uuid – the user ID
- Returns
self
-
URLs and Blacklist collections¶
Classes for interacting with URLs in the MongoDatabase.
All visited URLs must be recorded in some way in the database, at least to avoid recawling over and over the same links. Thus, URLs are splitted into two collections:
urls: stores interesting URLs that could/should be crawled again,
blacklist: stores URLs that should never by visited again.
At the time of writing, swisstext.cmd.scraping
defines interesting URLs as URLs with at least one
Swiss German sentence.
Also note that URLs that have never been visited can also be stored in the urls collection. They will be updated or moved to the blacklist after the first visit.
-
class
swisstext.mongo.abstract.urls.
AbstractMongoBlacklist
(*args, **values)[source]¶ Bases:
mongoengine.document.Document
An abstract
mongoengine.Document
for uninteresting URLs, stored in theblacklist
collection.-
classmethod
add_url
(url: str, source: swisstext.mongo.abstract.generic.Source = None)[source]¶ Blacklist a URL. This will create and save the entry to mongo automatically.
-
date_added
¶ When the URL was added to the blacklist
-
id
¶ The url ID, computed by hashing the text using
get_hash()
.
-
source
¶ What/who triggered the blacklisting, see
swisstext.mongo.abstract.generic.Source
.
-
url
¶ The blacklisted URL, indexed by hash.
-
classmethod
-
class
swisstext.mongo.abstract.urls.
AbstractMongoURL
(*args, **values)[source]¶ Bases:
mongoengine.document.Document
An abstract
mongoengine.Document
for interesting URLs, stored in theurls
collection.-
add_crawl_history
(new_sg_count, hash=None, sents_count=None, sg_sents_count=None)[source]¶ Add a crawl history entry. Note that this will update the document instance, but won’t persist the change to mongo. You need to call
mongoengine.Document.save()
yourself.- Parameters
new_sg_count – the number of new sentences found on this crawl.
kwargs – see
UrlCrawlMeta
- Returns
self
-
count
¶ Total number of new sentences found on this page in all visit.
AbstractMongoURL.count == sum((ch.count for ch in AbstractMongoURL.crawl_history))
-
crawl_history
¶ One entry for each visit of this URL by the scraper, ordered by the visit date ascending.
-
classmethod
create
(url, source=<Source: Source object>) → mongoengine.document.Document[source]¶ Create a URL. Warning: this won’t save the document automatically. You need to call the
.save
method to persist it into the database.
-
date_added
¶ When the URL was added to the collection, in UTC.
-
delta
¶ The number of new sentences found on the last visit (same as
crawl_history[-1].count
).
-
delta_date
¶ The date of the last visit (same as
crawl_history[-1].date
). Inexistant if the URL was never crawled.
-
classmethod
get
(url=None, id=None) → mongoengine.document.Document[source]¶ Get a URL Document instance by ID or url (or None if not exist).
-
static
get_hash
(text) → str[source]¶ Hash the given text using CityHash64.
-
classmethod
get_never_crawled
(**kwargs) → mongoengine.queryset.queryset.QuerySet[source]¶ Get a
QuerySet
of URLs that have never been visited.
-
id
¶ The url ID, computed by hashing the text using
get_hash()
.
-
source
¶ The source of the URL (see
Source
). Possible sources are defined inSourceType
.
-
classmethod
try_delete
(url: str = None, id: str = None)[source]¶ Delete a URL if it exists. Otherwise, do nothing silently.
-
url
¶ The URL, indexed by hash.
-
-
class
swisstext.mongo.abstract.urls.
UrlCrawlMeta
(*args, **kwargs)[source]¶ Bases:
swisstext.mongo.abstract.generic.CrawlMeta
Keep more information on the crawl
-
sents_count
¶ Number of quality sentences found on the page.
-
sg_sents_count
¶ Number of quality sentences spotted as GSW (new or not).
-
Users collection¶
Classes for interacting with Users in the MongoDatabase.
A user has a username and a password, as well as a list of possible roles to fine-grain the access to the frontend.
-
class
swisstext.mongo.abstract.users.
AbstractMongoUser
(*args, **values)[source]¶ Bases:
mongoengine.document.Document
-
classmethod
get
(uuid: str, password: str = None)[source]¶ Get a user. If password is specified, it is checked against the one in the database (login).
- Parameters
uuid – the username / ID
password – the clear password
- Returns
either a user instance or None if the uuid does not exist or the password was specified and incorrect.
-
id
¶ The user ID, which is also the username.
-
password
¶ The password, hashed using MD5.
In python:
import hashlib hashlib.md5(password.encode()).hexdigest()
In Mongo Shell:
hex_md5(password)
-
classmethod
MongoEngine-ready classes¶
Implementation of the classes defined in swisstext.mongo.abstract
for use with mongoengine
(and not flask-mongoengine
).
Example usage:
from mongoengine import connect
from swisstext.mongo.models import *
connect(db='swisstext') # default host and port: localhost:27017
# Get all urls in the urls collection
all_urls = MongoURL.objects
all_urls.count() # print the size of the cursor
# Get one url by ID
url = 'http://example.com'
MongoURL.objects.with_id(url) # returns either a MongoURL or None
# A more complex query:
# get all URLs containing 'wikipedia' that have been crawled at least once and
# with less than 10 new URLs found and sort the results by last crawl date, descending
MongoURL .objects(
id__icontains="wikipedia",
crawl_history__0__exists=True,
count__lt=10
) .order_by('-delta_date')
See also
- Module
swisstext.mongo.abstract
Documentation of all the classes. Simply look for
Abstract<Classname>
.- MongoEngine documentation
The MongoEngine documentation, including the API reference.
-
class
swisstext.mongo.models.
MongoBlacklist
(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.urls.AbstractMongoBlacklist
-
exception
DoesNotExist
¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned
¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.
MongoSeed
(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.seeds.AbstractMongoSeed
-
exception
DoesNotExist
¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned
¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.
MongoSentence
(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.sentences.AbstractMongoSentence
-
exception
DoesNotExist
¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned
¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.
MongoText
(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.text.AbstractMongoText
-
exception
DoesNotExist
¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned
¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.
MongoURL
(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.urls.AbstractMongoURL
-
exception
DoesNotExist
¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned
¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception
-
class
swisstext.mongo.models.
MongoUser
(*args, **values)[source]¶ Bases:
swisstext.mongo.abstract.users.AbstractMongoUser
-
exception
DoesNotExist
¶ Bases:
mongoengine.errors.DoesNotExist
-
exception
MultipleObjectsReturned
¶ Bases:
mongoengine.errors.MultipleObjectsReturned
-
exception