TODOs

Todo

add pipeline schema

(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/__init__.py:docstring of swisstext.cmd, line 43.)

Todo

This piece of code (and the commandline) has many flaws and should clearly be enhanced… For example:

Multithreading: currently, the worker stops when the queue is empty. This also means that if you launch the system with only 3 base urls but 5 workers, 2 workers will exit immediately instead of waiting for new tasks to be added to the queue.

A better way would be to keep track of active workers and stop only when all workers are idle, or when all workers reached a task with a depth > max depth…

Comportement on errors: in case fetching the URL triggers an error, we currently just log the thing. Should we also remove/blacklist the URL ? Should we allow the URL to fail X times before removal ?

(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/pipeline.py:docstring of swisstext.cmd.scraping.pipeline.PipelineWorker.run, line 23.)

Todo

Try using just the response.text from requests to get a proper encoding ?

(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/bs_crawler.py:docstring of swisstext.cmd.scraping.tools.bs_crawler.BsCrawler, line 14.)

Todo

Find a way to avoid altering the soup object.. ?

(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/bs_crawler.py:docstring of swisstext.cmd.scraping.tools.bs_crawler.BsCrawler.extract_text_blocks, line 8.)

Todo

Make more thorough tests to determine if those heuristics are worth it. If so, make this implementation the default.

(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/bs_crawler.py:docstring of swisstext.cmd.scraping.tools.bs_crawler.CleverBsCrawler, line 14.)

Todo

Find a way to avoid altering the soup object.. ?

(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/bs_crawler.py:docstring of swisstext.cmd.scraping.tools.bs_crawler.CleverBsCrawler.extract_text_blocks, line 8.)

(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/punkt_splitter.py:docstring of swisstext.cmd.scraping.tools.punkt_splitter.PunktSplitter, line 11.)