TODOs¶
Todo
add pipeline schema
(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/__init__.py:docstring of swisstext.cmd, line 43.)
Todo
This piece of code (and the commandline) has many flaws and should clearly be enhanced… For example:
Multithreading: currently, the worker stops when the queue is empty. This also means that if you launch the system with only 3 base urls but 5 workers, 2 workers will exit immediately instead of waiting for new tasks to be added to the queue.
A better way would be to keep track of active workers and stop only when all workers are idle, or when all workers reached a task with a depth > max depth…
Comportement on errors: in case fetching the URL triggers an error, we currently just log the thing. Should we also remove/blacklist the URL ? Should we allow the URL to fail X times before removal ?
(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/pipeline.py:docstring of swisstext.cmd.scraping.pipeline.PipelineWorker.run, line 23.)
Todo
Try using just the response.text from requests to get a proper encoding ?
(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/bs_crawler.py:docstring of swisstext.cmd.scraping.tools.bs_crawler.BsCrawler, line 14.)
Todo
Find a way to avoid altering the soup object.. ?
(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/bs_crawler.py:docstring of swisstext.cmd.scraping.tools.bs_crawler.BsCrawler.extract_text_blocks, line 8.)
Todo
Make more thorough tests to determine if those heuristics are worth it. If so, make this implementation the default.
(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/bs_crawler.py:docstring of swisstext.cmd.scraping.tools.bs_crawler.CleverBsCrawler, line 14.)
Todo
Find a way to avoid altering the soup object.. ?
(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/bs_crawler.py:docstring of swisstext.cmd.scraping.tools.bs_crawler.CleverBsCrawler.extract_text_blocks, line 8.)
Todo
Train a Punkt model for Swiss-German. (https://stackoverflow.com/questions/21160310/training-data-format-for-nltk-punkt)
(The original entry is located in /Users/lin/git/swisstext/backend/swisstext/cmd/scraping/tools/punkt_splitter.py:docstring of swisstext.cmd.scraping.tools.punkt_splitter.PunktSplitter, line 11.)