Skip to content

Commit

Permalink
prepare release 1.6.0 (#111)
Browse files Browse the repository at this point in the history
* prepare release 1.6.0

* fix setup

* update benchmark

* update evaluation
  • Loading branch information
adbar authored Nov 21, 2023
1 parent 93530b1 commit 964be3c
Show file tree
Hide file tree
Showing 7 changed files with 52 additions and 31 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
## Changelog


## 1.6.0
- focus on precision, stricter extraction patterns (#103, #105, #106, #112)
- simplified code base (#108, #109)
- replaced lxml.html.Cleaner (#104)
- extended evaluation

## 1.5.2
- fix for missing months keys in custom extractor (#100)
- fix for None in `try_date_expr()` (#101)
Expand Down
16 changes: 8 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,17 +97,17 @@ Performance
-----------

=============================== ========= ========= ========= ========= =======
500 web pages containing identifiable dates (as of 2022-11-28 on Python 3.8)
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
-------------------------------------------------------------------------------
Python Package Precision Recall Accuracy F-Score Time
=============================== ========= ========= ========= ========= =======
articleDateExtractor 0.20 0.769 0.691 0.572 0.728 4x
date_guesser 2.1.4 0.738 0.544 0.456 0.626 16x
goose3 3.1.12 0.821 0.453 0.412 0.584 14x
htmldate[all] 1.4.0 (fast) **0.856** 0.921 0.798 0.888 **1x**
htmldate[all] 1.4.0 (extensive) 0.847 **0.991** **0.840** **0.913** 2.2x
newspaper3k 0.2.8 0.729 0.630 0.510 0.675 13x
news-please 1.5.22 0.769 0.691 0.572 0.728 38x
articleDateExtractor 0.20 0.803 0.734 0.622 0.767 5x
date_guesser 2.1.4 0.781 0.600 0.514 0.679 18x
goose3 3.1.17 0.869 0.532 0.493 0.660 15x
htmldate[all] 1.6.0 (fast) **0.883** 0.924 0.823 0.903 **1x**
htmldate[all] 1.6.0 (extensive) 0.870 **0.993** **0.865** **0.928** 1.7x
newspaper3k 0.2.8 0.769 0.667 0.556 0.715 15x
news-please 1.5.35 0.801 0.768 0.645 0.784 34x
=============================== ========= ========= ========= ========= =======

For complete results and explanations see the `evaluation page <https://htmldate.readthedocs.io/en/latest/evaluation.html>`_.
Expand Down
31 changes: 23 additions & 8 deletions docs/evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,17 @@ The results below show that **date extraction is not a completely solved task**


=============================== ========= ========= ========= ========= =======
500 web pages containing identifiable dates (as of 2022-11-28 on Python 3.8)
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
-------------------------------------------------------------------------------
Python Package Precision Recall Accuracy F-Score Time
=============================== ========= ========= ========= ========= =======
articleDateExtractor 0.20 0.769 0.691 0.572 0.728 4x
date_guesser 2.1.4 0.738 0.544 0.456 0.626 16x
goose3 3.1.12 0.821 0.453 0.412 0.584 14x
htmldate[all] 1.4.0 (fast) **0.856** 0.921 0.798 0.888 **1x**
htmldate[all] 1.4.0 (extensive) 0.847 **0.991** **0.840** **0.913** 2.2x
newspaper3k 0.2.8 0.729 0.630 0.510 0.675 13x
news-please 1.5.22 0.769 0.691 0.572 0.728 38x
articleDateExtractor 0.20 0.803 0.734 0.622 0.767 5x
date_guesser 2.1.4 0.781 0.600 0.514 0.679 18x
goose3 3.1.17 0.869 0.532 0.493 0.660 15x
htmldate[all] 1.6.0 (fast) **0.883** 0.924 0.823 0.903 **1x**
htmldate[all] 1.6.0 (extensive) 0.870 **0.993** **0.865** **0.928** 1.7x
newspaper3k 0.2.8 0.769 0.667 0.556 0.715 15x
news-please 1.5.35 0.801 0.768 0.645 0.784 34x
=============================== ========= ========= ========= ========= =======


Expand All @@ -72,6 +72,21 @@ Note on the different versions:
Older Results
-------------

=============================== ========= ========= ========= ========= =======
500 web pages containing identifiable dates (as of 2022-11-28 on Python 3.8)
-------------------------------------------------------------------------------
Python Package Precision Recall Accuracy F-Score Time
=============================== ========= ========= ========= ========= =======
articleDateExtractor 0.20 0.769 0.691 0.572 0.728 4x
date_guesser 2.1.4 0.738 0.544 0.456 0.626 16x
goose3 3.1.12 0.821 0.453 0.412 0.584 14x
htmldate[all] 1.4.0 (fast) **0.856** 0.921 0.798 0.888 **1x**
htmldate[all] 1.4.0 (extensive) 0.847 **0.991** **0.840** **0.913** 2.2x
newspaper3k 0.2.8 0.729 0.630 0.510 0.675 13x
news-please 1.5.22 0.769 0.691 0.572 0.728 38x
=============================== ========= ========= ========= ========= =======



=============================== ========= ========= ========= ========= =======
500 web pages containing identifiable dates (as of 2022-03-23 on Python 3.8)
Expand Down
2 changes: 1 addition & 1 deletion htmldate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
__author__ = "Adrien Barbaresi"
__license__ = "GNU GPL v3"
__copyright__ = "Copyright 2017-2023, Adrien Barbaresi"
__version__ = "1.5.2"
__version__ = "1.6.0"


import logging
Expand Down
7 changes: 3 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@
extras = {
"speed": [
"backports-datetime-fromisoformat; python_version < '3.11'",
"cchardet >= 2.1.7; python_version < '3.11'", # build issue
"faust-cchardet >= 2.1.19; python_version >= '3.11'", # fix for build
"faust-cchardet >= 2.1.19",
"urllib3[brotli]",
],
}
Expand All @@ -34,7 +33,7 @@ def get_long_description():

def get_version(package):
"Return package version as listed in `__version__` in `init.py`"
initfile = Path(package, "__init__.py").read_text() # Python >= 3.5
initfile = Path(package, "__init__.py").read_text()
return re.search("__version__ = ['\"]([^'\"]+)['\"]", initfile)[1]


Expand Down Expand Up @@ -117,7 +116,7 @@ def get_version(package):
install_requires=[
"backports-datetime-fromisoformat; python_version < '3.7'",
"charset_normalizer >= 3.0.1; python_version < '3.7'",
"charset_normalizer >= 3.3.0; python_version >= '3.7'",
"charset_normalizer >= 3.3.2; python_version >= '3.7'",
"dateparser >= 1.1.2", # 1.1.3+ slower
"lxml >= 4.9.3 ; platform_system != 'Darwin'",
"lxml == 4.9.2 ; platform_system == 'Darwin'",
Expand Down
18 changes: 9 additions & 9 deletions tests/comparison.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,17 +86,14 @@ def run_newspaper(htmlstring):
# throws error on the eval_default dataset
try:
myarticle = Article(htmlstring)
except (TypeError, UnicodeDecodeError):
return None
myarticle.html = htmlstring
myarticle.download_state = ArticleDownloadState.SUCCESS
try:
myarticle.html = htmlstring
myarticle.download_state = ArticleDownloadState.SUCCESS
myarticle.parse()
except UnicodeEncodeError:
except (UnicodeDecodeError, UnicodeEncodeError):
return None
if myarticle.publish_date is None or myarticle.publish_date == "":
return None
return convert_date(myarticle.publish_date, "%Y-%m-%d %H:%M:%S", "%Y-%m-%d")
return str(myarticle.publish_date)[0:10]


def run_newsplease(htmlstring):
Expand Down Expand Up @@ -129,11 +126,14 @@ def run_dateguesser(htmlstring):

def run_goose(htmlstring):
"""try with the goose algorithm"""
article = G.extract(raw_html=htmlstring)
try:
article = G.extract(raw_html=htmlstring)
except (AttributeError, UnicodeDecodeError):
return None
if article.publish_date is None:
return None
datematch = re.match(r"[0-9]{4}-[0-9]{2}-[0-9]{2}", article.publish_date)
try:
datematch = re.match(r"[0-9]{4}-[0-9]{2}-[0-9]{2}", article.publish_date)
return datematch[0]
# illogical result
except TypeError:
Expand Down
2 changes: 1 addition & 1 deletion tests/eval-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# package
htmldate>=1.5.0
htmldate>=1.6.0

# alternatives
articleDateExtractor==0.20
Expand Down

0 comments on commit 964be3c

Please sign in to comment.