[Bug 1958539] Re: Move lxml.html.clean into external project
Launchpad Bug Tracker
1958539 at bugs.launchpad.net
Wed Apr 17 17:41:06 UTC 2024
This bug was fixed in the package lxml - 5.2.1-1
---------------
lxml (5.2.1-1) unstable; urgency=medium
* New upstream version.
-- Matthias Klose <doko at debian.org> Wed, 03 Apr 2024 22:07:13 +0200
** Changed in: lxml (Ubuntu)
Status: New => Fix Released
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to lxml in Ubuntu.
https://bugs.launchpad.net/bugs/1958539
Title:
Move lxml.html.clean into external project
Status in lxml:
Fix Released
Status in lxml package in Ubuntu:
Fix Released
Bug description:
Hi,
Recently at Red Hat, we had to fix (backport the changes for) multiple
lxml clean_html() security issues in the lxml versions that we are
maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the
implementation is based on a block list.
Would it be possible to deprecate, or even consider removing, the
clean_html() function and suggest developers to use the bleach project
instead? The bleach project is based on an allow list and so is safer.
Bleach project: https://github.com/mozilla/bleach
"Bleach is an allowed-list-based HTML sanitizing library that escapes
or strips markup and attributes"
Bleach seems quite popular: https://libraries.io/pypi/bleach says
11.7K repositories depend on it and 586 packages depend on it.
--
In the last 15 months, 3 vulnerabilities have been found in the lxml
clean_html() function:
* 2021-12-12, CVE-2021-43818 (SVG):
https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8
* 2021-03-21, CVE-2021-28957 (HTML action attribute):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957
* 2020-11-27, CVE-2020-27783 (lxml 4.6.2):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783
--
I ran a code search on PyPI top 5000 projects (at 2021-12-01).
I found the following 10 projects which uses the lxml clean_html()
method:
* requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False)
* html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html()
* newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html()
* readability-lxml: Document._parse() uses lxml clean_html()
* jusText: jusText.core.preprocessor() uses lxml clean_html()
* htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search".
* trafilatura: tree_cleaning() uses lxml clean_html()
* html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text()
* item: HTMLField uses lxml clean_html()
* extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html()
The "clean_html" code search also found projects which don't use lxml
to clean HTML:
* nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function")
* textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function")
* django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation).
* The django-html_sanitizer project is based on bleach.
* yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex
* recommender-xblock uses bleach.clean()
To manage notifications about this bug go to:
https://bugs.launchpad.net/lxml/+bug/1958539/+subscriptions
More information about the foundations-bugs
mailing list