[Bug 1958539] Re: Move lxml.html.clean into external project

Wed Apr 17 17:41:06 UTC 2024

This bug was fixed in the package lxml - 5.2.1-1

---------------
lxml (5.2.1-1) unstable; urgency=medium

  * New upstream version.

 -- Matthias Klose <doko at debian.org>  Wed, 03 Apr 2024 22:07:13 +0200

** Changed in: lxml (Ubuntu)
       Status: New => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to lxml in Ubuntu.
https://bugs.launchpad.net/bugs/1958539

Title:
  Move lxml.html.clean into external project

Status in lxml:
  Fix Released
Status in lxml package in Ubuntu:
  Fix Released

Bug description:
  Hi,

  Recently at Red Hat, we had to fix (backport the changes for) multiple
  lxml clean_html() security issues in the lxml versions that we are
  maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the
  implementation is based on a block list.

  Would it be possible to deprecate, or even consider removing, the
  clean_html() function and suggest developers to use the bleach project
  instead? The bleach project is based on an allow list and so is safer.

  Bleach project: https://github.com/mozilla/bleach

  "Bleach is an allowed-list-based HTML sanitizing library that escapes
  or strips markup and attributes"

  Bleach seems quite popular: https://libraries.io/pypi/bleach says
  11.7K repositories depend on it and 586 packages depend on it.

  --

  In the last 15 months, 3 vulnerabilities have been found in the lxml
  clean_html() function:

  * 2021-12-12, CVE-2021-43818 (SVG):
    https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8
  * 2021-03-21, CVE-2021-28957 (HTML action attribute):
    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957
  * 2020-11-27, CVE-2020-27783 (lxml 4.6.2):
    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783

  --

  I ran a code search on PyPI top 5000 projects (at 2021-12-01).

  I found the following 10 projects which uses the lxml clean_html()
  method:

  * requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False)
  * html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html()
  * newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html()
  * readability-lxml: Document._parse() uses lxml clean_html()
  * jusText: jusText.core.preprocessor() uses lxml clean_html()
  * htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search".
  * trafilatura: tree_cleaning() uses lxml clean_html()
  * html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text()
  * item: HTMLField uses lxml clean_html()
  * extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html()

  The "clean_html" code search also found projects which don't use lxml
  to clean HTML:

  * nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function")
  * textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function")
  * django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation).
  * The django-html_sanitizer project is based on bleach.
  * yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex
  * recommender-xblock uses bleach.clean()

To manage notifications about this bug go to:
https://bugs.launchpad.net/lxml/+bug/1958539/+subscriptions