- Removed superfluous debug prints.
- The
Cleaner()
now scans for hidden JavaScript code embedded within CSS comments. In certain contexts, such as within<svg>
or<math>
tags,<style>
tags may lose their intended function, allowing comments like/* foo */
to potentially be executed by the browser. If a suspicious content is detected, only the comment is removed.
- Do not parse URL addresses when it is not necessary.
- Parsing of URL addresses has been enhanced and Cleaner removes ambiguous URLs.
- sdist now includes all test files and changelog.
- Memory efficiency is now much better for HTML pages where cleaner removes a lot of elements. (#14)
- ASCII control characters (except HT, VT, CR and LF) are now removed from string inputs before they're parsed by lxml/libxml2.
- Regular expresion for image data URLs now supports multiple data URLs on a single line.
First official release of the split project.
This part contains releases of lxml project containing important changes related to HTML Cleaner functionalities.
- The HTML
Cleaner()
interpreted an accidentally provided string parameter for thehost_whitelist
as list of characters and silently failed to reject any hosts. Passing a non-collection is now rejected.
- A memory leak in
lxml.html.clean
was resolved by switching to Cython 0.29.34+. - URL checking in the HTML cleaner was improved. Patch by Tim McCormack.
- A vulnerability (GHSL-2021-1038) in the HTML cleaner allowed sneaking script content through SVG images (CVE-2021-43818).
- A vulnerability (GHSL-2021-1037) in the HTML cleaner allowed sneaking script content through CSS imports and other crafted constructs (CVE-2021-43818).
- A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung,
which allowed JavaScript to pass through. The cleaner now removes the HTML5
formaction
attribute.
- A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.
- A vulnerability was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.
Cleaner()
now validates that only known configuration options can be set.Cleaner.clean_html()
discarded comments and PIs regardless of the corresponding configuration option, ifremove_unknown_tags
was set.
- Javascript URLs that used URL escaping were not removed by the HTML cleaner. Security problem found by Omar Eissa. (CVE-2018-19787)
- The modules
lxml.builder
,lxml.html.diff
andlxml.html.clean
are also compiled using Cython in order to speed them up.