Discuss image strategy #25

rgaudin · 2021-08-23T09:45:05Z

wikiHow uses several image formats through the website:

PNG and JPEG obviously
SVG
WebP

SVG

SVG is a text-based vector format for which we have no optimizer (yet) on scraperlib. Unlike bitmap formats, it is expected vector ones to scale without quality loss. We can thus consider SVG users to expect unaltered image at any size, whatever the source image sizes are.

Currently, we treat them specially and simply upload them to S3 without change.

Options:

Use source, don't store on S3. Don't like this option because wikiHow websites are slow and may requests might be throttled. No using our cache is high-risk.
Use source, uploaded to S3. What current code does. Kinda break the purpose of an optimization cache.
Convert to WebP. While not direct, there are way to convert SVG to bitmap (PNG) and then to WebP. Too risky as the in-svg size are frequently unrelated to the used-ones because of scalability capabilities. This would decrease the rendered quality in many scenarios. Not sure it would make sense size-wise.
Optimize SVG lossless. There are some SVG optimizer around. Usually starts by removing the verbose clutter many editors add to the source. Other also include simplifications of drawings, lossless and destructive.

This last option, optimizing SVG using a lossless tool and uploading to S3 feels like the most appropriate. Even if source SVG are already optimized (haven't checked), we'd benefit from the cache. It keeps in sync with our objectives and is generic enough to be replicated elsewhere.

svgo probably most popular (node)
scour (python)
svgcleaner cleaning-only (rust)

WebP

Our goal for using WebP is to optimize source bitmaps as WebP is generally better in most cases. Having a source WebP might indicate that optimization was already a concern on the source website. It is important to note that wikiHow is not just serving WebP files, it is serving alternatives as well. A typical Webp image is represented as:

<img data-src="some.webp" data-nowebp="some.jpeg" />
<noscript><img src="some.jpeg" /></noscript>

wikiHow is not trying to polyfill but relies on JS to detect WebP support and adjust accordingly, defaulting to JPEG for users without JS enabled. They thus maintain two copies of each of those images.

Current code looks for Webp url and passes that to our pipeline, which means it is re-optimized and uploaded.

What should we do with WebP?

Use non-webp alt and re-encode/upload ?
Use webp and re-encode-upload (current behavior)?
Use webp from source URL (no upload) ?
Use webp and upload without re-encode ?

FYI, here's an example of an image that was barely readable and is now unreadable after our re-encoding. Probably an edge case though

@Kelson, your input is requested on this

kelson42 · 2021-08-23T10:42:56Z

I there any reason to have a different strategy than for our other (recent) scrapers? To me the only thing which is not obvious might be SVG handling.

rgaudin · 2021-08-23T10:43:30Z

Please read the ticket

rgaudin · 2021-08-23T11:05:42Z

After discussing this with @kelson42, here's what we have decided:

SVG: we upload the source for now on S3 (no change).
Add SVG optimizer python-scraperlib#80 will add an optimizer at some point. Once done, we'll use it.
WebP: re-encode and upload (no change)

rgaudin added the question Further information is requested label Aug 23, 2021

rgaudin mentioned this issue Aug 23, 2021

Handle math images correctly #26

Closed

rgaudin closed this as completed Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss image strategy #25

Discuss image strategy #25

rgaudin commented Aug 23, 2021

kelson42 commented Aug 23, 2021 •

edited

Loading

rgaudin commented Aug 23, 2021

rgaudin commented Aug 23, 2021

Discuss image strategy #25

Discuss image strategy #25

Comments

rgaudin commented Aug 23, 2021

SVG

WebP

kelson42 commented Aug 23, 2021 • edited Loading

rgaudin commented Aug 23, 2021

rgaudin commented Aug 23, 2021

kelson42 commented Aug 23, 2021 •

edited

Loading