Skip to content

Discuss image strategy #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rgaudin opened this issue Aug 23, 2021 · 3 comments
Closed

Discuss image strategy #25

rgaudin opened this issue Aug 23, 2021 · 3 comments
Labels
question Further information is requested

Comments

@rgaudin
Copy link
Member

rgaudin commented Aug 23, 2021

wikiHow uses several image formats through the website:

  • PNG and JPEG obviously
  • SVG
  • WebP

SVG

SVG is a text-based vector format for which we have no optimizer (yet) on scraperlib. Unlike bitmap formats, it is expected vector ones to scale without quality loss. We can thus consider SVG users to expect unaltered image at any size, whatever the source image sizes are.

Currently, we treat them specially and simply upload them to S3 without change.

Options:

  • Use source, don't store on S3. Don't like this option because wikiHow websites are slow and may requests might be throttled. No using our cache is high-risk.
  • Use source, uploaded to S3. What current code does. Kinda break the purpose of an optimization cache.
  • Convert to WebP. While not direct, there are way to convert SVG to bitmap (PNG) and then to WebP. Too risky as the in-svg size are frequently unrelated to the used-ones because of scalability capabilities. This would decrease the rendered quality in many scenarios. Not sure it would make sense size-wise.
  • Optimize SVG lossless. There are some SVG optimizer around. Usually starts by removing the verbose clutter many editors add to the source. Other also include simplifications of drawings, lossless and destructive.

This last option, optimizing SVG using a lossless tool and uploading to S3 feels like the most appropriate. Even if source SVG are already optimized (haven't checked), we'd benefit from the cache. It keeps in sync with our objectives and is generic enough to be replicated elsewhere.

WebP

Our goal for using WebP is to optimize source bitmaps as WebP is generally better in most cases. Having a source WebP might indicate that optimization was already a concern on the source website. It is important to note that wikiHow is not just serving WebP files, it is serving alternatives as well. A typical Webp image is represented as:

<img data-src="some.webp" data-nowebp="some.jpeg" />
<noscript><img src="some.jpeg" /></noscript>

wikiHow is not trying to polyfill but relies on JS to detect WebP support and adjust accordingly, defaulting to JPEG for users without JS enabled. They thus maintain two copies of each of those images.

Current code looks for Webp url and passes that to our pipeline, which means it is re-optimized and uploaded.

What should we do with WebP?

  • Use non-webp alt and re-encode/upload ?
  • Use webp and re-encode-upload (current behavior)?
  • Use webp from source URL (no upload) ?
  • Use webp and upload without re-encode ?

FYI, here's an example of an image that was barely readable and is now unreadable after our re-encoding. Probably an edge case though

Screen Shot 2021-08-23 at 09 39 47

Screen Shot 2021-08-19 at 19 27 53

@Kelson, your input is requested on this

@rgaudin rgaudin added the question Further information is requested label Aug 23, 2021
@kelson42
Copy link
Contributor

kelson42 commented Aug 23, 2021

I there any reason to have a different strategy than for our other (recent) scrapers? To me the only thing which is not obvious might be SVG handling.

@rgaudin
Copy link
Member Author

rgaudin commented Aug 23, 2021

Please read the ticket

@rgaudin
Copy link
Member Author

rgaudin commented Aug 23, 2021

After discussing this with @kelson42, here's what we have decided:

  • SVG: we upload the source for now on S3 (no change).
  • Add SVG optimizer python-scraperlib#80 will add an optimizer at some point. Once done, we'll use it.
  • WebP: re-encode and upload (no change)

@rgaudin rgaudin closed this as completed Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants