[SITES] www.mprnews.org #644

palfrey · 2024-07-14T16:50:37Z

First please check that it is really an issue with the library, and not some special case of website:

There is no paywall
You do not have to be logged in to see the articles
You tried using a common browser user agent in your configuration / call
The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://www.mprnews.org

Some sample urls that I have tried

https://www.mprnews.org/story/2024/07/09/new-minnesota-state-fair-foods
https://www.mprnews.org/story/2024/07/14/severe-storms-barrel-across-minnesota-overnight-leaving-thousands-without-power

The exact code i used to test this articles/website

Made a script called can_parse.py and ran with each of the urls as an arg with current master. Might be worth adding to the repository as a test script.

import sys

from newspaper.article import Article

url = sys.argv[1]
article = Article(url, fetch_images=False, follow_meta_refresh=True)
article.download()
article.parse()

Other information, remarks, messages, etc:

Traceback (most recent call last):
  File "/home/palfrey/src/newspaper4k/can_parse.py", line 8, in <module>
    article.parse()
  File "/home/palfrey/src/newspaper4k/newspaper/article.py", line 466, in parse
    authors = self.extractor.get_authors(self.doc)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/palfrey/src/newspaper4k/newspaper/extractors/content_extractor.py", line 59, in get_authors
    return self.author_extractor.parse(doc)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/palfrey/src/newspaper4k/newspaper/extractors/authors_extractor.py", line 99, in parse
    if "@graph" in script_tag:
       ^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable

The text was updated successfully, but these errors were encountered:

palfrey added the sites not working label Jul 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SITES] www.mprnews.org #644

[SITES] www.mprnews.org #644

palfrey commented Jul 14, 2024

[SITES] www.mprnews.org #644

[SITES] www.mprnews.org #644

Comments

palfrey commented Jul 14, 2024

First please check that it is really an issue with the library, and not some special case of website:

Your report as follows: