Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SITES] www.mprnews.org #644

Open
4 tasks done
palfrey opened this issue Jul 14, 2024 · 0 comments
Open
4 tasks done

[SITES] www.mprnews.org #644

palfrey opened this issue Jul 14, 2024 · 0 comments

Comments

@palfrey
Copy link
Contributor

palfrey commented Jul 14, 2024

First please check that it is really an issue with the library, and not some special case of website:

  • There is no paywall
  • You do not have to be logged in to see the articles
  • You tried using a common browser user agent in your configuration / call
  • The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://www.mprnews.org

Some sample urls that I have tried

https://www.mprnews.org/story/2024/07/09/new-minnesota-state-fair-foods
https://www.mprnews.org/story/2024/07/14/severe-storms-barrel-across-minnesota-overnight-leaving-thousands-without-power

The exact code i used to test this articles/website

Made a script called can_parse.py and ran with each of the urls as an arg with current master. Might be worth adding to the repository as a test script.

import sys

from newspaper.article import Article

url = sys.argv[1]
article = Article(url, fetch_images=False, follow_meta_refresh=True)
article.download()
article.parse()

Other information, remarks, messages, etc:

Traceback (most recent call last):
  File "/home/palfrey/src/newspaper4k/can_parse.py", line 8, in <module>
    article.parse()
  File "/home/palfrey/src/newspaper4k/newspaper/article.py", line 466, in parse
    authors = self.extractor.get_authors(self.doc)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/palfrey/src/newspaper4k/newspaper/extractors/content_extractor.py", line 59, in get_authors
    return self.author_extractor.parse(doc)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/palfrey/src/newspaper4k/newspaper/extractors/authors_extractor.py", line 99, in parse
    if "@graph" in script_tag:
       ^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant