Releases · gocolly/colly

27 Mar 10:47

asciimoo

v2.2.0

fe55a16

v2.2.0 Latest

Latest

Tons of bug fixes & little improvements & updated dependencies

What's Changed

Added new project that uses Colly by @twiny in #525
adding docker-slim to the list of open source projects using Colly by @kcq in #526
update coursera example by @ysung6 in #537
Update openedx_courses.go by @lingt-xyz in #539
feat: support Stop() in queue by @PalanQu in #531
Fix invalid base URL assignment by @dasshield in #549
Add context.Context support by @WGH- in #548
Fix handling of redirected URLs in OnResponseHeaders by @WGH- in #561
Fix race on count fields of Collector type by @shubham14bajpai in #563
SetProxyFunc: fix proxies rotation by @wowinter13 in #567
Update xkcd_store.go by @BigManing in #570
Updated xmlquery to close a known security vulnerability by @ijomeli in #582
Fix default User-Agent when using custom headers by @WGH- in #592
Update openedx_courses.go by @BigManing in #576
Add httpBackend cache response callback by @slashapp in #574
Update README.md by @seversky in #595
Change the type assersion for Attr by @moritamori in #588
Update User-Agent strings for Desktop by @guessi in #605
add quote crawler using colly by @eval-exec in #603
Fix golint issues by @waltton in #620
Replace Travis with GitHub Actions by @alexhraber in #669
Use github.com/nlnwa/whatwg-url for URL parsing by @WGH- in #673
Fix Async ignoring its arguments by @ljbink in #652
fix: generated code has compile error by @darjun in #629
moving filterings to one function (checkFilters) and use it in case of redirect too by @joelazar in #638
Fix Contributing.md link by @SaladinoBelisario in #641
Work around invalid URL encoding in path by @WGH- in #675
Host & Custom Header Support by @ErikOwen in #691
Code cleanup by @WGH- in #678
Add GitHub issue templates and nudge questions to Stack Overflow by @WGH- in #702
Fix redirects ignoring AllowURLRevisit=false by @WGH- in #681
Bump whatwg-url version by @WGH- in #713
fix some typos by @cuishuang in #736
feat: Add MaxRequests paramete by @guilhem in #722
Update CI configuration by @WGH- in #749
upgrade Cascadia version by @kinoute in #650
Bump golang.org/x/net from 0.0.0-20220114011407-0dd24b26b47d to 0.7.0 by @dependabot in #756
Support websites redirecting to the same page when AllowURLRevisit is disabled by @WGH- in #763
Update User-Agent strings for 2023 by @guessi in #765
Fix data races by @WGH- in #771
Replaced deprecated io/ioutil package with proper implementations of … by @zachary-walters in #785
Bump golang.org/x/net from 0.7.0 to 0.17.0 by @dependabot in #789
put accept header before OnRequest by @k4lizen in #786
Fix Bug: retry scrape will lost POST requestData by @Shinku-Chen in #794
scrapper -> scraper Update README.md by @ideabrian in #809
Test on Go 1.22, drop old versions by @WGH- in #810
Bump google.golang.org/protobuf from 1.24.0 to 1.33.0 by @dependabot in #807
Implement content sniffing for HTML parsing by @WGH- in #808
Add headers/ua to robots.txt request by @makew0rld in #838
doc: mention that OnScraped executes after OnHTML and OnXML by @aladmit in #832

New Contributors

@twiny made their first contribution in #525
@kcq made their first contribution in #526
@ysung6 made their first contribution in #537
@lingt-xyz made their first contribution in #539
@PalanQu made their first contribution in #531
@dasshield made their first contribution in #549
@shubham14bajpai made their first contribution in #563
@wowinter13 made their first contribution in #567
@BigManing made their first contribution in #570
@ijomeli made their first contribution in #582
@slashapp made their first contribution in #574
@seversky made their first contribution in #595
@moritamori made their first contribution in #588
@guessi made their first contribution in #605
@eval-exec made their first contribution in #603
@waltton made their first contribution in #620
@alexhraber made their first contribution in #669
@ljbink made their first contribution in #652
@darjun made their first contribution in #629
@joelazar made their first contribution in #638
@SaladinoBelisario made their first contribution in #641
@ErikOwen made their first contribution in #691
@cuishuang made their first contribution in #736
@guilhem made their first contribution in #722
@kinoute made their first contribution in #650
@dependabot made their first contribution in #756
@zachary-walters made their first contribution in #785
@k4lizen made their first contribution in #786
@Shinku-Chen made their first contribution in #794
@ideabrian made their first contribution in #809
@makew0rld made their first contribution in #838
@aladmit made their first contribution in #832

Full Changelog: v2.1.0...v2.2.0

Contributors

ideabrian, guilhem, and 31 other contributors

Assets 2

08 Jun 23:47

asciimoo

v2.1.0

bbf3f10

v2.1.0

HTTP tracing support
New callback: OnResponseHeader
Queue fixes
New collector option: Collector.CheckHead
Proxy fixes
Fixed POST revisit checking
Updated dependencies

Assets 2

28 Nov 12:29

asciimoo

v2.0.0

051af19

v2.0.0

Breaking change: Change Collector.RedirectHandler member to Collector.SetRedirectHandler function
Go module support
Collector.HasVisited method added to be able to check if an url has been visited
Collector.SetClient method introduced
HTMLElement.ChildTexts method added
New user agents
Multiple bugfixes

Assets 2

13 Feb 13:20

asciimoo

v1.2.0

ca2f07c

v1.2.0

Compatibility with the latest htmlquery package
New request shortcut for HEAD requests
Check URL availibility before visiting
Fix proxy URL value
Request counter fix
Minor fixes in examples

Assets 2

28 Aug 16:55

asciimoo

v1.1.0

a3a4941

v1.1.0

Appengine integration takes context.Context instead of http.Request (API change)
Added "Accept" http header by default to every request
Support slices of pointers and structs in unmarshal
Fixed a race condition in queues
ForEachWithBreak method added to HTMLElement
Added a local file example
Support gzip decompression of response bodies
Don't share waitgroup when cloning a collector
Fixed instagram example

Assets 2

14 May 07:19

asciimoo

v1.0.0

6a6c784

v1.0.0

We are happy to announce that the first major release of Colly is here. Our goal was to create a scraping framework to speed up development and let its users concentrate on collecting relevant data. There is no need to reinvent the wheel when writing a new collector. Scrapers built on top of Colly support different storage backends, dynamic configuration and running requests in parallel out of the box. It is also possible to run your scrapers in a distributed manner.

Facts about the development

It started in September 2017 and has not stopped since. Colly has attracted numerous developers who helped by providing valuable feedback and contributing new features. Let's see the numbers. In the last seven months 30 contributors have created 338 commits. Users have opened 78 issues. 74 of the those were resolved in a few days. Contributors have opened 59 pull requests and all of them except for one are either got merged or closed. We would like to thank all of our supporters who either contributed code or wrote blog posts about Colly or helped development somehow. We would not be here without you.

You might ask why it is released now. Our experience in various deployments in production shows Colly provides a stable and robust platform for developing and running scrapers both locally and in multi server configuration. The feature set is complete and ready to support even complex use cases. What are those features?

Rate limiting During scraping controlling the number of request sent to the scraped site might be crucial. We would not want to disrupt the service by overloading with too many requests. It is bad for the operators of the site and also for us, because the data we would like to collect becomes inaccessible. Thus, request number must be limited. The collector provided by Colly can be configured to send only a limited number of requests in parallel.
Request caching To relieve the load from external services and decrease the number of outgoing requests response caching is supported.
Configurable via environment variables To eliminate rebuilding of your scraper during fine-tuning, Colly can read configuration options from environment variables. So you can modify its settings without a Golang development environment.
Proxies/proxy switchers If the address of scrapers has to be hidden proxies can be added to make requests instead of the machine running the scraping job. Furthermore, to scale Colly without running multiple scraper instances, proxy switchers can be used. Collectors support proxy switchers which can distribute requests among multiple servers. Scraping collected sites is still done on the machine running the scrapers. But the network traffic is moved to different hosts.
Storage backend and storage interface During scraping a various data needs to be stored and sometimes shared. To access these objects Colly provides a storage interface. You can create your own storages and use it in your scraper by implementing the interface required. By default Colly saves everything into memory. Additional Colly backend implementations are available for Redis and SQLite3.
Request queue Scraping pages in parallel asynchronously is a must have feature when scraping. Colly maintains a request queue where URLs found during scraping are collected. Worker threads of your collector are taking these URLs and creating requests.
Goodies The package named extensions provides multiple helpers for collectors. These are common functions implemented in advance, so you don't have to bloat your scraper code with general implementations. An example extension is RandomUserAgent which generates a random User Agent for every request. You can find the full list of Goodies: https://godoc.org/github.com/gocolly/colly/extensions
Debuggers Debugging can be painful. Colly tries to ease the pain by providing Debuggers to inspect your scraper. You can simply write debug messages to the console by using LogDebugger. If you prefer web interfaces, we've got you covered. Colly comes with a web debugger. You can use it by initializing a WebDebugger. See here how debuggers can be used: https://godoc.org/github.com/gocolly/colly/debug

We the team behind Colly believe that it has become a stable and mature scraping framework capable of supporting complex use cases. We are hoping for an even more productive future. Last but not least thank you for your support and contributions.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

Facts about the development

Releases: gocolly/colly

v2.2.0

What's Changed

New Contributors

Contributors

v2.1.0

v2.0.0

v1.2.0

v1.1.0

v1.0.0

Facts about the development