WARC Portal

The University of Alberta and it's researchers have a collection of Web Archives, but does not have an easy way to analyze them. The WARC portal project aims to deal with extracting, searching and analyzing web archive files. We plan to provide intuitive and easy access for researchers to browse and search through thousands of possibly duplicated webpages, provide tools for analyzing their collections using an array of searches, filters; and provide helpful visualizations of their data by analyzing keywords used across the web pages and time. We will hopefully provide an invaluable tool for experienced digital humanities and social science researchers. It presents the web archive data in a intuitive way that will help researchers find overall patterns and trends.

Features of WARC Portal:

Searching through archived webpages
Searching through archived images
Displaying webpages in original format
Text and Image analysis

Authors:

Cheng Chen: cheng10@ualberta.ca

Adriano Marini: marini@ualberta.ca

Kevin Tang: tkevin@ualberta.ca

Mate Verunica: verunica@ualberta.ca

System Requirements

OS: Linux | Storage: 500GB+ free space

Port Bindings:

:8080 - pywb
:8000 - Django / REST API
:5000 - Front end interface

Dependencies

Front End
Web server
MySQL
Scala Language Support
Oracle Java JDK
WARCBASE (see warcbase.readme)
Apache Spark
Pywb (see pywb.readme)
npm
node
Apache Maven

Elements

This system consists of 3 major components:

Django back end
MySQL Database
React.js User Interface

In addition to:

CRON scripts
Scala scripts

Installation

Django / REST API

Installation:

|| > virtualenv venv
|| > source venv/bin/activate
|| > pip install -r requirements.txt

Testing the API (in folder):

|| > ./manage.py loaddata testdata.json
|| > ./manage.py runserver
|| > curl -H 'Accept: application/json; indent=4' -u admin:adminadmin http://127.0.0.1:8000
|| > or just go to http://127.0.0.1:8000/, user: admin:adminadmin

React User Interface

Development server

npm install
npm start

Production server

npm install
npm run build-prod
node app/server.js

In both cases the client will be hosted on http://127.0.0.1:5000

E2E Testing

You can run our E2E tests using Nightwatch and Selenium through this way.

npm install -g nightwatch

Alternatively, if you've already ran npm install, you can access the nightwatch binary through the node_modules. Next you must update your webdrivers for selenium and chrome before running the tests found in selenium_tests.

npm run e2e-setup
nightwatch

Scripts

https://www.freebsd.org/cgi/man.cgi?query=cron&sektion=8&apropos=0&manpath=FreeBSD+10.3-RELEASE+and+Ports

http://askubuntu.com/questions/2368/how-do-i-set-up-a-cron-job

Scripts use CRON to run. In order to prepare scripts:

Choose a directory in which you would like to store everything
Edit the scripts to ensure they point to the correct storage area
Set up cron to execute the job (see above websites)

Selected Dependencies

Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 297 Commits
_doc		_doc
app		app
pages		pages
pywb		pywb
scripts		scripts
selenium-tests		selenium-tests
web_api		web_api
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
globals.js		globals.js
nightwatch.json		nightwatch.json
package.json		package.json
webpack.config.js		webpack.config.js
webpack.dev.config.js		webpack.dev.config.js
webpack.prod.config.js		webpack.prod.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WARC Portal

Authors:

System Requirements

Dependencies

Elements

Installation

Django / REST API

React User Interface

E2E Testing

Scripts

Selected Dependencies

Documentation

React User Interface

Scripts

Django

Rest API

About

Releases

Packages

Contributors 4

Languages

License

cheng10/WARC-Portal

Folders and files

Latest commit

History

Repository files navigation

WARC Portal

Authors:

System Requirements

Dependencies

Elements

Installation

Django / REST API

React User Interface

E2E Testing

Scripts

Selected Dependencies

Documentation

React User Interface

Scripts

Django

Rest API

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages