KijiChopsticks allows you to write programs using the Scalding API that read from and write to Kiji tables.
This project contains an example that counts the words in the 20Newsgroups data set.
-
Set up a functioning KijiBento environment. For installation instructions see: http://www.kiji.org/.
-
Install KijiChopsticks and put the
chop
tool on your$PATH
. -
Download the 20Newsgroups data set. This data set will be loaded into a Kiji table.
curl -O http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz tar xvf 20news-18828.tar.gz
-
Start a bento cluster:
bento start
-
If you haven't installed the default Kiji instance yet, do so first:
kiji install
These examples are set up to be built using Apache Maven. To build a jar containing the following examples
git clone git@github.com:kijiprojct/kiji-chopsticks-examples.git
cd kiji-chopsticks-examples/
mvn package
The compiled jar can be found in
target/kiji-chopsticks-examples-0.1.0-SNAPSHOT.jar
Next, create and populate the 'postings' table:
kiji-schema-shell --file=ddl/postings.ddl
chop jar lib/kiji-chopsticks-examples-0.1.0-SNAPSHOT.jar \
org.kiji.chopsticks.examples.NewsgroupLoader \
kiji://.env/default/postings <path/to/newsgroups/root/>
This table contains one newsgroup post per row. To check that the table has been populated correctly:
kiji scan kiji://.env/default/postings --max-rows=10
You should see some newsgroup posts get printed to the screen.
The following chopsticks word count job reads newsgroup posts from the info:post
column of the
postings
Kiji table splitting each post up into the words it is composed of. The occurrences of
each word are then counted by using the
groupBy
aggregation method.
Run the word count, outputting to hdfs:
chop hdfs lib/kiji-chopsticks-examples-0.1.0-SNAPSHOT.jar \
org.kiji.chopsticks.examples.NewsgroupWordCount \
--input kiji://.env/default/postings --output ./wordcounts.tsv
Check the results of the job:
hadoop fs -cat ./wordcounts.tsv/part-00000 | grep "\<foo\>"
You should see:
foo 56
This project also contains an example of writing to a Kiji table. NewsgroupPostCounter reads
posts from the info:post
column of the postings
Kiji table and counts the number of words in
each post which is then written to the info:postLength
column of the postings
table.
To run the posting word counter, run:
chop hdfs lib/kiji-chopsticks-examples-0.1.0-SNAPSHOT.jar \
org.kiji.chopsticks.examples.NewsgroupPostCounter \
--input kiji://.env/default/postings --output kiji://.env/default/postings
Check the output in Kiji:
kiji scan kiji://.env/default/postings --max-rows=10