How to Word Count with Pig and the Hortonworks Sandbox

So last week I decided I wanted get outside of the Hortonworks tutorials and try something on my own. To that end I decided to try the  Hello World! of Hadoop and do a word count against a text file. For this task I wanted to use Pig. I couldn’t find any clear cut examples of how to do this and struggled for about an hour. Finally, with enough persistence and some trial and error I got it to work and wanted to share how I went about this task.

1) Create a text file with data

This can be anything but I ended up using the output of some textual data I had in SQL and dumping it into a text file. It’s definitely a little more interesting if you can work with some data you know or at least have an interest in.

2) Import the file into the Sandbox

Go to the File Browser tab and upload the .txt file. Notice that the default location it is loading is /user/hue.

7-28-2013 3-42-52 PM

3) Write a Pig script to parse the data and dump to a file

I put this code together from snippets I found on the web. The key thing here is to make sure your load statement is referencing the location where your file lives and that you specify an output location to store the file. Note: I didn’t create the /pig_wordcount folder before I ran this, the script ended up creating the location which was a handy feature. Just hit execute and sit back, you can check the run status on the query history tab.

7-28-2013 4-54-21 PM

4) Use HCatalog to load the file to a “table”

Being a SQL developer by day I wanted to be able to query the results in a familiar way so I decided to create a table using HCatalog so that it would be easily accessible through Hive. So I went into the HCatalog tab and chose the file from the folder I specified, named the table and columns, and hit create table. It churned for a while but eventually completed.

7-28-2013 6-06-40 PM

5) Use Hive to query and sort the data for final output

Finally, I went into the Hive tab and wrote a quick query to return and organize the results. Once it was completed I downloaded it and put the results in Excel so I could print and frame them.

7-29-2013 1-17-32 PM

7-28-2013 7-05-37 PM

Conclusion

In all this felt like a huge accomplishment for me. I definitely now have better understanding of how things are working on the backend of Hadoop now that I’ve struggled through a tutorial of my own design. There’s really something to be said for the self-directed learning process of trial/error/google.

Posted in Hadoop, Pig
5 comments on “How to Word Count with Pig and the Hortonworks Sandbox
  1. romaintech says:

    Great blog post that shows how to get started with Pig!

    If you like the Hadoop Web UI (and want a nicer & newer one ;-), it comes from the open source project Hue: http://gethue.com

  2. Cheryle says:

    Hi Fred, Nice tutorial! Have you thought about publishing this in GitHub? We have a new option to allow the members of the community to contribute their knowledge and gain more exposure: https://github.com/hortonworks/hadoop-tutorials

    Cheryle
    @bikergirlsj

  3. Cheryle says:

    Nice job on learning “that forking business”! Has your Hortonworks elephant been named yet? And congratulations on finding your angel investor. :-)

    • Fred says:

      No official name yet, but I’ve been leaning towards Edgar. He’s looking pretty lonely though so I’ll need to finish my MS SQL to Hortonworks Sandbox using Sqoop tutorial (which is nearly done) and see if I can get him a friend.

Leave a Reply