For a simple proof of concept I wanted to get data from MS SQL Server into the Hortonworks Sandbox using Sqoop. Apache Sqoop provides a way of efficiently transferring bulk data between Apache Hadoop and relational databases. This tutorial will show you how to use Sqoop to import data into the Hortonworks Sandbox from a Microsoft SQL Server data source.
Preparing the SQL Server Environment
On the SQL Server side there are a few key settings that need to be configured to ensure easy connectivity. For the purposes of this tutorial I’m assuming you’ll be trying to connect to a local install of SQL Server.
Create a login using SQL Server Authentication
When you create this user make sure to put on your DBA hat and grant all the permissions necessary to connect and access the data (Server Roles, User Mapping)
Create a new login using SQL Server Authentication
Enable Mixed Mode Authentication
Make sure Mixed Mode Authentication is enabled at the server level so that the new user can connect.
Enable TCP/IP Network Protocol
If you’re trying this against a local install of SQL Server chances are good that the default configuration is set to use the Shared Memory Network Protocol. You can check this setting by running the script below in SQL.
CONNECTIONPROPERTY('net_transport') AS net_transport,
CONNECTIONPROPERTY('protocol_type') AS protocol_type,
CONNECTIONPROPERTY('auth_scheme') AS auth_scheme,
CONNECTIONPROPERTY('local_net_address') AS local_net_address,
CONNECTIONPROPERTY('local_tcp_port') AS local_tcp_port,
CONNECTIONPROPERTY('client_net_address') AS client_net_address
If the code returns Shared Memory as the net_transport (as shown below), you’ll need to go into the SQL Server Configuration Manager, disable Shared Memory, make sure TCP/IP is enabled, and reboot the server.
Once you have Shared Memory disabled the code should return an IP address. Make note of the IP address and the Local TCP Port number as we’ll be using them later in the Sqoop connection string.
NOTE: It’s probably a good idea at this point to make sure the user is able to connect to the server and access some data. Log in through Management Studio with the credentials and do a SELECT against a table you want to import. If you can’t get to the data this way, Sqoop isn’t going to be able either.
Preparing the Hortonworks Sandbox
By default, the Microsoft SQL Server JDBC driver is not included in the Sqoop library that ships with the Hortonworks Sandbox so we need to get that driver before we’ll be able to connect the systems.
Login To the Virtual Machine
With the Sandbox running, jump into the shell by hitting Alt+F5
The login and password are root/hadoop and will get you to the following screen
Download the SQL Server JDBC Driver
Enter the following command to get to the /usr/local directory
Download and unzip the driver
curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz
You should now see a folder in the directory called sqljdbc_4.0 when you run the ls command.
Copy the driver to the Sqoop library
While still in the/usr/local/ directory run the following command to copy the driver out to the Sqoop library folder
cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/lib/sqoop/lib
Once this runs you should be able to see the driver in the list of available Sqoop drivers
Now that the driver has been added I recommend restarting the Sandbox completely. I’m sure there are ways to restart certain services so that the new driver becomes available but I had a much easier time just rebooting before continuing on.
Transfer data using Sqoop
Now we’re finally ready to use Sqoop to connect to SQL and pull data out of Microsoft’s sample AdventureWorks database.
Build the connection string
Before we start pulling data let’s verify our connection through Sqoop to SQL by listing the available databases. For the purposes of demonstration I’ll be hard coding the password into the connection string. The command for listing databases uses the following syntax
sqoop list-databases --connect jdbc:sqlserver://<IP Address>:<Port Number> --username <Username> --password <Password>
So for my connection I run the following code, using the IP Address and Port Number I saw earlier along with the user name and password I set up
sqoop list-databases --connect jdbc:sqlserver://192.168.56.1:1433 --username hadoop --password hadoop1
With that command, Sqoop should return a listing of all the available databases to connect to.
Import a Table into Hive
With the AdventureWorks database I will import the Sales.SalesOrderDetail table. The command to import the table is structured like this:
sqoop import --connect "jdbc:sqlserver://<IP Address>;database=<Database Name>;username=<Username>;password=<Password>" --table <Table Name> --hive-import -- --schema <Schema Name>
So for my import I end up with the following connection string
sqoop import --connect "jdbc:sqlserver://192.168.56.1:1433;database=AdentureWorks2008R2;username=hadoop;password=hadoop1" --table SalesOrderDetail --hive-import -- --schema Sales
Note: the schema argument is specific to the driver and must be passed at the end of the command after an empty — argument. If you are importing from the default schema of the database (like dbo) you do not need to specify the schema with this additional argument.
After you execute the command Hadoop is going to go to work and kick off a MapReduce job. You’ll see a lot of information flying past your eyes, but eventually you be left with the following results indicating your table has been successfully imported. (Use Shift+PageUp/PageDown if you want navigate through the command prompt information that flew past which is especially useful in debugging).
Querying the Results
At this point we’re ready to query the Hive table through Beeswax. Go back into the Hue interface and click on the Hive tab and enter a simple select query against the table you imported. After the query runs you will see the results return.
SELECT * FROM SalesOrderDetail
Sqoop is a powerful tool and can be used with RDBMS systems to provide a number of benefits including incremental imports and exports as well as loads to HBase. You can find out more by checking out the Apache Sqoop project page.
Last night I managed to get my weekend project finished. I decided it would be a good excercise to figure out how to read data from the Twitter API through a Python script and write the results into MongoDB (getting them to Hive instead is next on the list). This was my first real Python script that wasn’t in the command line or written to complete an exercise from a book. This process of working through use cases I make up for myself is really proving to be the best way for me to learn and I would recommend it to everyone that likes to jump in.
Getting into Python
The first thing I realized while starting to figure out how I was going to do this was that my goal of using Vim to write all my code was causing me trouble. It’s hard enough to learn one thing at a time, but working learning two things on top of each other is quite a challenge. With that realized I downloaded PyCharm to use as my IDE and couldn’t have been happier. The code completion, shortcuts, layout, and theme all helped me to start focusing on the code and less about the mechanics of using Vim. I definitely liked it more than using Eclipse. I can definitely imagine a day in future using Vim exclusively, but for now, in spite of all the hardcore programmer advice on StackOverflow, I’m going to use an IDE and worry about the rest later.
Pip those Packages
Next up was getting the necessary libraries to read from Twitter and write to MongoDB. I ended up using the twitter-python package in the end. I tried it first, then read something that said tweepy was better, tried that, and quickly made my way back after feeling a bit lost in the documentation. After that I grabbed the pymongo driver and was off an running. (btw, I already had a MongoDB instance running on my MacBook so that part took care of itself)
After a bit (or a lot) of hacking around, setting up a Twitter App on my account to get the credentials, and reading a lot of documentation I ended up with the code result shown below which I put up on Github as well for fun. Don’t worry, the code over there reads my Twitter keys from a file instead of being hardcoded as in the picture below. For the amount of time it took and all the trial and error, the result may look like “not much” to an outside observer but that’s okay by me. It’s the journey and not the destination.
Undoubtedly, when you begin to investigate Hadoop and the possibility of adding it into your enterprise ecosystem, the interaction between this newfangled technology and your existing RDBMS may pose some challenges. If you’re like me and are coming from a relational database background and are looking to bridge the gap between the old and new technologies, some natural initial questions like ‘How do I import from my RDBMS into Hadoop?’ and ‘How do I export from Hadoop to my RDBMS?’ will arise. The answer to both of these, of course, is Apache Sqoop which provides these functionalities in a pretty easy to use command line format.
I was lucky enough to get a copy of Instant Apache Sqoop to review and found it to be a useful reference for my needs and learning style. It jumps right in and shows you different ways to immediately achieve tasks. Short, fast, and focused certainly describe the contents well, which is perfect for my learning style: I learn by doing and prefer to dive in and get my hands dirty as quickly as possible and this book certainly helps with that goal.
The book is broken down into 9 sections covering some critical functions of Sqoop including Importing/Exporting with Hive, Importing/Exporting with HBase, and incremental importing. Up until now, I’ve only used Sqoop to import directly into HDFS and I expect I’ll return to this book when I try going to Hive or doing an incremental load for the first time.
Tagged with: Sqoop
Posted in Sqoop
When trying to start Hive for the first time I kept getting the following error message:
notice: /Stage/Hdp-hcat::Hcat::Service_check/Exec[hcatSmoke.sh prepare]/returns:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
It turned out that for some reason during the Ambari installation the hive user did not get created in MySQL and therefore couldn’t connect. The following command confirmed that this was a definite issue:
mysql -u hive -p
ERROR 1045 (28000): Access denied for user 'hive'@'localhost' (using password: YES)
So I jumped directly to the manual RPM documentation for Hive installation, configured the user accordingly, and was able to start Hive successfully. I’m thinking the failure had something to do with the fact I already had a MySQL instance installed and running but who knows. Issue corrected, time to move on.
After successfully applying for and receiving money from an angel investor who, for the sake of anonymity, we’ll call “my wife”, I was able complete my Hadoop-At-Home (HAH!) cluster this past week. The major infrastructure upgrade was necessary so I could move to four 1TB drives and stop worrying about running out of space as I get deeper into my learning. I decided to avoid trying to image my drives and to instead just start with a fresh clean install. It turned out this was an especially good idea since I had borked a few things along the way, and the time it would take to figure out how to unbork them would have probably been longer than just starting from scratch. I followed my inclination and luckily the reinstall went smoothly and fairly quick.
But then tragedy struck: last night I finished up provisioning my HAH! cluster with Ambari, using the new Hortonwork HDP 2.0 release, and soon discovered that all the glorious gigabytes I had envisioned were sadly only showing 50GB per machine.
HDFS Capacity free, 48.2GB per node! Not my intention in buying 1TB drives.
Read more ›
Blakey and Mingus working hard and running Hadoop on the super cheap
Why Even Bother?
When I first became interested in learning Hadoop, I figured out pretty much immediately that I would need to cultivate a deep understanding of the underlying system and its installation and configuration, before I could successfully work up to actually using the system myself. From everything I digested in my early research, it seemed that everyone in the Hadoop community knew the ecosystem from the ground up: it felt imperative that I lay my own groundwork in order to truly understand Hadoop on all its levels. To that end, I decided I would build my own Hadoop cluster at home using cheap components scavenged from eBay. Sure, there are ways to explore Hadoop virtually, but I have always done better learning things in a more hands-on way. I also figured if I had a room full of computer equipment, I would be less likely to get frustrated and wander away.
Read more ›
After a reboot I discovered all my datanodes were not running. Here’s how I got them back online using the Ambari interface.
- When I logged into my Ambari Server I saw the that only 1 of my datanodes was up and running (the one running was on my server)
Read more ›
Well, it’s been a while since my last post and with good reason. I’ve been installing Hadoop. After watching countless YouTube lectures and reading everything I can about Hadoop, it became clear to me that there would be a real benefit to understanding the entire ecosystem, including the hardware and networking aspects. It turns out that I have no experience with those items but I’ve never let lack of skills, knowledge, or ability stop me before so why start now? So I came up with what I thought was a simple plan.
- Buy cheap old servers on eBay.
- Build a LAN at home.
- Install Hadoop.
Here’s how the plan ended up working out. Read more ›
So last week I decided I wanted get outside of the Hortonworks tutorials and try something on my own. To that end I decided to try the Hello World! of Hadoop and do a word count against a text file. For this task I wanted to use Pig. I couldn’t find any clear cut examples of how to do this and struggled for about an hour. Finally, with enough persistence and some trial and error I got it to work and wanted to share how I went about this task.
1) Create a text file with data
This can be anything but I ended up using the output of some textual data I had in SQL and dumping it into a text file. It’s definitely a little more interesting if you can work with some data you know or at least have an interest in.
2) Import the file into the Sandbox
Go to the File Browser tab and upload the .txt file. Notice that the default location it is loading is /user/hue.
Read more ›
Well , after battling sickness and a cranky baby all week I finally got a little alone time to continue the quest. This evening I was able to walk through the PIG tutorial and get my feet wet with what seems like a really great language to know. I found it to be clean, intuitive, and easy to digest. I’m sure things get much more involved down the road but for now, selecting, filtering, and averaging made me quite pleased. Here piggy piggy:
So this is the little guy to the right came from Yahoo? Kind of looks like he’s got an attitude problem. Oh well, personal stuff aside I think we can work together nicely. Below is the example of the code I just wrote (copied from the tutorial). Isn’t it grand? I have a feeling I may want to dive in deeper with PIG. Hive seems fine fine and dandy but a little to close in nature to my SQL day job to get me jazzed. Maybe PIG will take me there. I’ll still planning on going though the rest of Hortonworks tutorials however and then when I get a clear hour or two I’ll try and install a server on Amazon. That will be the big fun.