Azure, AWS, GCP 클라우드 활용 Tip과 강좌 게시판

이곳은 개발자를 위한 Azure, AWS, GCP등 클라우드 활용 Tip과 강좌 게시판 입니다. 클라우드 환경을 개발하면서 알아내신 Tip이나 강좌, 새로운 소식을 적어 주시면 다른 클라우드를 공부하는 개발자 분들에게 큰 도움이 됩니다. 감사합니다. SQLER.com은 개발자와 IT전문가의 지식 나눔을 실천하기 위해 노력하고 있습니다.

안녕하세요. 코난 김대우 입니다.

이번에는 조금 갸웃갸웃한 내용일지 몰라요. 클라우드 컴퓨팅에 관심 있는 분들은 아마도 Hadoop(이하 하둡)에 관심이 많으실 것 같아요.

네~ Private 클라우드 구현을 위해 사용 가능한 플랫폼으로 분산처리 기술 오픈소스 프로젝트 입니다.

 

Hadoop 프로젝트 : http://hadoop.apache.org/

The Apache™ Hadoop™ software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.           

출처 : Hadoop 프로젝트 공식 사이트

 

국내의 여러 업체들도 이 Hadoop을 이용한 분산처리 솔루션에 관심이 많은 상태인 것 같아요. 어쩌면 Azure와 Hadoop은 다른 길을 걷는다고 생각하는 분들도 많이 계실지 모르겠습니다.

 

그런데, Azure에 Hadoop을 올린다? 그냥 올리는게 아니라, Hadoop 클러스터를 Node와 함께 생성하고, Job Tracker나 Slave 숫자를 조절하는 작업도 동적으로 Azure 관리툴로 진행을 하네요. 관심 있는 분들은 참고 하시길 바랍니다.

 

원본 링크 : Hadoop in Azure

 

Is it possible to deploy a Hadoop cluster in Azure? It sure is and setting one up is not difficult, here’s how you do it.

In this post I will demonstrate how to create a typical cluster with a Name Node, a Job Tracker and a customizable number of Slaves. You will also be able to dynamically change the number of Slaves using the Azure Management Portal. I will save the explanation of the mechanics for another post.

Follow these steps to create an Azure package for your Hadoop cluster:

Download all dependencies

  • This Visual Studio 2010 project is pre-configured with Roles for each Hadoop component. Don’t worry if you don’t have Visual Studio or don’t want to install the express edition, you can do everything from the command line.
  • The cluster configuration templates.
  • Install the latest Azure SDK. As of this writing the latest version was 1.4.
  • The Hadoop binaries. I used version 0.21. Hadoop is distributed in a tar.gz file, you will need to convert it to a ZIP file. You can use 7-zip for the task.
  • Now install Cygwin and package it in a single ZIP file. Hadoop 0.21 requires Cygwin under Windows. It’s fine if you don’t know anything about it, Hadoop uses it behind the scenes so you won’t need to even launch it. There is an on-going effort to remove this dependency for Hadoop 0.22 but it’s not ready yet. Just run the Cygwin install and accept all defaults. You should end up with Cygwin installed in c:\cygwin. Create a compressed folder of c:\cygwin called cygwin.zip.
  • Download the latest version of Yet Another Java Service Wrapper. Beta-10.6 was the current version as of this writing.
  • The last dependency is a Java VM to host Hadoop and YAJSW. If you don’t want to update any of the configuration files in this guide you will need to bundle your favorite JVM in a zip file called jdk.zip. All JVM files must be in a folder also called jvm in the ZIP file. If you have your JVM installed under C:\Program Files\Java\jdk1.6.0_<revision>\ you will need to rename (or copy) the jdk1.6.0_<revision> folder to jdk and zip it.

Configure your cluster

The custer-config.zip you downloaded has all the files needed to configure your Hadoop cluster. You will find the familiar [core|hdfs|mapred]-site.xml files there. Ignore all other files for now, I’ll explain what they are for in a later post. Edit the *-site.xml files as needed for your cluster configuration. Make sure to only add and not change any of the properties.

Create a new cluster-config.zip if you updated any for the XML files.

Upload all dependencies to your Azure Storage account

Create a container called bin and upload all zip files to it. Use your favorite tool for the job, I like ClumsyLeaf’s CloudXplorer. You should end up with these files in the bin container

Configure your Azure Deployment

Unzip the Visual Studio 2010 project. You can either use Visual Studio from here or update the required files using any text editor. I included a batch file to package the deployment if you are going down the command line route.

If you are using Visual Studio the only file you must change is NameNode\SetEnvironment.cmd. The projects that require this file have links to it. If you are not using Visual Studio you have to change it in three other places NameNode\bin\Debug, JobTracker\bin\Debug and NameNode\bin\Debug. Get an access key from your storage account and construct a connection string then paste it in the first line replacing [your connection string]. An Azure connection string has this format:

DefaultEndpointsProtocol=http;AccountName=[your_account_name];AccountKey=[key]

Unless you used different versions of any of the dependencies you won’t need to change anything else.

The Azure deployment is set to use 1 large VM for the Name Node, 1 large VM for the Job Tracker and 4 Extra Large nodes as Slaves. If you are ok with that configuration skip to the next step. If you like anything different change the Roles configuration directly in Visual Studio or edit both the HadoopAzure\ServiceDefinition.csdef and HadoopAzure\ServiceConfiguration.cscfg to the desired VM sizes and count.

Deploy your cluster

Create a new service to host your Hadoop cluster. The project is pre-configured for remote access to the machines in the cluster. If you didn’t change the project configuration you will need to upload the certificate AzureHadoop.pfx in the root of the project to your service. The certificate password is hadoop. The deployment will fail if you don’t have this certificate.

If you are using Visual Studio 2010 you can deploy by right-clicking on the Cloud project and selecting deploy. If you are not just run the buildPackage.cmd from the root of the project using a Windows Azure SDK Command Prompt. You will get the Azure package Hadoop.cspkg to deploy using the Azure Management Portal.

Deploy your service (you can ignore the warning message). Wait for it to complete, you should see something like this:

Using your Hadoop cluster

Now that everything is up and running you can navigate to the Name Node Summary page. The URL is http://<your service name>.cloudapp.net:50070.

If you click on “Browse the filesystem” Hadoop will construct a URL with the IP address of one of the Slaves. That IP address is not accessible from the Internet so you will need to replace it in the URL with <your service name>.cloudapp.net and you will be able to browse the file system:

Let’s run one of the example jobs Hadoop provides to confirm the cluster is working. As it is configured right now you must log in to the Job Tracker to start a new job. I will present alternatives in a future post (hint Azure Connect).

Go back to the Azure Management Portal and RDP into the Job Tracker by selecting it and clicking “Connect” in the toolbar. The username is hadoop and the password is H1Doop. After you log in open a command prompt window and execute the following commands:

E:\AppRoot\SetEnvironment.cmd

cd /d %HADOOP_HOME%

Now you are ready to run a job. I put together a hadoop script so you don’t have to deal with Cygwin for launching jobs. The syntax is the same as the regular hadoop scripts. Let’s launch a small job:

bin\hadoop jar hadoop-mapred-examples-0.21.0.jar pi 20 200

If you navigate to the Job Tracker page you will see the job running. The URL is http://<your service name>.cloudapp.net:50030.

Congratulations, you just ran your first Hadoop job in Azure!

What can I do with my Hadoop cluster?

The cluster is fully operational. You can run any job you would like. You can also use the Azure Management Portal to dynamically change the number of Slaves. Hadoop will discover the new node(s) or find out nodes were removed and reconfigure the cluster accordingly.

I added an extra Slave node

And my cluster changed to

If you used Hadoop in production you know you must take extra steps to prepare the Name Node, mostly around high availability. That’s a topic in itself which I plan to discuss in another post. If you can’t wait and want setup a backup and/or a checkpoint node go ahead, that’s certainly part of the solution. Azure Drive is probably another piece.

Let me know your experiences using Hadoop in Azure.

 

No. Subject Author Date Views
24 [Azure강좌] 2. SDK 설치와 Azure 무료 신청 엽이(이호엽) 2011.06.19 71818
» Hadoop을 Azure에 올릴 수 있을까? 도전~ [1] 코난(김대우) 2011.05.19 70781
22 SQL Azure BI(Business Intelligence)를 Excel과 PowerPivot으로 활용 코난(김대우) 2011.05.19 54359
21 PHP 개발자를 위한 Windows Azure 한글 개발자 가이드 문서 코난(김대우) 2011.05.19 55477
20 JAVA 개발자를 위한 Windows Azure 한글 개발자 가이드 문서 코난(김대우) 2011.05.19 61810
19 [Azure강좌] 1. 클라우드와 원도우 애저 소개 [11] 엽이(이호엽) 2011.05.19 87727
18 클라우드와 모바일, 진부하지 않은 상호운용성 - Windows Azure Toolkit for Windows Phone 7 v1.2 발표내용 정리 코난(김대우) 2011.05.18 48589
17 Azure AppFabric 캐싱 서비스 공식발표 코난(김대우) 2011.05.17 51407
16 윈도우 애저 성능을 테스트 해보려면? 코난(김대우) 2011.05.13 58815
15 윈도우 애저 교육 자료 공유 [1] 코난(김대우) 2011.05.13 78273
14 MIX11의 윈도우 애저 세션들 [1] 코난(김대우) 2011.05.13 60527
13 윈도우 애저 소개에 사용한 발표 자료 모음 코난(김대우) 2011.05.13 63851
12 클라우드세션 in 테크데이즈2011 봄 [1] 코난(김대우) 2011.05.13 36877
11 구름 속의 DBMS SQL애저 발표자료 [1] 코난(김대우) 2011.05.13 50010
10 윈도우 애저의 성능 및 안정성은? 코난(김대우) 2011.05.13 56183
9 윈도우 애저의 기술 호환성은? 코난(김대우) 2011.05.13 71894
8 SQL Azure에서 클러스터드 인덱스가 필요한 이유 [1] 코난(김대우) 2011.05.13 58068
7 서버 폭주에 대한 해결책 Windows Azure [1] 코난(김대우) 2011.05.13 63344
6 Windows Azure CDN에 대한 이해 [1] 코난(김대우) 2011.05.13 60146
5 Windows Azure 의 VM Role 에 대한 이해 코난(김대우) 2011.05.13 40303





XE Login