2016-01-19

Cloudera启用kerberos

CDH中想使用sentry进行安全管理，但需要首先集成kerberos，下面介绍CDH启用kerberos的步骤以及遇到的问题。
基本按照Cloudera官网: Enabling Kerberos Authentication Using the Wizard进行操作。

如果需要禁用kerberos，参考CDH禁用kerberos

WARNING!!! 操作需谨慎，坑很多，最好先在测试环境预演。

整体流程概述

总结一下整个流程下来需要做的事情：

搭建MIT kerberos，注意保存cloudera-scm/admin密码。
集群启用kerberos
删除所有node manager上yarn cache，/yarn/nm/usercache/
客户端配置
周边服务配置: opentsdb升级到2.2，加入jaas, flume配置升级
验证MapReduce, spark, hive, impala, sqoop, oozie, shell脚本的正确性

启用和禁用kerberos过程中的问题

如果有跨集群操作，最好不要启用kerberos，distcp命令时会报错。参考https://issues.apache.org/jira/browse/HDFS-7037
集群需要停机一段时间
禁用kerberos后恢复hbase需要一段时间
HA下yarn两个resource manager都处于standby状态，只能禁用HA

MIT kerberos搭建

公司内部使用的是Microsoft Active Directory，但由于不给创建用户权限，而且用户名必须符合一定格式，修改现存集群的账号太过麻烦，所以自己搭建了MIT kerberos。
参考http://blog.javachen.com/2014/11/04/config-kerberos-in-cdh-hdfs

在这纪录一下用到的一些命令，具体配置和步骤参考上文。

# 安装kerberos server
yum install krb5-server -y
yum install krb5-libs krb5-workstation -y
# 停止服务
service krb5kdc stop
service kadmin stop
# 删除数据库文件和一些配置
rm -f /var/kerberos/krb5kdc/principal*
rm -f /var/kerberos/krb5kdc/.k5.*
rm -f /etc/krb5.keytab
# 修改配置
## 注意最后的*之间有一个空格
vim /var/kerberos/krb5kdc/kadm5.acl
vim /var/kerberos/krb5kdc/kdc.conf
# 此配置需要同步到其他机器上，如果使用cloudera，可以让cloudera接管
vim /etc/krb5.conf
# 创建数据库
kdb5_util create -r HADOOP.COM -s
# 启动服务
service krb5kdc start
service kadmin start
# 添加root用户
echo -e "root\nroot" | kadmin.local -q "addprinc root/admin"
# 抽取密钥并将其储存在本地 keytab 文件 /etc/krb5.keytab 中。这个文件由超级用户拥有，所以您必须是 root 用户才能在 kadmin shell 中执行以下命令
kadmin.local -q "ktadd kadmin/admin"
klist -k /etc/krb5.keytab
# 生成cloudera中使用的admin账号
kadmin.local -q "addprinc -pw cloudera cloudera-scm/admin@HADOOP.COM"
# 生成用户账号keytab
kadmin.local -q "xst -k ${USERNAME}.keytab ${USERNAME}@HADOOP.COM"
# 合并keytab
$ ktutil
ktutil: rkt hdfs-unmerged.keytab
ktutil: rkt HTTP.keytab
ktutil: wkt hdfs.keytab
ktutil: exit
# 使用keytab切换用户
kinit -k -t hdfs.keytab hdfs@HADOOP.COM  
kinit -k -t hdfs.keytab hdfs
# 删除principal，不交互的话可以加'-force'参数
kadmin.local -q "delprinc $princ"

What’s a keytab file?

What’s a keytab file? It’s basically a file that contains a table of user accounts, with an encrypted hash of the user’s password. Why have a keytab file? Well, when you want a server process to automatically logon to Active Directory on startup, you have two options: type the password (in clear text) into a config file somewhere, or store an encrypted hash of the password in a keytab file.

CDH启用kerberos

参考Cloudera官网: Enabling Kerberos Authentication Using the Wizard

前期准备

CDH Manager Server上安装`openldap-clients`

1	sudo yum install openldap-clients

集群所有机器上安装`krb5-workstation, krb5-libs`

cloudera manager server可以ssh无密码登陆到其他机器上

1
2
3

for `cat /etc/hosts | fgrep 172 | awk '{print $1}'`; do
  sudo yum install krb5-workstation krb5-libs -y 
done

生成cloudera中使用的admin账号

密码要记牢，此账号需要有权限创建其他账号

Specifically, the Cloudera Manager Server must have a Kerberos principal that has privileges to create other accounts.

1	kadmin.local -q "addprinc -pw <Password> cloudera-scm/admin@YOUR-LOCAL-REALM.COM"

CDH启用kerberos

在CDH管理页面，集群名称右侧的小三角下拉菜单中，选择「启用Kerberos」，然后按Enabling Kerberos Using the Wizard一步步操作即可。

针对yarn配置，需要做下面的修改：
配置文件container-executor.cfg

min.user.id: 允许的最小 Linux 用户 ID, 用于阻止其他超级用户，默认是1000。
allowed.system.users: 明确列入白名单以允许运行容器的用户列表。可以使用该设置将 ID 低于“最小用户 ID”设置的用户列入白名单。
banned.users: 禁止运行容器的用户列表。
格式化resourcemanager state store

Important:
If you have enabled YARN Resource Manager HA in your non-secure cluster, you should clear the StateStore znode in ZooKeeper before enabling Kerberos. To do this:

Go to the Cloudera Manager Admin Console home page, click to the right of the YARN service and select Stop.

When you see a Finished status, the service has stopped.

Go to the YARN service and select Actions > Format State Store.

When the command completes, click Close.

客户端配置

创建hdfs超级用户

1	kadmin.local -q "addprinc hdfs@HADOOP.COM"

密码强度大一点，此用户拥有hdfs上所有权限。

1	kinit hdfs@HADOOP.COM

切换到此用户。

指定hdfs超级用户组

To designate a group of superusers instead of using the default hdfs account, follow these steps:

Navigate to the HDFS Service > Configuration tab.

In the Search field, type Superuser to display the Superuser Group property.

Change the value from the default supergroup to the appropriate group name for your environment.

Click Save Changes.
For this change to take effect, you must restart the cluster.

为每个用户创建principal

1	kadmin.local -q "addprinc user@HADOOP.COM"

Prepare the Cluster for Each User

Each account must have a user ID that is greater than or equal to 1000. In the /etc/hadoop/conf/taskcontroller.cfg file, the default setting for the banned.users property is mapred, hdfs, and bin to prevent jobs from being submitted via those user accounts. The default setting for the min.user.id property is 1000 to prevent jobs from being submitted with a user ID less than 1000, which are conventionally Unix super users.

周边服务

opentsdb连接集成了kerberos的hbase

下载使用最新的opentsdb 2.2版本，此前版本中集成的AsyncHBase无kerberos验证功能。

在/etc/opentsdb/opentsdb.conf中添加如下配置:

hbase.security.auth.enable=true
hbase.security.authentication=kerberos
hbase.kerberos.regionserver.principal=hbase/_HOST@HADOOP.COM
hbase.sasl.clientconfig=Client

新建/etc/opentsdb/jaas.conf，内容如下:

Client {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=false
  useTicketCache=true;
};

在opentsdb启动项中添加-Djava.security.auth.login.config=/etc/opentsdb/jaas.conf。
运行kinit命令：kinit -k -t hbase.keytab hbase
重启opentsdb

如果使用keytab方式会报无法连接hbase的错误，所以使用useTicketCache，并手动kinit。

KerberosClientAuthProvider: Could not login: the client is being asked for a password, but the client code does not currently support obtaining a password from the user. Make sure that the client is configured to use a ticket cache (using the JAAS configuration setting ‘useTicketCache=true)’ and restart the client. If you still get this message after that, the TGT in the ticket cache has expired and must be manually refreshed. To do so, first determine if you are using a password or a keytab. If the former, run kinit in a Unix shell in the environment of the user who is running this asynchbase client using the command ‘kinit ‘ (where is the name of the client’s Kerberos principal). If the latter, do ‘kinit -k -t ‘ (where is the name of the Kerberos principal, and is the location of the keytab file). After manually refreshing your cache, restart this client. If you continue to see this message after manually refreshing your cache, ensure that your KDC host’s clock is in sync with this host’s clock.

flume连接集成了kerberos的hdfs

将hdfs.keytab复制到flume conf目录下，并在flume.hdfs.conf中配置hdfs-sink，添加

1 2	a1.sinks.hdfs-sink.hdfs.kerberosPrincipal = hdfs@HADOOP.COM a1.sinks.hdfs-sink.hdfs.kerberosKeytab = ./conf/hdfs.keytab

注意keytab位置。

异常信息和解决方法

1. kinit: Cannot find KDC for requested realm while getting initial credentials

krb5.conf配置不正确，可以考虑cloudera接管krb5.conf

2. kinit: KDC reply did not match expectations while getting initial credentials

kerberos配置时域名要大写

3. 机器hostname引起的最大的坑！

症状: 集群中hdfs等服务无法启动,报错

java.io.IOException: Login failure for hdfs/vlnx107010@HADOOP.COM from keytab hdfs.keytab: javax.security.auth.login.LoginException: Client not found in Kerberos database (6) - CLIENT_NOT_FOUND
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:976)
    at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.loginAsNameNodeUser(NameNode.java:613)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:632)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:810)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:794)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1487)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1555)
Caused by: javax.security.auth.login.LoginException: Client not found in Kerberos database (6) - CLIENT_NOT_FOUND
    at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
    at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
    at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
    at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:967)
    ... 7 more
Caused by: KrbException: Client not found in Kerberos database (6) - CLIENT_NOT_FOUND
    at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:82)
    at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316)
    at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361)
    at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776)
    ... 20 more
Caused by: KrbException: Identifier doesn't match expected value (906)
    at sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
    at sun.security.krb5.internal.ASRep.init(ASRep.java:64)
    at sun.security.krb5.internal.ASRep.<init>(ASRep.java:59)
    at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:60)
    ... 23 more

klist -k hdfs.keytab
Keytab name: FILE:hdfs.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   2 HTTP/VLNX107010@HADOOP.COM
   2 HTTP/VLNX107010@HADOOP.COM
   2 HTTP/VLNX107010@HADOOP.COM
   2 HTTP/VLNX107010@HADOOP.COM
   2 hdfs/VLNX107010@HADOOP.COM
   2 hdfs/VLNX107010@HADOOP.COM
   2 hdfs/VLNX107010@HADOOP.COM
   2 hdfs/VLNX107010@HADOOP.COM

注意hdfs日志中principal为hdfs/vlnx107010@HADOOP.COM，但在生成的hdfs.keytab中为hdfs/VLNX107010@HADOOP.COM，区别就是域名的大小写，在kerberos中是区分大小写的，所以认证失败。

此集群的hostname和/etc/hosts中都是大写的，所以第一想法是想办法将hdfs的principal改为小写的，经过各种尝试和查阅资料后，发现不可行，参考https://community.cloudera.com/t5/Cloudera-Manager-Installation/javax-security-auth-login-LoginException-Client-not-found-in/td-p/30475

Hadoop in general expects that your hostnames and domain names are all lowercase. When Kerberos is introduced, this becomes important. While it is possible to override this behavior (of expecting lowercase) by doing manual configuration, I recommend ensuring via /etc/hosts or DNS that your host and domain are lower case. After that is corrected, regenerate credentials and that should correct the problem.

进行了更为苦逼的三步：
1。修改集群的hostname和/etc/hosts了，修改一台机器的/etc/hosts后，批量处理

for HOST in `cat /etc/hosts | fgrep 172 | awk '{print $2}'`; do
    scp /etc/hosts ${HOST}:/etc/hosts
    ssh ${HOST} "hostname ${HOST} && sed -i 's/VLNX/vlnx/g' /etc/sysconfig/network"
done

在cloudera中按照上面步骤disable kerberos，然后重新配置一次。发现还是不行，在kerberos中生成的用户还是大写的。

2。于是修改kerberos中账号名，将VLNX修改为vlnx。

for P in `kadmin.local -q "listprincs" | fgrep VLNX` ; do
    UP=`echo $P | sed 's/VLNX/vlnx/g' ` 
    kadmin.local -q "renprinc -force ${P} ${UP}"
done

还是不行，这是在我预料中的，因为在生成的keytab中账号名还是大写的。参考上面的klist -k hdfs.keytab命令结果。

3。终极解决方案来了。cloudera中kerberos生成凭证取的是在cloudera数据库中保存的主机名，所以需要修改cloudera主机名，在页面中没有找到可以修改主机名的地方，只能通过数据库修改。

cloudera自身使用的数据库是postgres，database为scm，密码在/etc/cloudera-scm-server/db.properties中。通过psql -h localhost -p 7432 -U scm进入postgres cli。

查看hosts，并修改name

scm=>  select host_id,host_identifier,name,ip_address from hosts;
 host_id |           host_identifier            |    name    |   ip_address
---------+--------------------------------------+------------+----------------
       1 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX107009 | 192.1.107.9
       2 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX107010 | 192.1.107.10
       9 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX103122 | 192.1.103.122
       7 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX103124 | 192.1.103.124
       8 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX103125 | 192.1.103.125
       6 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX103126 | 192.1.103.126
       5 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX103127 | 192.1.103.127
       4 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX103128 | 192.1.103.128
      12 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX103129 | 192.1.103.129
       3 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | VLNX107011 | 192.1.107.11
scm=>  update hosts set name = 'vlnx107011' where host_id = 3;

全部更新完成后，重启cloudera-scm-server和所有机器的cloudera-scm-agent，然后再重新来一遍。

4. 跨集群distcp异常

从一个非安全集群拷贝数据到安全集群存在问题，参考https://issues.apache.org/jira/browse/HDFS-7037
此问题还未得到解决，由于我们有两个集群，并且会进行集群间数据交互，所以最后未启用kerberos。

5. yarn中任务报错”用户id小于1000”

Diagnostics: Application application_1453449921629_0005 initialization failed (exitCode=255) with output: Requested user hdfs is not whitelisted and has id 496, which is below the minimum allowed 1000

用户在client上id不能小于1000，可以在container-executor.cfg配置min.user.id，或者修改linux的/etc/login.defs中UID_MIN大小。

6. yarn中跑任务时报错`User <user> not found`

异常信息: Diagnostics: Application application_1453449921629_0004 initialization failed (exitCode=255) with output: User usertest not found

解决方案：需要所有nodemanager上有相应的用户

1	useradd -m usertest && echo <password> \| passwd --stdin usertest

7. yarn报错`Can't create directory /yarn/nm/usercache/... Permission denied`

异常信息： Can't create directory /yarn/nm/usercache/usertest/appcache/application_1453549225084_0001 - Permission denied. Did not create any app directories

解决方法：
手动删除所有nodemanager的/yarn/nm/usercache/下内容。

注意不能删除整个usercache目录，否则会报Failed to create directory /yarn/nm/usercache/usertest - No such file or directory。报这个错时，只要再新建这个目录，并chown给yarn。

参考：http://community.cloudera.com/t5/Batch-Processing-and-Workflow/Can-t-create-directory-yarn-nm-usercache-urika-appcache/td-p/24891

Fix was to remove or move the urika cache directory from all the nodes with (computes in my case). Seems that these directories will get re-created during a run.
Bug when you go from simlpe AUTH to kerberos AUTH; the cache directories will not work if created under simple AUTH.

整体流程概述

总结一下整个流程下来需要做的事情：

启用和禁用kerberos过程中的问题

MIT kerberos搭建

What’s a keytab file?

CDH启用kerberos

前期准备

CDH Manager Server上安装openldap-clients

集群所有机器上安装krb5-workstation, krb5-libs

生成cloudera中使用的admin账号

CDH启用kerberos

客户端配置

创建hdfs超级用户

指定hdfs超级用户组

为每个用户创建principal

周边服务

opentsdb连接集成了kerberos的hbase

flume连接集成了kerberos的hdfs

异常信息和解决方法

1. kinit: Cannot find KDC for requested realm while getting initial credentials

2. kinit: KDC reply did not match expectations while getting initial credentials

3. 机器hostname引起的最大的坑！

4. 跨集群distcp异常

5. yarn中任务报错”用户id小于1000”

6. yarn中跑任务时报错User <user> not found

7. yarn报错Can't create directory /yarn/nm/usercache/... Permission denied

参考

CDH Manager Server上安装`openldap-clients`

集群所有机器上安装`krb5-workstation, krb5-libs`

6. yarn中跑任务时报错`User <user> not found`

7. yarn报错`Can't create directory /yarn/nm/usercache/... Permission denied`