RyanCheung Ignition sequence start

Hive On Elasticsearch读写数据


为Hadoop添加ES-Hadoop支持

  1. 根据对应的ElasticSearch版本下载ES-Hadoop包
  2. 解压elasticsearch-hadoop-xxxxx.zip,并提取其中的elasticsearch-hadoop-hive-xxxxx.jar上传至HDFS中;
  3. 在Hive命令行下执行add jar hdfs:///jar/elasticsearch/elasticsearch-hadoop-hive-xxxxx.jar添加此Jar包。

查看ElasticSearch的映射关系

可以在Kibana中使用Dev Tools通过执行下列代码查看某个索引的映社关系(mapping):

GET /{YOUR_INDEX}/{TYPE}/_mapping?pretty

也可以在Kibana=>Management=>Index Management=>{YOUR_INDEX}=>Mapping查看映射关系。

以下给出一个查询后的mapping的样例:

GET company/_doc/_mapping?pretty
{
  "mappings": {
    "_doc": {
      "properties": {
        "id": {
          "type": "long"
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "birth": {
          "type": "text"
        },
        "addr": {
          "type": "text"
        }
      }
    }
  }
}

在Hive中定义一个外表

add jar hdfs:///jar/elasticsearch/elasticsearch-hadoop-hive-6.8.16.jar;

DROP TABLE IF EXISTS octopus_fetch_test;

CREATE EXTERNAL table IF NOT EXISTS octopus_fetch_test( 
  `@timestamp`  timestamp,
  `@version`    STRING,
  `addr`        STRING,
  `areaid`      STRING,
  `areaname`    STRING,
  `birth`       STRING,
  `boxid`       STRING,
  `cells`       ARRAY<STRUCT<isChange:STRING,location:STRING,value:STRING>>,
  `createtime`  STRING,
  `dzqyid`      STRING,
  `dzqyname`    STRING,
  `fetchdatatime` BIGINT,
  `id`          STRING,
  `kjnd`        STRING,
  `kjqj`        STRING,
  `name`        STRING,
  `nsqxdm`      STRING,
  `parentboxid` STRING,
  `qyid`        STRING,
  `qyname`      STRING,
  `sbszid`      STRING,
  `sheetname`   STRING,
  `systemid`    STRING
)  
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' 
TBLPROPERTIES(
    'es.nodes' = 'http://172.37.4.156',
    'es.port' = '9200',
    'es.net.ssl' = 'true', 
    'es.nodes.wan.only' = 'true', 
    'es.nodes.discovery'='false',
    'es.input.use.sliced.partitions'='false',
    'es.input.json' = 'false',
    'es.resource' = 'octopus_fetch-*/_doc',
    'es.net.http.auth.user' = 'elastic', 
    'es.net.http.auth.pass' = 'esadmin',
    'es.mapping.names' = 'areaid:areaId,areaname:areaName,boxid:boxId,createtime:createTime,dzqyid:dzQyId,dzqyname:dzQyName,fetchdatatime:fetchDataTime,parentboxid:parentBoxId,qyid:qyId,qyname:qyName,sbszid:sbszId,sheetname:sheetName,systemid:systemId'
);

SELECT * FROM octopus_fetch_test LIMIT 100;

其中,需要注意以下:

  • 外表的结构根据在ES中查询到的映射结构(Mapping)定义的,在ES和Hive中类型的对应关系如下:
ES Type Hive Type
null void
boolean boolean
byte tinyint
short smallint
long bigint
double double
float float
int int
string string / varchar / char
binary binary
date timestamp
map struct / map
array array
  • 需要重点注意的是,由于Hive表字段会自动转小写,而ES中字段是区分大小写的,因此直接查询会出现字段无法匹配的问题而查不到内容,因此需要特别指定es.mapping.names,它的格式是'es.mapping.names':'hivekey:esKey[, ...]'

  • 如果ES采用了加密,需要特别指示'es.net.http.auth.user'='username''es.net.http.auth.user'='password'用于指示用户名和密码。
  • ES中的某个数组可以定义为Hive中的类型是array<struct<name:type,...>>

参考连接