decode是一个字符串函数:
decode(binary bin, string charset) STRING 用charset的编码方式对bin进行解码。
select * from (select distinct branch_code,branch_name from am_cfg_code_organization) t1
避免多层嵌套子查询,尽量拆分成单独的 CTE (with tmp1 as (…), tmp2 as (…)
字符串型字段 nvl(col_1, ‘abc’)
改为 case when coalesce(col_1,’’) = ‘’ then ‘abc’ else col_1
因为在oracle中空字符串和Null,都表现为Null,而在hadoop中这是2个不同的东西,故先
把NUll coalesce成 空字符串, 然后判断等于空时,给默认值 abc
ps: 数值型字段不存在这个问题
col_1 is not null --> coalesce(col_1,'' ) <> ''
表关联使用 join , left join, full join ,拒绝直接通过where条件实现
将字符串变量内容插入表时,需要替处理其中包含的单引号
ORA_DESC2 = ORA_DESC.toString().replace("'", "#");
sqlContext.sql(s"insert into table am_dp_in_log select $AL_OP_SEQ,'$P_I_DATE','$AL_SYS_NO','$AL_POPULATION_TIMESTAMP','$PROC_NAME','$START_TIME','$END_TIME','$ORA_CODE2','$ORA_DESC2','$OP_STEP2','$OP_DESC2'")
使用left [anti] semi join 替换 [not] exists
日期类型处理
经常使用 date_format, unix_timestamp(string date)
Oracle 中的to_date, to_char, date’’ 全部需要转换
函数 | 返回值 | 描述 | 注意 |
---|---|---|---|
to_date(string timestamp) | STRING | 返回时间中的年月日,例如:to_date(“1970-01-01 00:00:00”) = “1970-01-01” | |
year(string date) | INT | 返回指定日期中的年份,范围在1000到9999 | |
quarter(date/timestamp/string) | INT | 返回该date/timestamp所在的季度,如quarter(‘2015-04-01’)=2。 | |
month(string date) | INT | 返回指定时间的月份,范围为1至12月。 | |
day(string date) dayofmonth(date) | INT | 返回指定时间的日期。 | |
hour(string date) | INT | 返回指定时间的小时,范围为0到23。 | |
minute(string date) | INT | 返回指定时间的分钟,范围为0到59 | |
second(string date) | INT | 返回指定时间的秒,范围为0到59 | |
weekofyear(string date) | INT | 返回指定日期所在一年中的星期号,范围为0到53 | |
datediff(string enddate, string startdate) | INT | 两个时间参数的日期之差 | |
date_add(string startdate, int days) | STRING | 给定时间,在此基础上加上指定的时间段 | |
date_sub(string startdate, int days) | STRING | 给定时间,在此基础上减去指定的时间段 | |
current_date | DATE | 返回当前日期,如2016-07-04 | |
current_timestamp | TIMESTAMP | 返回当前时间,如2016-07-04 11:18:11.685 | |
add_months(string start_date, int num_months) | STRING | 返回start_date在num_months个月之后的date | |
last_day(string date) | STRING | 返回date所在月份的最后一天,格式为YYYY-MM-DD,如2015-08-31 | |
next_day(string start_date, string day_of_week) | STRING | 返回start_date之后最接近day_of_week的日期,格式为YYYY-MM-DD,day_of_week为一周内的星期表示(如Monday、FRIDAY | |
trunc(string date, string format) | STRING | 将date按照特定的格式进行清零操作,支持格式为MONTH/MON/MM, YEAR/YYYY/YY,如trunc(‘2015-03-17’, ‘MM’) = 2015-03-01 | |
months_between(date1, date2) | DOUBLE | 返回date1与date2之间的月份差 | |
date_format(date/timestamp/string ts, string fmt) | STRING | 返回date/timestamp/string的格式化输出,格式支持JAVA的SimpleDateFormat格式,如date_format(‘2015-04-08’, ‘y’) = ‘2015’。 | “JAVA的SimpleDateFormat格式: yyyy-MM-dd HH:mm:ss;Oracle 日期to_char,全部转换成这个” |
unix_timestamp(string date) | STRING | 指定日期参数调用UNIX_TIMESTAMP(),它返回“1970-01-01 00:00:00”到指定日期的秒数。 | “时间相减:time_field_A - time_field_B --》 转换成 (unix_timestamp(time_field_A)-unix_timestamp(time_field_B))/(3600 * 24) --> 精确到小数” |
避免使用distinct , 尽量使用group by 去重
distinct 只会生成一个reducer,效率低
instr:只接受2个参数
需要计算从某个位置开始的第一个匹配字符串位置时,需要使用locate
函数 | 返回值 | 描述 | 注意 |
---|---|---|---|
instr(string str, string substr) | INT | 返回substr在str中最早出现的下标。当参数中出现NULL时,返回NULL,但str中不存在substr时返回0,注意下标从1开始。 | 只能从1开始,应用场景有限,可以使用locate |
locate(string substr, string str[, int pos]) | INT | 返回在下标pos之后,substr在str中出现的最小下标 |
insert 后的select 语句中的字段需要与插入的目标表中的名称、类型、顺序保持一致,否则数据插入不会报错,但是Hive查询解析文件时,会出现问题,数据为NULL
使用except 代替 minus,except 需要指定对比的字段名称,不可以使用*
select operbiz,branch,src_sys,sequenceno,al_table,pk_name,pk_value,al_day from al_cvt_deleted_list_0920
except
select operbiz,branch,src_sys,sequenceno,al_table,pk_name,pk_value,al_day from al_cvt_deleted_list
//下面2个等同
ROW_NUMBER() OVER(PARTITION BY TOPACTUALID ORDER BY SOLVEDTIME DESC) --大数据
ROW_NUMBER() OVER(PARTITION BY TOPACTUALID ORDER BY SOLVEDTIME DESC NULLS LAST) --Oracle
函数 | 返回值 | 描述 | 注意 |
---|---|---|---|
regexp_extract(string subject, string pattern, int index) | STRING | 通过下标返回正则表达式指定的部分。regexp_extract(′foothebar′, ′foo(.*?)(bar)′, 2) returns ′bar.′ | |
regexp_replace(string A, string B, string C) | STRING | 字符串A中的B字符被C字符替代 | |
A RLIKE B | BOOLEAN | JAVA的LIKE操作,若A或其子字符串与B相匹配,则返回TRUE,否则返回FALSE。若A或B为NULL,则返回NULL |
正则表达式语法
((?!{“ckResult”).)* :表示非字符串{"ckResult"开头的字符
spark.sql("""with tmp1 as (
select '{"checkNo":"123345666","checkTime":"2031.12.28","companyName":"上海公司","finalSummary":"阳性结果和异常情况\n【1】 您本次体检所检项目未见异常,建议您定期健康体检。\n","gender":"","kindList":[{"checkItemList":[{"ckResult":"157.5","itemName":"身高","unit":"cm"},{"ckResult":"53.0","itemName":"体重","unit":"Kg"},{"ckResult":"21.4","itemName":"体重指数","normalVal":"18.5--23.99"},{"ckResult":"105","itemName":"收缩压","normalVal":"90--139","unit":"mmHg"},{"ckResult":"71","itemName":"舒张压","normalVal":"60--89","unit":"mmHg"}],"doctorName":"刘洋","kindName":"一般项目检查","summary":"未见明显异常"},{"checkItemList":[{"ckResult":"无","itemName":"病史"},{"ckResult":"无特殊","itemName":"家族史"},{"ckResult":"68","itemName":"心率(次/分)","unit":"次/分"},{"ckResult":"齐","itemName":"心律"},{"ckResult":"正常","itemName":"心音"},{"ckResult":"双侧呼吸音未闻及异常","itemName":"肺部听诊"},{"ckResult":"肝脏肋下未触及","itemName":"肝脏触诊"},{"ckResult":"脾脏肋下未触及","itemName":"脾脏触诊"},{"ckResult":"双肾区无叩痛","itemName":"肾脏叩诊"},{"ckResult":"无","itemName":"内科其它"}],"doctorName":"周芷萍","kindName":"内科","summary":"未见明显异常女外科\n"},{"checkItemList":[{"ckResult":"5.6","itemName":"白细胞计数","normalVal":"3.5--9.5","shortCode":"WBC","unit":"10^9/L"},{"ckResult":"4.35","itemName":"红细胞计数","normalVal":"3.8--5.1","shortCode":"RBC","unit":"10^12/L"},{"ckResult":"136.0","itemName":"血红蛋白","normalVal":"115--150","shortCode":"Hb","unit":"g/L"},{"ckResult":"0.41","itemName":"红细胞压积","normalVal":"0.35--0.45","shortCode":"HCT","unit":"L/L"},{"ckResult":"93.6","itemName":"平均红细胞体积","normalVal":"82--100","shortCode":"MCV","unit":"fL"},{"ckResult":"31.3","itemName":"平均红细胞血红蛋白含量","normalVal":"27--34","shortCode":"MCH","unit":"pg"},{"ckResult":"334.0","itemName":"平均红细胞血红蛋白浓度","normalVal":"316--354","shortCode":"MCHC","unit":"g/L"},{"ckResult":"12.0","itemName":"RDW-CV","normalVal":"10.9--15.4"},{"ckResult":"287.0","itemName":"血小板计数","normalVal":"125--350","shortCode":"PLT","unit":"10^9/L"},{"ckResult":"9.7","itemName":"平均血小板体积","normalVal":"7.6--13.6","shortCode":"MPV","unit":"fL"},{"ckResult":"11.6","itemName":"血小板分布宽度","normalVal":"9.0--17","shortCode":"PDW"},{"ckResult":"29.1","itemName":"淋巴细胞百分比","normalVal":"20--50%","shortCode":"LYMPH%"},{"ckResult":"5.3","itemName":"中间细胞百分比","normalVal":"3--10%","shortCode":"MON%"},{"ckResult":"65.6","itemName":"中性粒细胞百分比","normalVal":"40--75%","shortCode":"NEUT%"},{"ckResult":"1.6","itemName":"淋巴细胞绝对值","normalVal":"1.1--3.2","shortCode":"LYMPH","unit":"10^9/L"},{"ckResult":"0.3","itemName":"中间细胞绝对值","normalVal":"0.12--1.0","unit":"10^9/L"},{"ckResult":"3.7","itemName":"中性粒细胞绝对值","normalVal":"1.8--6.3","shortCode":"NEUT","unit":"10^9/L"},{"ckResult":"41.0","itemName":"红细胞分布宽度-标准差","normalVal":"39--53.9","shortCode":"RDW-SD","unit":"g/L"},{"ckResult":"0.235","itemName":"血小板压积","normalVal":"0.108--0.282%","shortCode":"PCT"}],"doctorName":"韩秀杰","kindName":"血常规","summary":"未见明显异常"},{"checkItemList":[{"ckResult":"1.020","itemName":"尿比重","normalVal":"1.010--1.025","shortCode":"SG"},{"ckResult":"6.0","itemName":"尿酸碱度","normalVal":"5.4--8.4","shortCode":"PH"},{"ckResult":"阴性","itemName":"尿白细胞","normalVal":"阴性","shortCode":"LEU","unit":"Cell/uL"},{"ckResult":"阴性","itemName":"尿亚硝酸盐","normalVal":"阴性","shortCode":"NIT"},{"ckResult":"阴性","itemName":"尿蛋白质","normalVal":"阴性","shortCode":"PRO","unit":"g/L"},{"ckResult":"阴性","itemName":"尿糖","normalVal":"阴性","shortCode":"GLU","unit":"mmol/L"},{"ckResult":"阴性","itemName":"尿酮体","normalVal":"阴性","shortCode":"KET","unit":"mmol/L"},{"ckResult":"阴性","itemName":"尿胆原","normalVal":"阴性","shortCode":"URO","unit":"umol/L"},{"ckResult":"阴性","itemName":"尿胆红素","normalVal":"阴性","shortCode":"BIL","unit":"umol/L"},{"ckResult":"阴性","itemName":"尿隐血","normalVal":"阴性","shortCode":"BLD","unit":"Cell/uL"}],"doctorName":"韩秀杰","kindName":"尿常规","summary":"未见明显异常"},{"checkItemList":[{"ckResult":"9","itemName":"丙氨酸氨基转移酶","normalVal":"0.00--40","shortCode":"ALT","unit":"U/L"},{"ckResult":"16","itemName":"天门冬氨酸氨基转移酶","normalVal":"0--40","shortCode":"AST","unit":"U/L"},{"ckResult":"4.28","itemName":"尿素","normalVal":"1.7--8.3","shortCode":"UREA","unit":"mmol/L"},{"ckResult":"4.72","itemName":"空腹血葡萄糖","normalVal":"3.90--6.10","shortCode":"FBG","unit":"mmol/L"},{"ckResult":"5.13","itemName":"总胆固醇","normalVal":"3.10--5.69","shortCode":"TC","unit":"mmol/L"},{"ckResult":"0.59","itemName":"甘油三酯","normalVal":"0.4--1.53","shortCode":"TG","unit":"mmol/L"},{"ckResult":"7.69","itemName":"甲胎蛋白定量","normalVal":"0--20","shortCode":"AFP","unit":"ng/ml"},{"ckResult":"1.10","itemName":"癌胚抗原定量","normalVal":"0--5","shortCode":"CEA","unit":"ng/ml"}],"doctorName":"韩秀杰、赵洁","kindName":"实验室检查","summary":"未见明显异常"},{"checkItemList":[{"ckResult":"未见明显异常","itemName":"肝"},{"ckResult":"未见明显异常","itemName":"胆"},{"ckResult":"未见明显异常","itemName":"胰"},{"ckResult":"未见明显异常","itemName":"脾"},{"ckResult":"未见明显异常","itemName":"双肾"}],"doctorName":"蒋海燕","kindName":"超声检查室","summary":"未见明显异常心电图\n"}],"medicalBranchOrg":"爱康国宾","medicalOrgName":"爱康国宾","name":"徐晓莉"}
'
text_info
)
select regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"(中性粒细胞数|中性粒细胞值|中性粒细胞绝对值|中性粒细胞计数|嗜中性粒细胞绝对值)([^"]*)"([^}]*)}',1) na --中性粒细胞绝对值
,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"红细胞计数"([^}]*)}',1) rbc --红细胞计数
,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"血红蛋白"([^}]*)}',1) hb --血红蛋白
,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"血小板计数"([^}]*)}',1) plat --血小板
,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"尿隐血"([^}]*)}',1) ubc --红细胞镜检(尿隐血)
,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"尿蛋白质"([^}]*)}',1) tup --尿蛋白
,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"(尿液管型|尿管型)([^"]*)"([^}]*)}',1) bc --尿管型
,regexp_extract(b.text_info,'"kindName":"宫颈刮片([^"]*)"([^}]*)"summary":"([^"]*)"([^}]*)}',3) css --宫颈刮片
,regexp_extract(b.text_info,'"kindName":"内科([^"]*)"([^}]*)"summary":"([^"]*)"([^}]*)}',3) iea --内科检查异常+B39:E48
,regexp_extract(b.text_info,'"kindName":"外科([^"]*)"([^}]*)"summary":"([^"]*)"([^}]*)}',3) sea --外科检查异常
from tmp1 b
""").show()
spark.sql("""
with tmp1 as (
select 1 id,78 score_english, 88 score_math, 97 score_chinese
union all
select 2 id,48 score_english, 83 score_math, 67 score_chinese
union all
select 3 id,58 score_english, 89 score_math, 85 score_chinese
)
select id,adid,ad from tmp1
lateral view explode(map('english',score_english,'math',score_math,'chinese',score_chinese)) adtable as adid,ad
""").show
spark.sql("""
with tmp1 as (
select 1 id
)
select id,t1 from tmp1
lateral view explode(array(1,2,3)) t as t1
""").show
select
diagnose_code
,diagnose
,accident_ind
,disease_ind
,risk_type
,adid
,ad
from icd10_standard_disease_info_mapping
-- 列转行
lateral view explode (map(
'D001',D001,'D002',D002,'D003',D003,'D004',D004,'D005',D005,'D006',D006,'D007',D007,'D008',D008,'D009',D009,'D010',D010
,'D011',D011,'D012',D012,'D013',D013,'D014',D014,'D015',D015,'D016',D016,'D017',D017,'D018',D018,'D019',D019,'D020',D020
,'D021',D021,'D022',D022,'D023',D023,'D024',D024,'D025',D025,'D026',D026,'D027',D027,'D028',D028,'D029',D029,'D030',D030
,'D031',D031,'D032',D032,'D033',D033,'D034',D034,'D035',D035,'D036',D036,'D037',D037,'D038',D038,'D039',D039,'D040',D040
,'D041',D041,'D042',D042,'D043',D043,'D044',D044,'D045',D045,'D046',D046,'D047',D047,'D048',D048,'D049',D049,'D050',D050
,'D051',D051,'D052',D052,'D053',D053,'D054',D054,'D055',D055,'D056',D056,'D057',D057,'D058',D058,'D059',D059,'D060',D060
,'D061',D061,'D062',D062,'D063',D063,'D064',D064,'D065',D065,'D066',D066,'D067',D067,'D068',D068,'D069',D069,'D070',D070
,'D071',D071,'D072',D072,'D073',D073,'D074',D074,'D075',D075,'D076',D076,'D077',D077,'D078',D078,'D079',D079,'D080',D080
,'D081',D081,'D082',D082,'D083',D083,'D084',D084,'D085',D085,'D086',D086,'D087',D087,'D088',D088,'D089',D089,'D090',D090
,'D091',D091,'D092',D092,'D093',D093,'D094',D094,'D095',D095,'D096',D096,'D097',D097,'D098',D098,'D099',D099,'D100',D100
,'M001',M001,'M002',M002,'M003',M003,'M004',M004,'M005',M005,'M006',M006,'M007',M007,'M008',M008,'M009',M009,'M010',M010
,'M011',M011,'M012',M012,'M013',M013,'M014',M014,'M015',M015,'M016',M016,'M017',M017,'M018',M018,'M019',M019,'M020',M020
,'M021',M021,'M022',M022,'M023',M023,'M024',M024)) adTable AS adid,ad
where pt = '${bizdate}000000'
and ad=1
查看Parquet 表时提示找不到Parquet文件
验证:1.insert…select 语句中是否有重名列;
2.write parquet 前打印schema与目标表结构比对
3.目标表的路径需要删除重新建立
hadoop dfs -rm -r /hdfs/table
hadoop dfs -mkdir /hdfs/table
设置自增序列
row_number() over():什么参数都不加则生成一个从1开始的自增字段
insert into table User_Attribute select row_number() over() as id,customid from tbl_custom;
若下次操作需要在上次自增值的基础上继续自增,实现序列的效果,可以先计算出原有的number: max(id),join on (1=1)关联上去
insert into table User_Attribute
select row_number() over() + id_max as id,customid
from tbl_custom a
join (select max(id) id_max from User_Attribute) b on (1 = 1);