Hive solutions by dheerendra yadav: 2017

Wednesday, 27 December 2017

Hive table showing count after deletion of records

If you have delete the records of hive external table from HDFS and table is not partitioned table and table is showing count(*) from table then use below properties to update metastore.

set hive.compute.query.using.stats=false;
ANALYZE TABLE table_name COMPUTE STATISTICS;

If your table is partition table then use below command to analyze.

ANALYZE TABLE table_name partition(id='12',name='xyz') COMPUTE STATISTICS;

Sunday, 10 December 2017

Sqoop Export table from oracle to Hive

#export path=/xx/xx/
Table_Name=$1
echo $Table_name
user_id=$2
echo $user_id
password=$3
echo $password
#cd $path
for i in `cat /xx/xx/abc_list.txt`
do
echo ${i}
echo "sqoop import --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = xxxx.net) (PORT = 1526) ) (CONNECT_DATA = (SERVICE_NAME = xx)))' --username '${user_id}' --password '${password}' --query "select 'all_columns' from ${Table_Name} partition"(${i})" where \$CONDITIONS " --target-dir /sales/channel/reporting/${Table_Name}/data/${i} --fields-terminated-by '\034' -m 1"
sqoop import --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = xxxx.com) (PORT = 1526) ) (CONNECT_DATA = (SERVICE_NAME = xxx--username ${user_id} --password ${password} --query "select * from ${Table_Name} partition(${i}) where \$CONDITIONS " --target-dir /sales/channel/reporting/${Table_Name}/data/${i} --fields-terminated-by '\034' -m 1
if [ $? == 0 ]
then
echo "Partion data data has processed successfully: echo ${i} "
else
echo "Partion data data has not processed successfully: echo ${i} "
echo "Please check the issue. "
fi
done

Sqoop with query===

sqoop import --connect 'jdbc:oracle:thin:@(DESCRIPTION =
(enable = broken)
(ADDRESS =
(PROTOCOL = TCP)
(HOST = hostname.xx.xxcorp.net)
(PORT = 1526)
)
(ADDRESS =
(PROTOCOL = TCP)
(HOST = hostname.xx.xxcorp.net)
(PORT = 1526)
)
(CONNECT_DATA =
(SERVICE_NAME = HPIPODSPC)
)
)' --username 'xx' --password 'xxx#ahu1vn' --query "select * from db_name.table_name partition(SYS_P297) where \$CONDITIONS" --fields-terminated-by '\034' --target-dir /sales/channel/reporting/folder_name/data/partition_col=SYS_P297 -m 1

Thursday, 2 November 2017

Sqoop import Error with Oracle

sqoop import --connect 'jdbc:oracle:thin:@ip_add/dbname' --username 'xxx' --password 'xxxx' --table dbname.table_name --target-dir /xx/xx/Talend/test/test2 -m 1

Error:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/11/02 14:38:25 ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: Invalid Oracle URL specified
java.sql.SQLException: Invalid Oracle URL specified
at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)
at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:133)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:199)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:263)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:271)

Solution : In case we have to add "thin" with jdbc connection string
Example :- --connect 'jdbc:oracle:thin:@ipadd/dbname'

Again Error:

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/11/02 14:40:55 ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: The Network Adapter could not establish the connection
java.sql.SQLException: The Network Adapter could not establish the connection
at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)
at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:133)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:199)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:480)

Solution:-

Instead of giving the URLand database name you have to replace whole connection string of oracle with TNSNAME entry of that database of oracle:

Example:

Oracle tnsname entry:

xxx =

(DESCRIPTION =

(LOAD_BALANCE = yes)

(SDU = 32768)

(ADDRESS =

(PROTOCOL = TCP)

(HOST = ip_address)

(PORT = 1526)

)

(CONNECT_DATA =

(SERVICE_= nnnnnDSI)

)

Now in sqoop cmd you have to make the change accordingly.

sqoop import --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = ip_addr) (PORT = 1526) ) (CONNECT_DATA = (SERVICE_NAME = HPIPODSI)))' --username 'xxx' --password 'xxxx' --table dbname.table --target-dir /apps/hive/warehouse/test001.db --escaped-by '\"' --enclosed-by '\"' --fields-terminated-by '|' -m 1

Again you are getting error.

LF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

17/11/02 14:45:33 INFO manager.OracleManager: Time zone has been set to GMT

17/11/02 14:45:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "AMSPODS_PUB"."end_cust_id" t WHERE 1=0

17/11/02 14:45:33 ERROR manager.SqlManager: Error executing statement: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist

java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist

at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:91)

at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:133)

at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:206)

Solution:

Use uppercase for Oracle Username,Database name and Table name

Example :

sqoop import --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = ip_address) (PORT = 1526) ) (CONNECT_DATA = (SERVICE_NAME = xx)))' --username 'xx' --password '#' --table dbname.END_CUST_ID --target-dir /apps/hive/warehouse/test001.db --escaped-by '\"' --enclosed-by '\"' --fields-terminated-by '|' -m 1

Now data has successfully imported.

Sqoop importing data

Link
http://hortonworks.com/hadoop-tutorial/loading-data-into-the-hortonworks-sandbox/
http://thinkbiganalytics.com/hadoop_nosql_services/freestone-framework/
CREATE USER 'user1'@'localhost' IDENTIFIED BY PASSWORD 'pass1';

sqoop import --connect "jdbc:mysql://localhost/test" --username "root" --table "t1" --target-dir
input1 --m 1
sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -
-table "t4" --hive-import -m 1

sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -
-table "t4" --hive-table bigdb.t12 -m 1
sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -
-table "t4" --warehouse-dir /user/hive/warehouse/bigdb --m 1

-----Import Using query—
bin/sqoop import --connect
jdbc:teradata://dnedwt.edc.cingular.net/edwdb --driver
com.teradata.jdbc.TeraDriver --username dy662t --password Dhiru+12 -
-query "select UserName , AccountName, UserOrProfile from
DBC.AccountInfoV where \$CONDITIONS" --hive-import --hive-table
tada_ipub.TMP_billing --split-by UserName --target-dir
/user/hive/warehouse/test_tmp1 -verbose
sqoop export --connect "jdbc:mysql://localhost/db2" --username "root" -
-table "t5" --export-dir /user/hive/warehouse/t5
sqoop export --connect "jdbc:mysql://localhost/mydb" --username "root"
--table "total_gender" --export-dir /user/hive/warehouse/gender
sqoop export --connect "jdbc:mysql://localhost/mydb" --username "root"
--password 'root123!' --table "gender" --export-dir
/user/hive/warehouse/bigdb.db/user_gender --input-fields-terminated-by
',' -verbose -m 1

---increamental load sqoop ---
sqoop import --connect "jdbc:mysql://localhost/test" --username "root"
--table "city5" --hive-import -check-column upd_dt --incremental
lastmodified --last-value 2015-05-26 --hive-import

Sqoop the data directly in externall table
-----------------

(1) Create external table in hive.(Attached sample DDL)

Example - dist_daily_temp2

(2) Run Sqoop command and keep the path of “--target-dir” same as location of external table.

sqoop import --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = ipadddress) (PORT = 1526) ) (CONNECT_DATA = (SERVICE_NAME = xxxI)))' --username 'AMSPODS_PUB' --password 'xxx' --table PUB_FCT_DSTRB_SLS_DAILY --fields-terminated-by '\034' --target-dir /sales/channel/Talend/test/test2 -m 1

(3) All the columns in sample table in string format to get the data properly from oracle later on we will change data type.

Tuesday, 24 October 2017

Update Hive table using CASE clause

I have taken 2 table "temp_add_1" and "temp_add_2".

Table temp_add_1 have 11 records and temp_add_2 have 13 records ( new records and updated columns)

Records of table temp_add_1 with old records

5000000196 KAPLAN UNIVERSITY 332 FRONT ST S STE 501 LA CROSSE WI 54601
5000000206 GLATFELTER 475 S PAINT ST STOREROOM RECEIVING CHILLICOTHE OH 45601
5000000209 METHODIST HEALTH SYSTEM 1441 N BECKLEY AVE DALLAS TX 75203
5000000220 WINDWOOD FARM 4857 WINDWOOD FARM RD AWENDAW SC 294295951
5000239439 D & H DISTRIBUTING 185 COWETA INDUSTRIAL PAR NEWNAN GA 30265 US

Records of table temp_add_2 with new rows.

5000000209 METHODIST HEALTH SYSTEM 1441 N BECKLEY AVE DALLAS TX 75203

5000000220 WINDWOOD FARM 4857 WINDWOOD FARM RD AWENDAW SC 294295951

5000000281 MARCO ST CLOUD OFFICE 4510 HEATHERWOOD RD ST CLOUD MN 56301

5000000294 RJ REYNOLDS TOBACCO 1001 REYNOLDS BLVD BLDG 603-5 DOCK 21 WINSTON SALEM NC 27105

5000239439 D & H DISTRIBUTING 185 COWETA INDUSTRIAL PARKWAY NEWNAN GA 30265 US

Use below command to update the records in temp_add_1.

I am using CASE clause with FULL OUTER JOIN

insert overwrite table sales_channel.temp_add_1

select

case

when t1.address_identifier = t2.address_identifier then t1.address_identifier else t2.address_identifier end as address_identifier,

case

when t1.customer_name = t2.customer_name then t1.customer_name else t2.customer_name end as customer_name,

case

when t1.address1_description = t2.address1_description then t1.address1_description else t2.address1_description end as address1_description,

case

when t1.address2_description = t2.address2_description then t1.address2_description else t2.address1_description end as address2_description,

case

when t1.city_name = t2.city_name then t1.city_name else t2.city_name end as city_name,

case

when t1.state_name = t2.state_name then t1.state_name else t2.state_name end as state_name,

case

when t1.zip_code = t2.zip_code then t1.zip_code else t2.zip_code end as zip_code,

case

when t1.iso_country_code = t2.iso_country_code then t1.iso_country_code else t2.iso_country_code end as iso_country_code

from sales_channel.temp_add_1 t1 FULL OUTER JOIN sales_channel.temp_add_2 t2

ON t1.address_identifier = t2.address_identifier;

Result:-

5000000209 METHODIST HEALTH SYSTEM 1441 N BECKLEY AVE DALLAS TX 75203

5000000220 WINDWOOD FARM 4857 WINDWOOD FARM RD AWENDAW SC 294295951

5000000281 MARCO ST CLOUD OFFICE 4510 HEATHERWOOD RD 4510 HEATHERWOOD RD ST CLOUD MN 56301

5000000294 RJ REYNOLDS TOBACCO 1001 REYNOLDS BLVD 1001 REYNOLDS BLVD WINSTON SALEM NC 27105

5000239439 D & H DISTRIBUTING 185 COWETA INDUSTRIAL PARKWAY NEWNAN GA 30265 US

Thursday, 31 August 2017

Hive commands

If you want to list down all tables contains same string in same database then use below command.

hive > show tables like '*tablename*';

Similarly for Database

hive > show databases like '*dbname*';

Remove duplicate (redundancy)

Case: We have multiple columns in table but none of the column or combination of columns in table provide unique row.

Solution: In that case you can use function called ROW_NUMBER and also used ROW_NUMBER with OVER (PARTITION BY )

Step 1: First you have to generate unique number in subquery with ROW_NUMBER() OVER() as unique_number.

Step 2 : Now you have to generate ROW_NUMBER OVER() ( PARTITION BY unique_number , table_column1,table_column2)

Now from step 2 you will start getting unique row.

Example:

select pos4.*,
if(pos4.Partner_Tier_Indicator_Code='T1','sell-thru','sell-to') as partner_selling_motion_measure_name,
if(pos4.Partner_Tier_Indicator_Code='T1','sold to','sold to R2R') as address_type_name
from
(SELECT pos3.*,
ROW_NUMBER() OVER (PARTITION BY pos3.rowid,pos3.partner_sales_transaction_date order by pos3.partner_sales_transaction_date desc ) as rownum
FROM
(SELECT
pos2.*,
if((ph.reporting_source_partner_level_3_identifier <> NULL or ph.reporting_source_partner_level_3_identifier <> ''),
ph.reporting_source_partner_level_3_identifier,pos2.partner_sold_to_matched_siebel_row_identifier) as HQ_partner2
FROM
(SELECT pos1.*,ROW_NUMBER() over() as rowid
FROM
(SELECT
pos.*,
if((ph.reporting_source_partner_level_3_identifier <> NULL or ph.reporting_source_partner_level_3_identifier <> ''),
ph.reporting_source_partner_level_3_identifier,pos.reporting_partner_siebel_row_identifier) as HQ_partner
FROM xx.pos_weekly_temp pos --xx.Fact_Channel_Point_Of_Sale_Weekly
LEFT OUTER JOIN xx.dim_partner_hierarchy ph
ON pos.reporting_partner_siebel_row_identifier=ph.reporting_partner_identifier
WHERE pos.Partner_Tier_Indicator_Code='T1'
AND pos.region_code='AMER'
) pos1
) pos2
LEFT OUTER JOIN xx.dim_partner_hierarchy ph
ON pos2.partner_sold_to_matched_siebel_row_identifier=ph.reporting_partner_identifier
AND pos2.region_code = ph.region_code
) pos3
LEFT OUTER JOIN xx.PAS_REPORTING_PARTNER_REFERENCE_temp rpr
ON pos3.HQ_partner2 = rpr.reporting_partner_siebel_row_identifier
AND pos3.channel_sub_segment_code = rpr.channel_sub_segment_code
) pos4
where pos4.rownum = 1;

Thursday, 11 May 2017

Alter partition table

If you have partition table and you want to add new column in table then after adding the new column using ALTER command the value appears NULL in that column then that case you should use CASCADE clause at end of ALTER command.

alter table pos_all_flgs_dh3 add columns(dheeren2 string) CASCADE;

INSERT
INTO pos_all_flgs_dh3 partition
(
Region_Code,
Partner_Tier_Indicator_Code,
Fiscal_Year_Week_Code
)
SELECT DISTINCT
p1.Channel_Sub_Segment_Identifier,
p1.Source_File_Name,
if((p1.cross_source_sale_flag <> p2.cross_source_sale_flag),'dheerentest','dheerentest') as dheeren2,
p1.Region_Code,
p1.Partner_Tier_Indicator_Code,
p1.Fiscal_Year_Week_Code
FROM
pos_4_flg_dh p1 LEFT OUTER JOIN pos_2_flg_dh p2
ON p1.Partner_Sales_Transaction_Identifier= p2.Partner_Sales_Transaction_Identifier limit 1;

Thursday, 27 April 2017

ORC format

hive> create external table test_pos_ext2(
> trxno string,
> trx_dt string,
> reporter_id string,
> buyer_id string
> ) row format delimited fields terminated by ','
> LOCATION '/gcw/testing/pos_test_ext/test';
OK
Time taken: 0.153 seconds
hive> select * from test_pos_ext2;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://hdpdevnn/gcw/testing/pos_test_ext/test/pos3.txt. Invalid postscript.

Solutiion:--

Add - STORED AS TEXTFILE in table while creating.

hive> create external table test_pos_ext2(

> trxno string,

> trx_dt string,

> reporter_id string,

> buyer_id string

> ) row format delimited fields terminated by ','

> STORED AS TEXTFILE

> LOCATION '/gcw/testing/pos_test_ext/test';

hive> select * from test_pos_ext2;

9106956188 3/7/2017 3-HWJW-516 3-2SS-2763

9106956189 3/7/2017 3-HWJW-516 3-2SS-2763

Time taken: 0.12 seconds, Fetched: 2 row(s)

Sunday, 2 April 2017

Copy Partition table

If you want to copy existing Partition table in Hive from one cluster to another cluster or copy from one database to another database on same cluster.

Suppose you have Table t1 in database testdb and you have load data in partition table from local directory.

create table testdb.t1(a string, b string) row format delimited fields terminated by ',';
load data local inpath '/home/dheerendra/working/Data/data1' overwrite into table testdb.t1;

Insert data in Partition table.

set hive.exec.dynamic.partition.mode=nonstrict;
insert into table testdb.tp2 partition(b) select a,b from testdb.t2;

Copy data from one cluster to another cluster using Distcp command. I my case cluster is same.

hadoop distcp /user/hive/warehouse/testdb.db/tp /user/hive/warehouse/db1.db/

When Cluster is different then.

hadoop distcp hdfs://cluster1/user/hive/warehouse/testdb.db/tp hdfs://cluster2/user/hive/warehouse/db1.db/

Create same table structure on destination database.

create table db1.tp(a string) partitioned by(b string);

Use MSCK command to add partition with metastore

msck repair table tp;

If you get error like:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Take below steps:

set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;

Friday, 3 March 2017

Hive output to csv format

Move hive output to csv format

nohup hive -f /home/dheerendra.a.yadav/query/s3.q | sed 's/[\t]/,/g' > /home/dheerendra.a.yadav/query/s6.csv &

Friday, 24 February 2017

Transpose column to row

Transpose or Pivot table or convert column to row

create table twit_seq STORED AS ORC tblproperties ("orc.compress"="NONE") as
select t1.consumer_id,t1.text,t1.rk
from
(SELECT consumer_id,text,postcreationtime1,rank() OVER (partition by consumer_id ORDER BY postcreationtime1 desc ) as rk,
row_number() over() as cnt
FROM dunami_v1)t1 where t1.rk < 6;

create table customer_tweet_info_dnami STORED AS ORC tblproperties ("orc.compress"="NONE") as
SELECT consumer_id ,
max(CASE WHEN rk = 1 THEN text else NULL END) AS Tweet_1,
max(CASE WHEN rk = 2 THEN text else NULL END) AS Tweet_2,
max(CASE WHEN rk = 3 THEN text else NULL END) AS Tweet_3,
max(CASE WHEN rk = 4 THEN text else NULL END) AS Tweet_4,
max(CASE WHEN rk = 5 THEN text else NULL END) AS Tweet_5
FROM twit_seq group by consumer_id;

week and rank function

Calculate weeks

select abs(floor(datediff(timestamp(from_unixtime(UNIX_TIMESTAMP(trim('01/04/16 00:00'),'MM/dd/yy hh:mm'))),postcreationtime1)/7)) as wk
from dunami_twitter_visual_data_v1 limit 100;

Use rank function to get value in descending order

create table twit_seq STORED AS ORC tblproperties ("orc.compress"="NONE") as
select t1.consumer_id,t1.text
from
(SELECT consumer_id,text,postcreationtime1,rank() OVER (partition by consumer_id ORDER BY postcreationtime1 desc ) as rk
FROM dunami_v1)t1 where t1.rk < 6;

Monday, 13 February 2017

Hive Optimization properties

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set mapreduce.job.mappers=1000;
set mapreduce.job.reducers=1000;
SET hive.exec.parallel=true;
-----------------------------------------------------------

set hive.vectorized.groupby.maxentries=10240;
set hive.vectorized.groupby.flush.percent=0.1;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cli.print.header = true;

Working with Tez and getting issue like tez.TezProcessor|: java.lang.OutOfMemoryError: Java heap space

Then set the properties on query level so that container size increase for that particular query.

SET hive.tez.java.opts=-Xmx8192m -Xms4096m

SET hive.tez.container.size=4026

Go through link -
https://discuss.pivotal.io/hc/en-us/articles/226239948-Hive-query-with-TEZ-engine-failed-with-exception-java-lang-OutOfMemoryError-Java-heap-space

Hive query output generate empty files along with other files then use below properties.
set hive.merge.tezfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.orcfile.stripe.level=true;

=================================================

Optimization for ORC files

set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
set hive.exec.parallel=true;
SET hive.tez.java.opts=-Xmx8192m -Xms4096m;
SET hive.tez.container.size=8192;
set hive.exec.parallel=true;

Sunday, 5 February 2017

Copy data from S3 to hdfs

hadoop distcp s3n://<access key>:<secret key>@buket/testhadoop/*/* /user/hadoop_user/test/

Date format in hive

Examples - 5/11/2015 12/31/2014 5/11/2015

select optin_date,cell_phone_date,optin_date_stamp from datetest;

select
to_date(from_unixtime(UNIX_TIMESTAMP(trim(Optin_Date_Stamp), 'MM/dd/yyyy'))),
to_date(from_unixtime(UNIX_TIMESTAMP(trim(cell_phone_date), 'MM/dd/yyyy'))) ,
to_date(from_unixtime(UNIX_TIMESTAMP(trim(optin_date_stamp), 'MM/dd/yyyy')))
FROM datetest;

----------------------------------------------------------------------------------------------------------
/*
3/10/2015 3:04:05 AM
4/11/2015 4:05:06 AM
5/13/2015 5:06:07 AM
5/31/2016 5:18:19 PM
4/29/2016 4:17:18 PM
10/20/2015 10:11:12 AM
*/

select timestamp(from_unixtime(UNIX_TIMESTAMP(trim(a), 'MM/dd/yyyy hh:mm:ss'))) FROM datetest2;

---------------------------------------------------------------------------------------------------------
2017-01-03T23:55:33.000Z

select timestamp(from_unixtime(unix_timestamp(regexp_replace(concat(concat(concat(substr(standardtimestamp,1,10),substr(standardtimestamp,12,2)),
substr(standardtimestamp,15,2)),substr(standardtimestamp,18,2)),'-',''),'yyyymmddhhmmss'))) as standardtimestamp from kdf_sample;

create table datetest(date1 string,date2 string,date3 string,date4 string) row format delimited fields terminated by ','
location '/user/dheerendra.a.yadav/testdata1/';

select
to_date(from_unixtime(UNIX_TIMESTAMP(trim(date1),'MM/dd/yyyy'))) as date1,
to_date(from_unixtime(UNIX_TIMESTAMP(trim(date2),'MM/dd/yyyy'))) as date2,
timestamp(from_unixtime(UNIX_TIMESTAMP(trim(date3),'MM/dd/yyyy hh:mm:ss'))) as date3,
timestamp(from_unixtime(unix_timestamp(regexp_replace(concat(concat(concat(substr(date4,1,10),substr(date4,12,2)),
substr(date4,15,2)),substr(date4,18,2)),'-',''),'yyyyMMddhhmmss'))) as date4
from datetest;

select standardtimestamp,timestamp(from_unixtime(unix_timestamp(regexp_replace(concat(concat(concat(substr(standardtimestamp,1,10),substr(standardtimestamp,12,2)),
substr(standardtimestamp,15,2)),substr(standardtimestamp,18,2)),'-',''),'yyyyMMddhhmmssss'))) as standardtimestamp from test limit 2

Compare dates with date in date format and date in date format:

select
sum(case when price_change_date > from_unixtime(unix_timestamp('2016-12-31' ,'yyyy-MM-dd'), 'yyyy-MM-dd') then 1 else 0 end) as price_change_date,
sum(case when listing_date > from_unixtime(unix_timestamp('2016-12-31' ,'yyyy-MM-dd'), 'yyyy-MM-dd') then 1 else 0 end) as listing_date,
sum(case when reduced_date > from_unixtime(unix_timestamp('2016-12-31' ,'yyyy-MM-dd'), 'yyyy-MM-dd') then 1 else 0 end) as reduced_date,
sum(case when distressed_date > from_unixtime(unix_timestamp('2016-12-31' ,'yyyy-MM-dd'), 'yyyy-MM-dd') then 1 else 0 end) as distressed_date
from fmdb.Interlinked_FM_DS_UPDT_V3;

When date in timestamp and you want to compare the date format with String format then use below format.

select from_unixtime(UNIX_TIMESTAMP(ins_gmt_ts,'yyyy-MM-dd'),'yyyyMMdd') as date1 from dbname_table_name_base limit 1;

========================================================

When you have interger or decimal value and you need to change in date format.

(1) Change integer or decimal value to string using CAST function.

(2) Convert into unix_timestamp with format e.g 'YYYYMMDD'

(3) Convert into from_unixtime()

(4) Convert into date_format() in required format e.g 'yyyy-MM-dd'

Example:

Date in decimal - birth_dt = 1956324.0

Solution:

SELECT date_format(from_unixtime(unix_timestamp(cast(birth_date as string),'YYYYMMDD')),'yyyy-MM-dd') as birth_dt from dummy

Result:

birth_dt = 1956-01-01

Wednesday, 1 February 2017

Json file parsing

{"shortIp":"38.126.127","os":"Windows","refDomain":"www.cnn.com","standardTimestamp":"2016-12-15T19:58:45.000Z","userAgent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0","deviceType":"Personal computer","estid":"CsY3SlcrcMZQT1Uvez0HAg==","browserFamily":"Firefox","mappedEvent":"pview","url":"http://money.cnn.com/2016/12/15/technology/apple-class-action-meal-breaks/index.html?iid=ob_homepage_tech_pool","iabCategories":[{"levelOne":"personal_finance","levelTwo":"financial_news","score":"medium"}],"ip":"38.126.127.126"}

drop table sdf;
create table sdf(
shortIp string
,os string
,refDomain string
,standardTimestamp string
,userAgent string
,deviceType string
,estid string
,browserFamily string
,mappedEvent string
,url string
,iabCategories array<struct<levelOne:string,levelTwo:string,score:string>>
,ip string)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
-------------------------------------------------------------------------------

drop table ext_sdf;
create external table ext_sdf(
shortIp string
,os string
,refDomain string
,standardTimestamp string
,userAgent string
,deviceType string
,estid string
,browserFamily string
,mappedEvent string
,url string
,iabCategories array<struct<levelOne:string,levelTwo:string,score:string>>
,ip string)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'hdfs://127.0.0.1:9000/sdf';

select shortip
,os,useragent
,refDomain
,standardTimestamp
,userAgent
,deviceType
,estid
,browserFamily
,mappedEvent
,url
,iabcategories.levelone[0]
,iabcategories.levelTwo[0]
,iabcategories.score[0]
,ip
from ext_sdf;