Wednesday, 27 December 2017

Hive table showing count after deletion of records

If you have delete the records of  hive external table from HDFS and table is not partitioned table and table is showing count(*) from table then use below properties to update metastore.

set hive.compute.query.using.stats=false;
ANALYZE TABLE table_name COMPUTE STATISTICS;


If your table is partition table then use below command to analyze.

 ANALYZE TABLE table_name partition(id='12',name='xyz') COMPUTE STATISTICS;

Sunday, 10 December 2017

Sqoop Export table from oracle to Hive

#export path=/xx/xx/
Table_Name=$1
echo $Table_name
user_id=$2
echo $user_id
password=$3
echo $password
#cd $path
for i in `cat /xx/xx/abc_list.txt`
do
echo ${i}
echo "sqoop import   --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = xxxx.net)  (PORT = 1526)   ) (CONNECT_DATA =      (SERVICE_NAME = xx)))' --username '${user_id}' --password '${password}' --query "select 'all_columns'  from ${Table_Name} partition"(${i})" where \$CONDITIONS " --target-dir /sales/channel/reporting/${Table_Name}/data/${i} --fields-terminated-by '\034' -m 1"
sqoop import   --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = xxxx.com)  (PORT = 1526)   ) (CONNECT_DATA =      (SERVICE_NAME = xxx--username ${user_id} --password ${password} --query "select *  from ${Table_Name} partition(${i}) where \$CONDITIONS " --target-dir /sales/channel/reporting/${Table_Name}/data/${i} --fields-terminated-by '\034' -m 1
if [ $? == 0 ]
then
echo "Partion data data has processed successfully: echo ${i} "
else
echo "Partion data data has not processed successfully: echo ${i} "
echo "Please check the issue. "
fi
done

Sqoop with query===


sqoop import  --connect 'jdbc:oracle:thin:@(DESCRIPTION =
    (enable = broken)
    (ADDRESS =
      (PROTOCOL = TCP)
      (HOST = hostname.xx.xxcorp.net)
      (PORT = 1526)
    )
    (ADDRESS =
      (PROTOCOL = TCP)
      (HOST = hostname.xx.xxcorp.net)
      (PORT = 1526)
    )
    (CONNECT_DATA =
      (SERVICE_NAME = HPIPODSPC)
    )
  )' --username 'xx' --password 'xxx#ahu1vn' --query "select * from db_name.table_name partition(SYS_P297)  where \$CONDITIONS" --fields-terminated-by  '\034'   --target-dir /sales/channel/reporting/folder_name/data/partition_col=SYS_P297  -m 1

Thursday, 2 November 2017

Sqoop import Error with Oracle

sqoop import   --connect 'jdbc:oracle:thin:@ip_add/dbname' --username 'xxx' --password 'xxxx' --table dbname.table_name  --target-dir /xx/xx/Talend/test/test2    -m 1


 Error:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/11/02 14:38:25 ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: Invalid Oracle URL specified
java.sql.SQLException: Invalid Oracle URL specified
        at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)
        at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:133)
        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:199)
        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:263)
        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:271)
     

Solution : In case we have to add "thin" with jdbc connection string
Example :- --connect 'jdbc:oracle:thin:@ipadd/dbname'

Again Error:

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/11/02 14:40:55 ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: The Network Adapter could not establish the connection
java.sql.SQLException: The Network Adapter could not establish the connection
        at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)
        at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:133)
        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:199)
        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:480)


Solution:-
 Instead of giving the URLand database name you have to replace whole connection string of oracle with TNSNAME entry of that database of oracle:

Example:

Oracle tnsname entry:

xxx =
  (DESCRIPTION =
    (LOAD_BALANCE = yes)
    (SDU = 32768)
    (ADDRESS =
      (PROTOCOL = TCP)
      (HOST = ip_address)
      (PORT = 1526)
    )
    (CONNECT_DATA =
      (SERVICE_= nnnnnDSI)
    )
  )

Now in sqoop cmd you have to make the change accordingly.


sqoop import   --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = ip_addr)  (PORT = 1526)   ) (CONNECT_DATA =      (SERVICE_NAME = HPIPODSI)))' --username 'xxx' --password 'xxxx' --table dbname.table  --target-dir /apps/hive/warehouse/test001.db --escaped-by '\"' --enclosed-by '\"'  --fields-terminated-by '|'  -m 1

Again you are getting error.

LF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/11/02 14:45:33 INFO manager.OracleManager: Time zone has been set to GMT
17/11/02 14:45:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "AMSPODS_PUB"."end_cust_id" t WHERE 1=0
17/11/02 14:45:33 ERROR manager.SqlManager: Error executing statement: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist

java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist

        at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:91)
        at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:133)
        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:206)
Solution:

Use uppercase for Oracle Username,Database name and Table name 

Example :

sqoop import   --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = ip_address)  (PORT = 1526)   ) (CONNECT_DATA =      (SERVICE_NAME = xx)))' --username 'xx' --password '#' --table dbname.END_CUST_ID  --target-dir /apps/hive/warehouse/test001.db --escaped-by '\"' --enclosed-by '\"'  --fields-terminated-by '|'  -m 1

Now data has successfully imported.



Sqoop importing data

Link
http://hortonworks.com/hadoop-tutorial/loading-data-into-the-hortonworks-sandbox/
http://thinkbiganalytics.com/hadoop_nosql_services/freestone-framework/
CREATE USER 'user1'@'localhost' IDENTIFIED BY PASSWORD 'pass1';

sqoop import --connect "jdbc:mysql://localhost/test" --username "root" --table "t1" --target-dir
input1 --m 1
sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -
-table "t4" --hive-import -m 1

sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -
-table "t4" --hive-table bigdb.t12 -m 1
sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -
-table "t4" --warehouse-dir /user/hive/warehouse/bigdb --m 1

-----Import Using query—
bin/sqoop import --connect
jdbc:teradata://dnedwt.edc.cingular.net/edwdb --driver
com.teradata.jdbc.TeraDriver --username dy662t --password Dhiru+12 -
-query "select UserName , AccountName, UserOrProfile from
DBC.AccountInfoV where \$CONDITIONS" --hive-import --hive-table
tada_ipub.TMP_billing --split-by UserName --target-dir
/user/hive/warehouse/test_tmp1 -verbose
sqoop export --connect "jdbc:mysql://localhost/db2" --username "root" -
-table "t5" --export-dir /user/hive/warehouse/t5
sqoop export --connect "jdbc:mysql://localhost/mydb" --username "root"
--table "total_gender" --export-dir /user/hive/warehouse/gender
sqoop export --connect "jdbc:mysql://localhost/mydb" --username "root"
--password 'root123!' --table "gender" --export-dir
/user/hive/warehouse/bigdb.db/user_gender --input-fields-terminated-by
',' -verbose -m 1


---increamental load sqoop ---
sqoop import --connect "jdbc:mysql://localhost/test" --username "root"
--table "city5" --hive-import -check-column upd_dt --incremental
lastmodified --last-value 2015-05-26 --hive-import


Sqoop the data directly in externall table
-----------------

(1)    Create external table in hive.(Attached  sample DDL)
Example - dist_daily_temp2
(2)    Run Sqoop command and keep the path of “--target-dir” same as location of external table.
sqoop import   --connect 'jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = yes)(SDU = 32768) (ADDRESS =(PROTOCOL = TCP) (HOST = ipadddress)  (PORT = 1526)   ) (CONNECT_DATA =      (SERVICE_NAME = xxxI)))' --username 'AMSPODS_PUB' --password 'xxx' --table PUB_FCT_DSTRB_SLS_DAILY  --fields-terminated-by  '\034'  --target-dir /sales/channel/Talend/test/test2  -m 1

(3)    All the columns in sample table in string format to get the data properly from oracle later on we will change data type.

Tuesday, 24 October 2017

Update Hive table using CASE clause

I have taken 2 table  "temp_add_1" and "temp_add_2".

Table temp_add_1  have 11 records and temp_add_2 have 13 records ( new records and updated columns)

Records of  table temp_add_1  with old records


5000000196      KAPLAN UNIVERSITY       332 FRONT ST S  STE 501 LA CROSSE       WI      54601
5000000206      GLATFELTER      475 S PAINT ST  STOREROOM RECEIVING     CHILLICOTHE     OH      45601
5000000209      METHODIST HEALTH SYSTEM 1441 N BECKLEY AVE              DALLAS  TX      75203
5000000220      WINDWOOD FARM   4857 WINDWOOD FARM RD           AWENDAW SC      294295951
5000239439      D & H DISTRIBUTING      185 COWETA INDUSTRIAL PAR               NEWNAN  GA      30265   US

Records of table temp_add_2 with new rows.

5000000209      METHODIST HEALTH SYSTEM 1441 N BECKLEY AVE              DALLAS  TX      75203
5000000220      WINDWOOD FARM   4857 WINDWOOD FARM RD           AWENDAW SC      294295951
5000000281      MARCO ST CLOUD OFFICE   4510 HEATHERWOOD RD             ST CLOUD        MN      56301
5000000294      RJ REYNOLDS TOBACCO     1001 REYNOLDS BLVD      BLDG 603-5 DOCK 21      WINSTON SALEM   NC      27105
5000239439      D & H DISTRIBUTING      185 COWETA INDUSTRIAL PARKWAY           NEWNAN  GA      30265   US


Use below command to update the records in temp_add_1. 

I am using CASE clause with FULL OUTER JOIN

insert overwrite table sales_channel.temp_add_1
select 
case 
    when t1.address_identifier = t2.address_identifier then t1.address_identifier else t2.address_identifier end as address_identifier,
case
    when t1.customer_name = t2.customer_name then t1.customer_name else t2.customer_name end as customer_name,
case
    when t1.address1_description = t2.address1_description then t1.address1_description else t2.address1_description end as address1_description,
case
    when t1.address2_description = t2.address2_description then t1.address2_description else t2.address1_description end as address2_description,
case
    when t1.city_name = t2.city_name then t1.city_name else t2.city_name end as city_name,
case
    when t1.state_name = t2.state_name then t1.state_name else t2.state_name end as state_name,
case
    when t1.zip_code = t2.zip_code then t1.zip_code else t2.zip_code end as zip_code,
case
    when t1.iso_country_code = t2.iso_country_code then t1.iso_country_code else t2.iso_country_code end as iso_country_code
from sales_channel.temp_add_1 t1 FULL OUTER JOIN sales_channel.temp_add_2 t2
ON t1.address_identifier = t2.address_identifier;

Result:-


OK

5000000209      METHODIST HEALTH SYSTEM 1441 N BECKLEY AVE              DALLAS  TX      75203
5000000220      WINDWOOD FARM   4857 WINDWOOD FARM RD           AWENDAW SC      294295951
5000000281      MARCO ST CLOUD OFFICE   4510 HEATHERWOOD RD     4510 HEATHERWOOD RD     ST CLOUD        MN      56301
5000000294      RJ REYNOLDS TOBACCO     1001 REYNOLDS BLVD      1001 REYNOLDS BLVD      WINSTON SALEM   NC      27105
5000239439      D & H DISTRIBUTING      185 COWETA INDUSTRIAL PARKWAY           NEWNAN  GA      30265   US

Thursday, 31 August 2017

Hive commands

If you want to list down all tables contains same string in same database then use below command.

hive > show tables like '*tablename*';

Similarly for Database

hive > show databases like '*dbname*';

Remove duplicate (redundancy)

Case: We have multiple columns in table but none of the column or combination of columns  in table provide unique row.

Solution: In that case you can use function called  ROW_NUMBER  and also used ROW_NUMBER with OVER (PARTITION BY )


Step 1:  First you have to generate unique number in subquery with ROW_NUMBER() OVER() as unique_number.

Step 2 : Now you have to generate ROW_NUMBER OVER() ( PARTITION BY unique_number , table_column1,table_column2)

Now from step 2 you will start getting unique row.

Example:

select pos4.*,
              if(pos4.Partner_Tier_Indicator_Code='T1','sell-thru','sell-to') as partner_selling_motion_measure_name,
              if(pos4.Partner_Tier_Indicator_Code='T1','sold to','sold to R2R') as address_type_name
        from
                  (SELECT pos3.*,
                             ROW_NUMBER() OVER (PARTITION BY pos3.rowid,pos3.partner_sales_transaction_date order by pos3.partner_sales_transaction_date  desc ) as rownum
                           FROM
                                 (SELECT
                                       pos2.*,
                                       if((ph.reporting_source_partner_level_3_identifier <> NULL or ph.reporting_source_partner_level_3_identifier <> ''),
                                       ph.reporting_source_partner_level_3_identifier,pos2.partner_sold_to_matched_siebel_row_identifier) as HQ_partner2
                                  FROM
                                      (SELECT  pos1.*,ROW_NUMBER()  over() as rowid
                                            FROM
                                                (SELECT
                                                  pos.*,
                                                  if((ph.reporting_source_partner_level_3_identifier <> NULL or ph.reporting_source_partner_level_3_identifier <> ''),
                                                  ph.reporting_source_partner_level_3_identifier,pos.reporting_partner_siebel_row_identifier) as HQ_partner
                                                FROM xx.pos_weekly_temp pos --xx.Fact_Channel_Point_Of_Sale_Weekly
                                                      LEFT OUTER JOIN  xx.dim_partner_hierarchy  ph
                                                      ON pos.reporting_partner_siebel_row_identifier=ph.reporting_partner_identifier
                                                      WHERE pos.Partner_Tier_Indicator_Code='T1'
                                                      AND   pos.region_code='AMER'
                                                )  pos1
                                  ) pos2
                                        LEFT OUTER JOIN  xx.dim_partner_hierarchy  ph
                                             ON pos2.partner_sold_to_matched_siebel_row_identifier=ph.reporting_partner_identifier
                                             AND pos2.region_code = ph.region_code
                     )  pos3
                            LEFT OUTER JOIN  xx.PAS_REPORTING_PARTNER_REFERENCE_temp rpr
                            ON pos3.HQ_partner2 = rpr.reporting_partner_siebel_row_identifier
                            AND pos3.channel_sub_segment_code = rpr.channel_sub_segment_code
                          ) pos4
                             where pos4.rownum = 1;

Thursday, 11 May 2017

Alter partition table

If you have partition table and you want to add new column in table then after adding the new column using ALTER command the value appears NULL in that column then that case you should use CASCADE clause at end of ALTER command.

alter table pos_all_flgs_dh3 add columns(dheeren2 string) CASCADE;

INSERT
INTO pos_all_flgs_dh3 partition
  (
    Region_Code,
    Partner_Tier_Indicator_Code,
    Fiscal_Year_Week_Code
  )
SELECT DISTINCT
  p1.Channel_Sub_Segment_Identifier,
  p1.Source_File_Name,
  if((p1.cross_source_sale_flag <> p2.cross_source_sale_flag),'dheerentest','dheerentest') as dheeren2,
  p1.Region_Code,
  p1.Partner_Tier_Indicator_Code,
  p1.Fiscal_Year_Week_Code
FROM
 pos_4_flg_dh p1 LEFT OUTER JOIN pos_2_flg_dh p2
 ON p1.Partner_Sales_Transaction_Identifier= p2.Partner_Sales_Transaction_Identifier limit 1;

Thursday, 27 April 2017

ORC format

hive> create external  table test_pos_ext2(
    > trxno string,
    > trx_dt string,
    > reporter_id  string,
    > buyer_id string
    > ) row format delimited fields terminated by ','
    > LOCATION '/gcw/testing/pos_test_ext/test';
OK
Time taken: 0.153 seconds
hive> select * from test_pos_ext2;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://hdpdevnn/gcw/testing/pos_test_ext/test/pos3.txt. Invalid postscript.


Solutiion:--


Add - STORED AS TEXTFILE in table while creating.

hive> create external  table test_pos_ext2(
    > trxno string,
    > trx_dt string,
    > reporter_id  string,
    > buyer_id string
    > ) row format delimited fields terminated by ','
    > STORED AS TEXTFILE
    > LOCATION '/gcw/testing/pos_test_ext/test';

hive> select * from test_pos_ext2;
OK
9106956188      3/7/2017        3-HWJW-516      3-2SS-2763
9106956189      3/7/2017        3-HWJW-516      3-2SS-2763
Time taken: 0.12 seconds, Fetched: 2 row(s)

Sunday, 2 April 2017

Copy Partition table

If you want to copy existing Partition table in Hive from one cluster to another cluster or copy from one database to another database on same cluster.

Suppose you have Table t1 in database testdb and you have load data in partition table from local directory.

create table testdb.t1(a string, b string) row format delimited fields terminated by ',';
load data local inpath '/home/dheerendra/working/Data/data1' overwrite into table testdb.t1;

Insert data in Partition table.

set hive.exec.dynamic.partition.mode=nonstrict;
insert into table testdb.tp2 partition(b) select a,b from testdb.t2;

Copy data from one cluster to another cluster using  Distcp command. I my case cluster is same.

hadoop distcp   /user/hive/warehouse/testdb.db/tp /user/hive/warehouse/db1.db/

When Cluster is different then.

hadoop distcp   hdfs://cluster1/user/hive/warehouse/testdb.db/tp  hdfs://cluster2/user/hive/warehouse/db1.db/


Create same table structure on destination database.

create table db1.tp(a string) partitioned by(b string);


Use MSCK command to add partition with metastore


msck repair table tp;

If you get error like:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
 Take below steps:

set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;

Friday, 3 March 2017

Hive output to csv format

Move hive output to csv format

 nohup hive -f /home/dheerendra.a.yadav/query/s3.q | sed 's/[\t]/,/g' > /home/dheerendra.a.yadav/query/s6.csv &

Friday, 24 February 2017

Transpose column to row

Transpose or Pivot table or convert column to row


create table twit_seq STORED AS ORC tblproperties ("orc.compress"="NONE") as
select t1.consumer_id,t1.text,t1.rk
from
(SELECT consumer_id,text,postcreationtime1,rank() OVER (partition by consumer_id ORDER BY postcreationtime1 desc ) as rk,
 row_number() over() as cnt
FROM dunami_v1)t1  where t1.rk < 6;


create table customer_tweet_info_dnami STORED AS ORC tblproperties ("orc.compress"="NONE") as
SELECT consumer_id ,
   max(CASE WHEN rk = 1 THEN text else NULL END) AS Tweet_1,
   max(CASE WHEN rk = 2 THEN text else NULL END) AS Tweet_2,
   max(CASE WHEN rk = 3 THEN text else NULL END) AS Tweet_3,
   max(CASE WHEN rk = 4 THEN text else NULL END) AS Tweet_4,
   max(CASE WHEN rk = 5 THEN text else NULL END) AS Tweet_5
FROM twit_seq  group by consumer_id;

week and rank function

Calculate weeks

select abs(floor(datediff(timestamp(from_unixtime(UNIX_TIMESTAMP(trim('01/04/16 00:00'),'MM/dd/yy hh:mm'))),postcreationtime1)/7)) as wk
from dunami_twitter_visual_data_v1 limit 100;


Use rank function to get value in descending order

create table twit_seq STORED AS ORC tblproperties ("orc.compress"="NONE") as
select t1.consumer_id,t1.text
from
(SELECT consumer_id,text,postcreationtime1,rank() OVER (partition by consumer_id ORDER BY postcreationtime1 desc ) as rk
FROM dunami_v1)t1  where t1.rk < 6;

Monday, 13 February 2017

Hive Optimization properties

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set mapreduce.job.mappers=1000;
set mapreduce.job.reducers=1000;
 SET hive.exec.parallel=true;
-----------------------------------------------------------

set hive.vectorized.groupby.maxentries=10240;
set hive.vectorized.groupby.flush.percent=0.1;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cli.print.header = true;


Working with Tez and getting issue like tez.TezProcessor|: java.lang.OutOfMemoryError: Java heap space

Then set the properties on query level so that container size increase for that particular query.

SET hive.tez.java.opts=-Xmx8192m -Xms4096m

SET hive.tez.container.size=4026

Go through link -
https://discuss.pivotal.io/hc/en-us/articles/226239948-Hive-query-with-TEZ-engine-failed-with-exception-java-lang-OutOfMemoryError-Java-heap-space


Hive query output generate empty files along with other files then use below properties.
set hive.merge.tezfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.orcfile.stripe.level=true;

=================================================

Optimization for ORC files

set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
set hive.exec.parallel=true;
SET hive.tez.java.opts=-Xmx8192m -Xms4096m;
SET hive.tez.container.size=8192;
set hive.exec.parallel=true;

Sunday, 5 February 2017

Copy data from S3 to hdfs

hadoop distcp s3n://<access key>:<secret key>@buket/testhadoop/*/* /user/hadoop_user/test/ 

Date format in hive

Examples -  5/11/2015       12/31/2014      5/11/2015

select optin_date,cell_phone_date,optin_date_stamp from datetest;

select
to_date(from_unixtime(UNIX_TIMESTAMP(trim(Optin_Date_Stamp), 'MM/dd/yyyy'))),
to_date(from_unixtime(UNIX_TIMESTAMP(trim(cell_phone_date), 'MM/dd/yyyy'))) ,
to_date(from_unixtime(UNIX_TIMESTAMP(trim(optin_date_stamp), 'MM/dd/yyyy')))
FROM datetest;

----------------------------------------------------------------------------------------------------------
/*
3/10/2015  3:04:05 AM
4/11/2015  4:05:06 AM
5/13/2015  5:06:07 AM
5/31/2016  5:18:19 PM
4/29/2016  4:17:18 PM
10/20/2015  10:11:12 AM
*/

select timestamp(from_unixtime(UNIX_TIMESTAMP(trim(a), 'MM/dd/yyyy hh:mm:ss'))) FROM datetest2;

---------------------------------------------------------------------------------------------------------
2017-01-03T23:55:33.000Z

select timestamp(from_unixtime(unix_timestamp(regexp_replace(concat(concat(concat(substr(standardtimestamp,1,10),substr(standardtimestamp,12,2)),
substr(standardtimestamp,15,2)),substr(standardtimestamp,18,2)),'-',''),'yyyymmddhhmmss')))  as  standardtimestamp from kdf_sample;

create table datetest(date1 string,date2 string,date3 string,date4 string) row format delimited fields terminated by ','
location '/user/dheerendra.a.yadav/testdata1/';

select
  to_date(from_unixtime(UNIX_TIMESTAMP(trim(date1),'MM/dd/yyyy'))) as date1,
  to_date(from_unixtime(UNIX_TIMESTAMP(trim(date2),'MM/dd/yyyy'))) as date2,
  timestamp(from_unixtime(UNIX_TIMESTAMP(trim(date3),'MM/dd/yyyy hh:mm:ss'))) as date3,
   timestamp(from_unixtime(unix_timestamp(regexp_replace(concat(concat(concat(substr(date4,1,10),substr(date4,12,2)),
   substr(date4,15,2)),substr(date4,18,2)),'-',''),'yyyyMMddhhmmss')))  as  date4
from datetest;

select standardtimestamp,timestamp(from_unixtime(unix_timestamp(regexp_replace(concat(concat(concat(substr(standardtimestamp,1,10),substr(standardtimestamp,12,2)),
substr(standardtimestamp,15,2)),substr(standardtimestamp,18,2)),'-',''),'yyyyMMddhhmmssss')))  as  standardtimestamp from test limit 2



Compare dates with date in date format  and date in date format:


select
sum(case when price_change_date  > from_unixtime(unix_timestamp('2016-12-31' ,'yyyy-MM-dd'), 'yyyy-MM-dd') then 1 else 0 end)  as price_change_date,
sum(case when listing_date  > from_unixtime(unix_timestamp('2016-12-31' ,'yyyy-MM-dd'), 'yyyy-MM-dd') then 1 else 0 end)  as listing_date,
sum(case when reduced_date  > from_unixtime(unix_timestamp('2016-12-31' ,'yyyy-MM-dd'), 'yyyy-MM-dd') then 1 else 0 end)  as reduced_date,
sum(case when distressed_date  > from_unixtime(unix_timestamp('2016-12-31' ,'yyyy-MM-dd'), 'yyyy-MM-dd') then 1 else 0 end)  as distressed_date
from fmdb.Interlinked_FM_DS_UPDT_V3;


When date in timestamp and you want to compare the date format with String format then use below format.


select from_unixtime(UNIX_TIMESTAMP(ins_gmt_ts,'yyyy-MM-dd'),'yyyyMMdd') as date1 from dbname_table_name_base limit 1;

========================================================

When you have interger or decimal value and you need to change in date format.

(1) Change integer or decimal value to string using CAST function.
(2) Convert into unix_timestamp with format e.g 'YYYYMMDD'
(3) Convert into from_unixtime()
(4) Convert into date_format()  in required format e.g 'yyyy-MM-dd'

Example:

 Date in decimal -  birth_dt = 1956324.0
 
Solution:

SELECT date_format(from_unixtime(unix_timestamp(cast(birth_date as string),'YYYYMMDD')),'yyyy-MM-dd') as birth_dt from dummy

Result:
birth_dt = 1956-01-01


Wednesday, 1 February 2017

Json file parsing

{"shortIp":"38.126.127","os":"Windows","refDomain":"www.cnn.com","standardTimestamp":"2016-12-15T19:58:45.000Z","userAgent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0","deviceType":"Personal computer","estid":"CsY3SlcrcMZQT1Uvez0HAg==","browserFamily":"Firefox","mappedEvent":"pview","url":"http://money.cnn.com/2016/12/15/technology/apple-class-action-meal-breaks/index.html?iid=ob_homepage_tech_pool","iabCategories":[{"levelOne":"personal_finance","levelTwo":"financial_news","score":"medium"}],"ip":"38.126.127.126"}

drop table sdf;
create table sdf(
shortIp string
,os string
,refDomain string
,standardTimestamp string
,userAgent string
,deviceType string
,estid string
,browserFamily string
,mappedEvent string
,url string
,iabCategories array<struct<levelOne:string,levelTwo:string,score:string>>
,ip string)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
-------------------------------------------------------------------------------

drop table ext_sdf;
create external table ext_sdf(
shortIp string
,os string
,refDomain string
,standardTimestamp string
,userAgent string
,deviceType string
,estid string
,browserFamily string
,mappedEvent string
,url string
,iabCategories array<struct<levelOne:string,levelTwo:string,score:string>>
,ip string)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'hdfs://127.0.0.1:9000/sdf';



select shortip
,os,useragent
,refDomain
,standardTimestamp
,userAgent
,deviceType
,estid
,browserFamily
,mappedEvent
,url
,iabcategories.levelone[0]
,iabcategories.levelTwo[0]
,iabcategories.score[0]
,ip
from ext_sdf;