Hive solutions by dheerendra yadav

Wednesday, 15 June 2022

Compare array list with Dataframe column

# If you have array list column of dataframe and need to check or compare another element of same dataframe then you can achieve this by using "expr" function for more details please find below code.

===================================

from pyspark.sql.functions import expr

df_desc_split=df_trx.withColumn('split_desc',sf.split(sf.col('description'),' '))

df_name_flg=df_desc_split.withColumn("first_name_flag", sf.expr("array_contains(split_desc, FIRSTNAME)"))

.withColumn("middle_name_flag", sf.expr("array_contains(split_desc, MIDDLE)"))

Explanation :

I have description field containing string like "My name is dheerendra" which I split and keep in array field "split_desc" like ['My','name','is','dheerendra']

Now if I have another column of same dataframe e.g "name" which contain "dheerendra".

|name |split_desc|

|dheerendra| ['My','name','is','dheerendra']|

If I need to check the existence of 'dheerendra' in split_desc field then need to use "expr" function along with array_contains functions

df_desc_split.withColumn("first_name_flag", sf.expr("array_contains(split_desc, FIRSTNAME)"))

Thursday, 16 December 2021

Convert Interger or decimal to Date format

When you have interger or decimal value and you need to change in date format.

(1) Change integer or decimal value to string using CAST function.

(2) Convert into unix_timestamp with format e.g 'YYYYMMDD'

(3) Convert into from_unixtime()

(4) Convert into date_format() in required format e.g 'yyyy-MM-dd'

Example:

Date in decimal - birth_dt = 1956324.0

Solution:

SELECT date_format(from_unixtime(unix_timestamp(cast(birth_date as string),'YYYYMMDD')),'yyyy-MM-dd') as birth_dt from dummy

Result:

birth_dt = 1956-01-01

Tuesday, 31 August 2021

Hive with Parquet file

File location on hdfs :

/data44/2/part-00000-ddd75d27-e608-4a17-a96d-4d631c71e875-c000.snappy.parquet

Create external table in Hive

create external table test_paq4(deptid string,name string,id int) stored as parquet location 'hdfs://127.0.0.1:9000/data44/2';

hive> select * from test_paq4;

Note :- If you are using parquet then you can select the sequence of column on random basis e.g id,name,deptid or deptid,name,id

Only the name of the column should match with Parquet file column. It mean each parquet file contains schema (column names).

Monday, 28 June 2021

Sqoop Commands

sqoop import --connect "jdbc:mysql://localhost/test" --username "root" --table "t1" --target-dir

input1 --m 1

sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -

-table "t4" --hive-import -m 1

sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -

-table "t4" --hive-table bigdb.t12 -m 1

sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -

-table "t4" --warehouse-dir /user/hive/warehouse/bigdb --m 1

sqoop export --connect "jdbc:mysql://localhost/db2" --username "root" -

-table "t5" --export-dir /user/hive/warehouse/t5

sqoop export --connect "jdbc:mysql://localhost/mydb" --username "root"

--table "total_gender" --export-dir /user/hive/warehouse/gender

sqoop export --connect "jdbc:mysql://localhost/mydb" --username "root"

--password 'root123!' --table "gender" --export-dir

/user/hive/warehouse/bigdb.db/user_gender --input-fields-terminated-by

',' -verbose -m 1

---increamental load sqoop ---

sqoop import --connect "jdbc:mysql://localhost/test" --username "root"

--table "city5" --hive-import -check-column upd_dt --incremental

lastmodified --last-value 2015-05-26 --hive-import --m 1

Monday, 28 May 2018

Update table in Hive by using ACID properties

Prerequisites :-

(1) Table should be Bucketized.
(2) ORC format
(3) Need to set 2 properties:
(a) SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
(b) SET hive.support.concurrency=true;

Step: 1 => Create table using TBLPROPERTIES('transactional'=true');

create table testacid(id int, name string, rollno int) clustered by (id) into 4 buckets
stored as ORC tblproperties('transactional'='true') ;

Step : 2 => Insert records into table.

insert into testacid values(1,'dh1',121),(2,'dh2',122);

Step : 3 => update testacid set name='dheeren' where name='dh2';

Saturday, 12 May 2018

Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Error : Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Solution: This issue is related with Metastore which will done by help of database e,g Derby, MySql etc...

First steps

Verify hive-site.xml with below properties.

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby:;databaseName=metastore_db;create=true</value>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>org.apache.derby.jdbc.EmbeddedDriver</value>

<name>hive.metastore.warehouse.dir</name>

<value>hdfs://localhost:9000/user/hive/warehouse</value>

<name>hive.metastore.uris</name>

Keep it BLANK

Now try again to start HIVE.If you are still getting the same problem.

Please follow below steps:

(1) Remove metastore_db directory if exists

rm -r metastore_db

(2) Now initiate Schema by command schematool from hive bin directory

schematool -initSchema -dbType derby

(3) IF Schematool complete successfully the please start hive.

Now you can able to use database in hive.

Monday, 16 April 2018

Extracting and comparing data

To Extract the data in minimum part file use order by clause in Query

Exmaple:

INSERT OVERWRITE LOCAL DIRECTORY '/dataingestion/folder_name/data/ship' row format delimited FIELDS TERMINATED BY '~'
SELECT
wkly_agg.col1,
wkly_agg.col2,
SUM(wkly_agg.col3) as net_amount
FROM db_name.POS_AGGR_FLG_TEMP_T1_2 wkly_agg
WHERE wkly_agg.aggregate_sales_flag ='Y'
AND (wkly_agg.flg1 <>'Y' OR wkly_agg.flg2 <> 'N' OR wkly_agg.seller_flag <> 'Y')
AND wkly_agg.trx_date between '2017-11-01' and '2017-11-30'
GROUP BY
wkly_agg.col1,
wkly_agg.col2
Order by 1

Output --
On edge node
/dataingestion/folder_name/data/ship/000000_0

Move this file to windows by winscp or ftp
Now convert the file in .csv format.

Now this file have list of values in column and paste other column which values you need to compare with this.

Now you need to use Excel function to compare the value.
(1) =MATCH(B3,A:A,0) It return row number of matched value.
(2) =INDEX(A:A,MATCH(B4,A:A,0)) Return the matched values.

A:A column - Contains the those values which was extracted by Hive query.
B3 column - Contain the value you want to match with list.

Once it done and return correct match drag the formula on the column value which you want to match.

Wednesday, 15 June 2022

Compare array list with Dataframe column

Thursday, 16 December 2021

Tuesday, 31 August 2021

Hive with Parquet file

Monday, 28 June 2021

Sqoop Commands

Monday, 28 May 2018

Update table (Acid)

Saturday, 12 May 2018

Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Monday, 16 April 2018

Extracting and comparing data