Wednesday, 15 June 2022

Compare array list with Dataframe column

# If you have array list column of dataframe and need to check or compare  another element of same dataframe then you can achieve this by using  "expr" function for more details please find below code.


===================================

from pyspark.sql.functions import expr


df_desc_split=df_trx.withColumn('split_desc',sf.split(sf.col('description'),' '))


df_name_flg=df_desc_split.withColumn("first_name_flag", sf.expr("array_contains(split_desc, FIRSTNAME)"))

.withColumn("middle_name_flag", sf.expr("array_contains(split_desc, MIDDLE)"))


Explanation :

 I have description field containing string like "My name is dheerendra" which I split and keep in array field "split_desc" like ['My','name','is','dheerendra']

Now if I have another column of same dataframe e.g "name" which contain "dheerendra".

|name   |split_desc|

|dheerendra| ['My','name','is','dheerendra']|

If I need to check the existence of 'dheerendra' in  split_desc field then need to use "expr" function along with array_contains functions

df_desc_split.withColumn("first_name_flag", sf.expr("array_contains(split_desc, FIRSTNAME)"))



Thursday, 16 December 2021

 Convert Interger or decimal  to Date format


When you have interger or decimal value and you need to change in date format.

(1) Change integer or decimal value to string using CAST function.
(2) Convert into unix_timestamp with format e.g 'YYYYMMDD'
(3) Convert into from_unixtime()
(4) Convert into date_format()  in required format e.g 'yyyy-MM-dd'

Example:

 Date in decimal -  birth_dt = 1956324.0
 
Solution:

SELECT date_format(from_unixtime(unix_timestamp(cast(birth_date as string),'YYYYMMDD')),'yyyy-MM-dd') as birth_dt from dummy

Result:
birth_dt = 1956-01-01

Tuesday, 31 August 2021

Hive with Parquet file

 File location on hdfs :

/data44/2/part-00000-ddd75d27-e608-4a17-a96d-4d631c71e875-c000.snappy.parquet

Create external table in Hive 

create external table test_paq4(deptid string,name string,id int) stored as parquet location 'hdfs://127.0.0.1:9000/data44/2';

hive> select * from test_paq4;


Note :- If you are using parquet then you can select the sequence of column on random basis e.g id,name,deptid or deptid,name,id 

Only the name of the column should match with Parquet file column. It mean each parquet file contains schema (column names).


Monday, 28 June 2021

Sqoop Commands

 sqoop import --connect "jdbc:mysql://localhost/test" --username "root" --table "t1" --target-dir

input1 --m 1


sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -

-table "t4" --hive-import -m 1


sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -

-table "t4" --hive-table bigdb.t12 -m 1

sqoop import --connect "jdbc:mysql://localhost/db2" --username "root" -

-table "t4" --warehouse-dir /user/hive/warehouse/bigdb --m 1


sqoop export --connect "jdbc:mysql://localhost/db2" --username "root" -

-table "t5" --export-dir /user/hive/warehouse/t5

sqoop export --connect "jdbc:mysql://localhost/mydb" --username "root"

--table "total_gender" --export-dir /user/hive/warehouse/gender


sqoop export --connect "jdbc:mysql://localhost/mydb" --username "root"

--password 'root123!' --table "gender" --export-dir

/user/hive/warehouse/bigdb.db/user_gender --input-fields-terminated-by

',' -verbose -m 1


---increamental load sqoop ---

sqoop import --connect "jdbc:mysql://localhost/test" --username "root"

--table "city5" --hive-import -check-column upd_dt --incremental

lastmodified --last-value 2015-05-26 --hive-import --m 1

Monday, 28 May 2018

Update table (Acid)

 Update table in Hive by using ACID properties

Prerequisites :-

(1) Table should be Bucketized.
(2) ORC format
(3) Need to set 2 properties:
     (a) SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
     (b) SET hive.support.concurrency=true;

Step: 1 => Create table using TBLPROPERTIES('transactional'=true');


create table testacid(id int, name string, rollno int) clustered by (id) into 4 buckets
stored as ORC tblproperties('transactional'='true') ;


Step : 2 =>  Insert records into table.

insert into testacid values(1,'dh1',121),(2,'dh2',122);

Step : 3 =>  update testacid set name='dheeren' where name='dh2';








Saturday, 12 May 2018

Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Error :   Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Solution: This issue is related with Metastore which will done by help of database e,g Derby, MySql etc...

First steps 
Verify hive-site.xml with below properties.

 <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
   
<name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
    
  <name>hive.metastore.warehouse.dir</name>
    <value>hdfs://localhost:9000/user/hive/warehouse</value>

 <property>
    <name>hive.metastore.uris</name>
    Keep it BLANK

Now try again to start HIVE.If you are still getting the same problem.

Please follow below steps:
(1) Remove metastore_db directory if exists
     rm -r metastore_db

(2) Now initiate Schema by command schematool from hive bin directory 
       schematool -initSchema -dbType derby

(3)     IF Schematool complete successfully the please start hive.
         Now you can able to use database in hive.

Monday, 16 April 2018

Extracting and comparing data

To Extract the data in minimum part file use order by clause in Query

Exmaple:

INSERT OVERWRITE LOCAL DIRECTORY '/dataingestion/folder_name/data/ship' row format delimited FIELDS TERMINATED BY '~'
 SELECT
    wkly_agg.col1,
    wkly_agg.col2,
    SUM(wkly_agg.col3) as net_amount
 FROM db_name.POS_AGGR_FLG_TEMP_T1_2 wkly_agg
   WHERE  wkly_agg.aggregate_sales_flag ='Y'
   AND (wkly_agg.flg1 <>'Y' OR wkly_agg.flg2 <> 'N' OR wkly_agg.seller_flag <> 'Y')
  AND wkly_agg.trx_date between '2017-11-01' and '2017-11-30' 
 GROUP BY
  wkly_agg.col1,
  wkly_agg.col2
Order by 1

Output --
On edge node
/dataingestion/folder_name/data/ship/000000_0 

Move this file to windows by winscp or ftp
Now convert the file in .csv format.

Now this file have list of values in  column and paste other column which values you need to compare with this.

Now you need to use Excel function to compare the value.
(1) =MATCH(B3,A:A,0) It return row number of matched value.
(2) =INDEX(A:A,MATCH(B4,A:A,0)) Return the matched values.

A:A column  - Contains the those values which was extracted by Hive query.
B3 column - Contain the value you want to match with list.

Once it done and return correct match drag the formula on the column value which you want to match.