Friday, November 18, 2016

Hive - Creating and loading data in an external partitioned table


I have been ramping up on Hive concepts these days. 
This post is on one interesting and a bit tricky concept that I came across and thought of sharing.
It deals with creating and loading data in an external PARTITIONED table and then querying it for data

Hive version: 2.1.0 from Apache

First let’s create an EXTERNAL table with a three fields and partitioned on one of these fields.

CREATE EXTERNAL TABLE my_dest_table (
                         primary_alias_type string,
                         primary_alias_id string)
PARTITIONED BY (d string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/home/input/testData/my_dest_table/'
As you can see I have created an external table with partitions and at the location mentioned I have placed a csv file having three columns and plenty of data.

Now I run a select statement to view data:
hive> select * from my_dest_table;
OK
Time taken: 1.467 seconds

As you can see I don’t see any output here. Had this been a Non Partitioned External table the query would have returned results but in this case it needs some more effort. You need to create partitions and load data in those partitions as well. 

ALTER TABLE my_dest_table ADD PARTITION(d = 20000)
LOCATION '/home/input/testData/my_dest_table/20000';

As you can see above I have created a partition using ALTER TABLE command to add this info to the metastore. Also created corresponding folder and placed relevant snapshot of csv file data in the folder. You can alternatively create all partition folders in the filesystem and then use MSCK REPAIR TABLE command to load partition info in metastore in one go.

Now when I do a select, I can see the data in query output.

Some more points of note are that an external table is only created in metastore and you wont see its info in the warehouse folder (as you can see for MANAGED tables).

Tuesday, September 20, 2016

Effective Interviewing

I have been taking interviews for my organisation for more than 2 yrs now and must have taken 80+ interviews. So when I was invited to attend Effective Interviewing Skills workshop, it felt more like a waste of time. After all what more was there to learn I felt. Thankfully my bubble of overconfidence burst and in a pleasant way.

I viewed interviewing as a non avoidable task and a hindrance in the way of completing my work on time. I used to approach the interviews with a very straight face and kept it to the point. However during this session I realised I had got it all wrong.

An interview is a candidate's first face to face interaction with the company, and first impressions are often hard to change. So it must be given its due importance and time.
I found that its important for an interviewer to introduce himself as well, talk a brief about the organisation, specially points that cannot be found on the world of internet. Also its good to understand if the candidate knows the role and job profile he is appearing for, and set the expectations correctly. Now that I have implemented this, it has become a good way to break the ice and gives some time to the candidate to settle down as well.

I also found that one should read the candidate's resume in advance (rather than go through it in front of them) so that they are well aware of candidate's skill sets and know what to ask. When I went ahead and implemented this, I found the candidates many a times were both amused and surprised that I had read their profile so deeply.

You might be taking a technical interview but its important not to limit yourself to technical stuff only during the discussion. One must evaluate the candidate on his/her attitude as well and not leave it for the manager or HR. After all attitude at many times plays an as important role as knowledge.
While you can argue that attitude is not easy to judge, BEI or Behavioural Event Interviewing plays an important role here. This strategy stresses on questioning the candidate a behavioural question and asking him to supplement the answer with actual instances from the past. e,g Tell me about a situation where in you were faced with a very challenging situation. What was it and how did you deal with it. Since the answer needs to focus on past experience there is lesser scope for manipulation. Also what a candidate considers challenging may or may not suit your project needs and will give you a good idea on whether he/she will fit in or not.

Last but not the least dont try to cut an interview short. Evaluating a candidate can easily take upwards of 45 mins so be patient and keep pending work tasks off your mind :)

Saturday, August 20, 2016

Documenting in the Agile world

Documentation is considered pretty un-Agilish in the Agile development world. However based on my short tryst with Agile, I feel its a necessary evil. Without adequate documentation, debugging; bug fixing and maintenance tend to become nightmares. Infact I have seen more time getting consumed by developers trying to understand other developers code than it would take to document it.
I definitely DO NOT support the elaborate documentation style that comes with Waterfall - High level designs, Low level designs and more but should be minimal enough to support the needs.

One way to do so is to add more comments than lines of code you write. This way your code would look like a beautiful story - easy to read, understand and maintain. Block comments, method comments and class level comments - all add to the readability of your code. Also this needs to be done while you are coding and not at a later stage (because that time never comes).

To document business aspect of your application you can maintain a running technical design document where in developments done in a Sprint can be jotted down for reference later on.

In these two ways you can keep your application maintenance low-cost even as it expands and still adhere to Agile's short timelines.

This post is based on my experience with Agile. Any thoughts or recommendations are welcome. Please feel free to comment. 

Friday, August 5, 2016

GIT Troubleshooting Guide

We recently moved to Git (rather E Git or Eclipse Git). The move turned out to be a bit challenging as we were among the first teams in our organisation to explore it. Hence we had to heavily rely on Google and Stackoverflow for getting past issues. Though we found most of the answers on these websites they were scattered here and there. This post aims to consolidate all those issues in one place and provide a ready reckoner for those exploring Git (via Eclipse) for the first time.

I will start with the Git Best Practices first.

Git Best Practices


  1. Before starting your work, do a Git Pull.
  2. Do commits often (Commit only, not Commit and Push).
  3. When committing, select only the files that you have changed. (See screenshot below)
  4. Before pushing your changes do a Sync with workspace. Make sure there are no conflicts with the remote repository.
  5. Before leaving for the day ‘Push to Upsteam’ your changes.
  6. Login to Git UI > Commits and verify if you can see your commits. Always verify your commits by going to Files tab of Git GUI and verifying if your changes are present. We came across situations where in though code got committed successfully, and also showed up in Commit tab, still files did not make it to the remote repo. The changes did not show up in Files tab either in such situations. So its always better to verify changes in Files tab.

Troubleshooting Tips


  • Tip 1


Probable Cause: This error can be caused if you accidentally update invalid credentials for Git.

Resolution: Go to Eclipse>Window>Open Perspective>Git
Open <project repository>remote>origin
Click on ‘push branch’ (one with red arrow)>Change Credentials.
Verify your credentials have been entered correctly. If not correct them and then try doing a push.
If this doesn’t work then it’s probably an issue with your GIT credentials.



  • Tip 2

When doing Push to Upstream you get below error:


Probable Cause: There have been other commits to remote repository in the meanwhile due to which you local repository head has fallen behind that of remote. You can also see a down arrow in your Project Explorer view indicating there are changes in Remote Repo that are not in your Local Repo.


Resolution: Do a Git Pull. This would bring latest changes from Remote to Local. In case there are no conflicting changes.

  • Tip 3

When doing a Git Pull you get below error:


Probable Cause: Another developer has been working on same file as you and has checked it in before you did.

Resolution: Go to Eclipse> <file having confict>Replace With > Head Revision. This would override your changes with the one in Remote repo. Now a Git Pull should be successful.


  • Tip 4

When doing a Git Pull you get below error:

Probable Cause: Your local repo’s head has got out of sync with remote.

Resolution: Team>Reset><From references choose  repo that is pointing to latest code>Mixed reset type. This would bring your local repo’s head in sync with that of remote.
Do a Git Pull, Commit you changes and Push to Upstream.

  • Tip 5

Modified files not showing when doing a Commit.

Probable Cause: You probably did a Commit and Push. Though Commit was successful, Push failed due to any of the above reasons.

Resolution: Go to Git Staging View>Staged files. This would show files that have been commited but yet to be Pushed. If you see your changes here do a Push to Upstream. This would push the changes to remote repo.

  • Tip 6

Resolving GIT merge conflicts (A conflict is shown by red arrow as shown below)

Probable Cause: Multiple developers working on same file simultaneously.

Resolution: Go to file having conflict>Team>Merge Tool. Resolve the conficts, then Add to Index and do a Commit followed by push. Now a Git Pull should happen successfully. 

Thursday, June 25, 2015

Improving Code Maintainability with Sonar and Jenkins

Problem Statement


Application code growing at an uncontrollable pace, few experienced guys in the team, challenging deadlines. All these made a heady nigtmarish mix for our application from maintenance standpoint.

Issue:
I was tired of the fact that people didn't seem to realize the importance of writing comments. Files after files, methods after methods were being created with no explanation as why they were needed and what purpose they solved.

Solution: 
This time I decided to at least try and put a stop to this never ending nightmare. The solution in my eyes lied in automating code reviews instead of just manually relying on it (which was exhausting and boring). We were already using Jenkins for building our code. I (with the help of a very talented colleague), added SonarQube feature to it. 
Then I tweaked SonarQube quality rules and made comment density a critical component. Initially I have set comment density at 5%. What that means is for every 100 lines of code atleast 5 lines of comments have to be there. If this criteria is not fulfilled for any of the checked in files on a given day, the build for that day will break, requiring urgent attention from team members.
The beauty of this feature is that even if a single line has been modified in a file (say as part of a defect fix), the person checking in the file would be forced to bring the entire file's comment density at par, thereby improving application readability and maintainability.

Below screenshot explains how to configure comment density in SonarQube.

Go to Dashboards>Quality Profiles> (Your quality profile)
Search for relevant rule by entering 'comment' in Search Box. Activate 'Insufficient comment density' rule and set its Severity.
 


Friday, June 5, 2015

Implement Throttling using Apache Camel

Problem Statement


Our application was getting huge surge of orders in small duration and at certain periods of the day only. To quanity further we were getting around 5000-6000 orders in a span of 30-45 mins, around 3 times in a day. Remaining time order volume was 100-300  orders in an hour. To tackle this load, number of app servers were increased from 4 to 8.

Issue
However the concern here was that this costly infrastructure was idle for most part of the day. Going by above stats, the servers were idle (usage much below capacity) for approx 70-80% of the time.

Solution
To optimize this we were asked to explore the option of Throttling.

Throttling: In software terms throttling is a  process responsible for regulating the rate at which application is processing.

Behaviour of application order processing before implementing throttling: Our application works by exposing a REST webservice. Client systems send order xml to this WS and its put on a JMS queue for processing.

Apache Camel Throttling PoC
I began by segregating above behaviour into two different components/routes.
One camel route listened to REST WS and put requests in a folder (say inbox). Another camel route listened to requests coming into this folder and put in on JMS queue but only after being throttled by Camel Throttling. Camel provides an option to define number of requests that need to be picked in a given time interval thereby making sure that even if order volumes surge, only pre defined amount of requests make it to processing stage. Rest stay in file system waiting to be picked up.

Sample code below:
<camel:route>
  <!-- Reading from REST url -->
  <camel:from uri="<REST WS url>" />
  <to
     uri="file:data/inbox?fileName=${header.OrderNumber}-${header.OrderVersion}.xml"
     pattern="InOut" />
</camel:route>
<camel:route>
  <!-- Route to put message from folder to main queue -->
  <camel:from uri="file:data/inbox" />
  <!--  Using camel provided throttling. Defined no of requests that can be processed in given time period -->
  <throttle timePeriodMillis="60000">       
   <constant>1</constant>
   <camel:log
    message="Sucessfully processing service order ${headers.OrderNumber}-${headers.OrderVersion}.xml" />
   <camel:to uri="file:data/outbox" />
  </throttle>
</camel:route>

Thursday, May 21, 2015

Apache Camel Redelivery Policy

Problem Statement

Continuing on the problem described in my previous post I wanted that once an exception occurs in my route, instead of trying to poll the request again, the route should sleep for a while say half an hour. This would give sufficient time for servers to stabalize and avoid unnecessary polling.

Issue
I used camel redeliveryPolicy to do this. Since i wanted that the request be polled until it was successfully process and not for some predefined max retries, I specified only redeliveryDelay.
   <camel:onException>
        <camel:exception>java.lang.Exception</camel:exception>
        <camel:redeliveryPolicy redeliveryDelay="50000" />
        <camel:log message="Default error handler was called"></camel:log>
    </camel:onException>


Even though onException was getting invoked delay was not taking effect.

Solution
On my colleague's advice i added maximumRedeliveries tag as well and finally delay took effect. Code snapshot that worked below:
 <camel:onException>
  <camel:exception>java.lang.Exception</camel:exception>
  <camel:redeliveryPolicy maximumRedeliveries="1" redeliveryDelay="50000" />
  <camel:log message="Default error handler was called"></camel:log>
 </camel:onException>


So seems like in redeliveryPolicy, maximumRedeliveries and redeliveryDelay go hand in hand.