Learning Regex

For past many days, I have been learning how to automate the mundane activities which me and my colleagues do at work. So, I got to know about web scraping. I didn’t knew there was a specific term for the objective I was trying to achieve.

Web scraping helps in data collection jobs, where all data is publicly available and you just need to copy/paste the data from one place to another.

Sometimes, my job involves these useless activities of copy-pasting from web to an excel sheet. So, I thought to learn web scraping and let the computers do brainless work, while I prefer doing work which is a good food for my brain.

As a I started to learn more about web scraping, I discoverer, that my existing knowledge of HTML/CSS would also help in getting desired result. This got me interesting. I also discovered that web scraping involves extensive use of Python and its libraries along with a little bit of Regex (Regular Expression) knowledge. So, to fulfill requisites, I took up Python learning at Learn Python the Hard Way, and Regex learning at RegexOne. Both are really great and engaging resources for beginners.

Thankfully, I was able to complete regex in one day. I would like to share some of the solutions, which I came up for Regex practice exercise at RegexOne

Problem My Solution
Problem 1 (-|)\d+(\.\d+[e]\d+|\.\d+|,\d+\.\d+|)[^p]$
Problem 2  (\d{3})
Problem 3  (\w+\.?\w+)
Problem 4  <(\w+)
Problem 5  (\w+)\.(jpg|png|gif)$
Problem 6  \s+(.+)
Problem 7  \w\/(\w+)?\( \w+\)\:\s+at widget.List.(\w+)\((\w+.java):(\w+)
Problem 8  (\w+)://(\w+\.?-?\w+\.?\w+):?(\d+)?

Some of the solutions differ from the solutions originally posted on website. This might help in better understanding of RegEx, when you view alternate code that can achieve same objective.

Credits: http://twiki.org/cgi-bin/view/Codev/TWikiPresentation2013x03x07
Credits: http://twiki.org/cgi-bin/view/Codev/TWikiPresentation2013x03x07

I really loved the learning experience at this website, and I would recommend this website for beginners to learn RegEx. [RegExOne]

Additional Resources may include cheat sheets, comics, and videos available on internet for learning Regex. Checkout more resources here.

Companies in Delhi

There are variety of companies in Delhi ranging from trading, construction, business services, real estate, etc. to manufacturing, transport, social services, etc. As of March 2015, there were 272,369 companies registered in Delhi state (this doesn’t include companies in NCR). Not surprisingly, more than 33% of these companies belongs to service sector. Thanks to the fact that India has one of the fastest growing service sector, this number is going to increase further. Number of service oriented companies was as large as three times the next category of trading companies. An infographic below shows the percentage of companies by the type of business they were involved in, as of March 2015.

Companies in delhi (by business activity - Percentage)
Companies in Delhi (by business activity – Percentage)

What was surprising to me was fact that there were a significant number of social services companies in the state, which outnumbered financial companies, machinery manufacturing companies and transportation companies. This might be due the fact that a significant number of NGOs registered in Delhi are used as money laundering devices. [http://www.hindustantimes.com/delhi/99-ngos-are-fraud-money-making-devices-hc/story-2AMyh5VMGA0edUtvtRnAMP.html]

Further, if we look at data pertaining to number of companies by Paid-up Capital, we can find that most of the companies have exactly the bare minimum paid-up captal of Rs 100,000. This could be because a lot of companies are setup as dummy just to exploit tax laws and government schemes. Below is the data shown for companies in Delhi (by paid up capital)

Companies in Delhi (by paid up capital - March 2015)
Companies in Delhi (by paid up capital – March 2015)

The status of various companies as of March 2015 is listed below:

Status Number of companies
Active 197,051
Active (in progress) 27
Amalgamated 3,078
Converted to LLP 189
Converted to LLP and Dissolved 532
Dissolved 109
Dormant 28,168
Dormant (u/s 455 of CA 2013) 65
Liquidated 54
Strike Off 40,966
Under Liquidation 689
Striking Off (under process) 1,441

Other facts about companies in Delhi:

  • The top four companies (by paid up capital) in Delhi were Air India, BSNL, DMRC, and ONGC.
  • Among 272,369 companies registered in Delhi, more than 72% were active companies, 15% were struck off from register and 10% were deemed as dormant as of March 2015.
  • Number of publicly listed companies were 19,231, while number of private companies in Delhi was 252,738 as of March 2015; remaining 400 were One Person Companies

RAM upgradation

Recently, I have upgraded RAM in my new laptop from in-built 4GB to 8GB by using Amazon lightening deal for Transcend 4GB RAM; however, my experience was not really good.

The only good thing is that RAM is working fine and now my computer is showing 8GB RAM. However, the speed for newer RAM seems slow as its maximum CAS Latency is 9, while for pre-installed RAM it was CAS Latency of 11. Here is a screenshot showing details for current DDR3 RAM details.

Current RAM details
Current RAM details

Now my hp 15 ac082tx laptop runs faster with this piece of hardware. 😊

Extract images from office document

At times you might need to extract images from a powerpoint presentation or a word document. It can be very tedious job to extract those images individually. Moreover, this is a time consuming activity.
This can be easily done by renamming the powerpoint/word file into a zip file and the extracting data from it.
Here is how a typical powerpoint file looks when unzipped.

Folder structure
Directory structure of powerpoint file

Above tree structure reveals how infromation is organized in microsoft office powerpoint file.

  • docProps folder is used to store file properties and thumbnail information
  • ppt folder contains information on various ojects used in a presentation
  • _rels folder defines relationship among various files

To get all images in a microsoft office document, one can look into media folder, which contains all images used in an office document.
Similarly, other folders contain information about other objects used in an office document.

You can read more about this tree structure at MSDN.