Saturday, 8 October 2016

Use of Python to Check URL Indexing

Organic search on search engine involves three main factors; i.e. crawling, indexing and ranking. The action of search engine arriving at your website and going through its links is the process of crawling. Then search engine records those links in its index; this process is called indexing. After that, search engine uses different parameters to decide the placement of your website’s URLs against specific search queries. It is called ranking.

Most of Search Engine Optimizers usually give their full attention to the ranking part only without considering the fact that they can’t see their websites in search engine result pages if searched engine haven’t crawled and indexed those websites. With that said, it is important to understand that Indianapolis SEO can be successful only if search engine has crawled through the website and indexed it properly.

What is the way you can determine whether your website has been indexed?

Using Google Search Console, you will get to know about the number of pages the XML sitemap has and the number pages that are indexed. Nevertheless, this feature isn’t made to make thorough analysis to tell what pages aren’t indexed by Google yet.

It means that you will have to dig a bit deeper using some manual technique that would take a lot of time. However, there is a small tool that can help you in this task. It’s called Python.

Let’s first discuss how you can check a URL whether it is indexed by Google. It can be done by using “info:” search operator like this:

Info:http://domain-name.com

If the URL is indexed, the result page will show one search result, which would be your website’s link. If the URL isn’t indexed, you will get an error message.

Using Python to check indexing status of multiple URLs
To check one link, the trick mentioned above is surely enough. But what can be done if there are more than 1000 pages to be checked. If you have 100 people working for you on their PCs, you can give 10 URLs to each person and get the report within a few minutes. Or, you can use Python.

To start using this tool, first you need to install Python 3. Then you will need to install a library which is called BeautifulSoup. There is a simple command you can run in Command Prompt:

pip install beautifulsoup4

Now you are ready to download the script. The folder in which script in stored, create a simple text file and start adding links in it. Make sure every link is separated by rows.

To run this script, you need to run the free proxy first. For that purpose, download Tor Expert Bundle, extract the file and run it. Then download Polipo that will run Tor and HTTP proxy.


Go to Polipo folder and create a text file. Enter the following commands in that text file:

socksParentProxy = "localhost:9050"
socksProxyType = socks5
diskCacheRoot = ""
disableLocalInterface=true

Then open the command prompt and go to the Polipo directory. After that, run following command:


polipo.exe -c config.txt


After this command, you can now run the Python script:

python indexchecker.py


After you run the script, you will be asked to enter the number of seconds you want to keep as a delay between the checking of two URLs. The end result will be a CSV file in which every link will be listing along with its status. Status ‘TRUE’ against a URL means the URL is indexed and status ‘FALSE’ means the URL isn’t indexed.