Common mistakes in website deployment

Security measures during deployment is probably one of the most ignored aspects of website development. Many novice web developers choose to deploy their websites in Cpanel by uploading zip file of their website source content and extracting it using the file manager. The worse part of it is sometimes they forget to delete the actual zip after extraction, exposing the source code for everyone to download. Also, many website developers don’t secure their .git folder, exposing the source content to everyone. Some developers even go so far as to place the different API keys in their git source code, thus exacerbating the situation.To analyze the situation of current websites, I downloaded the alexa list of top one million websites. The goal is to find any mistakes made by the website developer during deployment.

The first thing to check is if the website developer forgot to delete the backup zip file of the website. A novice developer will choose to deploy his website as a compressed zip file of source code. This is particularly easy for beginners as they can compress all the files into a single file, upload it in the cpanel file manager and extract all of the files using the online interface – preventing them from uploading each file individually. The zip filename is usually the name of the website itself. For example, for a website example.com , the zip filename is usually example.zip. Other common filenames are www.zip, htdocs.zip etc.

Another silly mistake that a developer makes is exposing the .git repository for their websites. This is a common mistake among even fairly adept developers.  We can check the response of HEAD file inside the .git repository to check if the directory is accessible. For example, for example.com, we can check the response of example.com/.git/HEAD

Another file from where we can gain information about the website is the error_log file generated by APACHE. Although it doesn’t contain any confidential information, some information like database name, path of the current script, filename of scripts etc may be exposed in the file.

The first step in analyzing the websites is to prepare a list of URLs, whose response is to be checked. A simple script can be used to generate a list of URLs whose HTTP response code is analyzed to check if the file exists. The following script is to get the HTTP response code of the URLs in a file “urllist” and save the result in “results.csv”. It uses CURL to get the HTTP response and xargs to perform the action in parallel.

#!/bin/bash
xargs -n1 -P 10 curl -o /dev/null --silent --head --location --write-out '%{url_effective};%{http_code};\n' < urllist | tee results.csv

Leaving this script to run for a few days gave few million URLs and their response codes. Then, the list of urls that gave 200 code is filtered. The result is again partitioned into “zip files”, “git repos”, and “error log files”.

This gave a list of around 8000 zip URLs. Not all of these URLs are source code containing Zip files. Many websites respond 200 for any URL. So, I wrote a script to check the header and content of the zip file to verify if the URL is of a valid Zip file. The final filtered list contains around 1600 Zip files. I manually checked the content of some of these zip files. Most of them are wordpress backups, exposing the database name, username and password among other things. Some of them are harmless-containing only static files. Hence, one out of every 625 websites are exposing their complete or partial source code as a downloadable Zip file.

Another weak point of website deployment is the Git directory ( Or any other version control system repository) . Website developers often use git as an easy to use deployment tool. While placing secret keys in git is highly discouraged, many developers fail to do so, thus increasing the severity of loss if any intruder gets hold of the git repository. The script gave a list of around 10000 websites that returned 200 HTTP response for /.git/HEAD . I wrote another program to verify that the git directory is really a valid repo and that the website isn’t just returning 200 response to every request. That resulted in a list of 3000 exposed git repositories.

The error log URL list contained around 30000 URLs. I didn’t bother to filter the error_log URL list as the file contains only small amount of information that might not be very useful or interesting to intruders.

How to prevent these loopholes?

Small precautions taken during website deployment can easily prevent these mistakes. First of all, don’t compress the content of your website to upload or download it. CPanel users can use FTP with client like FileZilla to upload or download the source code. If possible, use a version control system for deployment. This will save your website from being exposed as a downloadable zip file in case you forget to delete the file.

Also, never place your secret API keys or other keys in your source code managed by a version control system. There are different methods like environment variables, separate file in server, etc where you can put the secret keys. If you place the secret keys in the repository, any person who has access to your repository will have access to all the secret keys.

It is also important to secure your .git directory. The easiest method is to point the domain root to a sub-directory of your repo. For example, consider a project mywebsite managed using git. The directory structure should be more or less like this:

mywebsite (managed by git)
–> .git
–> content (Contains all the website content)

In this way, any outside person will have no way to access the .git directory directly.

If you already have a directory hierarchy where your .git directory is inside your website root, you can prevent access to the directory using apache or nginx rules. If you use apache, placing the following in the .htaccess file of website root will prevent access to the .git folder

RedirectMatch 404 /\.git

If you nginx, place the following in your server block

location ~ /\.git {
  deny all;
}

So, your website nginx configuration will look like this

server {
	listen 80 default_server;
	listen [::]:80 default_server ipv6only=on;

	root /usr/share/nginx/html;
	index index.html index.htm;

	
	server_name localhost;
	# Block .git directory
	location ~ /\.git {
  	deny all;
	}
	# Make site accessible from http://localhost/
	location / {
		# First attempt to serve request as file, then
		# as directory, then fall back to displaying a 404.
		try_files $uri $uri/ =404;
		# Uncomment to enable naxsi on this location
		# include /etc/nginx/naxsi.rules
	}
}

Installing and managing PostgreSQL Database Server

PostgreSQL(Postgres) is a very popular open source Database Management System. I use it frequently in my Django projects. Here are few tips for setting up postgres from the very beginning. I will also discuss about troubleshooting some of common problems during it’s usage.

Installing Postgres

Installing postgres is very straightforward.

sudo apt-get -y install postgresql postgresql-contrib

If you are using postgres in a django project, you might like to install some dependencies for postgres to work with django.

sudo apt-get install -y libpq-dev python-dev

Configure Posgres

After installing postgres, you can use it to manage your databases. You can start by creating a database and creating a user to manage the database. To create a database, first of all switch to user postgres as

sudo su – postgres

Then, create your database using the following command

createdb testdatabase

Replace “testdatabase” with any database name you wish to create. Then, create a user using the command

createuser -P testuser

Replace “testuser” with any username you like. You should be prompted to enter a password for the new user. Enter a strong password.

Now, we want the newly created user to be able to manage the newly created database. So, to grant the user permission for the new database, first of all, start the PostgreSQL interactive terminal as

psql

Then, use the command below to grant the user “testuser” permissions to use database “testdatabase”

GRANT ALL PRIVILEGES ON DATABASE testdatabase TO testuser;

That’s all. Use \q to exit the postgres terminal and then enter exit to exit the postgres user mode.

Deploying Django 1.7 with python3 using virtualenv in Ubuntu

This is a tutorial to deploy a django application created in django 1.7 and python 3 using virtualenv in Ubuntu server. It assumes that you have already setup initial server configuration in your server like creating a new user, setting up security measures, etc.

First of all, update your packages.

sudo apt-get update
sudo apt-get upgrade

Then, install python3 and python3-pip in your server (if they doesn’t exist already)
sudo apt-get install -y python3
sudo apt-get install -y python3-pip

Next, install virtualenv
sudo apt-get install python-virtualenv

Let’s assume that you want to install your application in ~/djangoapps/ directory. Then, create a python virtual environment in the directory as
virtualenv –no-site-packages –distribute -p /usr/bin/python3 ~/djangoapps/myappname

Now, let’s get inside that virtual environment

source ~/djangoapps/myappname/bin/activate

Install django and any other requirements for your applications inside the environment

pip3 install django

Next step is to install gunicorn server. You can install gunicorn as

pip3 install gunicorn

That’s all for now. Deactivate the virtual environment as

deactivate

Next step is to install nginx server.

sudo apt-get install nginx

Go to ~/djangoapps/myappname/ and create a file rungunicorn.sh . Replace the project name and path in the code below and copy the content to that file.

#!/bin/bash
NAME="yourappname" # Name of the application
DJANGODIR=/path/to/your/app # Django project directory

# bind gunicorn to this port. Make sure that you use different
# port for different django projects
BINDIP=127.0.0.1:8214
USER=root # the user to run as
GROUP=root # the group to run as
NUM_WORKERS=9 # how many worker processes should Gunicorn spawn
DJANGO_SETTINGS_MODULE=yourappname.settings # which settings file should Django use
DJANGO_WSGI_MODULE=yourappname.wsgi # WSGI module name
 
echo "Starting $NAME"
 
cd $DJANGODIR
# Start your Django Unicorn
# Programs meant to be run under supervisor should not daemonize themselves (do not use --daemon)
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--user=$USER --group=$GROUP \
--bind=$BINDIP \
--log-file=-

Now, finally we have to configure nginx to serve static files and django website. go to /etc/nginx/sites-available/ and create a file named mysite. Place the following inside that file

    server {
        server_name yourdomainorip.com;

        access_log off;

        location /static/ {
            alias /opt/myenv/static/;
        }

        location / {
                proxy_pass http://127.0.0.1:8214;
                proxy_set_header X-Forwarded-Host $server_name;
                proxy_set_header X-Real-IP $remote_addr;
                add_header P3P 'CP="ALL DSP COR PSAa PSDa OUR NOR ONL UNI COM NAV"';
        }
    }

Make sure to change the server_name to your domain name and the port number to the port number you specified earlier.

After this, go to /etc/nginx/sites-enabled and create a link to the previous file using this command:

sudo ln -s ../sites-available/mysite

Now, go to ~/djangoapps/myappname/ and start the startgunicorn.sh script as

./startgunicorn.sh

Test if the website is running from your browser. If it is working, cancel the gunicorn process using ctrl-c and type “bg” and hit enter. This should keep the gunicorn server running in background.

Nepali Ngram Models

There are lots of resources for English Ngram models. Google has a huge repository of ngram data which you can download for free. But, you are not so fortunate if you are searching for Nepali ngram models. Nepali language is a very less researched area and lacks proper data for doing any research. I figured that some people could benefit from this dataset.

I will use python and urllib to extract data from some national newspapers. Nagariknews URLs follow integer indexed pattern for news. We can run from 0 to some finite value to extract the news articles. Similarly, We can easily extract content from setopati.

We will try to extract news from setopati.com first. Setopati’s URL looks like http://setopati.com/samaj/1231 . Although the url seems to be organized with category like samaj and bichar, http://setopati.com/samaj/1231 is same as http://setopati.com/bichar/1231. Hence, we can run from http://setopati.com/bichar/2000 to http://setopati.com/bichar/12000 to extract almost all news of setopati till the current date (june 2014). We start from 2000 because most of the news articles having index 0 to 2000 are just blank pages like http://setopati.com/bichar/0 . We can extract all news from setopati and store it in text file for now. This script downloads the content from setopati.

The script yielded 65MB of text file. But that’s not enough for creating a large ngram model. Hence, our next move is to extract more data from other sources. Out next source will be Nagariknews and Ekantipur. Both store their news as integer ID. We can ignore other parameters of the URL and just change the ID in order to get all the news. For example, ekantipur’s URL looks like http://www.ekantipur.com/np/2051/2/21/full-story/390317.html . At first glance, it looks like we have to extract (or guess? :D) the date of news publication for extracting news with ID 390317. But in fact, it only considers ID and ignores other parameters. Hence, we can set the date to 2051/1/1/ and still get the same news. Same goes for nagariknews.

A quick view of ekantipur reveals that lots of IDs yield nothing. As we can’t keep guessing every ID till 390317, we are going to skip that for the time being. Nagariknews has no dead IDs and hence we can extract about 16k news from it’s website. This script is used to download the contents of nagariknews.

The script downloaded around 130 MB of text data. We can join the articles in a single text file for easier manipulation. In total, we have 180 MB of text data. The text data can be downloaded from the links below:

Data from setopati in individual files ( 13.6 MB )
Data from setopati in a single file ( 8.9 MB )
Data from nagariknews in individual files ( 31.0 MB )
Data from nagariknews in a single file ( 22.4 MB )

You can view all the raw files here.

Our next step is to generate language models from the text data. We can represent start and end of words with a symbol. I chose ‘#’ to represent start and end of sentences. So, we will get lines like
छ # 130060
That represent that word ‘छ’ acts as end of line 130060 times.

We assume some of the symbols [‘?’,’!’,’.’,’;’,’\n’] represents the end of a sentence. Similarly, some of the symbols [‘-‘,’,’,’\”,'”‘,’\t’,'(‘,’)’,'<‘,’>’,’‘’,’’’,’“’,’”’,’–’] seperate the words. This python script is used to generate the ngram language models.

You can download the language models from the links below:

Unigram Model

The unigram model consists of 430669 words. First few lines of the model are :

र 182278
छ 155075
पनि 112644
छन् 69376
भने 66268
भएको 56146
हो 53006

Download Unigram Model ( 2.7 MB )

Bigram Model

The Bigram model consists of 4196838 combination of words. First few lines of the bigram model are :

छ # 130060
छन् # 60878
थियो # 39912
हो # 34355
थिए # 30806
बताए # 24951
# तर 16898

Download Bigram Model ( 37.2 MB )

Trigram Model

Download Trigram Model ( 87.7 MB )

Four-gram Model

Download Four-gram Model ( 123.6 MB )

Five-gram Model

Download Five-gram Model ( 148.7 MB )


Browse all models

Update:

The data above is extracted from Nepali Newspaper websites. Hence, it may contain several mispelled words. Another small, but rather accurate source of nepali text is the Nepali Dictionary database from Madanpuraskar. You can download the sql file here.

If you want the content of the database, you can directly download the words list and the meanings list. I just extracted certain rows from the database. The contents of these files are not processed. You can download the ready-made ngram models below:

Download Unigram Model

Download Bigram Model

Download Trigram Model

Download Four-Gram Model

Download Five-Gram Model

Browse All Data