There are lots of resources for English Ngram models. Google has a huge repository of ngram data which you can download for free. But, you are not so fortunate if you are searching for Nepali ngram models. Nepali language is a very less researched area and lacks proper data for doing any research. I figured that some people could benefit from this dataset.
I will use python and urllib to extract data from some national newspapers. Nagariknews URLs follow integer indexed pattern for news. We can run from 0 to some finite value to extract the news articles. Similarly, We can easily extract content from setopati.
We will try to extract news from setopati.com first. Setopati’s URL looks like http://setopati.com/samaj/1231 . Although the url seems to be organized with category like samaj and bichar, http://setopati.com/samaj/1231 is same as http://setopati.com/bichar/1231. Hence, we can run from http://setopati.com/bichar/2000 to http://setopati.com/bichar/12000 to extract almost all news of setopati till the current date (june 2014). We start from 2000 because most of the news articles having index 0 to 2000 are just blank pages like http://setopati.com/bichar/0 . We can extract all news from setopati and store it in text file for now. This script downloads the content from setopati.
The script yielded 65MB of text file. But that’s not enough for creating a large ngram model. Hence, our next move is to extract more data from other sources. Out next source will be Nagariknews and Ekantipur. Both store their news as integer ID. We can ignore other parameters of the URL and just change the ID in order to get all the news. For example, ekantipur’s URL looks like http://www.ekantipur.com/np/2051/2/21/full-story/390317.html . At first glance, it looks like we have to extract (or guess? :D) the date of news publication for extracting news with ID 390317. But in fact, it only considers ID and ignores other parameters. Hence, we can set the date to 2051/1/1/ and still get the same news. Same goes for nagariknews.
A quick view of ekantipur reveals that lots of IDs yield nothing. As we can’t keep guessing every ID till 390317, we are going to skip that for the time being. Nagariknews has no dead IDs and hence we can extract about 16k news from it’s website. This script is used to download the contents of nagariknews.
The script downloaded around 130 MB of text data. We can join the articles in a single text file for easier manipulation. In total, we have 180 MB of text data. The text data can be downloaded from the links below:
Data from setopati in individual files ( 13.6 MB )
Data from setopati in a single file ( 8.9 MB )
Data from nagariknews in individual files ( 31.0 MB )
Data from nagariknews in a single file ( 22.4 MB )
You can view all the raw files here.
Our next step is to generate language models from the text data. We can represent start and end of words with a symbol. I chose ‘#’ to represent start and end of sentences. So, we will get lines like
छ # 130060
That represent that word ‘छ’ acts as end of line 130060 times.
We assume some of the symbols [‘?’,’!’,’.’,’;’,’\n’] represents the end of a sentence. Similarly, some of the symbols [‘-‘,’,’,’\”,'”‘,’\t’,'(‘,’)’,'<‘,’>’,’‘’,’’’,’“’,’”’,’–’] seperate the words. This python script is used to generate the ngram language models.
You can download the language models from the links below:
The unigram model consists of 430669 words. First few lines of the model are :
Download Unigram Model ( 2.7 MB )
The Bigram model consists of 4196838 combination of words. First few lines of the bigram model are :
छ # 130060
छन् # 60878
थियो # 39912
हो # 34355
थिए # 30806
बताए # 24951
# तर 16898
Download Bigram Model ( 37.2 MB )
Download Trigram Model ( 87.7 MB )
Download Four-gram Model ( 123.6 MB )
Download Five-gram Model ( 148.7 MB )
The data above is extracted from Nepali Newspaper websites. Hence, it may contain several mispelled words. Another small, but rather accurate source of nepali text is the Nepali Dictionary database from Madanpuraskar. You can download the sql file here.
If you want the content of the database, you can directly download the words list and the meanings list. I just extracted certain rows from the database. The contents of these files are not processed. You can download the ready-made ngram models below: