Subscribe to DSC Newsletter

This application is used to identify keywords related to a specified keyword, and it is used in all search engines. It is described in our article Fast clustering algorithms for massive datasets. Here we provide everything you need to create your first API from scratch! Just read and download the stuff below, it will keep you busy for a little while.

We provide the source code, as part of our Data Science Apprenticeship. The application can be tested at www.frenchlane.com/kw8.html.

The source code and and data, available for download below, consists of:

  • The HTML page where the application is hosted: in particular, it contains the web form with all the parameters used by the API
  • The keyword co-frequencies table as described in my article. It is saved as 27 files: for instance kwsum2s_SRCH_CLICKS_p.txt corresponds to keywords starting with letter p; kwsum2s_SRCH_CLICKS_0.txt corresponds to keywords not starting with a letter, e.g. 2013, 98029 (zip code) or 1040 (as in "IRS tax form 1040"). It is available here as a compressed zip file (7 MB - click on the link to download it). The un-compressed version consists of 27 files... to make it run about 27 times faster! 
  • The Perl script kw8x3.pl that computes keyword correlations, given the keyword co-frequencies table with pre-computed frequencies (rename this file as kw8x3.pl when using it, here it is provided as a text file for convenience).

The application is written in very simple Perl, but can easily be translated into Python. It does not require special Perl libraries. We will later provide another example of API that requires downloading special libraries (along with web crawler source code and instructions) to our DSA (Data Science Apprenticeship) students.

To get our API to work, first install cygwin on your computer or server, then install Perl. If you want the API to work as a web app (as on frenchlane.com), it has to be installed on a web server. Perl scripts - files with the .pl extension - must be made executable (usually in a /cgi-bin/  directory), using the UNIX command chmod 755, that is, in our case, chmod 755 kw8x3.pl.

Two examples of API call:

Here we use the API to find keywords related to the keyword data.

Click on URL to replicate results. Note that in the first example, the parameter mode is set to Silent, and correl is not specified. In the second example, mode is set to Verbose, and correl to $n12/sqrt($n1*$n2) as suggested in our article (where n1=x, n2=y, n12=z).

Example 1:

http://www.frenchlane.com/cgi-bin/kw8x3.pl?query=data&ndisplay=...

Results returned:

data recovery
data sheet
data base
data cable
data management
recovery
data entry
data protection
data from
data storage

Example 2:

http://www.frenchlane.com/cgi-bin/kw8x3.pl?query=data&ndisplay=...

Results Returned:

0.282 : data =data recovery= 2143:171:0.245
0.167 : data =data sheet= 2143:60:1.066
0.139 : data =data base= 2143:42:0.571
0.138 : data =data cable= 2143:41:1.414
0.134 : data =data management= 2143:39:0.512
0.121 : data =recovery= 2143:928:0.637
0.116 : data =data entry= 2143:29:1.068
0.112 : data =data protection= 2143:27:1.074
0.105 : data =data from= 2143:24:1
0.103 : data =data storage= 2143:23:1.217

Explanations

Results can be recovered manually (from the web app itself, with your browser, for instance when you click on the above links), or with a web crawler for batch or real time processing. Note that the correlation formula used in this example is the same as the one described in our article: $n12/sqrt($n1*$n2). The only difference is that in our article, $n1, $n2 and $n12 are respectively called x, y and z.

In the second example of API call (above), the results returned are $n1 = 2143, $n2 = 928, and correlation = 0.121 for the keyword pair {data, recovery}. I don't remember what 0.637 stands for, maybe someone can help me, by looking at the code? Note that n1 = 2143 is the number of occurrences of keyword data as reported in the co-frequencies table that you have just downloaded, n2 = 928 is the number of occurrences of keyword recovery, while n12 would be the number of simultaneous occurrences of data and recovery (e.g. in a same web page or user query), as reported in our co-frequencies table. The creation of the co-frequencies table is described in our original article.

One of the tricky parts of this API is that it accepts a user-provided formula to compute the keyword correlations, based on $n1, $n2 and $n12, unless the correl parameter (in the API call) is left empty. Because of this, the API creates an auxiliary Perl script called formula.pl from within kw8x3.pl, in the same directory where the parent script (kw8x3.pl) is located. The parent script then calls the getRho subroutine stored in formula.pl to compute the correlations. FYI, here's the default code for formula.pl:

sub getRho{

  my $rho;

  $rho=$n12/sqrt($n1*$n2);
  return($rho)
}
1;

The path where formula.pl is stored is /home/cluster1/data/d/x/a1168268/cgi-bin/. So you will have to change this path accordingly when installing our app on your server. Also, you can improve this API a bit by using a list of stop words - words such as from, the, how etc. which you want to ignore.

Finally, keep in mind that this is just a starting point. If you want to make it a high quality, "weapons grade" app, you'll need to add a few features. In particular, you'll have to use a look-up table of keywords that can not be broken down into individual tokens, such as "New York", "San Francisco" etc. You'll also have to use a stop list of keywords, and do a lot (but not too much!) of keyword cleaning (you can normalize traveling as travel but not booking as book). The feed that you use to create your co-frequencies table is also critical: it must contains millions of keywords. If you use too few, results will look poor. If you use too many, results will look noisy. In our case, we used a combination of feeds:

  • about a million categories and web site descriptions from DMOZ (public data)
  • many million of user queries from search engines (private data)
  • text extracted with a web crawler, from several million web pages (public data)

If you have questions about the code or about this API, please ask your questions below, and I will try to answer as soon as I can. Thanks!

Related articles:

Views: 15225

Replies to This Discussion

Vincent

I am not running this as web app and i do not see a link to download formula.pl. Without this the perl script is not doing much. Please let me know.

Thanks

Hi Sanjay, the code for formula.pl is in my article. It consists of one subroutine:

sub getRho{

  my $rho;

  $rho=$n12/sqrt($n1*$n2);
  return($rho)
}
1;

Please pardon my ignorance, but could you provide some additional detail regarding the installation of Cygwin and Perl?

Is Perl to be accessed through the Cygwin installation?

I would appreciate your steering me in the correct direction.

Respectfully,

 

--  Dean Pangelinan

Perl and Cygwin are separate, but once you've installed Cygwin, you can call Perl from within Cygwin consoles.

Just type > Perl myprogram.pl in a Cygwin window, where myprogram.pl is your Perl program.

Thank you Dr. Granville.

I'm on my way!

 

--  Dean

RSS

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2016   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service