MySQL-Fullltext: 使用 MySQL 实现简单的搜索引擎

概述

本文涵盖了一个简单的C实现的搜索引擎的搭建始末。

我通常使用SQL Server和C #,但我教C/C++的朋友要远离微软。在过去,MySQL不是我想要的数据库,因为标准安装版不支持事务,但它变得越来越成熟。我使用64位InnoDB引擎的MySQL 5.6,使用Unicode(utf8)编码,这是我新数据库的默认设置。

Freetext是InnoDB的新特征,它在MySQL5.6版中被首次推出。

与C相比我通常更喜欢C++,即使在小项目中:不用知道所有的函数名,而且有一些内置的常用操作和漂亮的IntelliSense支持。在C++中,还有有STL及集合和字符串助手。

C++的Mysql接口比较弱,而C的接口很成熟,所以我决定使用C接口。

C的dll文件是和WCF一起发布的,以便完成AJAX请求,在Visual Studio Ultimate 2012中我使用C#的"WCF Service Application"模板,我搜索了使用C++搭建WebService的方法,但只找到一些使用C++处理WebServices调用的例子。

用户界面是一个使用Jquery和Jquery-UI自动提示的HTML界面,页面被增加到"WCF服务应用",项目被命名为VisionWeb

网页看起来是这样的:

MySQL-Fullltext: 使用 MySQL 实现简单的搜索引擎_第1张图片

我在.NET框架4.0,64位系统上配置这个项目,如果你使用32位的Mysql服务器,你必需随之做些更改。记得设置UNICODE选项为默认值。

配置MySQL

你有可能会从VisionSmall中打开这个VisionDAL项目, 假定你必须修改连接MySQL的C程序接口. 在这儿,我介绍了如何在新项目中安装MySQL接口: 检查那些设置是否符合你的要求,尤其是mysql.lib文件和VisionDAL.dll的路径.

在Visual Studio中,添加一个VisionDAL工程, 通过这个流程"Other Languages/Visual C++/Empty Project". 在这之中, 你只需要改变"应用类型" 为DLL. 把VisionDAL.cpp改名为VisionDAL.c, 这就清楚的告诉Visual Studio把编译器从C++改为C. 给这个工程添加一个头文件命名为VisionDAL.h.

在窗口中, 右击VisionDAL工程并选择属性. 然后在"配置属性"/Linker/Input, 选择 "Additional Dependencies" 并且添加libmysql.lib 到这个路径, 不要忘记了分隔符 ";".

Under "Configuration Properties"/Linker/General, choose "Additional Library Directories" for me, add C:\Program Files\MySQL\MySQL Server 5.6\lib>. Now we have linked in the C-Interface, but the DLL implementing the calls inlibmysql.lib must be in the system search path for executables: from the Control Panel, choose System, click "Advanced system settings", "click Environment Variables" under "System Variables", choose Path, and add the libmysql.lib path (DLL is in the same folder with the lib file): C:\Program Files\MySQL\MySQL Server 5.6\lib.

We need to have VisionDal.dll in our path too, IIS won't find it when you put the DLL into its bin folder of the website. Add <path to the solution>/x64/debug to the path. I needed to reboot to get this setting effective. When the website gets its first request it will load VisionDAL.dll; when you now rebuild the project, you will get a write error on VisionDAL.dll: to fix it, restart the website or use a tool like unlocker.

Then we specify the include properties for VisonDAL. Under "Configuration Properties"/"C/C++" add your MySQl header file path, for example: C:\Program Files\MySQL\MySQL Server 5.6\include.

下面我们在“C/C++”/"预编译头"菜单栏中,从“预编译头”切换到“不使用预编译头”,设置Preproccessor定义防止使用strcpy和fopen时产生的错误消息:在"C/C++"/预编译器/"预编译器定义 "中设定SE_STANDARD_FILE_FUNCTIONS和_CRT_SECURE_NO_WARNINGS。

当你现在连接,mysqllib引用的问题并没有解决,因为它们是64位处理器。通过在VisionDal中打开工程属性,选择“配置管理”,然后设置为x64平台。


现在我们来创建名为 Vision 的样本数据库

打开SQL Development 中的 MySql 工作台,打开你的实例。将会出现一个新窗口 "SQL File 1" 。 双击VisionDAL项目中的 Sql.txt 文件。复制所有内容到剪贴板,粘贴到工作台中的"SQL File 1"窗口。 点击螺栓图标(左边第三个图标),创建样本数据库。

接下来我们需要用来数据库登录的通用信息。

我们有一个关于此的配置文件: <installation director>VisionSmall\x64\Debug\VisionConfiguration.txt, 看起来像这样:

1 Host: localhost
2 User: root
3 Password: frob4frob
4 Database: vision
5 Port: 3306

修改这些数值以匹配你的SQL-Configuration。

Vision 数据库

数据库中只有一张表

1 CREATE TABLE 'document' (
2   'DocumentID' int(11) NOT NULL AUTO_INCREMENT,
3   'Title' varchar(255) DEFAULT NULL,
4   'Text' text,
5   PRIMARY KEY ('DocumentID'),
6   FULLTEXT KEY 'ft' ('Title','Text'),
7   FULLTEXT KEY 'ftTitle' ('Title')
8 ) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8;

搜索的时候我们使用名为'ft'的全文索引,查找自动完成单词的时候我们使用名为'ftTitle'的全文索引。

如果你拥有一个很多字段的全文索引,你可以在Microsoft SQL Server中选择,查询的时候,哪个字段被包含进搜索。在MySQL中,通常全文索引的所有字段都被搜索,所以我们必须指定额外的全文索引'ftTitle'。

Querying MySQL with the C-Interface

First we need to connect to the database and get a MYSQL pointer for further access:
01     MYSQL *Connect(){
02     MYSQL *conn; // Connection
03  
04     // Connect to MySQL
05     conn = mysql_init(NULL);
06     if(mysql_real_connect(
07         conn, Configuration.Host, Configuration.User, Configuration.Password,
08         Configuration.Database, Configuration.Port, NULL, 0) == NULL) {
09             fprintf(stderr, "sorry, no database connection ...\n");
10             return NULL;
11     }
12     return conn;
13 }
At startup we fill the global Configuration struct with the values from the configuration file  VisionConfiguration.txt, which should be in the same directory are our executing program. The routine to read the settings isConfigurationRead. To get the path of the currently executing module it usesGetModuleFileNamefrom the Win32 API:
1 TCHAR *GetExecutablePath(){
2     TCHAR *pBuf = (TCHAR *)malloc(512);
3     int bytes = GetModuleFileName(NULL, pBuf, 255);
4     if(bytes == 0)
5         return NULL;
6     else
7         return pBuf;
8 }
There is only one routine we want to expose:GetDocuments. Definition in the header file:
1 #define FORMAT_TEXT 0
2 #define FORMAT_JSON 1
3 __declspec(dllexportTCHAR*   __cdecl GetDocuments(TCHAR *search, int format, intforAutocomplete);
Definition in the source file:
1 __declspec(dllexportTCHAR* GetDocuments(TCHAR *search, int format, int forAutocomplete)

__declspec(dllexport)on the declaration and definition effectuate that the call is added to the file VisionDAL.lib and exported in the VisionDAL.dll file.__cdecldefines how to call the procedure, here we use C-style calling conventions.TCHARis a define which is the same asWCHARwhen the UNICODE definition is set otherwise its a simple char, in our case UNICODE is turned on.

    Note that there are different Unicode-Formats:
  • in C code we use a two byte value to denote a char value
  • in MySQL and the .NET Framework the format UTF-8 is, which means one byte is used for each character and only on demand more than one byte are used
  • in Console Applications you usually use one byte for each char and use Codepage 850 for the values greater than 127.

The Parameter format is FORMAT_TEXT or FORMAT_JSON, to toggle the output between text and JSON.

IfforAutocompleteis true only the Title is searched and returned.

VisionDALClientConsole

VisionDALClientConsole is a tiny Windows Console Application, to test ourGetDocumentsprocedure.

It has a reference to the VisionDAL project set. Its output files go to VisionSmall\x64\Debug together with the output from VisionDAL.

VisionDALClientConsoleasks for the search string, the wildcard is "*", it searches the columns title and text and outputs the text from theGetDocumentscall.

A sample run:

MySQL-Fullltext: 使用 MySQL 实现简单的搜索引擎_第2张图片

The main routine:

01 int _tmain(int argc,TCHAR* argv[])
02 {
03     char c;
04     TCHAR *result;
05     TCHAR *search = (TCHAR *)malloc(1000*2);
06     char *searchA = (char *)malloc(1000);
07     int retval = 1;
08     char buffer[32000];
09  
10     buffer[0]=0;
11     printf("Search for: ");
12     /* wscanf doesn't get umlauts */
13     if(scanf("%[^\n]", searchA) <= 0){
14         printf("Could not read input - retrieving all Documents \n");
15         *search=0;
16     }else{
17         MultiByteToWideChar(850,0,searchA, -1,search, 999);
18     }
19     result=GetDocuments(search, FORMAT_TEXT, 0);
20     if(result == NULL){
21         retval = 0;   
22     }else{
23         WideCharToMultiByte(850,0,result, -1,buffer, 32000,NULL,NULL);
24         printf("%s", buffer);
25     }
26     fflush(stdin);
27     printf("Press RETURN Key to Exit\n");
28     getchar();
29     return retval;
30 }

In Microsoft C V. 12 you have routines to deal with Unicode-16 strings. They have a starting additional w or replacestrwithwcs, for example:wscanf,wprintf, andwcsleninstead ofstrlen. Usingwscanfdid not get the umlauts right. I usedMultiByteToWideCharusing codepage 850 to get the wide chars andWideCharToMultiByteto convert back to chars.

Querying the MySQL Database

Above I showed how to connect to the database and get a MySQL pointer namedconn.

Next we build the SQL-Query:

01 mysql_query(conn, "SET NAMES 'utf8'");
02 if(forAutocomplete){
03     if(search == NULL || wcslen(search) ==0){
04         WideCharToMultiByte(CP_UTF8,0,
05           L"SELECT Title from Document LIMIT 20",-1,sql,1000,NULL,NULL);
06     }else{
07         wsprintf(lbuffer, L"SELECT Title, match(Title) against('%ls'IN
08           BOOLEAN MODE) as Score from Document where match(Title) against('%ls'
09           IN BOOLEAN MODE) > 0.001 order by Score Desc LIMIT 20",
10             search, search);
11         WideCharToMultiByte(CP_UTF8,0,lbuffer,-1,sql,1000,NULL,NULL);
12     }
13 }else if(search == NULL || wcslen(search) ==0){
14     WideCharToMultiByte(CP_UTF8,0,L"SELECT DocumentID, Title, Text from Document",-1,sql,1000,NULL,NULL);
15 }else{       
16     wsprintf(lbuffer, L"SELECT DocumentID, Title, Text, match(Title, Text)
17              against('%ls' IN BOOLEANMODE) as Score from Document where match(Title, Text)
18              against('%ls' IN BOOLEAN MODE) > 0.001 order by Score Desc",
19         search, search);
20     WideCharToMultiByte(CP_UTF8,0,lbuffer,-1,sql,1000,NULL,NULL);
21 }

match(Title, Text) against('%ls' IN BOOLEAN MODE)searches for the search string in the columns Title and Text which returns a value how good the match is. Only documents with a Score greater 0.001 are displayed, the result is ordered by the score.

IN BOOLEAN MODEeffectiates that the search for the multiple words is interpreted asor.

In the search string, you can use "*" as wildchar, which will match 0 to n characters. For example "as*" which will find ASP. The search is not case-sensitive. Some things are annoying "as**" won't find anything; *sp" won't match anything - in Microsoft SQl Server, you can match wildcards at the beginning of a string.

获得数据

01 if(mysql_query(conn, sql)) {
02     fprintf(stderr, "%s\n", mysql_error(conn));
03     fprintf(stderr, "%s\n", sql);
04     return NULL;
05 }
06     // Process results
07 result = mysql_store_result(conn);
08     ...
09     while((row = mysql_fetch_row(result)) != NULL) {
10     if(format == FORMAT_TEXT){
11         MultiByteToWideChar(CP_UTF8,0,row[0], -1,buffer, 255);
12         wsprintf(resultBufferp,L"%s\t", buffer);
13         resultBufferp+=wcslen(buffer)+1;
14         MultiByteToWideChar(CP_UTF8,0,row[1], -1,buffer, 255);
15         wsprintf(resultBufferp,L"%s\t", buffer);
16         resultBufferp+=wcslen(buffer)+1;
17         MultiByteToWideChar(CP_UTF8,0,row[2], -1,buffer, 32000);
18         wsprintf(resultBufferp,L"%s\n", buffer);
19         resultBufferp+=wcslen(buffer)+1;
20     }else if(format == FORMAT_JSON){
21         if(!forAutocomplete){
22             MultiByteToWideChar(CP_UTF8,0,row[0], -1,buffer, 255);
23             wsprintf(resultBufferp,L"{\"DocumentID\": %s, ", buffer);
24             resultBufferp+=wcslen(buffer)+wcslen(L"{\"DocumentID\": , ");
25             MultiByteToWideChar(CP_UTF8,0,row[1], -1,buffer, 255);
26             wsprintf(resultBufferp,L"\"Title\": \"%s\", ", buffer);
27             resultBufferp+=wcslen(buffer)+wcslen(L"\"Title\": \"\", ");
28             MultiByteToWideChar(CP_UTF8,0,row[2], -1,buffer, 32000);
29             wsprintf(resultBufferp,L"\"Text\": \"%s\"},", buffer);
30             resultBufferp+=wcslen(buffer)+wcslen(L"\"Text\": \"\"},");
31         }else{
32             MultiByteToWideChar(CP_UTF8,0,row[0], -1,buffer, 255);
33             wsprintf(resultBufferp,L"\"%s\",", buffer);
34             resultBufferp+=wcslen(buffer)+wcslen(L"\"\",");
35         }
36     }
37 }

mysql_query 将查询发送到服务器。mysql_store_result将结果准备为一个集合,你可用mysql_fetch_row(result)进行迭代。无论列具有什么数据类型,每行都是一个字符串数组。我更喜欢ADO.NET中的具有类型的列。在.NET中,我们可能使用StringBuilder来聚集结果字符串,这里我们通过malloc和增长resultBufferp指针来定位char[]。我们使用MultiByteToWideChar来转换到WCHAR。

 JSON 格式

我决定不采用XML格式,而使用轻量级的 JSON-格式,以此来从Web页面通过AJAX与Webservice通讯。

JSON-输出看起来像这样

1 [{"DocumentID": 1, "Title""ASP MVC 4""Text":
2       "Was für Profis"},{"DocumentID": 2, "Title""JQuery",
3       "Text""Hat Ajax Support"},{"DocumentID": 3, "Title": "
4 WebServices", "Text": "Visual C++ kanns nicht"},{"DocumentID": 4,
5   "Title""Boost""Text""Muss Extra installiert werden"}]
  在参数自动完成为真的时候,JSON-看起来像这样:
1 ["ASP MVC 4","JQuery","WebServices","Boost"]
  "[]" 符号表明了一个数组的开始与结束, "{}" 标明了一个对象的开始与结束。在一个对象中,":"前面的部分是属性名称,在它后面的部分是属性值。与之类似的,在你用JavaScript编码的时候也差不多一样。通过JavaScript-命令JSON.parse,你得到一个完整的对象,这个对象的属性可以通过通常的"." 符号访问。

Hosting the Webservice for the GetDocuments method

I created the project VisionWeb using the template "Visual C#/WCF/WCF Service Application", what added the needed references likeSystem.ServiceModel.

Next we use NuGet to add the needed JavaScript libraries. Choose "Tools/Library Packet Manager/Package Manager Console" and issue the commands:

1 Install-Package jQuery
2 Install-Package jQuery.UI.Combined
Next we define the service contract in the file "  App-Code/IVisionService.cs":
01 namespace VisionServices
02 {
03     [ServiceContract(SessionMode = SessionMode.Allowed)]
04     public interface IVisionService
05     {
06         [OperationContract]
07         [WebInvoke(
08             Method = "POST",
09             BodyStyle = WebMessageBodyStyle.WrappedRequest,
10             RequestFormat = WebMessageFormat.Json,
11             ResponseFormat = WebMessageFormat.Json)]
12         string GetDocuments(string search, int format, intforautocomplete);   
13     }  
14 }

TheWebInvokeattribute ensures that the service can be called by Ajax calls. As method I chose POST which sends the parameter in the body of the HTTP-Request. The alternative GET, would encode and unveil the parameters in the URL.

We specify that request and response are send in JSON-Format.BodyStyle = WebMessageBodyStyle.WrappedRequestmust be used when more than one parameter are used. You can useWebMessageBodyStyle.Bareif you have zero or one Parameter.

Webservice的实现

我们将实现定义在 "App-Code/IVisionService.cs"中:

01 namespace VisionServices
02 {
03     public class PInvoke
04     {
05         [DllImport("VisionDAL.dll", CharSet = CharSet.Unicode)]
06         public static extern string GetDocuments(string search, int format, int forAutocomplete);
07     }
08     public class VisionService : IVisionService
09     {
10         public string GetDocuments(string search, int format, int forautocomplete)
11         {
12             string result = PInvoke.GetDocuments(search, format, forautocomplete).ToString();
13             return result;
14         }
15     }
16 }

VisionService.svc的实现

1 <%@ ServiceHost Language="C#" Debug="true" Service="VisionServices.VisionService" CodeBehind="App_Code\VisionService.cs" %>

这里定义了调用"http://<your webserver>:<your port>VisionService.svc"时的服务端点 ,调用GetDocuments函数的URL地址是 "http://<your webserver>:<your port>VisionService.svc/GetDocuments"

The Web.config

01 <?xml version="1.0"?>
02 <configuration>
03   <appSettings/>
04   <system.web>
05     <httpRuntime/>
06     <compilation debug="true"/>
07   </system.web>
08   <system.serviceModel>
09     <services>

你可能感兴趣的:(搜索引擎,web服务器,StringBuilder,应用程序,2012,Visual,Studio)