PHP+MongoDB+Coreseek/Sphinx(xmlpipe2数据源)打造千万级搜索引擎

近几年来,Linux+Nginx+PHP+MongoDB(LNPM)的组合越来越火,甚至有取代Linux+Nginx/Apache+PHP+Mysql组合的趋势。原因是MongoDB强大,灵活,易扩展,更关键的易用。MongoDB不用事先设计好表结构,往里面插入什么都可以,而且管理方便。因此成为创业团队的首选数据库,更是移动互联网的一枝新秀。

然而MongoDB和关系型数据库也有很多相似之处,比如全文索引不支持中文。MongoDB在2.6版本中开始默认支持全文索引,一如既往的不支持伟大的Chinese,所以如果需要搜索功能,就要另辟蹊径。
Sphinx和Lucene是做搜索引擎的不错的选择。个人观点Lucene对Java的支持比较好,而Sphixn对PHP的支持较好,所以我选择了Sphinx。其实Sphinx对中文的支持也不是很好,因为Sphinx是根据空格来分词(适用与英文),根本不适用中文分词。幸好有人提供了基于Sphinx的支持中文的插件Coreseek和Sphinx—for—chinese。
Coreseek有完整的文档,目前支持最新版的Sphinx,所以我选择Coreseek。
Sphinx-for-chinese严重缺乏文档。

安装:
Coreseek安装。
Sphinx-for-chinese安装。
创建索引:
Coreseek支持与Mysql直接对接,只要在Coreseek配置文件里填上Mysql的信息,Coreseek就会自动读取Mysql数据来创建索引(当然前提是你做了生成索引的相应设置或执行生成索引的命令)。然而Sphinx不支持与MongoDB直接对接,可以把Mongo数据源转换为Python数据源或转换成xmlpipe2数据源。
本人不会Python,所以用php些了一个xml管道用于把MongoDB数据传输到Coreseek。

<?php

class SphinxXmlpipe{

    private $xmlWriter;
    private $fields = array();
    private $attributes = array();
    private $documents = array();

    public function setFields($fields) {
    	$this->fields = $fields;
    }

    public function setAttributes($attributes) {
    	$this->attributes = $attributes;
    }

    public function beginOutput() {
    	//create a new xml document
    	$this->xmlWriter = new \XMLWriter();
    	$this->xmlWriter->openMemory();
        $this->xmlWriter->setIndent(true);
        $this->xmlWriter->startDocument('1.0', 'UTF-8');

        $this->xmlWriter->startElement('sphinx:docset');
        $this->xmlWriter->startElement('sphinx:schema');

	// add fileds to the schma
	foreach($this->fields as $field) {
	    $this->xmlWriter->startElement('sphinx:field');
	    $this->xmlWriter->writeAttribute('name', $field);
	    $this->xmlWriter->endElement();
    	}

	/*
	// add atttributes to the schema
	foreach($this->attributes as $attributes) {
	    $this->xmlWriter->startElement('sphinx:attr');
	    foreach($attributes as $key => $value) {
		$this->xmlWriter->writeAttribute($key, $value);
	    }
	    $this->xmlWriter->endElement();
	}
	*/
	$this->xmlWriter->endElement(); // schema
    }

    public function addDocument($doc) {
	$this->xmlWriter->startElement('sphinx:document');
	$this->xmlWriter->writeAttribute('id', $doc['book_id']);

	foreach($doc as $key => $value) {
	    $this->xmlWriter->startElement($key);
	    $this->xmlWriter->text($value);
	    $this->xmlWriter->endElement();
	}

	$this->xmlWriter->endElement(); // document
    }

    public function endOutput() {
	// end sphinx:docset
	$this->xmlWriter->endElement();
	$this->xmlWriter->endDocument();
	echo $this->xmlWriter->outputMemory();
    }

    public function xmlpipe2() {
	$this->setfields( array(
	    'book_id',
	    'book_name',
	));

	$this->setAttributes( array(
	    array(
		'name' => 'book_id',
		'type' => 'int',
		'bits' => '16',
		'default' => '1',
 	    ),
	));

	$this->beginOutput();

	$mBook = D('book');
	$count = $mBook->count();
	$limit = c('XMLPIPE_BOOKS_COUNT_PER_TIME');
	$tCont = (int)$count/$limit;
	$oCount = $count%$limit;
	if($tCont>0) {
	    do {
		$books = $mBook->field('book_id,book_name','_id=>0')->limit($limit)->select();
		foreach($books as $book) {
		    $this->addDocument($book);
		}
		unset($books);
		$tCont--;
	    } while($tCont>0);

	    $books = $mBook->field('book_id,book_name','_id=>0')->limit($oCount)->select();
	    foreach($books as $book) {
		$this->addDocument($book);
	    }
	    unset($books);
	} else {
	    $books = $mBook->field('book_id,book_name','_id=>0')->limit($oCount)->select();
	    foreach($books as $book) {
		$this->addDocument($book);
	    }
	    unset($books);
	} 

	$this->endOutput();
    }

}

输出的xml格式如下

<document>
<id>123</id>
<group>45</group>
<timestamp>1132223498</timestamp>
<title>test title</title>
<body>
this is my document body
</body>
</document>

<document>
<id>124</id>
<group>46</group>
<timestamp>1132223498</timestamp>
<title>another test</title>
<body>
this is another document
</body>
</document>

相应的Coreseek设置

source src1
{
        type			= xmlpipe2
	xmlpipe_command	= cd /var/www/PHPParser && php index.php /Home/SphinxXmlpipe/xmlpipe2

	xmlpipe_field		= book_id
	xmlpipe_field		= book_name

	xmlpipe_attr_timestamp	= book_id
	xmlpipe_attr_uint	= book_id

	xmlpipe_fixup_utf8	= 1
}

搜索:
1、PHP提供了Sphinx扩展,适用于Coreseek。
2、Sphinx 安装包提供了sphinxapi,在api目录下。
我用的PHP扩展
sphinx搜索代码示例

public function getResultBySearchText($search_text) {
        $sphinxClient = new \SphinxClient();
        $sphinxClient->setServer('localhost', 9312);   // server = localhost,port = 9312.
        $sphinxClient->setMatchMode(SPH_MATCH_ANY);
        $sphinxClient->setMaxQueryTime(5000);  // set search time 5  seconds. 

        $result = $sphinxClient->query($search_text); 

        if(isset($result['matches'])) {
            $rel['time'] = $result['time'];
            $rel['matches'] = $result['matches'];
            return $rel;
        } else {
            $rel['time'] = $result['time'];
            return $rel;
        }
    }

由于用的xmlpipe数据源,所以返回的是文档id,还需要根据id去mongo提取数据。至于如何提取mongo数据,我就不写了,如果需要帮助就连系我吧(sq371426@163.com)。

JackSun

JackSun

I'm a coder.

You may also like...

Leave a Reply

Your email address will not be published.