在SpringBoot项目中，通过Google Vision API获取图片的文字，并使用Solr进行全文检索

晚上无聊，记录一下前些日子一个调查的经过，用于备忘。

首先Solr的安装：

官方网站：http://lucene.apache.org/solr/，下载Solr，目前是7.5版本。

下面一些常用命令：

./bin/solr start -e cloud         //启动solr
./bin/solr stop all               //关闭solr

启动之后（初期启动，应该会让选择Solr部署的节点数啊啥的，一路随意设定，端口不变），如果没有改变设置，浏览器能进Solr的管理界面了（http://127.0.0.1:8983/solr）

稍微测试一下

#假设 MyCol001  为库名称
#传入文件 Linux环境
bin/post -c MyCol001 /workspace/1.Java/FullTextSearch/pdf/002.pdf  
#传入文件 Windows环境
java -jar -Dc=MyCol001 -Dauto example\exampledocs\post.jar example\exampledocs\*
#目录，文件名皆可
#删除所有数据
bin/post -c MyCol001 -d "<delete><query>*:*</query></delete>"

如果执行上面命令之后，管理界面里面能看到结果，那就是OK了

然后是代码端：

1.在SpringBoot里追加SolrJ和GoogleVision的依赖

<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-solrj</artifactId>
    <version>7.5.0</version>
</dependency>
<dependency>
    <groupId>com.google.cloud</groupId>
    <artifactId>google-cloud-vision</artifactId>
    <version>1.40.0</version>
</dependency>

2，修改配置文件application.properties

spring.data.solr.host=http://127.0.0.1:8983/solr/MyCol001

3，SolrJ大概的使用

@Autowired
private SolrClient client;
//在函数中
//查询
ModifiableSolrParams params = new ModifiableSolrParams();
params.add("q", "attr_content:\"" + searchW + "\"");
params.add("start", "0");
params.add("rows", "9999");
QueryResponse query = client.query(params);
//追加特定文字到索引
SolrInputDocument doc = new SolrInputDocument();
doc.setField("id", solrId);
doc.setField("attr_content", <要索引的文件内容>);
client.add(doc);
client.commit();
//追加文件到索引（类似XLS，PDF自动解析，获取文本之后追加索引）
HttpSolrClient mHttpSolrClient = (HttpSolrClient) client;
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
MimetypesFileTypeMap mimeTypesMap = new MimetypesFileTypeMap();
String mimeType = mimeTypesMap.getContentType(new File(fileName));
up.addFile(new File(fileName), mimeType);
up.setParam("literal.id", solrId);
up.setParam("resource.name", fileName);
up.setParam("uprefix", "attr_");
up.setParam("stream_size", "0");
up.setParam("fmap.content", "attr_content");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
mHttpSolrClient.setUseMultiPartPost(true);
mHttpSolrClient.request(up);

上面已经有了Solr的追加与查询，接下来使用GoogleVisionAPI解析图片，然后调用【追加特定文字列到索引】的例子，加入Solr

List<AnnotateImageRequest> requests = new ArrayList<>();
ByteString imgBytes;
try {
    imgBytes = ByteString.readFrom(new FileInputStream(filePath));
    Image img = Image.newBuilder().setContent(imgBytes).build();
    Feature feat = Feature.newBuilder().setType(Type.TEXT_DETECTION).build();
    AnnotateImageRequest request = AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
    requests.add(request);
    try (ImageAnnotatorClient client = ImageAnnotatorClient.create()) {
        BatchAnnotateImagesResponse response = client.batchAnnotateImages(requests);
        List<AnnotateImageResponse> responses = response.getResponsesList();
        for (AnnotateImageResponse res : responses) {
	    if (res.hasError()) {
		return strResult;
	}
// For full list of available annotations, see http://g.co/cloud/vision/docs
for (EntityAnnotation annotation : res.getTextAnnotationsList()) {
    strResult = annotation.getDescription();
    // out.printf("Position : %s\n", annotation.getBoundingPoly());
    break;
}}}
} catch (IOException e) {
	e.printStackTrace();
}

基本就是Google的例子，直接拿来用…

最后，因为API是要票子的，所以需要环境变量设置Google的Token密钥文件。

Linux环境下

export GOOGLE_APPLICATION_CREDENTIALS="/workspace/Key.json"

Windows环境下，环境变量里面加一个 GOOGLE_APPLICATION_CREDENTIALS 内容 C:/workspace/Key.json

收工。。。

在SpringBoot项目中，通过Google Vision API获取图片的文字，并使用Solr进行全文检索

作者

归档

分类