当前访客身份:游客 [ 登录  | 注册加入尚学堂]
直播

我来了!

拥有积分:3962
尚学堂雄起!!威武。。。

博客分类

笔记中心

课题中心

提问中心

答题中心

解答题中心

tika提取pdf信息异常

我来了! 发表于 2年前 (2014-11-09 23:27:23)  |  评论(0)  |  阅读次数(418)| 0 人收藏此文章,   我要收藏   
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:305)
at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:398)
at org.apache.pdfbox.util.PDFTextStripper.writeString(PDFTextStripper.java:866)
at org.apache.pdfbox.util.PDFTextStripper.writeLine(PDFTextStripper.java:1896)
at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:744)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:461)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)

在使用apache tika提取pdf信息时,报以上错误。根据错误信息提示,可能读取超过请求限制(10万字)。

我的代码如下:

		Parser parser = new PDFParser();
		//parser.
		BodyContentHandler handler = new BodyContentHandler();
		Metadata metadata = new Metadata();
		InputStream stream = null;
		try {
			
			stream = new FileInputStream(new File("1.pdf"));
			parser.parse(stream, handler, metadata, new ParseContext());
			
			 for (String name : metadata.names()) {
                 System.out.println(name + ":\t" + metadata.get(name));
             }
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (SAXException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (TikaException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} finally {
			try {
				stream.close();
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		}

  对读取字数限制,可能在某个构造函数里我没有传入最大限制,而使用了默认的十万字。检查一下上面的代码,我注意到了

BodyContentHandler的构造函数:
org.apache.tika.sax.BodyContentHandler.BodyContentHandler(int writeLimit)

  看样子有关系。修改一下构造函数的数字为:10*1024*1024(这个数字有pdf文档大小决定)。

重新调试程序,即可获得pdf的元数据信息如下:

  

dc:subject:	
meta:save-date:	2014-07-22T21:02:38Z
subject:	PostgreSQL 9.3 Documentation
Author:	The PostgreSQL Global Development Group
dcterms:created:	2014-07-22T20:55:33Z
date:	2014-07-22T21:02:38Z
creator:	The PostgreSQL Global Development Group
Creation-Date:	2014-07-22T20:55:33Z
title:	PostgreSQL 9.3 Documentation
trapped:	False
meta:author:	The PostgreSQL Global Development Group
created:	Wed Jul 23 04:55:33 CST 2014
meta:keyword:	
cp:subject:	PostgreSQL 9.3 Documentation
dc:format:	application/pdf; version=1.4
PTEX.Fullbanner:	This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012/Debian) kpathsea version 6.1.0
xmp:CreatorTool:	LaTeX with hyperref package
Keywords:	
dc:title:	PostgreSQL 9.3 Documentation
Last-Save-Date:	2014-07-22T21:02:38Z
meta:creation-date:	2014-07-22T20:55:33Z
dcterms:modified:	2014-07-22T21:02:38Z
dc:creator:	The PostgreSQL Global Development Group
pdf:PDFVersion:	1.4
Last-Modified:	2014-07-22T21:02:38Z
modified:	2014-07-22T21:02:38Z
xmpTPg:NPages:	2861
pdf:encrypted:	false
producer:	pdfTeX-1.40.13; modified using iText® 5.1.3 ©2000-2011 1T3XT BVBA
Content-Type:	application/pdf

  

分享到:0
关注微信,跟着我们扩展技术视野。每天推送IT新技术文章,每周聚焦一门新技术。微信二维码如下:
微信公众账号:尚学堂(微信号:bjsxt-java)
声明:博客文章版权属于原创作者,受法律保护。如果侵犯了您的权利,请联系管理员,我们将及时删除!
(邮箱:webmaster#sxt.cn(#换为@))
北京总部地址:北京市海淀区西三旗桥东建材城西路85号神州科技园B座三层尚学堂 咨询电话:400-009-1906 010-56233821
Copyright 2007-2015 北京尚学堂科技有限公司 京ICP备13018289号-1 京公网安备11010802015183