国产欧美视频在线,视色网,日韩精品影视

前言

POI是 Apache 旗下一款讀寫微軟家文檔聲名顯赫的類庫。應該很多人在做報表的導出，或者創建 word 文檔以及讀取之類的都是用過 POI。POI 也的確對于這些操作帶來很大的便利性。我最近做的一個工具就是讀取計算機中的 word 以及 excel 文件。

POI結構說明

包名稱說明

HSSF提供讀寫Microsoft Excel XLS格式檔案的功能。

XSSF提供讀寫Microsoft Excel OOXML XLSX格式檔案的功能。

HWPF提供讀寫Microsoft Word DOC格式檔案的功能。

HSLF提供讀寫Microsoft PowerPoint格式檔案的功能。

HDGF提供讀Microsoft Visio格式檔案的功能。

HPBF提供讀Microsoft Publisher格式檔案的功能。

HSMF提供讀Microsoft Outlook格式檔案的功能。

下面就word和excel兩方面講解以下遇到的一些坑：

word 篇

對于 word 文件，我需要的就是提取文件中正文的文字。所以可以創建一個方法來讀取 doc 或者 docx 文件：

									private static String readDoc(String filePath, InputStream is) {

									 String text= "";

									 try {

									  if (filePath.endsWith("doc")) {

									   WordExtractor ex = new WordExtractor(is);

									   text = ex.getText();

									   ex.close();

									   is.close();

									  } else if(filePath.endsWith("docx")) {

									   XWPFDocument doc = new XWPFDocument(is);

									   XWPFWordExtractor extractor = new XWPFWordExtractor(doc);

									   text = extractor.getText();

									   extractor.close();

									   is.close();

									  }

									 } catch (Exception e) {

									  logger.error(filePath, e);

									 } finally {

									  if (is != null) {

									   is.close();

									  }

									 }

									 return text;

									}

理論上來說，這段代碼應該對于讀取大多數 doc 或者 docx 文件都是有效的。但是!!!!我發現了一個奇怪的問題，就是我的代碼在讀取某些 doc 文件的時候，經常會給出這樣的一個異常：

1	`org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents.`

這個異常的意思是什么呢，通俗的來講，就是你打開的文件并不是一個 doc 文件，你應該使用讀取 docx 的方法去讀取。但是我們明明打開的就是一個后綴是 doc 的文件啊！

其實 doc 和 docx 的本質不同的，doc 是 OLE2 類型，而 docx 而是 OOXML 類型。如果你用壓縮文件打開一個 docx 文件，你會發現一些文件夾：

利用POI讀取word、Excel文件的最佳實踐教程

本質上 docx 文件就是一個 zip 文件，里面包含了一些 xml 文件。所以，一些 docx 文件雖然大小不大，但是其內部的 xml 文件確實比較大的，這也是為什么在讀取某些看起來不是很大的 docx 文件的時候卻耗費了大量的內存。

然后我使用壓縮文件打開這個 doc 文件，果不其然，其內部正是如上圖，所以本質上我們可以認為它是一個 docx 文件。可能是因為它是以某種兼容模式保存從而導致如此坑爹的問題。所以，現在我們根據后綴名來判斷一個文件是 doc 或者 docx 就是不可靠的了。

老實說，我覺得這應該不是一個很少見的問題。但是我在谷歌上并沒有找到任何關于此的信息。how to know whether a file is .docx or .doc format from Apache POI 這個例子是通過 ZipInputStream 來判斷文件是否是 docx 文件：

1	`boolean` `isZip =` `new` `ZipInputStream( fileStream ).getNextEntry() !=` `null;`

但我并不覺得這是一個很好的方法，因為我得去構建一個ZipInpuStream，這很顯然不好。另外，這個操作貌似會影響到 InputStream，所以你在讀取正常的 doc 文件會有問題。或者你使用 File 對象去判斷是否是一個 zip 文件。但這也不是一個好方法，因為我還需要在壓縮文件中讀取 doc 或者 docx 文件，所以我的輸入必須是 Inputstream，所以這個選項也是不可以的。我在 stackoverflow 上和一幫老外扯了大半天，有時候我真的很懷疑這幫老外的理解能力，不過最終還是有一個大佬給出了一個讓我欣喜若狂的解決方案，FileMagic。這個是一個 POI 3.17新增加的一個特性：

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

									public enum FileMagic {

									 /** OLE2 / BIFF8+ stream used for Office 97 and higher documents */

									 OLE2(HeaderBlockConstants._signature),

									 /** OOXML / ZIP stream */

									 OOXML(OOXML_FILE_HEADER),

									 /** XML file */

									 XML(RAW_XML_FILE_HEADER),

									 /** BIFF2 raw stream - for Excel 2 */

									 BIFF2(new byte[]{

									   0x09, 0x00, // sid=0x0009

									   0x04, 0x00, // size=0x0004

									   0x00, 0x00, // unused

									   0x70, 0x00 // 0x70 = multiple values

									 }),

									 /** BIFF3 raw stream - for Excel 3 */

									 BIFF3(new byte[]{

									   0x09, 0x02, // sid=0x0209

									   0x06, 0x00, // size=0x0006

									   0x00, 0x00, // unused

									   0x70, 0x00 // 0x70 = multiple values

									 }),

									 /** BIFF4 raw stream - for Excel 4 */

									 BIFF4(new byte[]{

									   0x09, 0x04, // sid=0x0409

									   0x06, 0x00, // size=0x0006

									   0x00, 0x00, // unused

									   0x70, 0x00 // 0x70 = multiple values

									 },new byte[]{

									   0x09, 0x04, // sid=0x0409

									   0x06, 0x00, // size=0x0006

									   0x00, 0x00, // unused

									   0x00, 0x01

									 }),

									 /** Old MS Write raw stream */

									 MSWRITE(

									   new byte[]{0x31, (byte)0xbe, 0x00, 0x00 },

									   new byte[]{0x32, (byte)0xbe, 0x00, 0x00 }),

									 /** RTF document */

									 RTF("{\\rtf"),

									 /** PDF document */

									 PDF("%PDF"),

									 // keep UNKNOWN always as last enum!

									 /** UNKNOWN magic */

									 UNKNOWN(new byte[0]);

									 final byte[][] magic;

									 FileMagic(long magic) {

									  this.magic = new byte[1][8];

									  LittleEndian.putLong(this.magic[0], 0, magic);

									 }

									 FileMagic(byte[]... magic) {

									  this.magic = magic;

									 }

									 FileMagic(String magic) {

									  this(magic.getBytes(LocaleUtil.CHARSET_1252));

									 }

									 public static FileMagic valueOf(byte[] magic) {

									  for (FileMagic fm : values()) {

									   int i=0;

									   boolean found = true;

									   for (byte[] ma : fm.magic) {

									    for (byte m : ma) {

									     byte d = magic[i++];

									     if (!(d == m || (m == 0x70 && (d == 0x10 || d == 0x20 || d == 0x40)))) {

									      found = false;

									      break;

									     }

									    }

									    if (found) {

									     return fm;

									    }

									   }

									  }

									  return UNKNOWN;

									 }

									 /**

									  * Get the file magic of the supplied InputStream (which MUST

									  * support mark and reset).<p>

									  *

									  * If unsure if your InputStream does support mark / reset,

									  * use {@link #prepareToCheckMagic(InputStream)} to wrap it and make

									  * sure to always use that, and not the original!<p>

									  *

									  * Even if this method returns {@link FileMagic#UNKNOWN} it could potentially mean,

									  * that the ZIP stream has leading junk bytes

									  *

									  * @param inp An InputStream which supports either mark/reset

									  */

									 public static FileMagic valueOf(InputStream inp) throws IOException {

									  if (!inp.markSupported()) {

									   throw new IOException("getFileMagic() only operates on streams which support mark(int)");

									  }

									  // Grab the first 8 bytes

									  byte[] data = IOUtils.peekFirst8Bytes(inp);

									  return FileMagic.valueOf(data);

									 }

									 /**

									  * Checks if an {@link InputStream} can be reseted (i.e. used for checking the header magic) and wraps it if not

									  *

									  * @param stream stream to be checked for wrapping

									  * @return a mark enabled stream

									  */

									 public static InputStream prepareToCheckMagic(InputStream stream) {

									  if (stream.markSupported()) {

									   return stream;

									  }

									  // we used to process the data via a PushbackInputStream, but user code could provide a too small one

									  // so we use a BufferedInputStream instead now

									  return new BufferedInputStream(stream);

									 }

									}

在這給出主要的代碼，其主要就是根據 InputStream 前 8 個字節來判斷文件的類型，毫無以為這就是最優雅的解決方式。一開始，其實我也是在想對于壓縮文件的前幾個字節似乎是由不同的定義的，magicmumber。因為 FileMagic 的依賴和3.16 版本是兼容的，所以我只需要加入這個類就可以了，因此我們現在讀取 word 文件的正確做法是：

									private static String readDoc (String filePath, InputStream is) {

									 String text= "";

									 is = FileMagic.prepareToCheckMagic(is);

									 try {

									  if (FileMagic.valueOf(is) == FileMagic.OLE2) {

									   WordExtractor ex = new WordExtractor(is);

									   text = ex.getText();

									   ex.close();

									  } else if(FileMagic.valueOf(is) == FileMagic.OOXML) {

									   XWPFDocument doc = new XWPFDocument(is);

									   XWPFWordExtractor extractor = new XWPFWordExtractor(doc);

									   text = extractor.getText();

									   extractor.close();

									  }

									 } catch (Exception e) {

									  logger.error("for file " + filePath, e);

									 } finally {

									  if (is != null) {

									   is.close();

									  }

									 }

									 return text;

									}

excel 篇

對于 excel 篇，我也就不去找之前的方案和現在的方案的對比了。就給出我現在的最佳做法了：

									@SuppressWarnings("deprecation" )

									private static String readExcel(String filePath, InputStream inp) throws Exception {

									 Workbook wb;

									 StringBuilder sb = new StringBuilder();

									 try {

									  if (filePath.endsWith(".xls")) {

									   wb = new HSSFWorkbook(inp);

									  } else {

									   wb = StreamingReader.builder()

									     .rowCacheSize(1000) // number of rows to keep in memory (defaults to 10)

									     .bufferSize(4096)  // buffer size to use when reading InputStream to file (defaults to 1024)

									     .open(inp);   // InputStream or File for XLSX file (required)

									  }

									  sb = readSheet(wb, sb, filePath.endsWith(".xls"));

									  wb.close();

									 } catch (OLE2NotOfficeXmlFileException e) {

									  logger.error(filePath, e);

									 } finally {

									  if (inp != null) {

									   inp.close();

									  }

									 }

									 return sb.toString();

									}

									private static String readExcelByFile(String filepath, File file) {

									 Workbook wb;

									 StringBuilder sb = new StringBuilder();

									 try {

									  if (filepath.endsWith(".xls")) {

									   wb = WorkbookFactory.create(file);

									  } else {

									   wb = StreamingReader.builder()

									     .rowCacheSize(1000) // number of rows to keep in memory (defaults to 10)

									     .bufferSize(4096)  // buffer size to use when reading InputStream to file (defaults to 1024)

									     .open(file);   // InputStream or File for XLSX file (required)

									  }

									  sb = readSheet(wb, sb, filepath.endsWith(".xls"));

									  wb.close();

									 } catch (Exception e) {

									  logger.error(filepath, e);

									 }

									 return sb.toString();

									}

									private static StringBuilder readSheet(Workbook wb, StringBuilder sb, boolean isXls) throws Exception {

									 for (Sheet sheet: wb) {

									  for (Row r: sheet) {

									   for (Cell cell: r) {

									    if (cell.getCellType() == Cell.CELL_TYPE_STRING) {

									     sb.append(cell.getStringCellValue());

									     sb.append(" ");

									    } else if (cell.getCellType() == Cell.CELL_TYPE_NUMERIC) {

									     if (isXls) {

									      DataFormatter formatter = new DataFormatter();

									      sb.append(formatter.formatCellValue(cell));

									     } else {

									      sb.append(cell.getStringCellValue());

									     }

									     sb.append(" ");

									    }

									   }

									  }

									 }

									 return sb;

									}