Skip to content

Instantly share code, notes, and snippets.

@Chetan496
Last active December 15, 2016 16:55
Show Gist options
  • Save Chetan496/d37cd30e0edc2f94dad2aa9082e1a259 to your computer and use it in GitHub Desktop.
Save Chetan496/d37cd30e0edc2f94dad2aa9082e1a259 to your computer and use it in GitHub Desktop.
Apache POI snippets

/* contains code samples and explanation of apache poi*/

HSLF - Horrible slide layout format

for reading, creating, editing PPT files

OLE - Object linking and Embedding is a proprietary technology by Microsoft which allows embedding and linking to other doucments and objects

OLE allows embedding one document within another

POIFS is a pure Java implementation of the OLE 2 Compound Document format.

From the Apache POI FS intro page: A common confusion is on just what POIFS buys you or what OLE 2 Compound Document format is exactly. POIFS does not buy you DOC, or XLS, but is necessary to generate or read DOC or XLS files. You see, all file formats based on the OLE 2 Compound Document Format have a common structure. The OLE 2 Compound Document Format is essentially a convoluted archive format. Think of POIFS as a "zip" library. Once you can get the data in a zip file you still need to interpret the data. As a general rule, while all of our formats use POIFS, most of them attempt to abstract you from it. There are some circumstances where this is not possible, but as a general rule this is true.

If you're an end user type just looking to generate XLS files, then you'd be looking for HSSF not POIFS; however, if you have legacy code that uses MFC property sets, POIFS is for you! Regardless, you may or may not need to know how to use POIFS but ultimately if you use technologies that come from the POI project, you're using POIFS underneath. Perhaps we should have a branding campaign "POIFS Inside!". ;-)

All the MS Office documents are basically compressed XML files in a convoluted ZIP format

It is possible for one OLE 2 based document to have other OLE 2 documents embedded in it. For example, an Excel file may have a Word document and a PowerPoint slideshow embedded as part of it.

Normally, these other documents are stored in subdirectories of the OLE 2 (POIFS) filesystem. The exact location of the embedded documents will vary depending on the type of the master document, and the exact directory names will differ each time. To figure out exactly which directory to look in, you will either need to process the appropriate OLE 2 linking entry in the master document, or simple iterate over all the directories in the filesystem.

As a general rule, you will find the same OLE 2 entries in the subdirectories, as you would've found at the root of the filesystem were a document to not be embedded.

To conclude: Every OLE document (an MS Office document) can be considered as a filesystem..

The child directories may have other embeded documents.

Files embedded in Powerpoint:

PowerPoint does not normally store embedded files in the OLE2 layer. Instead, they are held within records of the main PowerPoint file. See the HSLF Tutorial for how to retrieve embedded OLE objects from a presentation.

To conclude, every MS Office document may have sub-directories containing embedded documents and some records of the main master document

POIFS provides a simple tool for listing the contents of OLE2 files. This can allow you to see what your POIFS file contents, and hence if it has any embedded documents in it, and where.

The tool to use is org.apache.poi.poifs.dev.POIFSLister. This tool may be run from the command line, and takes a filename as its parameter. It will print out all the directories and files contained within the POIFS file.

Relevant classes for working with PowerPoint files:

  1. HSLFSlideShow
  2. HSLFShape
  3. HSLFObjectData
  4. HSLFSlide
  5. HSLFTextShape (may prove to be useful)
  6. HSLFTextBox (subclass of the above class)
  7. HSLFGroupShape
  8. HSLFTextParagraph
  9. HSLFTextRun - represents a run of text, all having the same style -- could be useful to identify the section headers

Links to refer for PowerPoint:

https://poi.apache.org/slideshow/how-to-shapes.html#OLE

Reading a PPT using Apache-POI:

FileInputStream is = new FileInputStream("slideshow.ppt");
HSLFSlideShow ppt = new HSLFSlideShow(is);
is.close();

This is how you get the handle to the PPT of the given inputstream of PPT document

Getting the title of the slide:

HSLFSlide.getTitle()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment