Parsing Microsoft Office Open XML document properties with Node.js
Posted on
Recently, I built an Algolia search index for a large repository of Microsoft Office documents. Algolia provides a hosted search platform that makes it easy to implement search within websites and mobile applications. It’s especially good for searching semi-structured data and returning relevant results quickly.
You can use tools like textract or Apache Tika to extract text from Microsoft Office documents. If you want to also include the standard document properties in your index, you can use the office-document-properties node module to read document properties from Microsoft Office Open XML documents (docx, docm, pptx, pptm, xlsx, xlsm).
To get started, first install the module using the npm install command:
To read document properties from a file, import the module and use the fromFilePath method.
Read document properties from a buffer using the fromBuffer method:
Hope that you find this module useful for your data indexing needs – be sure to visit the GitHub project to report any issues you encounter.