Converting all kinds of documents into text#

Have a collection of documents? Word docs, HTML files, PDFs, image-based PDFs, and anything else? Don't worry, Apache Tika has you covered.

Installation#

These installation instructions only work on OS X, but it's possible to get the same software running on Windows.

Tesseract#

Tesseract is a piece of software that performs OCR, converting images of text into actual text. If we need to perform OCR on more languages than just English, we'll also need to install tesseract-lang to add more languages to the mix.

brew install tesseract tesseract-lang

Tika#

Tika is an incredible piece of software that converts just about any kind of document to text. It requires Java - I installed Java from https://www.java.com/en/download/ and it didn't work, so you'll need to use the install command below.

brew cask install adoptopenjdk
brew install tika

Tika will automatically know about tesseract.

Python bindings for Tika#

Tika is a piece of software that exists outside of Python. If we want Python to be able to use Tika, we'll need to install the Python bindings for Tika.

pip install tika

If you'd like to just run this all from the notebook, uncomment and run the cell below. You'll need to type in your password for the adoptopenjdk one, so be sure to pay attention to when it asks you.

# !brew install tesseract tesseract-lang
# !brew cask install adoptopenjdk
# !brew install tika 
# !pip install tika
# Download the image
!curl -O https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  822k  100  822k    0     0  3259k      0 --:--:-- --:--:-- --:--:-- 3264k

!tesseract Dr._Jekyll_and_Mr._Hyde_Text.jpg stdout
at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor’s suspense, and some perhaps
on my own growing curiosity.

“There it is, sir,” said I, pointing to the drawer, where it lay on the
floor behind a table and still covered with the sheet.

He sprang to it, and then paused, and laid his hand upon his
heart: I could hear his teeth grate with the convulsive action of his
jaws; and his face was so ghastly to see that I grew alarmed both for
his life and reason.

“Compose yourself,’ said I.

He turned a dreadful smile to me, and as if with the decision of
despair, plucked away the sheet. At sight of the contents, he uttered
one loud sob of such immense relief that I sat petrified. And the
next moment, in a voice that was already fairly well under control,
“Have you a graduated glass?” he asked.

I rose from my place with something of an effort and gave him
what he asked.

He thanked me with a smiling nod, measured out a few min-
ims of the red tincture and added one of the powders. The mix-
ture, which was at first of a reddish hue, began, in proportion as the


Using Tika#

Starting it up#

import tika
import requests
from tika import parser

# Start running the tika service
tika.initVM()

Doing your parsing#

There are two ways to do it!

Right from the web

response = requests.get(...)
results = parser.from_buffer(response.content)

From a downloaded file

results = parser.from_file(filename)

Note if you want to do non-English OCR, you need to change things up a bit. The one below for Greek. See what your tesseract supports with tesseract --list-langs

headers = {
    "X-Tika-OCRLanguage": "grc"
}

results = parser.from_buffer(response.content, headers=headers)

Examples#

PDF example#

The first time it will be very slow, as it's... downloading Tika again, I think?

response = requests.get('https://data.ct.gov/download/fxjv-82m6/application/pdf')
results = parser.from_buffer(response)
results.keys()
dict_keys(['status', 'content', 'metadata'])
results['status']
200
# Only showing the first 500 chars because there are SO MANY
results['content'][:1000]
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n  \n\n  \n\n \n\n \n\nConnecticut \n\nOpen Data \n\nPolicy \nEffective April 22, 2015 \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nPromulgated in accordance with and \n\nunder the authority of Executive \n\nOrder 39 of Governor Dannel P. \n\nMalloy \n\n \n\n  \n\n  \n\n \n\n  \n \n\n  \n\n\n\n \n\n \n\nContents \n\n \n\n \n1.0 Definitions .......................................................................................................................... 3 \n\n2.0  Introduction...................................................................................................................... 5 \n\n2.1  Intent ............................................................................................................................ 5 \n\n2.2  Scope ............................................................................................................................ 5 \n\n2.3  Legal Considerations ....................................................................................................... 5 '
# Only showing the first 10000 chars
print(results['content'][:10000].strip())
Connecticut 

Open Data 

Policy 
Effective April 22, 2015 

 

 

 

 

 

 

 

Promulgated in accordance with and 

under the authority of Executive 

Order 39 of Governor Dannel P. 

Malloy 

 

  

  

 

  
 

  



 

 

Contents 

 

 
1.0 Definitions .......................................................................................................................... 3 

2.0  Introduction...................................................................................................................... 5 

2.1  Intent ............................................................................................................................ 5 

2.2  Scope ............................................................................................................................ 5 

2.3  Legal Considerations ....................................................................................................... 5 

3.0  Open Data Policy Requirements ......................................................................................... 6 

3.1  General Requirements .................................................................................................... 6 

3.2  Open Data Criteria and Assessment Requirements ............................................................. 6 

3.3  Additional Requirements ................................................................................................. 7 

4.0  Governance and Oversight ................................................................................................. 7 

Connecticut’s Chief Data Officer (CDO) ........................................................................................ 7 

4.1  Agency Data Officer (ADO) .............................................................................................. 7 

4.2  Agency Data Stewards (ADS) ............................................................................................ 7 

5.0  State Data Standards ......................................................................................................... 8 

5.1  Data Set Selection .......................................................................................................... 8 

5.2  Data Set Publishing ......................................................................................................... 8 

5.3  Maintenance ................................................................................................................. 8 

5.4  Ownership & Responsibility ............................................................................................. 8 

 

 

 

  

 

  



 

 

 

1.0 Definitions  

 Agency Data Officer: Responsible for fulfilling a State agency’s responsibilities under Executive 

Order 39 of Governor Dannel P. Malloy.  The Agency Data Officer shall be an employee, 

knowledgeable about the overall business practices of the agency and the data it collects.  

API: An application programming interface, which is a set of definitions of the ways one piece 

of computer software communicates with another. It is a method of achieving abstraction, 

usually (but not necessarily) between higher-level and lower-level software. 

 Catalog: A catalog is a searchable and interactive collection of data sets or web services often 

known as a portal or repository. 

 Chief Data Officer: An individual within the Office of Policy and Management, designated by 

the Governor, to coordinate implementation of and compliance with Executive Order 39 of 

Governor Dannel P. Malloy, and coordinate initiatives to improve access to state data. 

  

Data: Statistical or factual information that: (a) is reflected in a list, table, graph, chart, or other 

non-narrative form, that can be digitally transmitted or processed; (b) is regularly created and 

maintained by or on behalf of an executive branch agency; and (c) records a measurement, 

transaction, or determination related to the mission of the agency or is provided to the agency 

by third parties pursuant to law. 

  

Database: A collection of data stored according to a schema and manipulated according to the 

rules set out in one Data Modelling Facility.  

 

Data Inventory: An itemized list of current data assets such as: databases, data sets, 

spreadsheets, collections, or geospatial data in the possession of a State Agency 

  

Data Set: A data set is an organized collection of data. The most basic representation of a data 

set is data elements presented in tabular form. Each column represents a particular variable. 

Each row corresponds to a given value of that column’s variable. A data set may also present 

information in a variety of non-tabular formats, such as an extensible mark-up language (XML) 

file, a geospatial data file, or an image file, etc. 



 Machine Processed: Refers to information or data that is in a format that can be easily 

processed by a computer without human intervention while ensuring no semantic meaning is 

lost. 

 Metadata: To facilitate common understanding, a number of characteristics, or attributes, of 

data are defined. These characteristics of data are known as “metadata”, that is, “data that 

describes data.” For any particular datum, the metadata may describe how the datum is 

represented, ranges of acceptable values, its relationship to other data, and how it should be 

labeled. Metadata also may provide other relevant information, such as the responsible 

steward, associated laws and regulations, and access management policy. Each of the types of 

data described herein has a corresponding set of metadata. The metadata for structured data 

objects describes the structure, data elements, interrelationships, and other characteristics of 

information, including its creation, disposition, access and handling controls, formats, content, 

and context, as well as related audit trails. Metadata includes data element names (such as 

Organization Name, Address, etc.), their definition, and their format (numeric, date, text, etc.). 

In contrast, data is the actual data values such as the “US Patent and Trade Office” or the 

“Social Security Administration” for the metadata called “Organization Name” and including a 

description of the data sources. Metadata may also include metrics about an organization’s 

data including its data quality (accuracy, completeness, etc.). 

 Standardized: Utilizing standards developed or adopted by voluntary consensus standards 

bodies, both domestic and international. These standards include provisions requiring that 

owners of relevant intellectual property have agreed to make that intellectual property 

available on a non-discriminatory, royalty-free or reasonable royalty basis to all interested 

parties. 

Web Service: A Web service is a method of communication between two or more electronic 

devices over a network. It has an interface described in a machine-processable format 

(specifically Representational State Transfer or REST).  

 

            

  



 

  

2.0  Introduction 

2.1  Intent 

Connecticut Open Data, as supported by the Open Data Policy (ODP) is intended to: 

● Increase agency accountability and responsiveness 

● Improve public knowledge of the government and its operations. 

● Provide timely data that is easily accessible to the public 

● Encourage public participation and interaction with government agencies, policies and 

issues 

● Foster agency/interagency efficiency 

● Create economic opportunity 

● Facilitate partnerships with non-governmental organizations 

● Empower citizens to create value from Open Data 

● Encourage the use of open frameworks and products, allowing third parties to embrace 

and expand on the state’s open data services. 

2.2  Scope 

The ODP applies to data in the custody or under the control of the State Agencies with a 

department head as defined by section 4-5 of the General Statutes. While the ODP applies to all 

government data: legal, policy, and contractual obligations limit the application of this ODP in 

some cases. In addition, this ODP sets out specific criteria that must be met before data can be 

considered Open Data. 

  

2.3  Legal Considerations 

The following legal considerations guide the development of the ODP and provide context for 

its application. 

Federal Freedom of Information Act (FOIA) 

Enacted in 1966, and taking effect on July 5, 1967, FOIA provides that any person has a right, 

enforceable in court, to obtain access to federal agency records, except to the extent that such 

records (or portions of them) are protected from public disclosure by several exemptions or by 

one of three special law enforcement record exclusions. For additional information regarding 

the FOIA, visit www.foia.gov. 

Connecticut Freedom of Information Act 

The Connecticut Freedom of Information Act (CTFOIA) is under the authority of the Connecticut 

Freedom of Information Commission. Their mission is “to administer and enforce the provisions 

http://www.cga.ct.gov/current/pub/chap_046.htm#sec_4-5
http://www.foia.gov/
http://www.foia.gov/


of the Connecticut freedom of information act, and to thereby ensure citizen access to the 

records and meetings of public agencies in the state of Connecticut”. For additional information 

on the Connecticut Freedom of Information Commission, go to www.ct.gov/foi. 

Open Data Licensing 

Explicit licensing is essential to provide clarity and certainty to users and reusers. Generally, 

Open Data provided by State Agencies should be identified as “Public Domain,” however in 

some instances it may be necessary, or desirable to apply an Open Data license. Open Data 

licenses are available from Open Data Commons: http://opendatacommons.org/licenses/odbl/. 

  

3.0  Open Data Policy Requirements 

3.1  General Requirements 

The following poli

Word doc example#

response = requests.get('https://pasteur.epa.gov/uploads/10.23719/1500001/LDPE_nanoclay_Highlights_.docx')
results = parser.from_buffer(response)
print(results['content'].strip())
Highlights 

Evaluating Weathering of Food Packaging Polyethylene-Nano-clay Composites: Release of Nanoparticles and their Impacts

Changseok Han1, Amy Zhao1, and Eunice Varughese2, E. Sahle-Demessie*1




1. UV or O3 degradation food packaging composites released nanoclay particles. 
2. Properties of nanocomposites changed during accelerated weathering.
3. Nanoclay release was proportional to weathering time.
4. Toxicity of released nanoclay at test concentrations were not significant.

OCR image example#

It will work the same with a PDF instead of an image.

response = requests.get('https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg')
results = parser.from_buffer(response)
results['status']
200
print(results['content'].strip())
at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor’s suspense, and some perhaps
on my own growing curiosity.

“There it is, sir,” said I, pointing to the drawer, where it lay on the
floor behind a table and still covered with the sheet.

He sprang to it, and then paused, and laid his hand upon his
heart: I could hear his teeth grate with the convulsive action of his
jaws; and his face was so ghastly to see that I grew alarmed both for
his life and reason.

“Compose yourself,’ said I.

He turned a dreadful smile to me, and as if with the decision of
despair, plucked away the sheet. At sight of the contents, he uttered
one loud sob of such immense relief that I sat petrified. And the
next moment, in a voice that was already fairly well under control,
“Have you a graduated glass?” he asked.

I rose from my place with something of an effort and gave him
what he asked.

He thanked me with a smiling nod, measured out a few min-
ims of the red tincture and added one of the powders. The mix-
ture, which was at first of a reddish hue, began, in proportion as the

Using local files#

# Save the file locally
!curl -O https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg

results = parser.from_file('Dr._Jekyll_and_Mr._Hyde_Text.jpg')
print(results['content'].strip())
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  822k  100  822k    0     0  2594k      0 --:--:-- --:--:-- --:--:-- 2595k
at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor’s suspense, and some perhaps
on my own growing curiosity.

“There it is, sir,” said I, pointing to the drawer, where it lay on the
floor behind a table and still covered with the sheet.

He sprang to it, and then paused, and laid his hand upon his
heart: I could hear his teeth grate with the convulsive action of his
jaws; and his face was so ghastly to see that I grew alarmed both for
his life and reason.

“Compose yourself,’ said I.

He turned a dreadful smile to me, and as if with the decision of
despair, plucked away the sheet. At sight of the contents, he uttered
one loud sob of such immense relief that I sat petrified. And the
next moment, in a voice that was already fairly well under control,
“Have you a graduated glass?” he asked.

I rose from my place with something of an effort and gave him
what he asked.

He thanked me with a smiling nod, measured out a few min-
ims of the red tincture and added one of the powders. The mix-
ture, which was at first of a reddish hue, began, in proportion as the