EPUB and PDF are widely used digital document formats. This post shares how to parse them using Python’s Langchain library to build a dataset for a custom language model.

Parsing EPUB with Langchain

Langchain’s EPUBLoader module loads an EPUB and splits it into a list of documents, each containing a section or chapter’s content and metadata.

from langchain.document_loaders import EPUBLoader

loader = EPUBLoader("example_data/book.epub")
documents = loader.load_and_split()

print(documents[0].content)  # Print first section content

Parsing PDF with Langchain

Langchain offers several ways to load PDFs, including PyPDF, MathPix, and Unstructured PDF loaders. These load a PDF into an array of documents, each with page content and metadata.

PyMuPDF example:

from langchain.document_loaders import PyMuPDF

loader = PyMuPDF("example_data/document.pdf")
pages = loader.load_and_split()

print(pages[0].page_content)  # Print first page content

Gutenberg Open Books Dataset

The Gutenberg project offers 60,000+ free eBooks, mostly in EPUB and PDF. Python’s gutenberg library helps access and parse these.

Example:

from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers

text = strip_headers(load_etext(2701)).strip() 
print(text)

load_etext(2701) loads book ID 2701, and strip_headers() removes header/footer info.

Finding Free PDFs on Google

Use Google’s filetype: search to find free PDFs, e.g. "Moby Dick filetype:pdf".

Langchain API provides powerful tools for parsing EPUB and PDF files, enabling tasks like building book apps, document management systems, or text analysis. Resources like Gutenberg and Google make it easy to access many free books.