I wanted to scrape the 特集一覧 section on the NHK news site and make a nicely formatted ebook from the data.
See the repo here.
Below is what it looks like.
Over 600 articles, going back just over a year, nicely formatted without distractions!
Looking in the network panel, the site fetches each page of articles from a JSON file, starting with web_tokushu_001.json. This saves us having to scrape an article list.
Get a list of links to all of the articles.
async function getLinks() {
const items = [];
for (let i = 1;; i++) {
const page = i.toString().padStart(3, "0");
console.log(`Getting json ${page}`);
const url = `${NHK_NEWS_URL}/json16/tokushu/new_tokushu_${page}.json`;
const response = await fetch(url);
const data = await response.json();
items.push(...data.channel.item);
if (!data.channel.hasNext) {
break;
}
console.log("Sleeping 1s");
await sleep(1000);
}
const links = items.map((item, index) => ({
index,
url: `${NHK_NEWS_URL}/${item.link}`,
item,
}));
console.log("Writing to output/links.json");
await writeUtf8("output/links.json", toJson(links));
}
Use Puppeteer to get each article after it has loaded, and run Mozilla's Readability.js on it.
NOTE: The code we pass to Puppeteer runs in the browser context, so we also need to provide any libraries we wish to use. In this case Readability.js is just a single file, so I read it in from node_modules, but in more complicated situations it's probably worth bundling whatever you need and passing along that bundle.
const READABILITY_JS = readUtf8Sync("node_modules/@mozilla/readability/Readability.js");
async function getPages() {
const links = fromJson(await readUtf8("output/links.json"));
const browser = await startBrowser();
const pages = [];
for (const link of links) {
console.log(`Getting page ${link.index}`);
const url = link.url;
const page = await browser.newPage();
await page.goto(url, {
waitUntil: "networkidle0",
});
const readability = await page.evaluate(`
(() => {
${READABILITY_JS}
return new Readability(document).parse();
})();
`);
pages.push({
...link,
readability,
});
console.log("Sleeping 1s");
await sleep(1000);
}
console.log("Writing to output/pages.json");
await writeUtf8("output/pages.json", toJson(pages));
await browser.close();
}
Get a list of images to download, and replace image sources with file URLs. Format the publishing date as YYYY年M月D日 H時mm分, same as on the website. Also map the data to a nice shape for an article object.
async function mapToArticles() {
const pages = fromJson(await readUtf8("output/pages.json"));
const articles = [];
const imagesToFetch = [];
for (const page of pages) {
console.log(`Mapping page ${page.index}`);
// Article thumbnail image.
const thumbnailSrc = `${NHK_NEWS_URL}/${page.item.imgPath}`;
const thumbnailFileName = `${sha256sum(thumbnailSrc)}.jpeg`;
imagesToFetch.push({ url: thumbnailSrc, path: `{{IMAGE_DIR}}/${thumbnailFileName}` });
const newThumbnailSrc = `file://{{IMAGE_DIR}}/${thumbnailFileName}`;
// Images in article body.
const document = new JSDOM(page.readability.content).window.document;
document.querySelectorAll("img").forEach((element) => {
const src = element.getAttribute("src");
const fileName = `${sha256sum(src)}.jpeg`;
imagesToFetch.push({ url: src, path: `{{IMAGE_DIR}}/${fileName}` });
const newSrc = `file://{{IMAGE_DIR}}/${fileName}`;
element.setAttribute("src", newSrc);
});
articles.push({
index: page.index,
url: page.url,
title: page.item.title,
time: formatTimestamp(page.item.pubDate),
image: newThumbnailSrc,
content: document.body.innerHTML,
});
}
console.log("Writing to output/articles.json");
await writeUtf8("output/articles.json", toJson(articles));
console.log("Writing to output/images-to-fetch.json");
await writeUtf8("output/images-to-fetch.json", toJson(imagesToFetch));
}
Download all of the images we previously scanned for.
async function downloadImages() {
await mkdir("output/images");
const imageDir = path.resolve("output/images");
const images = fromJson(await readUtf8("output/images-to-fetch.json"));
const failures = [];
for (const image of images) {
try {
const imagePath = image.path.replace("{{IMAGE_DIR}}", imageDir);
console.log(`Downloading ${image.url} to ${imagePath}`);
await download(image.url, imagePath);
} catch {
console.log("Failed");
failures.push(image);
}
console.log("Sleeping 0.1s");
await sleep(100);
}
if (failures.length > 0) {
console.log("Writing to output/failures.json");
await writeUtf8("output/failures.json", toJson(failures));
}
console.log("Done");
}
Output one big HTML file with absolute paths for the next step.
async function buildBook() {
const imageDir = path.resolve("output/images");
const articles = fromJson(await readUtf8("output/articles.json"));
const body = articles.map((article) => {
const image = article.image.replace("{{IMAGE_DIR}}", imageDir);
const body = article.content.replace(/{{IMAGE_DIR}}/g, imageDir);
return `
<h1>${article.title}</h1>
<img src="${image}">
<p>${article.time}</p>
${body}
`;
}).join("\n");
console.log("Writing to output/nhk.html");
await writeUtf8("output/nhk.html", body);
}
Use pandoc to build an EPUB.
$ pandoc nhk.html -o nhk.epub
Load the book up in calibre and convert. I had noticed that one article was a bit messed up since it was just an aggregation of links to others, so I deleted it. I also regenerated the navigation/table of contents so each article was a chapter with no subsections, and compressed the images.
NOTE: Ran mogrify -resize 320x\> -colorspace gray *.jpeg on the images first, then used the built in compressor in calibre.