Notes / Scraping NHK for Reading on a Kindle

I wanted to scrape the 特集一覧 section on the NHK news site and make a nicely formatted ebook from the data.

See the repo here.

ＮＨＫニュース・特集.azw3

ＮＨＫニュース・特集.min.azw3

Below is what it looks like.

Over 600 articles, going back just over a year, nicely formatted without distractions!

Steps

Looking in the network panel, the site fetches each page of articles from a JSON file, starting with web_tokushu_001.json. This saves us having to scrape an article list.

Get links

Get a list of links to all of the articles.

async function getLinks() {
    const items = [];
    for (let i = 1;; i++) {
        const page = i.toString().padStart(3, "0");
        console.log(`Getting json ${page}`);
        const url = `${NHK_NEWS_URL}/json16/tokushu/new_tokushu_${page}.json`;

        const response = await fetch(url);
        const data = await response.json();

        items.push(...data.channel.item);

        if (!data.channel.hasNext) {
            break;
        }

        console.log("Sleeping 1s");
        await sleep(1000);
    }

    const links = items.map((item, index) => ({
        index,
        url: `${NHK_NEWS_URL}/${item.link}`,
        item,
    }));

    console.log("Writing to output/links.json");
    await writeUtf8("output/links.json", toJson(links));
}

Get each page

Use Puppeteer to get each article after it has loaded, and run Mozilla's Readability.js on it.

NOTE: The code we pass to Puppeteer runs in the browser context, so we also need to provide any libraries we wish to use. In this case Readability.js is just a single file, so I read it in from node_modules, but in more complicated situations it's probably worth bundling whatever you need and passing along that bundle.

const READABILITY_JS = readUtf8Sync("node_modules/@mozilla/readability/Readability.js");

async function getPages() {
    const links = fromJson(await readUtf8("output/links.json"));
    const browser = await startBrowser();

    const pages = [];
    for (const link of links) {
        console.log(`Getting page ${link.index}`);
        const url = link.url;

        const page = await browser.newPage();
        await page.goto(url, {
            waitUntil: "networkidle0",
        });

        const readability = await page.evaluate(`
            (() => {
                ${READABILITY_JS}
                return new Readability(document).parse();
            })();
        `);

        pages.push({
            ...link,
            readability,
        });

        console.log("Sleeping 1s");
        await sleep(1000);
    }

    console.log("Writing to output/pages.json");
    await writeUtf8("output/pages.json", toJson(pages));

    await browser.close();
}

Map data so it's nicer to work with

Get a list of images to download, and replace image sources with file URLs. Format the publishing date as YYYY年M月D日 H時mm分, same as on the website. Also map the data to a nice shape for an article object.

async function mapToArticles() {
    const pages = fromJson(await readUtf8("output/pages.json"));

    const articles = [];
    const imagesToFetch = [];

    for (const page of pages) {
        console.log(`Mapping page ${page.index}`);

        // Article thumbnail image.
        const thumbnailSrc = `${NHK_NEWS_URL}/${page.item.imgPath}`;
        const thumbnailFileName = `${sha256sum(thumbnailSrc)}.jpeg`;
        imagesToFetch.push({ url: thumbnailSrc, path: `{{IMAGE_DIR}}/${thumbnailFileName}` });
        const newThumbnailSrc = `file://{{IMAGE_DIR}}/${thumbnailFileName}`;

        // Images in article body.
        const document = new JSDOM(page.readability.content).window.document;
        document.querySelectorAll("img").forEach((element) => {
            const src = element.getAttribute("src");
            const fileName = `${sha256sum(src)}.jpeg`;
            imagesToFetch.push({ url: src, path: `{{IMAGE_DIR}}/${fileName}` });
            const newSrc = `file://{{IMAGE_DIR}}/${fileName}`;

            element.setAttribute("src", newSrc);
        });

        articles.push({
            index: page.index,
            url: page.url,
            title: page.item.title,
            time: formatTimestamp(page.item.pubDate),
            image: newThumbnailSrc,
            content: document.body.innerHTML,
        });
    }

    console.log("Writing to output/articles.json");
    await writeUtf8("output/articles.json", toJson(articles));

    console.log("Writing to output/images-to-fetch.json");
    await writeUtf8("output/images-to-fetch.json", toJson(imagesToFetch));
}

Download images

Download all of the images we previously scanned for.

async function downloadImages() {
    await mkdir("output/images");
    const imageDir = path.resolve("output/images");

    const images = fromJson(await readUtf8("output/images-to-fetch.json"));
    const failures = [];

    for (const image of images) {
        try {
            const imagePath = image.path.replace("{{IMAGE_DIR}}", imageDir);
            console.log(`Downloading ${image.url} to ${imagePath}`);
            await download(image.url, imagePath);
        } catch {
            console.log("Failed");
            failures.push(image);
        }
        console.log("Sleeping 0.1s");
        await sleep(100);
    }

    if (failures.length > 0) {
        console.log("Writing to output/failures.json");
        await writeUtf8("output/failures.json", toJson(failures));
    }

    console.log("Done");
}

Output book as HTML

Output one big HTML file with absolute paths for the next step.

async function buildBook() {
    const imageDir = path.resolve("output/images");

    const articles = fromJson(await readUtf8("output/articles.json"));

    const body = articles.map((article) => {
        const image = article.image.replace("{{IMAGE_DIR}}", imageDir);
        const body = article.content.replace(/{{IMAGE_DIR}}/g, imageDir);

        return `
            <h1>${article.title}</h1>
            <img src="${image}">
            <p>${article.time}</p>
            ${body}
        `;
    }).join("\n");

    console.log("Writing to output/nhk.html");
    await writeUtf8("output/nhk.html", body);
}

Build EPUB

Use pandoc to build an EPUB.

$ pandoc nhk.html -o nhk.epub

Convert to KF8/AZW3

Load the book up in calibre and convert. I had noticed that one article was a bit messed up since it was just an aggregation of links to others, so I deleted it. I also regenerated the navigation/table of contents so each article was a chapter with no subsections, and compressed the images.

NOTE: Ran mogrify -resize 320x\> -colorspace gray *.jpeg on the images first, then used the built in compressor in calibre.