dennogumi.org-archive/_posts/2019-01-03-hack-of-the-day-extracting-comments-from-nicovideo.markdown at master

Archived

This repository has been archived on 2021-01-06. You can view files and clone it, but you cannot make any changes to its state, such as pushing and creating new issues, pull requests or comments.

Luca Beltrame 94b2a4069f

New post: "Hack of the day: extracting comments from Nicovideo"

2019-01-04 00:45:46 +01:00

15 KiB

Raw Permalink Blame History

Accessing the Nicovideo comments

It turned out that a Python 3.x module to access the Nicovideo API coupled with scraping exists. It even offers the ability to download the comments! The main drawback from the program is that it downloads either raw JSON or raw XML from the comments (it does not only do that, but I only considered my needs), and that the documentation is a little outdated (I'll get over that in a bit).

A typical example of the JSON it fetches is below:


{
  "chat": {
    "thread": "1546445582",
    "no": 1,
    "vpos": 861,
    "date": 1546468307,
    "date_usec": 432985,
    "premium": 1,
    "anonymity": 1,
    "user_id": "cT63iaz7BkGIQOV2aOebGQJs_nA",
    "mail": "184",
    "content": "うぽつ"
  }
}

There are several keys, but the ones of interest are the ones marked chat. The text of the comment is in the content field in this case うぽつ means アップロードおつかれさまです, that is, "thanks for uploading" (very loosely translated). no is a progressive number that tells us, in this case, that the comment is the first one (not necessarily at the start of the video). vpos is important, but weird: it tells where the comment appears in the video, but its unit is in tenths of millisecond since the start (from 0), so that means that this comment appeared at 8160 ms in the video. The other fields, as described here are:

mail: a value of 184 indicates an anonymous comment
user_id: the user ID of the commenter, or a random string in case of an anonymous comment
anonymity: self explanatory
date: date of the comment in UNIX timestamp
date_usec: microseconds of the date (as far as I understoood)
thread: ID of the conversation

Making it into something legible

As you saw, the information there are not very useful for humans. What needs to be done first is to convert the values into something useful. In my case, I needed just the comment and the timestamp, so something like this:


from datetime import datetime

def process_element(element):

    element = element["chat"]

    video_pos = element["vpos"] * 10
    comment = element["content"]
    video_time = datetime.fromtimestamp(video_pos / 1000).strftime("%M:%S.%f")

    return (video_time, comment)

What this does is converting the vpos field into milliseconds, then use datetime.fromtimestamp to change it into a proper datetime object, then format it by minutes and seconds (Nicovideo downgrades every video longer than 30 minutes to 360p quality, so having videos longer than an hour is unlikely).

This could be then wrapped nicely, assuming we have the JSON representation:


import simplejson as json # plain old json will suffice too

with open(filename) as handle:
    data = json.load(handle)
    # Get rid of the other elements
    valid_elements = [item for item in data if "chat" in item]
    entries = [process_element(item) for item in valid_elements]
    # Sort from earliest to latest, crude but effective
    entries.sort(key=lambda x: x[0])

Afterwards, it's trivial to write the results as a tab-delimited file.

Subtitles

I then thought that perhaps it would be useful to have the comments as subtitles, because like this they could be superimposed to the video (like Nicovideo does) to gather more context (but without using Nicovideo's awful video player). A quick search led me to the pysrt library which can relatively easily create SRT files. Generating then a SRT file matching the video is quite simple:


import pysrt

def build_srt(data):

    srt_file = pysrt.SubRipFile()
    for index, processed_content in enumerate(data):
        time, text = processed_content
        # HACK: reparse because we already have sorted data
        time = datetime.strptime(time, "%M:%S.%f")
        subtime_st = pysrt.SubRipTime(minutes=time.minute, seconds=time.second,
                                      milliseconds=time.microsecond / 1000)
        subtime_end = pysrt.SubRipTime(minutes=time.minute,
                                       seconds=time.second + 2,
                                       milliseconds=time.microsecond / 1000)
        entry = pysrt.SubRipItem(index, subtime_st, subtime_end, text=text)
        srt_file.append(entry)

    return srt_file


srt_file = build_srt(valid_elements)
srt_file.save(srt_name)

The function takes a list like the entries we have generated, containing tuples of (time, comment) and returns a SubRipFile instance.

Each entry in a SRT file has its own index, starting from 0, so we just rely on enumerate to get the right one. Having sorted the list beforehand makes it simple to preserve the correct ordering: this is the reason for the hacky reparsing of the date (better solutions welcome!). Such a generated file can then be used with mpv or vlc (the latter has some trouble, though) to display the comments superimposed to the video.

Automating things

So far, the process is in two steps, downloading the file and then parsing it. However, it would be nicer if we could avoid just saving the JSON file and process the data directly after obtaining it. nicotools can be used as a library, but unfortunately the API does not provide such a feature. However, everything is defined as a series of classes, so it's easy to extend the Comment class used to fetch comments for our purpose, using inheritance:


import asyncio
from nicotools.download import Comment, utils

class CommentStream(Comment):

    # If for any reason we don't want SRT output
    srt = False

    def saver(self, video_id: str, is_xml: bool,
              coroutine: asyncio.Task) -> bool:

        # This only works with JSON output
        if is_xml:
            super().saver(video_id, is_xml, coroutine)
            return True

        comment_data = coroutine.result()

        data = json.loads(comment_data)
        contents = [process_element(item)
                    for item in data if "chat" in item]

        file_path = utils.make_name(self.glossary[video_id], "",
                                    extention="txt")
        file_srt = utils.make_name(self.glossary[video_id], "",
                                   extention="srt")

        contents.sort(key=lambda x: x[0])

        with file_path.open("w", encoding="utf-8") as f:
            f.write("Time\tComment\n")
            f.writelines("\t".join(element) + "\n" for item in contents)

        if self.srt:
            srt_data = build_srt(contents)
            srt_data.save(str(file_srt))

        return True

nicotools uses asyncio to perform its task, and saving the data is handled by the saver class. Thus, we create a subclass of Comment and we override its saver method. There we build our data, and we save our results, using the provided helpers from nicotools, which create a pathlib.Path instance from the video name (extracted from self.glossary[video_id]).

Wrapping it up

With our modified class, we now need to actually connect to Nicovideo and download the comments. To do this job, nicotools requires video IDs, which in Nico are in the form smXXXXX. To make things easier for my acquaintance, I though it would be better to extract it from one or more URLs, so an easy helper function was born:


import os
from urllib.parse import urlparse

def extract_ids(urls):

    video_ids = list()

    for url in urls:
        parsed = urlparse(url)
        # The last path element is always the video ID
        nico_id = os.path.split(parsed.path)[-1]
        video_ids.append(nico_id)

    return nico_id

We can then get to the Real Thing(TM) and download the comments. First of all, we need a valid username and password. Since my acquaintance uses youtube-dl which parses netrc files, I just used the netrc module part of the standard library to extract what was needed:


from netrc import netrc

user, _, password = netrc.netrc().hosts["niconico"]

After taking care of that, I got stumped, because when testing the modified class (but also the original!) with the examples in the documentation, I always got an exception stating that the session had expired. A look at the actual code of nicotools showed me that the examples were wrong: before doing anything, we need to establish a sort of persistent session that will allow us to download things. This is done by the Info class in nicotools:


from nicotools.download import Info

# video_ids is a list of smXXX IDs
database = Info(video_ids, mail=user, password=password).info

We then need to pass these data to our CommentStream class and have it handle the rest:

com = CommentStream(database, user, password, save_dir=my_dest_dir)
com.srt = True # If neeeded
com.start()

start() will authenticate, parse and download what needed, then it will execute our saver function and save the results.

There! Like this we can get the comments, have them in a human-readable format, and also use optionally a SRT file to superimpose them to the video.

The full script

The script I built from the pieces I showed here is reproduced below. It is licensed under the BSD license, and even though it worked for me (and whoever I made it for) your mileage may vary.

It requires:

Python 3.6 (required by nicotools)
nicotools (and its dependencies)
pysrt
simplejson


#!/usr/bin/python3
# Copyright 2018 Luca Beltrame
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from this
# software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.

import asyncio
import argparse
from datetime import datetime
import netrc
import os
from urllib.parse import urlparse

import pysrt
from nicotools.download import Comment, Info, utils
import simplejson as json


def process_element(element):

    element = element["chat"]

    video_pos = element["vpos"] * 10
    comment = element["content"]
    video_time = datetime.fromtimestamp(video_pos / 1000).strftime("%M:%S.%f")

    return (video_time, comment)


def extract_ids(urls):

    video_ids = list()

    for url in urls:
        parsed = urlparse(url)
        nico_id = os.path.split(parsed.path)[-1]
        video_ids.append(nico_id)

    return nico_id


def build_srt(data):

    srt_file = pysrt.SubRipFile()
    for index, processed_content in enumerate(data):
        time, text = processed_content
        # HACK: reparse because we already have sorted data
        time = datetime.strptime(time, "%M:%S.%f")
        subtime_st = pysrt.SubRipTime(minutes=time.minute, seconds=time.second,
                                      milliseconds=time.microsecond / 1000)
        subtime_end = pysrt.SubRipTime(minutes=time.minute,
                                       seconds=time.second + 2,
                                       milliseconds=time.microsecond / 1000)
        entry = pysrt.SubRipItem(index, subtime_st, subtime_end, text=text)
        srt_file.append(entry)

    return srt_file


class CommentStream(Comment):

    # If for any reason we don't want SRT output
    srt = False

    def saver(self, video_id: str, is_xml: bool,
              coroutine: asyncio.Task) -> bool:

        # This only works with JSON output
        if is_xml:
            super().saver(video_id, is_xml, coroutine)
            return True

        comment_data = coroutine.result()

        data = json.loads(comment_data)
        contents = [process_element(item)
                    for item in data if "chat" in item]

        file_path = utils.make_name(self.glossary[video_id], "",
                                    extention="txt")
        file_srt = utils.make_name(self.glossary[video_id], "",
                                   extention="srt")

        contents.sort(key=lambda x: x[0])

        with file_path.open("w", encoding="utf-8") as f:
            f.write("Time\tComment\n")
            f.writelines("\t".join(element) + "\n" for item in contents)

        if self.srt:
            srt_data = build_srt(contents)
            srt_data.save(str(file_srt))

        return True


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("video", help="Video URL(s)", nargs="+")
    parser.add_argument("-d", "--destination", help="Destination directory",
                        default="./")
    parser.add_argument("--no-srt", action="store_false",
                        help="Don't generate SRT")

    options = parser.parse_args()
    user, _, password = netrc.netrc().hosts["niconico"]
    video_ids = extract_ids(options.video)

    database = Info(video_ids, mail=user, password=password).info
    com = CommentStream(database, user, password,
                        save_dir=options.destination)
    com.srt = options.no_srt
    com.start()


if __name__ == "__main__":
    main()

15 KiB Raw Permalink Blame History