New post: "Hack of the day: extracting comments from Nicovideo"

2019-01-04 00:45:17 +01:00 · 2019-01-04 00:45:17 +01:00 · 94b2a4069f
commit 94b2a4069f
parent 20923eaa7e
1 changed files with 380 additions and 0 deletions
--- a/_posts/2019-01-03-hack-of-the-day-extracting-comments-from-nicovideo.markdown
+++ b/_posts/2019-01-03-hack-of-the-day-extracting-comments-from-nicovideo.markdown
@ -0,0 +1,380 @@
+---
+categories:
+- General
+- Anime
+comments: true
+date: 2019-01-03 23:07:42+0100
+layout: page
+tags:
+- Linux
+- anime
+- hacks
+title: 'Hack of the day: extracting comments from Nicovideo'
+---
+
+A [while ago](https://www.dennogumi.org/2018/01/hack-of-the-day-downloading-jikkyou-from-nicovideo/) I posted a way to download 実況動画 from Nicovideo. One of my acquaintances publishes videos there, and he wanted to collect the various comments to reply in a follow up video (what the Nico users call the コメント返し). When the comments are more than just a few, it's hard to find where they refer to without watching them on Nicovideo itself, and its UI is suboptimal. Ideally one would check comments ordered by their position in the video. There are [existing programs](http://xenog.web.fc2.com/) but their user interface is utterly confusing at best. This called for something simpler, and that's what I tried to do.
+
+## Accessing the Nicovideo comments
+
+It turned out that [a Python 3.x module to access the Nicovideo API coupled with scraping](https://pypi.org/project/nicotools/) exists. It even offers the ability to download the comments! The main drawback from the program is that it downloads either raw JSON or raw XML from the comments (it does not only do that, but I only considered my needs), and that the documentation is a little outdated (I'll get over that in a bit).
+
+A typical example of the JSON it fetches is below:
+
+```json
+
+{
+  "chat": {
+    "thread": "1546445582",
+    "no": 1,
+    "vpos": 861,
+    "date": 1546468307,
+    "date_usec": 432985,
+    "premium": 1,
+    "anonymity": 1,
+    "user_id": "cT63iaz7BkGIQOV2aOebGQJs_nA",
+    "mail": "184",
+    "content": "うぽつ"
+  }
+}
+```
+
+There are several keys, but the ones of interest are the ones marked `chat`. The text of the comment is in the `content` field in this case うぽつ means アップロードおつかれさまです, that is, "thanks for uploading" (*very* loosely translated). `no` is a progressive number that tells us, in this case, that the comment is the first one (not necessarily at the start of the video). `vpos` is important, but weird: it tells where the comment appears in the video, but its unit is in tenths of millisecond since the start (from 0), so that means that this comment appeared at 8160 ms in the video. The other fields, as described [here](https://qiita.com/tor4kichi/items/4df5b11ec564bb8f8d16) are:
+
+- `mail`: a value of 184 indicates an anonymous comment
+- `user_id`: the user ID of the commenter, or a random string in case of an anonymous comment
+- `anonymity`: self explanatory
+- `date`: date of the comment in UNIX timestamp
+- `date_usec`: microseconds of the date (as far as I understoood)
+- `thread`: ID of the conversation
+
+## Making it into something legible
+
+As you saw, the information there are not very useful for humans. What needs to be done first is to convert the values into something useful. In my case, I needed just the comment and the timestamp, so something like this:
+
+```python
+
+from datetime import datetime
+
+def process_element(element):
+
+    element = element["chat"]
+
+    video_pos = element["vpos"] * 10
+    comment = element["content"]
+    video_time = datetime.fromtimestamp(video_pos / 1000).strftime("%M:%S.%f")
+
+    return (video_time, comment)
+
+```
+
+What this does is converting the `vpos` field into milliseconds, then use `datetime.fromtimestamp` to change it into a proper `datetime` object,
+then format it by minutes and seconds (Nicovideo downgrades every video longer than 30 minutes to 360p quality, so having videos longer than an hour is unlikely).
+
+This could be then wrapped nicely, assuming we have the JSON representation:
+
+```python
+
+import simplejson as json # plain old json will suffice too
+
+with open(filename) as handle:
+    data = json.load(handle)
+    # Get rid of the other elements
+    valid_elements = [item for item in data if "chat" in item]
+    entries = [process_element(item) for item in valid_elements]
+    # Sort from earliest to latest, crude but effective
+    entries.sort(key=lambda x: x[0])
+
+```
+
+Afterwards, it's trivial to write the results as a tab-delimited file.
+
+## Subtitles
+
+I then thought that perhaps it would be useful to have the comments as subtitles, because like this they could be superimposed to the video (like Nicovideo does) to gather more context (but without using Nicovideo's awful video player). A quick search led me to the [pysrt library](https://pypi.org/project/pysrt/) which can relatively easily create SRT files. Generating then a SRT file matching the video is quite simple:
+
+```python
+
+import pysrt
+
+def build_srt(data):
+
+    srt_file = pysrt.SubRipFile()
+    for index, processed_content in enumerate(data):
+        time, text = processed_content
+        # HACK: reparse because we already have sorted data
+        time = datetime.strptime(time, "%M:%S.%f")
+        subtime_st = pysrt.SubRipTime(minutes=time.minute, seconds=time.second,
+                                      milliseconds=time.microsecond / 1000)
+        subtime_end = pysrt.SubRipTime(minutes=time.minute,
+                                       seconds=time.second + 2,
+                                       milliseconds=time.microsecond / 1000)
+        entry = pysrt.SubRipItem(index, subtime_st, subtime_end, text=text)
+        srt_file.append(entry)
+
+    return srt_file
+
+
+srt_file = build_srt(valid_elements)
+srt_file.save(srt_name)
+
+```
+
+The function takes a list like the `entries` we have generated, containing tuples of `(time, comment)` and returns a `SubRipFile` instance.
+
+Each entry in a SRT file has its own index, starting from 0, so we just rely on `enumerate` to get the right one. Having sorted the list beforehand makes it simple to preserve the correct ordering: this is the reason for the hacky reparsing of the date (better solutions welcome!). Such a generated file can then be used with `mpv` or `vlc` (the latter has some trouble, though) to display the comments superimposed to the video.
+
+## Automating things
+
+So far, the process is in two steps, downloading the file and then parsing it. However, it would be nicer if we could avoid just saving the JSON file and process the data directly after obtaining it. nicotools can be used as a library, but unfortunately the API does not provide such a feature. However, everything is defined as a series of classes, so it's easy to extend the `Comment` class used to fetch comments for our purpose, using inheritance:
+
+```python
+
+import asyncio
+from nicotools.download import Comment, utils
+
+class CommentStream(Comment):
+
+    # If for any reason we don't want SRT output
+    srt = False
+
+    def saver(self, video_id: str, is_xml: bool,
+              coroutine: asyncio.Task) -> bool:
+
+        # This only works with JSON output
+        if is_xml:
+            super().saver(video_id, is_xml, coroutine)
+            return True
+
+        comment_data = coroutine.result()
+
+        data = json.loads(comment_data)
+        contents = [process_element(item)
+                    for item in data if "chat" in item]
+
+        file_path = utils.make_name(self.glossary[video_id], "",
+                                    extention="txt")
+        file_srt = utils.make_name(self.glossary[video_id], "",
+                                   extention="srt")
+
+        contents.sort(key=lambda x: x[0])
+
+        with file_path.open("w", encoding="utf-8") as f:
+            f.write("Time\tComment\n")
+            f.writelines("\t".join(element) + "\n" for item in contents)
+
+        if self.srt:
+            srt_data = build_srt(contents)
+            srt_data.save(str(file_srt))
+
+        return True
+```
+
+nicotools uses `asyncio` to perform its task, and saving the data is handled by the `saver` class. Thus, we create a subclass of `Comment` and we override its `saver` method. There we build our data, and we save our results, using the provided helpers from nicotools, which create a `pathlib.Path` instance from the video name (extracted from `self.glossary[video_id]`).
+
+## Wrapping it up
+
+With our modified class, we now need to actually connect to Nicovideo and download the comments. To do this job, nicotools requires video IDs, which in Nico are in the form `smXXXXX`. To make things easier for my acquaintance, I though it would be better to extract it from one or more URLs, so an easy helper function was born:
+
+```python
+
+import os
+from urllib.parse import urlparse
+
+def extract_ids(urls):
+
+    video_ids = list()
+
+    for url in urls:
+        parsed = urlparse(url)
+        # The last path element is always the video ID
+        nico_id = os.path.split(parsed.path)[-1]
+        video_ids.append(nico_id)
+
+    return nico_id
+
+```
+
+We can then get to the Real Thing(TM) and download the comments. First of all, we need a valid username and password. Since my acquaintance uses `youtube-dl` which parses netrc files, I just used the netrc module part of the standard library to extract what was needed:
+
+```python
+
+from netrc import netrc
+
+user, _, password = netrc.netrc().hosts["niconico"]
+```
+
+After taking care of that, I got stumped, because when testing the modified class (but also the original!) with the examples in the documentation, I always got an exception stating that the session had expired. A look at the actual code of nicotools showed me that the examples were wrong: before doing anything, we need to establish a sort of persistent session that will allow us to download things. This is done by the `Info` class in nicotools:
+
+```python
+
+from nicotools.download import Info
+
+# video_ids is a list of smXXX IDs
+database = Info(video_ids, mail=user, password=password).info
+```
+
+We then need to pass these data to our `CommentStream` class and have it handle the rest:
+
+```python
+com = CommentStream(database, user, password, save_dir=my_dest_dir)
+com.srt = True # If neeeded
+com.start()
+```
+
+`start()` will authenticate, parse and download what needed, then it will execute our `saver` function and save the results.
+
+There! Like this we can get the comments, have them in a human-readable format, and also use optionally a SRT file to superimpose them to the video.
+
+## The full script
+
+The script I built from the pieces I showed here is reproduced below. It is licensed under the BSD license, and even though it worked for me (and whoever I made it for) your mileage may vary.
+
+It requires:
+
+- Python 3.6 (required by nicotools)
+- nicotools (and its dependencies)
+- pysrt
+- simplejson
+
+```python
+
+#!/usr/bin/python3
+# Copyright 2018 Luca Beltrame
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice,
+# this list of conditions and the following disclaimer.
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from this
+# software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+
+import asyncio
+import argparse
+from datetime import datetime
+import netrc
+import os
+from urllib.parse import urlparse
+
+import pysrt
+from nicotools.download import Comment, Info, utils
+import simplejson as json
+
+
+def process_element(element):
+
+    element = element["chat"]
+
+    video_pos = element["vpos"] * 10
+    comment = element["content"]
+    video_time = datetime.fromtimestamp(video_pos / 1000).strftime("%M:%S.%f")
+
+    return (video_time, comment)
+
+
+def extract_ids(urls):
+
+    video_ids = list()
+
+    for url in urls:
+        parsed = urlparse(url)
+        nico_id = os.path.split(parsed.path)[-1]
+        video_ids.append(nico_id)
+
+    return nico_id
+
+
+def build_srt(data):
+
+    srt_file = pysrt.SubRipFile()
+    for index, processed_content in enumerate(data):
+        time, text = processed_content
+        # HACK: reparse because we already have sorted data
+        time = datetime.strptime(time, "%M:%S.%f")
+        subtime_st = pysrt.SubRipTime(minutes=time.minute, seconds=time.second,
+                                      milliseconds=time.microsecond / 1000)
+        subtime_end = pysrt.SubRipTime(minutes=time.minute,
+                                       seconds=time.second + 2,
+                                       milliseconds=time.microsecond / 1000)
+        entry = pysrt.SubRipItem(index, subtime_st, subtime_end, text=text)
+        srt_file.append(entry)
+
+    return srt_file
+
+
+class CommentStream(Comment):
+
+    # If for any reason we don't want SRT output
+    srt = False
+
+    def saver(self, video_id: str, is_xml: bool,
+              coroutine: asyncio.Task) -> bool:
+
+        # This only works with JSON output
+        if is_xml:
+            super().saver(video_id, is_xml, coroutine)
+            return True
+
+        comment_data = coroutine.result()
+
+        data = json.loads(comment_data)
+        contents = [process_element(item)
+                    for item in data if "chat" in item]
+
+        file_path = utils.make_name(self.glossary[video_id], "",
+                                    extention="txt")
+        file_srt = utils.make_name(self.glossary[video_id], "",
+                                   extention="srt")
+
+        contents.sort(key=lambda x: x[0])
+
+        with file_path.open("w", encoding="utf-8") as f:
+            f.write("Time\tComment\n")
+            f.writelines("\t".join(element) + "\n" for item in contents)
+
+        if self.srt:
+            srt_data = build_srt(contents)
+            srt_data.save(str(file_srt))
+
+        return True
+
+
+def main():
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("video", help="Video URL(s)", nargs="+")
+    parser.add_argument("-d", "--destination", help="Destination directory",
+                        default="./")
+    parser.add_argument("--no-srt", action="store_false",
+                        help="Don't generate SRT")
+
+    options = parser.parse_args()
+    user, _, password = netrc.netrc().hosts["niconico"]
+    video_ids = extract_ids(options.video)
+
+    database = Info(video_ids, mail=user, password=password).info
+    com = CommentStream(database, user, password,
+                        save_dir=options.destination)
+    com.srt = options.no_srt
+    com.start()
+
+
+if __name__ == "__main__":
+    main()
+```