Yande.Re is one of famous anime/game/CG imageboards, devoted mostly for high quality scan - artbooks etc.
It has strong community (that gives the adequate score to images) and well organized (not too verbose) tagging system.
Also Yande_Re presents wide variery pictures composition and quality - from trash-like partial scans and completely
text-filled artbook pages to clear art, from bad quality scans and almost-unvisible line art to pure full-color digital.
That's why Yande_Re is a good source for investigation of non-photographic images and their metadata (similar to
Gwern Danbooru dataset) to build tools to auto-classify all of that (or simply make your eyes happy).
This release contains:
- JSON metadata for 618.801 Yande_Re posts from start till 700.000 (29.10.2020) except those failed to grab (deleted posts etc)
* with simple Python script how to do it
* with pretty-printed example to illustrate structure and content
- 397.691 "sample" images as prepared by site with reasonable quality (introduced from ID=165352 15.12.2010)
* longer side = 1500 px but no more than 1.8 MPix
* JPEG quality 92%, some optimizations done
- additional TSV (tab separated text) metadata
* key parameters of 403.933 posts, including some calculated stats
~ derived from JSON
~ computed with ImageMagick over above mentioned "samples"
* tags list, including some calculated stats and over them and external references
* tag-to-post relation as separate table - 4.261.026 rows
- some database (Oracle), Batch (Windows) and Python scripts
* data structures definition
* key processing steps in database
* some query examples
* tools for stats computing
- more detailed readme for DATA
- BONUS 1: example of usage to make "BUST DATASET" based on [Nagadomi face detector](https://github.com/nagadomi/lbpcascade_animeface)
- scripts and some description
- several (zipped BUST suffixed) folders with transformed and cleaned up "busts" (upper body)
- RAW results has to be manually filtered and arranged
- upper body detector can be built when dataset become big and clean enough
- BONUS 2: example of usage with [notAI-tech NudeNet](https://github.com/notAI-tech/NudeNet) tensorflow based object detector
- scripts and more description
- several (zipped NUDE suffixed) folders with marked samples
- when enough resources it can be used
* for recheck after manual de-hentaing (as I did)
* for semi-manual blurring to make images "less explicit"
* [for scene segmentation and person distinction based on related group of body parts](https://www.kaggle.com/printcraft/anime-and-cg-characters-detection-using-yolov5)
Release include "samples" only for 2/3 total posts with "good enough" images worth to get originals:
- file_ext in ('jpg','png')
- greatest(image_height,image_width)>=1200 -- not too small
and least(image_height,image_width)>=1000
and image_height*image_width>=1310720 -- (1280x1024)
and image_width/image_height between 0.4 and 2.1 -- not too disproportional
- rating in ('s','q') in separate folders/zips
* 457 explicit (evident sех, mаsturbаtiоn, pеnis) and 3033 explicit-like (mostly because pussy
too exposed or absent) samples excluded from 'questionable' - that's marked in metadata as "directories"
- grabbed files renamed to contain "ID - up_to_3_copyrights ~ up_to_5_characters (up_to_2_artists)"
* tags concatenated via "+", spaces replaced with underscores
* maximum file name length 220 symbols, characters tags may be truncated if too long
* this enables file system search and sampling (with masked XCOPY, UNZIP etc)
- some gentle deduplication done (minus 2752 images), preferring 's' rating and newer posts
* when no visible artistic difference but maybe technical issues
* a little bit practically blank pages throwed out
* so lots (~5000) of similarities left (that's typical to Yande_Re)
- no filter applied by score and/or tags
* it was an initial idea to include only "the best of" and exclude "banned tags"
* the border of "acceptable quality" turned out to be fuzzy
* user score vs tags vs tech-metadata may be the field of analysis
Sample images archived by 10.000 ID groups NNxxxx.[Q=questionable] NN=16..69
I recommend to use FastStone MaxView to browse images inside zips.
[HERE](https://sukebei.nyaa.si/view/3204613) is the same way created release for konachan.com
[THERE ARE](https://nyaa.si/user/AlexPUA) some rips on Nyaa tracker for Safebooru and Zerochan. No nipples to detect there.
Comments - 0