Yande.Re is one of famous anime/game/CG imageboards, devoted mostly for high quality scan - artbooks etc.
It has strong community (that gives the adequate score to images) and well organized (not too verbose) tagging system.
Also Yande_Re presents wide variery pictures composition and quality - from trash-like partial scans and completely
text-filled artbook pages to clear art, from bad quality scans and almost-unvisible line art to pure full-color digital.
That’s why Yande_Re is a good source for investigation of non-photographic images and their metadata (similar to
Gwern Danbooru dataset) to build tools to auto-classify all of that (or simply make your eyes happy).
This release contains:
- JSON metadata for 618.801 Yande_Re posts from start till 700.000 (29.10.2020) except those failed to grab (deleted posts etc)
- with simple Python script how to do it
- with pretty-printed example to illustrate structure and content
- 397.691 “sample” images as prepared by site with reasonable quality (introduced from ID=165352 15.12.2010)
- longer side = 1500 px but no more than 1.8 MPix
- JPEG quality 92%, some optimizations done
- additional TSV (tab separated text) metadata
- key parameters of 403.933 posts, including some calculated stats
~ derived from JSON
~ computed with ImageMagick over above mentioned “samples”
- tags list, including some calculated stats and over them and external references
- tag-to-post relation as separate table - 4.261.026 rows
- some database (Oracle), Batch (Windows) and Python scripts
- data structures definition
- key processing steps in database
- some query examples
- tools for stats computing
- more detailed readme for DATA
- BONUS 1: example of usage to make “BUST DATASET” based on Nagadomi face detector
- scripts and some description
- several (zipped BUST suffixed) folders with transformed and cleaned up “busts” (upper body)
- RAW results has to be manually filtered and arranged
- upper body detector can be built when dataset become big and clean enough
- BONUS 2: example of usage with notAI-tech NudeNet tensorflow based object detector
- scripts and more description
- several (zipped NUDE suffixed) folders with marked samples
- when enough resources it can be used
Release include “samples” only for 2/3 total posts with “good enough” images worth to get originals:
- file_ext in (‘jpg’,‘png’)
- greatest(image_height,image_width)>=1200 – not too small
and least(image_height,image_width)>=1000
and image_height*image_width>=1310720 – (1280x1024)
and image_width/image_height between 0.4 and 2.1 – not too disproportional
- rating in (‘s’,‘q’) in separate folders/zips
- 457 explicit (evident sех, mаsturbаtiоn, pеnis) and 3033 explicit-like (mostly because pussy
too exposed or absent) samples excluded from ‘questionable’ - that’s marked in metadata as “directories”
- grabbed files renamed to contain “ID - up_to_3_copyrights ~ up_to_5_characters (up_to_2_artists)”
- tags concatenated via “+”, spaces replaced with underscores
- maximum file name length 220 symbols, characters tags may be truncated if too long
- this enables file system search and sampling (with masked XCOPY, UNZIP etc)
- some gentle deduplication done (minus 2752 images), preferring ‘s’ rating and newer posts
- when no visible artistic difference but maybe technical issues
- a little bit practically blank pages throwed out
- so lots (~5000) of similarities left (that’s typical to Yande_Re)
- no filter applied by score and/or tags
- it was an initial idea to include only “the best of” and exclude “banned tags”
- the border of “acceptable quality” turned out to be fuzzy
- user score vs tags vs tech-metadata may be the field of analysis
Sample images archived by 10.000 ID groups NNxxxx.[Q=questionable] NN=16…69
I recommend to use FastStone MaxView to browse images inside zips.
HERE is the same way created release for konachan.com
THERE ARE some rips on Nyaa tracker for Safebooru and Zerochan. No nipples to detect there.
Comments - 0