Mining Book and Author Data From Goodreads

Data are scraped using a site-wide crawler on Goodreads, extracting details from /book/show and /author/show type pages. This enables the collection of a large, rich, high-quality dataset which can be used for quantitative analyses, visualization and augmenting other datasets. The crawlers are focused, enabling the collection of very specific datasets, such as all books from a ListopiaListopia is a collection of lists of books, each one being ordered by members votes. Each member who votes on a list can order their individual votes, which are then used to generate the order of the main list. list.

Keywords: web crawling, data mining, asynchronous programming Project Timeline: Sep 2017 - Current Technologies: Python Libraries: click, rich, scrapy, selenium



Wikidata Toolkit - Bots to Fix Data Inconsistencies on Wikidata

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines, and acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. The Wikidata toolkit provides bots and scripts that work on television series and episodes to improve data quality. Here are some stats for the bot.

A core part of the implementation is a simple rule engine, where arbitrary constraints on the data can be verified and enforced.

Keywords: open source, rule engine
Project Timeline: July 2019 - December 2020
Technologies: Python
Libraries: click, pywikibot



Clomask - Object Detection, Segmentation and Classification using Mask-RCNN

Clomask is an application of Mask-RCNNMask Region-based Convolutional Neural Networks generate binary image masks for objects in an image to perform object detection and semantic segmentationassociating each pixel of the image with a label of retail products such as bottles, candy bags and cereal boxes. We trained it on a synthetically generated dataset, achieving a weighted mAPaverage precision is the precision for a single image averaged over multiple intersection-over-union values. mean average precision (mAP) is for the entire dataset of 0.586.

In addition to working on data synthesis, performance tuning and technical writing, I also developed a scalable architecture to serve the model over the web using S3 and SQS.

Keywords: computer vision, deep learning
Project Timeline: Sep 2018 - March 2019
Technologies: Python, Mask-RCNN, AWS
Libraries: numpy, opencv



Social Network Analysis of Dynamic Character Graphs in 18/19th Century Novels

We generate character graphs from literature using named entity recognitionNER: locate and classify named entities into categories such as person names, organizations and locations to extract characters (nodes), and co-occurrence to infer links (edges) between these characters. We also maintain the points in time (in the novel) where characters interact, resulting in dynamic (temporal) graphs that can reveal interesting patterns in narrative structure and storylines.

Keywords: computational linguistics, natural language processing, social network analysis, graph theory
Project Timeline: March 2018 - March 2019
Technologies: Python, Gephi
Libraries: flair, igraph, networkx, stanford-nlp, nltk, selenium, scrapy, d3



Analyzing 19th Century Literature w.r.t. The Bechdel Test

We infer conversations between characters in literature using named entity recognition to extract characters, and an ad-hoc algorithm to infer interactions. These data are then used to perform a qualitative analysis of 19th century literature w.r.t. the Bechdel test, a popular measure of women representation in fiction and media.

Keywords: computational linguistics, natural language processing, human centered data science, reproducibility
Project Timeline: Nov 2018 - Dec 2018
Technologies: Python
Libraries: stanford-nlp, nltk, scrapy, matplotlib



FollowApp - A Low-Cost High-Efficacy Vaccination Reminder

Developed as part of a technology innovation program at Morgan Stanley, FollowApp is an end-to-end application that reminds parents and/or guardians of pending vaccinations for their children through automated calls. Their responses to the call are recorded for scheduling future calls. Among other modules, we developed an interface that was agnostic to the IVRInteractive Voice Response is an automated telephony system that interacts with callers, gathers information and routes calls to the appropriate recipients. service provider, allowing us to switch providers (Twilio, Exotel, imimobile) with minimal development effort. The project won the first prize for technological innovation, and has been covered in media several times.

Keywords: scalability, system design, social good, api development
Project Timeline: June 2016 - Aug 2017
Technologies: Java, SQL, AngularJS, AWS, IVR
Libraries: springboot, spring mvc, mysql, lombok