A Quickstart Guide to Semantic SEO with JSON-LD and Schema Markup
“Documents can’t really be merged and integrated and queried. But data are protean, able to shift into whatever shape best suits your needs.”
While discussions about Semantic SEO are easy to come by online and in search conferences, I couldn’t find a single resource that gave me enough foothold to properly start to run with it. This document is my attempt to create a comprehensive training guide I wish I’d found when I started.
Semantic SEO works by linking and publishing metadata and making it available to external users – people, web crawlers, and graph-based knowledge repositories, also known as knowledge graphs. Linked data publishing allows for machine learning algorithms to execute semantic queries and generate SERP features, which in turn supports traffic acquisition.
I will unpack the notion of entity-oriented search and will provide you with schema markup and JSON tutorials which will allow you to:
- Write JSON-LD, deploy schema markup, and build semantic nodes.
- Create entities and list them in the knowledge graphs.
- Map out an entity-oriented and a SERP-based strategy.
- Optimize content with natural language processing (NLP) tools.
If you don’t have the time to learn how to write JSON-LD you can jump to the simple method and generate the code with online tools. For better context, I would still recommend reading the Introduction.
This guide is a work in progress so if you identify inaccuracy, omissions, oversights, or any other mistakes please contact me at email@example.com and I will make sure to stand corrected.
My understanding of semantic search has benefited greatly from the synthetic work of Krisztian Balog and my exchange with Dixon Jones who offered a coherent and comprehensive overview of the state of the art.
Most SEOs started looking into schema markup in order to optimise SERP presence of a company with knowledge cards, rich cards, answer boxes, featured snippets and other rich results such as these:
As an SEO or a digital marketer I will argue that the process of defining and connecting your webpages with semantic markup is a valuable activity for two reasons:
First, it’s both an efficient marketing tactic.
Content discovery and improved SERP-placement leads to higher click-through-rates, deeper engagement and improved brand awareness. You can take a look these case studies including this Schema App study that showed a 160% increase in YoY click volume after the SAP website was marked up.
Additionally, the web is rich with accessible, high-potential opportunities for rich results, especially for long tail keywords. Over 25% of all web queries mention or target specific entities. Only 1/3rd of websites contain defined entities. Only 12.29% of search queries trigger rich results.
Moreover, the growing importance of semantic markup is supported by the record global smart speakers and virtual assistants sales, as the number of units sold grew to 147 million in 2019. Hands-free voice command is becoming a significant way consumers perform search (specially in the 26-35 age group) and 40.7% of all voice answers come from rich or featured snippets.
Second, it represents an opportunity to play a transformative role in the semantic web environment. A marked up web document, originally designed for human consumption, becomes a machine-understandable data set. Schema markup allows you to create nodes that can be mapped and processed by the semantic web and by the Google’s Knowledge Graph.
Or, as Aaron Swartz put it:
“Documents can’t really be merged and integrated and queried; they serve mostly as isolated instances to be viewed and reviewed. But data are protean, able to shift into whatever shape best suits your needs.”
In other words it is how documents:
…are plugged into a metadata ecosystem or a Giant Global Graph:
This is the Linked Open Data Cloud, a continuously updated visualization of the web’s connected data. Pink circles represent user generated data, often annotated with schema.org
Semantic Networks & the Semantic Web
A semantic network is a graphic depictions of knowledge that can be organized into a taxonomy. It has been used in mathematics, psychology, and linguistics since Euler’s solution of the Königsberg bridge problem in 1736.
Semantic networks became popular in artificial intelligence and natural language processing in 1960s.
The Semantic Web is an extension of the existing Web where machine-readable data, is layered over the information that is provided for people. The advantage of semantic data is that software can process it.
The history of efforts to enable Web-scale exchange of structured and linked data dates back to 1990s. A 2001 Scientific American article by Tim Berners-Lee et al., was probably the most ambitious view of this program. He imagined a web of linked data, where semantics, structure, and shared standards would allow humans to communicate with intelligent machines in order to access automated services. Berners-Lee originally expressed his vision of the Semantic Web in 1999 as follows:
“I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize.”
Another advocate of the semantic web was Aaron Swartz, co-founder of reddit, political organizer and hacktivist who was instrumental in the fight against SOPA. In his unfinished A Programmable Web he wrote:
“The Semantic Web is based on bet, a bet that giving the world tools to easily collaborate and communicate will lead to possibilities so wonderful we can scarcely even imagine them right now.”
History of the Knowledge Graph: a solution to Google’s scale problem
The first-ever website on the Internet, info.cern.ch was published by Tim Berners-Lee on August 6, 1991. Over the next two decades hundreds of trillions of webpages got stored on over 8M servers. Even though Google indexes only a small pourcentage of them, Google’s ranking models were quickly facing a scale problem. By 2005, a Browseable Fact Repository emerged as Google’s single-point of truth data base solution. Google started to move from indexing URLs to indexing data. Knowledge graphs emerged quickly as the most popular semantic metadata organizational model for both commercial and research applications. Since then, the distinction between semantic networks and knowledge graphs was blurred.
In 2006, Google patented a Browseable Fact Repository, an early version of Google’s Knowledge Graph. It was built by the Google Annotation Framework team headed by Andrew Hogue. The idea was to start indexing data instead of URLs and match them against entities listed in a single-point of truth data-base.
At the same time Danny Hillis and John Giannandrea founded Metaweb Technologies which developed Freebase, an online collection of structured data harvested from multiple sources, including user-submitted wiki contributions.
Metaweb was acquired by Google, in 2010, and its technology became the basis of the Google Knowledge Graph. John Giannandrea was hired by Google. He was appointed as Head of Search in 2016. In 2018 he moved to Apple and was appointed as Senior VP of ML and AI strategy. He is currently in charge of Siri and probably of the upcoming Apple Search Engine.
One of the challenges in 2000s was to enable different applications to work easily with data from different silos. Text search, with its wide coverage well beyond narrow verticals, emerged as the ideal candidate for a common ontology that allows both humans and machines to understand.
In 2011, major search engines Bing, Google, and Yahoo (later joined by Yandex) presented webmasters with Schema.org, a standardized semantic vocabulary of tags (or microdata) used to mark up web content and make it machine-readable.
In December 2012, Ray Kurzweil was personally hired by Google co-founder Larry Page as Director of Engineering to “work on new projects involving machine learning and language processing“. He contributed to the Hummingbird algorithm that placed greater emphasis on NLP and entities annotated by the Knowledge Graph.
The Knowledge Graph is a database that provides descriptions of real world entities and their interrelations, or in other words answers not just links. It leveraged DBpedia and Freebase as a data source and later incorporated content annotated with Schema.org. Google won’t say what percentage of queries evoke a Knowledge Graph answer but seems comfortable with a ballpark estimate of about 25%.
Today, major companies, such as Facebook, Airbnb, Amazon and Uber have created their own “knowledge graphs” that power semantic searches and enable smarter processing and delivery of data.
In July 2020, the Wikimedia Foundation announced Abstract Wikipedia, based on a 22-page paper by Denny Vrandečić, founder of Wikidata. It would allow contributors to create content using abstract notation which could then be translated to different natural languages, balancing out content more evenly, no matter the language you speak.
Semantic Web is slowly growing even though benefits of developing for the Semantic Web are not always immediate, or visible. We all stand to gain from incorporating semantic markup into our web pages, as every site that does strengthens the foundations of an open, transparent, decentralized internet.
Semantic SEO emerged as a practice shortly after the publication of the 2015 patent: ‘Ranking search results based on entity metrics‘ which described how, in some instances, search results are based on entities found inside of Google’s Knowledge Graph. The meaning and relationships of entities on web pages are communicated to knowledge graphs via search engines, with a machine-native meta-data vocabulary called schema.org. The simplest and most popular encoding methodology used for schema.org is JSON-LD (RDF in its developer-friendly form).