Creating a body language of online learning with graph databases

August 26, 2016



 Sean York is chief architect of innovation and advanced development for Pearson Global Higher Education.

Sean York is chief architect of innovation and advanced development for Pearson Global Higher Education.

Sean York of Pearson discusses how graph technology becomes a medium for enriching online environments.


PwC: What has your group’s research been focused on?

Sean York: One of the big areas of our research has to do with conversations in online learning environments, what we call automated online discourse analysis. In a real-world classroom of 150 people, not all people can talk during a 50-minute class period. Not everybody gets a chance every time. In the online world, everybody can be required to talk. But if 150 people all talk and opine and discuss course material during a week, what parts of those conversations should an instructor focus on and what should an instructor take away from the content of those conversations?

PwC: Was it the need for that kind of conversation analysis that led you to try graph databases? What were your goals?

Sean York: Yes, we were early adopters of graph stores. I came to the picture before Titan actually became Titan. In 2011, we were using Neo4j and the predecessors to Titan—the open source TinkerPop graph stack, and associated query language, Gremlin. We first engaged with Aurelius in 2011. Back then, we wanted to build a social and cooperative learning application based on a single platform that could be automatically instrumented for analytics. It quickly became apparent that we would need a graph, so we could reason about the highly connected and highly complex data sets that we would be generating.

I was pursuing my master’s in education, working with graphs, systems theory, and set theory. We connected with Aurelius and started working with them. We conducted some studies and some proofs of concept.

A couple of years later, we did a further assessment and still liked what had become the Titan database. We formed a team to support this social and cooperative learning analytics concept. And for the past couple of years, we’ve continued with Titan as our back-end database.

“We wanted to scale our solution to potentially millions of users and provide the analytics that the graph approach could surface.”

PwC: What was the main challenge you faced at that point?

Sean York: If you wanted to do something new with someone’s data, it was very hard to access the data in various systems. We wanted a central place where we could gather data and use the graph to reason about that data in the intuitive and exploratory way that the graph schema offers and is very good at. But we also wanted to scale our solution to potentially millions of users and provide the analytics that the graph approach could surface.

PwC: What sorts of analytics are specific to graphs?

Sean York: Our team has a research focus on innovation opportunities in online learning. When you look at a graph schema, it’s very much like a concept map, or a semantic map. It’s a very easy way to reason about things. We explore the graph to find interesting aspects of our data model, discover patterns, and extract that data for visualizations, machine learning models, and different kinds of data analysis. Part of what we’re trying to do is surface a body language of online learning. We’d like to figure out what the contextual cues are, so we can provide situational awareness to instructors and students in the online learning environment.

We’re looking at what we can adopt from the face-to-face model and see how it translates, how it can be instrumented, how to think about it and understand it. We want to surface metrics and analytics that can give instructors new sign posts, signifiers, and ways of understanding the activity that occurs in those environments, in order to enrich 21st-century eLearning contexts.

PwC: What kinds of analytics help most in a 21st-century learning context?

Sean York: We’re very interested in discussion analytics right now. A naive metric of discussion participation might be how many times each author has contributed to the thread of a discussion forum, or, if you go a little deeper, maybe word counts or timestamps. But those kinds of metrics are not particularly germane to learning.

We wanted to understand discussion analytics in a much more rich and robust way. So we modeled conversations as graph tree structures of the threads of discussion. We used natural language processing to extract the important concepts from those threaded discussions and made those concepts part of the graph as well.

From there we could ask interesting questions: What are the important topics being discussed? Who’s discussing them? If we find a dominant leader of discussions on one topic, what are all the other discussions they contributed to? Are they contributing the same sort of content domain concepts to that discussion as they were to the other discussions? And how do individuals impact the evolution and the productivity of a conversation, the direction of it? What methods can we use to rank the participation of people in this sort of rich but unstructured discussion environment?

“The graph reflects the state of data from multiple systems. The graph is the place for us to situate in a consistent manner the data that we see coming out of a learning management system, a homework product, or an e-text product.”

So instead of counting how many times a person participates directly, maybe count the number of posts or the structure of the conversation that occurs after a person contributes. You start to get a sense of their influence on the discussion. That’s the research domain we’re looking at and thinking about.

PwC: What are some of the conclusions you’ve drawn?

Sean York: Most of our work has been research to this point. But one interesting example is that by deriving social network structures from discourse, you can then apply graph metrics that give you a solid and replicable yardstick with which you can compare those discussions to other data that you have, such as grades, or other outcomes or participation data. There are some very interesting possibilities, not necessarily for auto-grading, but for providing instructors with support for making data-driven assessment decisions in these loosely structured online learning environments.

In another example, we’re seeing a real disconnect in the patterns of participation in these discussions by the instructors versus what we observe with students. The students are logging in 24/7, all the time, distributed across the whole week. But the instructor may log in Friday evening, work for 20 minutes, and post across everything that’s there.

We use Natural-Language Processing to identify questions being asked in the text, and sometimes you can see instructors employing a question-and-answer, interview-type strategy as they prompt students to go deeper into an area of personal, professional, or course knowledge that they’re discussing.

That’s an example of looking at important conversation inflection points and strategies and then raising awareness to the participants, so they can get their bearings in the course. You can surface a questionography, or a list of all the questions being asked in a certain context. And the instructor can use that list as a tool to prepare for an upcoming class, such as a starting point for the next lecture.

PwC: How does a graph database help in this context?

Sean York: The graph reflects the state of data from multiple systems. The graph is the place for us to situate in a consistent manner the data that we see coming out of a learning management system, a homework product, or an e-text product.

The graph allows us to very quickly test an algorithm, explore some data, perform a visualization or a research study, and then fold that result directly back into the system that’s providing analytics and application support services to various learning products.

PwC: There’s a fundamental problem of information overload in online education as you scale up, particularly to massive open online courses [MOOCs]. Does the modeling and data description that graphs enable make it possible to reduce that overload?

Sean York: I think what we’re aiming for would be incredibly valuable in the MOOC space. It would allow both students and teachers to see 10,000-foot views of conversation, zoom in and out and see each other’s context, understand who’s contributing what, and find experts and peer tutors based on potentially complementary knowledge areas. These are all possibilities I see coming out of these technologies.

PwC: What are the biggest challenges you’re facing?

Sean York: We’ve not only been navigating our way through a very difficult unexplored problem domain, but also trying to do it with a continually evolving set of tools.

Those are two difficult things to try to do at the same time. It’s like trying to paint a train going 90 miles per hour through the countryside. We face some typical innovator challenges in that sense. We really want to make an impact with this work. We think we’re on to something really interesting, but we must find the right audience, customers, and adopters—the right champions. That’s a set of challenges separate from the technology stack itself.

This is just one domain where we’re choosing to apply our skills to try to improve things a little bit. We want to see what is out there that we can apply, that can make some difference, and that can move the needle. If we can succeed even in one such area the effort will be worth it, because we have the potential to serve so many people.



Chris Curran

Principal and Chief Technologist, PwC US Tel: +1 (214) 754 5055 Email

Vicki Huff Eckert

Global New Business & Innovation Leader Tel: +1 (650) 387 4956 Email

Mark McCaffery

US Technology, Media and Telecommunications (TMT) Leader Tel: +1 (408) 817 4199 Email