Top Nguyen Blog
Parsing and manipulating HTML documents in C#

Parsing and manipulating HTML documents in C#

23 December, 2016
0 Comments
4 mins
448 words

Library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). 

What Is AngleSharp?

AngleSharp is a library for parsing and manipulating HTML documents. It fully supports HTML5 and, just like browsers, it can handle malformed HTML. After parsing the HTML, it provides a DOM, which you can query with CSS selectors or filter with LINQ. The DOM is fully interactive, so you can remove or add new elements and the DOM will reflect the changes. AngleSharp is available as a Portable Class Library, making it usable for Xamarin and Windows Phone projects as well.

How To Use

If you already have the HTML string and you wanna handle it in C#, just do this

var parser = new HtmlParser();
var document = parser.Parse(@"
<html> 
<head>
        <title>AngelSharp by TopNguyen</title>
</head> 
<body> 
        <script> 
            document.addEventListener('changeAuthor', function(){ 
                    document.querySelector('#author').innerText = 'Another'
            }); 
        </script> 
        <ul> 
        <li id='author'>Top Nguyen</li> 
        <li class='description'>I am Top Nguyen, I am C# .NET Senior Developer.</li> 
        <li class='description introduce'>I do the work I do because I love it. Keep simple, everything possible.</li> 
        <li class='description introduce'>This is my blog, stories from the life and trials of a web developer.</li> 
        </ul> 
</body>
</html>");

//HTML should be output in the end
Console.WriteLine("FULL HTML:");
Console.WriteLine(document.DocumentElement.OuterHtml);

//Do something with LINQ
var descriptionItemsLinq = document.All.Where(m => m.LocalName == "li" && m.ClassList.Contains("description"));

//Or directly with CSS selectors
var descriptionItemsCssSelector = document.QuerySelectorAll("li.description");

// directly with ID selector
var author = document.QuerySelector("#author");

Console.WriteLine();
Console.WriteLine("LINQ:");

foreach (var item in descriptionItemsLinq)
{
    Console.WriteLine(item.Text());
}

Console.WriteLine();
Console.WriteLine("CSS:");

foreach (var item in descriptionItemsCssSelector)
{
    Console.WriteLine(item.Text());
}

Console.WriteLine();
Console.WriteLine("Author by ID:");
Console.WriteLine(author.Text());

Console.ReadKey();

Then you get the result

Result

AngleSharp also supports you to call javascript functions.

So you just put create and execute an event code before Console.ReadKey().

//------------------------------------------------------------
//    Register event listener from C# (we also have one in JS)
//------------------------------------------------------------

document.AddEventListener("changeAuthor", (s, ev) =>
{
    author.TextContent = "Another Author";
});

var e = document.CreateEvent("event");
e.Init("changeAuthor", false, false);

// execute event
document.Dispatch(e);

Console.WriteLine();
Console.WriteLine(new string('-', 10));
Console.WriteLine("After execute event");
Console.WriteLine(new string('-', 10));
Console.WriteLine();

Console.WriteLine("Author by ID:");
Console.WriteLine(author.Text());
Console.WriteLine("FULL HTML:");
Console.WriteLine(document.DocumentElement.OuterHtml);

Console.ReadKey();

And you get the result

Result - Javascript

You can get full my demo code by this link.

An Important NuGet Note

PM> Install-Package AngleSharp

Wiki link

Website link

Summary

The performance of AngleSharp is quite close to the performance of browsers. Even very large pages can be processed within milliseconds. AngleSharp tries to minimize memory allocations and reuses elements internally to avoid unnecessary object creation.

Happy Coding!