Convert HTML to plain text in JS without browser environment

I have a CouchDB view map function that generates an abstract of a stored HTML document (first x characters of text). Unfortunately I have no browser environment to convert HTML to plain text.

Currently I use this multi-stage regexp

  • Error about 'invalid JSON' with couchDB view but the json's fine
  • What JavaScript functions are available in the CouchDB map and reduce view functions?
  • couchdb views tied between two databases?
  • Starting with Node.js and CouchDB without libraries like nano or cradle
  • Get rid of CORS in CouchDB?
  • Passing parameters to map functions in CouchDb
  • html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
        .replace(/<script([\s\S]*?)<\/script>/gi, ' ')
        .replace(/(<(?:.|\n)*?>)/gm, ' ')
        .replace(/\s+/gm, ' ');
    

    while it’s a very good filter, it’s obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?

  • jQuery CouchDB - filter keys for view
  • Is CouchDB an example of Server Side Javascript?
  • How to manage CouchDB code out of it?
  • How to console log in couchdb
  • Why Javascript is used in MongoDB or CouchDB instead of other languages such as Java, C++?
  • Error: connect ENFILE 127.0.0.1:5984 - Local (undefined:undefined)
  • 4 Solutions collect form web for “Convert HTML to plain text in JS without browser environment”

    This regular expression works:

    text.replace(/<[^>]*>/g, '');
    

    Converter HTML to plain text like Gmail:

    html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
    html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
    html = html.replace(/<\/div>/ig, '\n');
    html = html.replace(/<\/li>/ig, '\n');
    html = html.replace(/<li>/ig, '  *  ');
    html = html.replace(/<\/ul>/ig, '\n');
    html = html.replace(/<\/p>/ig, '\n');
    html = html.replace(/<br\s*[\/]?>/gi, "\n");
    html = html.replace(/<[^>]+>/ig, '');
    

    If you can use jQuery :

    var html = jQuery('<div>').html(html).text();
    

    With TextVersionJS (http://textversionjs.com) you can convert your HTML to plain text. It’s pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

    In node.js it looks like:

    var createTextVersion = require("textversionjs");
    var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
    
    var textVersion = createTextVersion(yourHtml);
    

    (I copied the example from the page, you will have to npm install the module first.)

    It’s pretty simple, you can also implement a “toText” prototype:

    String.prototype.toText = function(){
        return $(html).text();
    };
    
    //Let's test it out!
    var html = "<a href=\"http://www.google.com\">link</a>&nbsp;<br /><b>TEXT</b>";
    var text = html.toText();
    console.log("Text: " + text); //Result will be "link TEXT"