Sunday, March 27, 2011

Get raw text from markdown'ed text

In my DB, I have a text that is markdown'ed. The same way than SO does when showing the excerpts of the questions, I would like to get the N first characters of the text, i.e. all formatting must be removed. Of course the MD -> HTML step must be avoided and the work must be done on the MD'ed text. Performance is a requirement. Thx.

From stackoverflow
  • The way that I would handle this is by defining a formatter interface for the class containing/representing your marked down text. You'd then have concrete implementations that support HTML formatting and plain text formatting. All you would need to do is inject the correct implementation and call the formatter.

    Your plain text formatter could simply iterate through the characters in the string, copying characters until it hits some markdown. It would then skip the markdown and start outputting again when it hits the text.

    public interface IFormatter
    {
        string Format();
    }
    
    public class HtmlFormatter: IFormatter
    {
        public Format()
        {
            return ...string translated to HTML...
        }
    }
    
    public class PlainTextFormatter : IFormatter
    {
        public Format()
        {
            ...go through and remove all markdown and return rest
        }
    }
    
    
    public class Post : IFormattable
    {
        public IFormatter Formatter { get; set; }
    
        public Post( IFormatter formatter )
        {
            this.Formatter = formatter ?? new HtmlFormatter();
        }
    
        public Format()
        {
            return this.Formatter.Format();
        }
    }
    
    Nicolas Cadilhac : Not the answer to my question, but a good point for the surrounding framework. Voted.
  • Forgive me if I'm misunderstanding (or simply under-understanding) what you need to do here, but it occurs to me that if there are more reads (page views) than there are inserts (additions of new markdown'ed records) to this database, that from a perfomance standpoint you may be able to make the biggest gain by saving a version of the text with all markup stripped in a separate field in the database. That way your front-end doesn't have to repeatedly parse what it reads from the database before displaying to the browser... it would be parsed only once when new records were added.

    Whether or not this actually makes sense from a performance standpoint depends on a variety of variables specific to your situation... how big the text entries are, how often records are inserted versus read, etc.

    Nicolas Cadilhac : A good point that I will keep in mind. Thanks. However it doesn't answer the question about the best way to generate this excerpt.
  • In my DB, I have a text that is markdown'ed. The same way than SO does when showing the excerpts of the questions, I would like to get the N first characters of the text, i.e. all formatting must be removed.

    We store both representations of the text in the database:

    1. Raw Markdown suitable for editing
    2. HTML-ized version suitable for output

    and when we display it, we use the HTML-ized output version and simply apply our standard HTML stripping algorithms.

  • Here is the path I'm taking: I will modify the markdown code so that, with a switch, I can either produce html or simple text. Once the excerpt has been generated, I will surely store it in the DB.

    I won't tag any answer as the solution since there are many ways to do it. Everyone gets my vote ;)

0 comments:

Post a Comment